This article is the main portal for technical information about HDInsight Services for Windows and related Microsoft technologies. It provides a brief overview of Apache Hadoop, as well as information for the HDInsight Services provided by Microsoft for deployment on both Windows and Windows Azure. It also provides links to more detailed technical content in various formats.
Note: Contributions are welcome and appreciated: Please feel free to update this and other articles on this Wiki, and to add links to relevant content both from within and outside Microsoft.
Topics
Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS), a reliable and distributed data storage, and MapReduce, a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.
HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.
Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks. These independent chunks are processed by the map tasks running across the nodes of the Hadoop cluster in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Some of the main advantages of Hadoop are that it can process vast amounts of data, hundreds of terabytes or even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is located rather than moving the data to some processing location, and detect and handle failures by design.
There are two other key Apache technologies that are frequently used with Hadoop: Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
For more details on Apache Hadoop, see http://hadoop.apache.org/.
This section contains links to resources useful in learning Hadoop, such as installation, configuration, and basic how-to information.
The links in this section provide information on deploying and using the Developer Preview of HDInsight Services on Windows.
The links in this section provide information on deploying and using Apache Hadoop on the Microsoft Windows Azure Platform. Instead of setting up and managing a Hadoop cluster on Azure by yourself, you can use the HDInsight Services for Windows Azure dashboard that Microsoft has made available at hadooponazure.com. This is a preview of the HDInsight Services for Windows Azure to which you can submit MapReduce jobs to be processed along with the data used in the processing. It enables you to process vast amounts of structured as well as non-structured data easily without worrying about setting up the Hadoop cluster, configuring, maintaining, and managing it manually.
Link
This section contains links to the tutorials for the samples that are on the Hadoop on Windows Azure Portal.
Description
This section contains information on developing solutions using Hadoop.
This section contains information on using Hadoop with other BI technologies.
Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS)
This section contains a list of Hadoop-related how-to articles.
This section contains a list of Hadoop-related examples.
This section contains a list of Hadoop-related videos.
This section contains a list of Hadoop-related books.
Microsoft is planning on providing guidance on best practices in the future. If you have best practices guidance that you'd like to share, please feel free to provide a link to it here. (Some suggestions.) Be great to list some best practices around:
BradSevertson edited Revision 102. Comment: reorganizing Azure and Windows sections
BradSevertson edited Revision 101. Comment: fixing more links
BradSevertson edited Revision 98. Comment: toc links changed to same window tagets
BradSevertson edited Revision 95. Comment: Updating and fixing links to sections from toc table
BradSevertson edited Revision 94. Comment: reclassifying topic types
BradSevertson edited Revision 93. Comment: formatting edits - spacing issues fixed
BradSevertson edited Revision 92. Comment: formatting edits to tables and moved getting started with Hadoop on Windows Azure to Getting Started table.
BradSevertson edited Revision 91. Comment: got rid of link to deleted windows server section
BradSevertson edited Revision 90. Comment: fixed link to Milion Song Dataset
BradSevertson edited Revision 89. Comment: Added new links to 2 tutorials: Analyzing Twitter and Recommedatation engine using Mahout, both in getting started table.