This tutorial shows two ways in which Hadoop MapReduce programs can be run on an Hadoop Distributed File System (HDFS) using HDInsight Services for Windows Azure.
You will learn:
This tutorial is composed of the following segments:
You must have an account to access Hadoop on Windows Azure and have created a cluster to work through this tutorial. To obtain an account and create an Hadoop cluster, follow the instructions outlined in the Getting started with Microsoft Hadoop on Windows Azure section of the Introduction to Apache Hadoop-based Service for Windows Azure topic.
From your Account page, click the Create Job icon in the Your Tasks section to bring up the Create Job UI.
To run a MapReduce program, specify the Job Name and the JAR File to use. Parameters are added to specify the name of the MapReduce program to run, the location of input and code files, and an output directory.
To see a simple example of how this interface is used to run the MapReduce job, let's look at the Pi Estimator sample. Return to your Account page. Scroll down to the Samples icon in the Manage your account section and click it.
From your Account page, scroll down to the Samples icon in the Manage your account section and click it.
Click the Pi Estimator sample icon in the Hadoop Sample Gallery.
On the Pi Estimator page, information is provided about the application and downloads that are available for Java MapReduce programs and the jar file that contains the files needed by Hadoop on Windows Azure to deploy the application.
To deploy the files to the cluster, click the Deploy to your cluster button on the right side.
The fields on the Create Job page are populated for you in this example. The first parameter value defaults to "pi 16 10000000". The first number indicates how many maps to create (default is 16) and the second number indicates how many samples are generated per map (10 million by default). So this program uses 160 million random points to make its estimate of Pi. The Final Command is automatically constructed for you from the specified parameters and jar file.
To run the program on the Hadoop cluster, simply click the blue Execute job button on the right side of the page.
The status of the job is displayed on the page and changes to Completed Successfully when it is done. The result is displayed at the bottom of the Output(stdout) section. For the default parameters, the result is Pi = 3.14159155000000000000 which is accurate to eighth decimal place, when rounded.
This segment shows how to run a MapReduce job with a query by using the fluent API layered on Pig that is provided by the Interactive Console. This example requires an input data file. The WordCount sample that you use here has already had this file uploaded to the cluster. But the sample does require that the .js script be uploaded to the cluster and you use this step to show the procedure for uploading files to HDFS from the Interactive Console.
First download a copy of the WordCount.js script to your local machine. Store it locally to upload it to the cluster. Click here and save a copy of the WordCount.js file to your local ../downloads directory. In addition download the The Notebooks of Leonardo Da Vinci, available here.
To get to the Interactive JavaScript console, return to your Account page. To bring up the Interactive JavaScript console, scroll down to the Your Cluster section and click the Interactive Console icon.
To upload the JavaScript.js file to the cluster, enter the upload command fs.put() at the js> console and select the Wordcount.js form your downloads folder, for the Destination parameter use ./WordCount.js/.
fs.put()
Click the Browse button for the Source, navigate to the ../downloads directory and select the WordCount.js file. Enter the Destination value as shown and click the Upload button.
Repeat this step to upload the davinci.txt file by using ./example/data/ for the Destination.
Execute the MapReduce program from the js> console by using the following command:
pig.from("/example/data/gutenberg/davinci.txt").mapReduce("WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("DaVinciTop10Words")
Scroll to the right and click view log if you want to observe the details of the job's progress. This log also provides diagnostics if the job fails to complete.
To display the results in the DaVinciTop10Words directory once the job completes, use the file = fs.read("DaVinciTop10Words") command at the js> prompt.
file = fs.read("DaVinciTop10Words")
In this tutorial, you have seen two ways to run MapReduce jobs by using the Hadoop on Windows Azure portal. One used the Create Job UI to run a Java MapReduce program by using a jar file. The other used the Interactive Console to run a MapReduce job by using a .js script within a Pig query.
Carsten Siemens edited Revision 13. Comment: Removed stub tag.
BradSevertson edited Revision 12. Comment: Cut Hive scenario for now - the datamarket scenario is not available now
BradSevertson edited Revision 11. Comment: fixed cmd for pig job - location of data file - isn't it already there? Is the upload working?
BradSevertson edited Revision 10. Comment: some minor formatting edits
BradSevertson edited Revision 9. Comment: More revisions to the introduction - and goals
BradSevertson edited Revision 8. Comment: title edits again
BradSevertson edited Revision 7. Comment: edited title
BradSevertson edited Revision 6. Comment: introduction reworked.
BradSevertson edited Revision 4. Comment: new title
Larry Franks edited Revision 3. Comment: changing to HDInsight for the offering name.
Michael Agranov edited Original. Comment: Added content. Performed edits.
Michael Agranov edited Revision 1. Comment: Added images.
Michael Agranov edited Revision 2. Comment: Added images, performed edits.