Hadoop on Azure WordCount Sample Tutorial

Hadoop on Azure WordCount Sample Tutorial

Overview

This tutorial shows two ways to use Hadoop on Azure to run a MapReduce program that counts word occurences in a text. First, with a Hadoop .jar file by using the Create Job UI. Second, with a query by using the fluent API layered on Pig that is provided by the Interactive Console. The first approach uses a MapReduce program written in Java; the second uses one written in JavaScript. The text file analyzed here is the Project Gutenberg eBook edition of The Notebooks of Leonardo Da Vinci. The Java MapReduce program outputs the total number of occurrences of each word in a text file. The query that uses the JavaScript MapReduce program outputs only the 10 most frequently occurring words.

 Important!
This wiki topic may be obsolete.
The wiki topics on Windows Azure HDInsight Service are no longer updated by Microsoft. We moved the content to windowsazure.com where we keep it current. This topic can be found at Getting Started with Windows Azure HDInsight Service.

The Hadoop MapReduce program reads the text file and counts how often each word occurs. The output is a new text file that consists of lines, each of which contains a word and the count (a key/value tab-separated pair) of how often that word occurred in the document. This process is done in two stages. The mapper (the cat.exe in this sample) takes each line from the input text as an input and breaks it into words. It emits a key/value pair each time a work occurs of the word followed by a 1. The reducer (the wc.exe in this sample) then sums these individual counts for each word and emits a single key/value pair that contains the word followed by the sum of its occurrences.

You can use the Azure interactive JavaScript console to view the results for each of the approaches. You can also view the 10 most frequently occurring words in the text in a bar chart for visualization using commands provided by the Hadoop on Azure interactive JavaScript console.

Goals

In this tutorial you see three things:

  1. How to use the portal provided by Hadoop on Azure to deploy a MapReduce program written in Java with a jar file.

  2. How to run queries from the interactive console that deploys a MapReduce program written in Javascript using the fluent API layered on Apache Pig provided by Hadoop on Azure.

  3. How to use the interactive console to read the results of a MapReduce program and display them graphically.

Key technologies

  • MapReduce
  • Hadoop on Azure
  • Hadoop on Azure Interactive JavaScript Console

Setup and configuration

You must have an account to access Hadoop on Azure and have created a cluster to work through this tutorial. To obtain an account and create an Hadoop cluster, follow the instructions outlined in the Getting started with Microsoft Hadoop on Azure section of the Introduction to Hadoop on Azure topic.


Tutorial

This tutorial is composed of the following segments:

  1. Deploy the WordCount MapReduce program to your Hadoop cluster and run it using a jar file
  2. View the results by using the interactive JavaScript console
  3. Analyze the results for the top 10 words by using a JavaScript program from the interactive console and display the results graphically

Deploy the Wordcount MapReduce program to a Hodoop cluster and run it

When you log in from the Hadoop on Azure page, you land on your Account page. To bring up the Hadoop Sample Gallery, scroll down to the Manage your account section and click the Samples icon.

To bring up the deployment page for the sample, click the WordCount sample icon in the Hadoop Sample Gallery.

On the Wordcount deployment page, information is provided about the application and downloads that are available for Java MapReduce programs, the input text, and the .jar files that contain the files needed by Hadoop on Azure to deploy the application. The Java code can be inspected on this page by scrolling down into the Details section. To begin the deployment, click the Deploy to your cluster button on the right-hand side of the page.

The deployment process brings up the Create Job page for the Wordcount sample. The job name and parameters have been assigned default values.The Job Name is "WordCountExample". Parameter0 is just the name of the program, "wordcount". Parameter1 specifies, respectively, the path/name of the input file (/example/data/gutenberg/davinci.txt) and the output directory where the results are saved (DaVinciAllTopWords). Note the output directory assumes a default path relative to the /user/ folder. 

The Final Command contains the Hadoop jar command that executes the MapReduce program with the parameter values provided above. See the documentation on .jar syntax for details. To run the program with these default values on your Hadoop cluster, simply click the blue Execute job button on the right-hand side.

The status of the deployment is provided on the page. When the program has completed the status is set to "Completed Successfully", as shown here. 

View the results by using the interactive JavaScript console

To get to the Interactive JavaScript console, return to your Account page. To bring up the Interactive JavaScript console, scroll down to the Your Cluster section and click the Interactive Console icon. 

To confirm that you have the part-r-00000 output file in the DavinciAllTopWords folder that contains the results, enter the command #ls DaVinciAllTopWords in the console check that this file is there.

To view the word counts, enter the command file = fs.read("DaVinciAllTopWords") in the console prompt. It is a large file. Scroll up to see the long list of words and their summary counts. In the next segment, we see how to focus the analysis on a subset of the data and present it in a more useful way.

Analyze the results for the top 10 words by using a JavaScript program from the interactive console and display the results graphically

In your browser, go to http://isoprodstore.blob.core.windows.net/isotopectp/examples/WordCount.js and save a copy of the WordCount.js file to your local ../downloads directory.

Enter the upload command fs.put() at the js> console and then enter the following parameters into the Upload windows that pop up:

**Source**: ../downloads/Wordcount.js   
**Destination**: ./WordCount.js/    



Click the Browse button for the Source, navigate to the ../downloads directory and select the WordCount.js file. Enter the Destination value as shown and click the Upload button.

Confirm that the file has been uploaded with: js> #ls command. It should be in the default directory.

You can examine the JavaScript code for the map and reduce functions in the WordCount.js file by using the #cat WordCount.js command.

Execute the MapReduce program from the js> console with the pig.from("/example/data/gutenberg/davinci.txt").mapReduce("WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("DaVinciTop10Words") command.

Scroll to the right and click view log if you want to observe the job progress. This log also provides diagnostics if the job fails to complete. When the job does complete, you see the message:
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

To display the results in the DaVinciTop10Words directory, use the file = fs.read("DaVinciTop10Words") command at the js> prompt.

Parse the contents of the file into a data file with the data = parse(file.data, "word, count:long") command.

Plot the data by using the graph.bar(data) command.


Summary

You have seen two ways to use Hadoop on Azure to run a MapReduce programs that count word occurrences in a text.

  • Using a Hadoop jar file from the Create Job UI.
  • Using a query with the fluent API from the Interactive Console.

You have also seen how to use the Interactive Console to graph the results obtained from the MapReduce analysis.

Leave a Comment
  • Please add 2 and 5 and type the answer here:
  • Post
Wiki - Revision Comment List(Revision Comment)
Sort by: Published Date | Most Recent | Most Useful
Comments
  • Michael Agranov edited Revision 1. Comment: Added images, performed edits.

  • Michael Agranov edited Original. Comment: Added content. Performed edits.

Page 1 of 1 (2 items)
Wikis - Comment List
Sort by: Published Date | Most Recent | Most Useful
Posting comments is temporarily disabled until 10:00am PST on Saturday, December 14th. Thank you for your patience.
Comments
  • Michael Agranov edited Original. Comment: Added content. Performed edits.

  • Michael Agranov edited Revision 1. Comment: Added images, performed edits.

Page 1 of 1 (2 items)