MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. Most of the MapReduce jobs are written in Java. Hadoop provides a streaming API to MapReduce that enables you to write map and reduce functions in languages other than Java. This tutorial shows how to run the C# Streaming sample from the HDInsight Sample Gallery and how to use C# programs with the Hadoop streaming interface.
Before running this tutorial, you must have a Windows Azure HDInsight cluster provisioned. For more information on provision a HDInsight cluster, see Getting Started with Windows Azure HDInsight Service.
Estimated time to complete: 30 minutes.
There is a communication protocol between the Map/Reduce framework and the streaming mapper/reducer utility that enables this approach. The utility creates a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
When an executable is specified for mappers, each mapper task launches the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feeds the lines to the stdin of the process. In the meantime, the mapper collects the line-oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. If there is no tab character in the line, then entire line is considered as key and the value is null.
When an executable is specified for reducers, each reducer task launches the executable as a separate process when the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line-oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value.
For more information on the Hadoop streaming interface, see Hadoop Streaming.
The MapReduce program uses the cat.exe application as a mapping interface to stream the text into the console and wc.exe application as the reduce interface to count the number of words that are streamed from a document. Both the mapper and reducer read characters, line by line, from the standard input stream (stdin) and write to the standard output stream (stdout).
On the C# Streaming page, it contains the description, details, and the downloads. The downloads include:
using System; using System.IO; namespace cat { class cat { static void Main(string[] args) { if (args.Length > 0) { Console.SetIn(new StreamReader(args[0])); } string line; while ((line = Console.ReadLine()) != null) { Console.WriteLine(line); } } } }
The mapper code in the cat.cs file uses a StreamReader object to read the characters of the incoming stream into the console, which in turn writes the stream to the standard output stream with the static Console.Writeline method.
The source code for wc.exe (Reducer) is:
using System; using System.IO; using System.Linq; namespace wc { class wc { static void Main(string[] args) { string line; var count = 0; if (args.Length > 0){ Console.SetIn(new StreamReader(args[0])); } while ((line = Console.ReadLine()) != null) { count += line.Count(cr => (cr == ' ' || cr == '\n')); } Console.WriteLine(count); } } }
The reducer code in the wc.cs file uses a StreamReader object to read characters from the standard input stream that have been output by the cat.exe mapper. As it reads the characters with the Console.Writeline method, it counts the words by counting space and end-of-line characters at the end of each word, and then it writes the total to the standard output stream with the Console.Writeline method.
The Parameter 0 is the path and the file name of the mapper and reducer; parameter 1 is the input file and output file path and name; and parameter 2 designates mapper and reducer executables.
The Final Command shows the actual command that will be used.
If the job is taking too long, press F5 on the browser to refresh the screen.
You can use the Interactive JavaScript console to check the mapreduce job results.
#lsr /example/data/StreamingOutput #cat /example/data/StreamingOutput/wc.txt/part-00000
part-00000 is the default Hadoop output file name.
In this tutorial, you have seen how use C# programs with the Hadoop streaming interface.
Maheshkumar S Tiwari edited Revision 10. Comment: Added tags
Carsten Siemens edited Revision 7. Comment: Removed stub tag.
Don Krapohl edited Revision 6. Comment: Changed $ to # in command to cat output
Don Krapohl edited Revision 5. Comment: Modified head node path to match configuration on the VM
Michael Agranov edited Revision 4. Comment: Added images.
Michael Agranov edited Revision 3. Comment: Added images, performed edits.
Michael Agranov edited Revision 2. Comment: Added content.