Hadoop-based Services for Windows Azure includes several samples you can use for learning and testing. One sample is the 10GB GraySort which is a scaled-down version of the Hadoop Terasort benchmark. There are three jobs to run and in this video, Developer Brad Sarsfield walks you through Terasort.
Hi, my name is Brad Sarsfield and I’m a Developer on the Hadoop Services for Windows and Windows Azure team.
In video one in this series, I generated 10GB of data using the teragen example. In this video I’ll run terasort to sort that data. Terasort has 2 phases: the map phase and then the reduce phase. The map tasks partition the data into 25 ranges and the reduce tasks sort the data and write it to HDFS.
The command line content I added to the parameters is translated into the Final Command below .
Each map takes a file, and, based on the contents of that file, creates 25 buckets or data range partitions. It doesn’t sort the data yet – just partitions the data. It samples and figures out how to partition it into the appropriate data ranges. Then that map task internally sorts the data it is responsible for – only the data in that one file.
After the map tasks have completed, the reduce tasks start up. The first reduce task is put in charge of one of those buckets. It reaches out to each node and asks for any data that fits into its bucket. For example, if my bucket is the numbers 60 through 88, I collect those numbers from each map task. That is my piece of the job. Once all of the data from the cluster is read, the Reduce task brings that all in and sorts it and writes its part of the final output. In parallel, all of the other 24 Reduce tasks are doing the same thing.
The map task results are stored in temporary files – they are not saved in HDFS and are not persisted past the life of the job.
It’s reducing now, going through the reduces. I could configure my cluster to do this work faster. For instance, if I had more nodes, it would run faster. But for this video I have only 4 worker nodes dividing up the work.
When I submit the job and tell it to use 50 maps and 25 Reduces , jobscheduler decides on placement and execution of tasks based on the available slots. Once a task finishes, the next task in the queue begins.
Each of my medium VM nodes is set to accept 2 map tasks and 1 reduce task each. So in this case, 4 workernodes each with 2 slots available for map tasks means I am running 8 tasks at a time.
The jobscheduler tries to affinitize the task to the placement of the data. It places the task close to the location of the data, if not on the same machine.
Now it’s time to validate the sort. Validating the data output is covered in the next video in this series.
Thank you for watching, I hope you found it helpful.