Hadoop-based Services for Windows Azure includes several samples you can use for learning and testing. One sample is the 10GB GraySort which is a scaled-down version of the Hadoop Terasort benchmark. There are three jobs to run and in this video, Developer Brad Sarsfield walks you through Teragen.
Hi, my name is Brad Sarsfield and I’m a Developer on the Hadoop Services for Windows and Windows Azure team.
In this video I will show you how to use the 10GB GraySort sample to generate 10GB of data and store it in HDFS. The GraySort sample is a scaled-down version of the TeraSort I/O benchmark. [see sortbenchmark.org]. This video is part 1 in the series. In part 2 I’ll submit a MapReduce job to sort the data (terasort) and write it back to disk. In part 3 I’ll validate the data that have been sorted (teraval).
So let’s get started.
But I know that it’s better to split the size of the data that I’m generating into a much smaller number, keeping in mind that map tasks can fail for a variety of reasons and would need to be restarted. Ideally, each task should take one to one-and-a-half minutes at most – so they fail fast. If a task should fail, Hadoop will restart that task either on the same node or on a different node. In fact, if Hadoop thinks one of the tasks is running too slowly, it will start up a new parallel task on a new node and use the results of whichever one finishes first. This is called Speculative Execution.
Let’s take a look at the data generated. There are several ways to do this:
Now it’s time to sort that data. Sorting is covered in the next video in this series.
Thank you for watching, I hope you found it helpful.
Michele [MSFT] edited Revision 5. Comment: added link to video on 5/5/12