This tutorial covers several techniques for storing and importing data for use in Hadoop MapReduce jobs run with Windows Azure HDInsight Service ( formerly Apache™ Hadoop™-based Services for Windows Azure). Apache Hadoop is a software framework that supports data-intensive distributed applications. While Hadoop is designed to store data for such applications with its own distributed file system (HDFS), cloud-based on-demand processing can also use other forms of cloud storage such as Windows Azure storage. Collecting and importing data in such scenarios is the subject of this tutorial.
You will learn:
While HDFS is the natural storage solution for Hadoop jobs, data needed can also be on cloud based, large, and scalable storage systems such as Windows Azure storage. It is reasonable to expect that Hadoop, when running on Windows Azure, be able to read data directly from such cloud storage.
In this tutorial you will analyze IIS logs located in Windows Azure storage by using a standard Hadoop streaming MapReduce job. The scenario demonstrates a Windows Azure web role that generates IIS logs using the Windows Azure diagnostic infrastructure. A simple Hadoop job reads the logs directly from Windows Azure storage and finds the 5 most popular URIs (web pages).
You must have a Windows Azure storage account for storing the IIS log file and the results of the MapReduce job. For the instructions, see How to Create a Storage Account. After the account is created, write down the storage account name and the access key. You will use the information in this tutorial.
To find the storage account name and its key
To simplify this tutorial, create two new containers in your storage account. The IISLogsInput is for the source IIS log file, the IISLogsOutput is for the MapReduce job results.
To create Windows Azure storage containers
If you have an IIS log file, you can skip the next procedure.
To generate IIS logs and place them in Windows Azure storage, you can enable Windows Azure Diagnostics in an ASP.NET web role, and configure DiagnosticInfrastructureLogs. For more information on Windows Azure diagnostics, see Enabling Diagnostics in Windows Azure.
The following are the prerequisites for creating the web role:
// Configure IIS Logging DiagnosticMonitorConfiguration diagMonitorConfig = DiagnosticMonitor.GetDefaultInitialConfiguration(); diagMonitorConfig.DiagnosticInfrastructureLogs.ScheduledTransferLogLevelFilter = LogLevel.Information; diagMonitorConfig.DiagnosticInfrastructureLogs.ScheduledTransferPeriod = TimeSpan.FromMinutes(1); diagMonitorConfig.Directories.ScheduledTransferPeriod = TimeSpan.FromMinutes(1); diagMonitorConfig.Directories.DataSources.Add(new DirectoryConfiguration() { Container = "wad-iis-logfiles", Path = "logfiles" }); DiagnosticMonitor.Start("Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString",diagMonitorConfig);
Important: you must leave the application running in the compute emulator for more than one minute or until the log file is synchronized to the Windows Azure Storage account.
Uploading, downloading, and browsing files in Windows Azure Blob is an easy task if you install a blob storage browsing application such as Azure Storage Explorer or CloudBerry Explorer for Windows Azure Blob storage.
The following steps are for the Azure Storage Explorer application; you can use the same techniques with CloudBerry Explorer, but the steps may differ.
To copy the log file to the iislogsinput container on Windows Azure storage
Hadoop Streaming is a utility that lets you create and run MapReduce jobs by creating an executable or a script in any language. Both the mapper and reducer read the input from STDIN and write the output to STDOUT. For more information about Hadoop Streaming, see the Hadoop streaming documentation.
To create a MapReduce streaming job
using System.IO;
if (args.Length > 0) { Console.SetIn(new StreamReader(args[0])); } var counters = new Dictionary(); string line; while ((line = Console.ReadLine()) != null) { var words = line.Split(' '); foreach (var uri in words) { if ((uri.StartsWith(@"http://")) || (uri.EndsWith(".aspx")) || (uri.EndsWith(".html"))) { if (!counters.ContainsKey(uri)) counters.Add(uri, 1); else counters[uri]++; Console.WriteLine(string.Format("{0}\t{1}", uri, counters[uri])); } } }
The code searches for strings started with "http://" or ends with ".aspx" or ends with ".html". You can customize the code if you want to.
if (args.Length > 0) { Console.SetIn(new StreamReader(args[0])); } // counter for each uri var UriCounters = new Dictionary(); // list of the uri ordered by the counter value var topUriList = new SortedList(); string line; while ((line = Console.ReadLine()) != null) { // parse the uri and the number of request var values = line.Split('\t'); string uri = values[0]; int numOfRequests = int.Parse(values[1]); // save the max number of requests for each uri in UriCounters if (!UriCounters.ContainsKey(uri)) UriCounters.Add(uri, numOfRequests); else if (UriCounters[uri] < numOfRequests) UriCounters[uri] = numOfRequests; } //Create the ordered list foreach (var keyValue in UriCounters) if (!topUriList.ContainsKey(keyValue.Value)) topUriList.Add(keyValue.Value, keyValue.Key); else topUriList[keyValue.Value] = string.Format("{0} , {1}", topUriList[keyValue.Value], keyValue.Key); // make the list descending var lst = topUriList.Reverse().ToArray(); // print the results for (int i = 0; (i < 5) && (i < lst.Count()); i++) Console.WriteLine(string.Format("{0} {1}", lst[i].Key, lst[i].Value));
By default, the executable files are stored in the Reduce\bin\Debug and Map\bin\Debug folder un the project folder. You must upload the executables to HDFS before you can execute them.
To copy map.exe and reduce.exe to HDFS
fs.put()
#ls /example/apps
You shall see both map.exe and reduce.exe listed.
From HDInsight, you can connect to Windows Azure storage using the asv protocol. For example, within Hadoop, you normally would get a listing of files within HDFS using the command line interface:
#ls /
In the case of accessing files within Windows Azure Blob storage, you can run the following command to list the iislogs.txt you uploaded to the container earlier in the tutorial.
#ls asv://iislogsinput/
Before you can access Windows Azure storage, you must configure ASV
To set up ASV
To create and execute a new Hadoop job
The hadoop-streaming.jar can be downloaded from http://www.java2s.com/Code/Jar/h/Downloadhadoopstreamingjar.htm.
After the job completes, open the blob results.txt/part-00000 in the container iislogsoutput using Windows Azure Explorer, and look at the results. For example:
While Hadoop is a natural choice for processing unstructured and semi-structured data like logs and files, there may be a need to process structured data stored in relational databases as well. Sqoop (SQL-to-Hadoop) is a tool that allows you to import structured data to Hadoop and use it in MapReduce and HIVE jobs.
In this tutorial, you will install the AdventureWorks community sample databases into a Windows Azure SQL Database server, and then use sqoop to import the data to HDFS.
To install the AdventureWorks databases into Windows Azure SQL Database
Since Sqoop currently adds square brackets to the table name, add a synonym to support two-part naming for SQL Server tables and run the following query:
CREATE SYNONYM [Sales.SalesOrderDetail] FOR Sales.SalesOrderDetail
Run the following query and review its result.
select top 200 * from [Sales.SalesOrderDetail]
To import data using Sqoop
In the Hadoop command prompt change the directory to "c:\Apps\dist\sqoop\bin" and run the following command:
sqoop import --connect "jdbc:sqlserver://[serverName].database.windows.net;username=[userName]@[serverName];password=[password];database=AdventureWorks2012" --table Sales.SalesOrderDetail --target-dir /data/lineitemData -m 1
Go to the Hadoop on Windows Azure portal and open the interactive console. Run the #lsr command to list the file and directories on your HDFS.
Run the #tail command to view selected results from the part-m-0000 file.
#tail /user/RAdmin/data/SalesOrderDetail/part-m-00000
In this tutorial you have seen how various data sources can be used for MapReduce jobs in Hadoop on Windows Azure. Data for Hadoop jobs can be on cloud storage or on HDFS. You have also seen how relational data can be imported into HDFS using Sqoop and then be used in Hadoop jobs.
Jonathan Gao edited Revision 5. Comment: Remove the ftp/ftps portion (no longer supported); Refresh the first segment - Using Windows Azure Storage in MapReduce
Jonathan Gao edited Revision 6. Comment: format image; modify title
Jonathan Gao edited Revision 7. Comment: format image
Jonathan Gao edited Revision 8. Comment: format the sqoop section
Carsten Siemens edited Revision 11. Comment: Removed stub tag.
Michael Agranov edited Original. Comment: Added content.
Michael Agranov edited Revision 1. Comment: Added images.
Michael Agranov edited Revision 2. Comment: Added images, performed edits.
Michael Agranov edited Revision 3. Comment: Performed edits.
For sample data (1-10KB) it is working fine but when i try this approach for large data as i have a file having 60MB data then on running job, it is throwing error i.e. ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
Do you have any idea, what could be the issue.