Hadoop on Azure - Creating and Running a simple Java MapReduce
Apache Hadoop has a variety of APIs for developing MapReduce applications: you can
use the streaming API to create MapReduce applications with almost any programming language, Hadoop pipes adds native support for C++ applications and Hadoop on Azure provides it’s IsotopeJS library for creating JavaScript MapReduce jobs. You can also use a variety of higher-level abstractions and libraries such as Pig and Hive. With that said and done, it is also useful to know how to develop MapReduce applications using Hadoop’s most natural and primal Java API. This API allows you to develop richer, more powerful MapReduce apps and has a staggering amount of samples around the Internet.
If you come from a .NET background (like I do), creating your first MapReduce job might be a little bit frustrating the first time around. But once you get past some of the initial hurdles you will be writing Java MapReduce code in your sleep.
Setting up your IDE
After playing around with quite a few IDEs I have found JetBrains IntelliJ IDEA to be the best choice. I am currently using the community edition that can be downloaded free of charge from here.
Now we need to create a new project. The easiest way to create a project using Maven. Maven is basically a build automation tool but one of it’s coolest features is the ability to to create a new project based on a template (called an archetype). I used this post to setup Maven on my machine
Next we need to get an archetype that matches the Hadoop version used for the Microsoft Hadoop Distribution (0.20.2). Matthias Friedrich created such a repository which creates a sample word count project. In order to use this Maven repository you need to run the following command line from your working folder:
mvn archetype:generate -DarchetypeCatalog=http://dev.mafr.de/repos/maven2/ -DgroupId=com.hadoop.example -DartifactId=MyFirstMapReduce
Note that MyFirstMapReduce is the name of my package.
Next you will be prompted to select the archetype from the list of archetypes in the repository:
Select 1 and press Enter.
You will be asked to change the version property. Press Enter to keep the default value of 1.0-SNAPSHOT and when prompted again select Y to confirm the details of your project.
A Look Around Our Project
Open IntelliJ IDEA, select File->Open Project and the folder created by Maven, Click OK:
The newly created project should load. This project is created with the classic MapReduce sample: word count including all its dependencies as well as some unit tests.
Open the TokenizingMapper class and view its implementation:
public class TokenizingMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable one = new IntWritable(1);
protected void map(LongWritable offset, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer tok = new StringTokenizer(value.toString());
while (tok.hasMoreTokens()) {
Text word = new Text(tok.nextToken());
context.write(word, one);
}
}
}
This common sample reads the input file line by line, and generates a key-value output with every input word as its key and the digit 1 as its value. This sample actually uses the IntSumReducer class that basically sums the values of the inputs based on their key. Lets take a look at the WordCount class that implements the Tool interface. The Tool interface is used to handle the command-line arguments and configure the MapReduce job.
public class WordCount extends Configured implements Tool { public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { String[] remainingArgs = new GenericOptionsParser(getConf(), args).getRemainingArgs(); if (remainingArgs.length < 2) { System.err.println("Usage: WorldCount <in> <out>"); ToolRunner.printGenericCommandUsage(System.err); return 1; } Job job = new Job(getConf(), "WordCount"); job.setJarByClass(getClass()); job.setMapperClass(TokenizingMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(remainingArgs[0])); FileOutputFormat.setOutputPath(job, new Path(remainingArgs[1])); boolean success = job.waitForCompletion(true); return success ? 0 : 1; } }
Now we can build the package. To do this we will use Maven again by running the following command from the project’s directory:
mvn package
after maven finishes building your package you should see the following output:
deploying and running the job
before we start, we need some text to analyze. you can use any text file for that, i specifically am going to use the the 2010 cia world factbook i have downloaded from project gutenberg.
to deploy the job you will first need to setup a hadoop on azure cluster. this is fairly easy all you need to do is just give your cluster a dns name, select the size of cluster you would like to setup and create a cluster login using the cluster request form:
once the cluster is up, click on the remote desktop icon, placed in the “your cluster” group and log in using the credentials you set during the cluster request.
copy your input file to the local file system on the remote server and open the hadoop command shell using the desktop shortcut. copy the file to hdfs using the following command:
hadoop fs -copyfromlocal c:\ciabook.txt input/ciabook.txt
let’s list the files in our hdfs using the following command:
hadoop fs -lsr
we’re now ready to run the word count application. let’s go back to the portal and create a new job. click on the create job icon, placed in the “your tasks” group
in the create job page, give the job a friendly name and the jar file. the jar file can be found under target in the project’s directory, and it should be named [package name]-1.0-snapshot-job.jar where [pakagename] is the name of your package.
next you need to add two parameters by clicking the add parameter button (once for each parameter). set the value of the first parameter (parameter 0) to the input directory (the input directory we just created) and set the second parameter (parameter 1) to output (this will be the output directory create by the mapreduce job). click the execute job button.
your job should start executing. for the cia factbook (approx. 12mb in size) the job executes in less than 2 minutes on average. once the job has completed you should see that its status is “completed sucessfully” (the typo is embarrassingly from the hadoop on azure portal)
open the interactive console icon placed in the “your cluster” group and run the #lsr command to list the files in our hdfs:
and use the #tail command to see the first few lines in the file output file:
summary
coming from a .net background writing hadoop java mapreduce code can be a little bit intimidating. but just like in the .net world, the right tool can make all the difference.





Comments
Hadoop on Azure - Creating and Running a simple Java MapReduce - I'm on a mission from God object said:
Pingback from Hadoop on Azure - Creating and Running a simple Java MapReduce - I'm on a mission from God object
DotNetKicks.com said:
You've been kicked (a good thing) - Trackback from DotNetKicks.com
Utiliser Windows Azure Mobile Services pour recevoir des notification de HadoopOnAzure(Windows Azure HDInsight) | J??r??me CHRIST said:
Pingback from Utiliser Windows Azure Mobile Services pour recevoir des notification de HadoopOnAzure(Windows Azure HDInsight) | J??r??me CHRIST
http://www.saclancelsoldes2013.org said:
Our pool needs to be fed through those photopages for which you consider seriously worth becoming an area of the "Best Opinion Collection"