Hadoop on Azure – Creating and Running a simple Java MapReduce

4 ביוני 2012

one comment

Apache Hadoop has a variety of APIs for developing MapReduce applications: you can Hadoop-Azure-Logo-Newuse the streaming API to create MapReduce applications with almost any programming language, Hadoop pipes adds native support for C++ applications and Hadoop on Azure provides it’s IsotopeJS library for creating JavaScript MapReduce jobs. You can also use a variety of higher-level abstractions and libraries such as Pig and Hive. With that said and done, it is also useful to know how to develop MapReduce applications using Hadoop’s most natural and primal Java API. This API allows you to develop richer, more powerful MapReduce apps and has a staggering amount of samples around the Internet.

If you come from a .NET background (like I do), creating your first MapReduce job might be a little bit frustrating the first time around. But once you get past some of the initial hurdles you will be writing Java MapReduce code in your sleep.

Setting up your IDE

After playing around with quite a few IDEs I have found JetBrains IntelliJ IDEA to be the best choice. I am currently using the community edition that can be downloaded free of charge from here.

Now we need to create a new project. The easiest way to create a project using Maven. Maven is basically a build automation tool but one of it’s coolest features is the ability to to create a new project based on a template (called an archetype). I used this post to setup Maven on my machine

Next we need to get an archetype that matches the Hadoop version used for the Microsoft Hadoop Distribution  (0.20.2). Matthias Friedrich created such a repository which creates a sample word count project. In order to use this Maven repository you need to run the following command line from your working folder:

mvn archetype:generate -DarchetypeCatalog=http://dev.mafr.de/repos/maven2/ -DgroupId=com.hadoop.example -DartifactId=MyFirstMapReduce

Note that MyFirstMapReduce is the name of my package.

Next you will be prompted to select the archetype from the list of archetypes in the repository:

Maven mvn archetype for Hadoop on Azure

Select 1 and press Enter.

You will be asked to change the version property. Press Enter to keep the default value of 1.0-SNAPSHOT and when prompted again select Y to confirm the details of your project.

Maven mvn archetype for Hadoop on Azure

A Look Around Our Project

Open IntelliJ IDEA, select File->Open Project and the folder created by Maven, Click OK:

IntelliJ project for Hadoop on Azure

The newly created project should load. This project is created with the classic MapReduce sample: word count including all its dependencies as well as some unit tests.

IntelliJ project for Hadoop on Azure

Open the TokenizingMapper class and view its implementation:

public class TokenizingMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private static final IntWritable one = new IntWritable(1);
    protected void map(LongWritable offset, Text value, Context context)
            throws IOException, InterruptedException {
        StringTokenizer tok = new StringTokenizer(value.toString());
        while (tok.hasMoreTokens()) {
            Text word = new Text(tok.nextToken());
            context.write(word, one);

This common sample reads the input file line by line, and generates a key-value output with every input word as its key and the digit 1 as its value. This sample actually uses the IntSumReducer class that basically sums the values of the inputs based on their key. Lets take a look at the WordCount class that implements the Tool interface. The Tool interface is used to handle the command-line arguments and configure the MapReduce job.

public class WordCount extends Configured implements Tool {
    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordCount(), args);

    public int run(String[] args) throws Exception {
        String[] remainingArgs 
                = new GenericOptionsParser(getConf(), args).getRemainingArgs();
        if (remainingArgs.length < 2) {
            System.err.println("Usage: WorldCount <in> <out>");
            return 1;
        Job job = new Job(getConf(), "WordCount");
        FileInputFormat.addInputPath(job, new Path(remainingArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(remainingArgs[1]));
        boolean success = job.waitForCompletion(true);
        return success ? 0 : 1;

Now we can build the package. To do this we will use Maven again by running the following command from the project’s directory:

mvn package

after maven finishes building your package you should see the following output:

Maven mvn build for Hadoop on Azure

deploying and running the job

before we start, we need some text to analyze. you can use any text file for that, i specifically am going to use the the 2010 cia world factbook i have downloaded from project gutenberg.

to deploy the job you will first need to setup a hadoop on azure cluster. this is fairly easy all you need to do is just give your cluster a dns name, select the size of cluster you would like to setup and create a cluster login using the cluster request form:

Creating a cluster using Hadoop on Azure 

once the cluster is up, click on the remote desktop icon, placed in the “your cluster” group and log in using the credentials you set during the cluster request.

rdp using Hadoop on Azure

copy your input file to the local file system on the remote server and open the hadoop command shell using the desktop shortcut. copy the file to hdfs using the following command:

hadoop fs -copyfromlocal c:\ciabook.txt input/ciabook.txt

let’s list the files in our hdfs using the following command:

hadoop fs -lsr


we’re now ready to run the word count application. let’s go back to the portal and create a new job. click on the create job icon, placed in the “your tasks” group

Creating a job using Hadoop on Azure

in the create job page, give the job a friendly name and the jar file. the jar file can be found under target in the project’s directory, and it should be named [package name]-1.0-snapshot-job.jar where [pakagename] is the name of your package.

next you need to add two parameters by clicking the add parameter button (once for each parameter). set the value of the first parameter (parameter 0) to the input directory (the input directory we just created) and set the second parameter (parameter 1) to output (this will be the output directory create by the mapreduce job). click the execute job button.

Creating a job using Hadoop on Azure

your job should start executing. for the cia factbook (approx. 12mb in size) the job executes in less than 2 minutes on average. once the job has completed you should see that its status is “completed sucessfully” (the typo is embarrassingly from the hadoop on azure portal)

Creating a job using Hadoop on Azure

open the interactive console icon placed in the “your cluster” group and run the #lsr command to list the files in our hdfs:

lsr using Hadoop on Azure interactive console

and use the #tail command to see the first few lines in the file output file:

tail using Hadoop on Azure interactive console


coming from a .net background writing hadoop java mapreduce code can be a little bit intimidating. but just like in the .net world, the right tool can make all the difference.

kick it on DotNetKicks.com Shout it

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


one comment

  1. http://www.saclancelsoldes2013.org20 באפריל 2013 ב 5:07

    Our pool needs to be fed through those photopages for which you consider seriously worth becoming an area of the "Best Opinion Collection"