Introduction
Hadoop Map-Reduce is a YARN-based system for parallel processing of large data sets. If you are new to hadoop, first visit here. In this article, I will help you quickly start with writing the simplest Map-Reduce job. This is a famous “Wordcount” MR job and the first one for 90% of the people (if not more).
WordCount is a simple application that counts the number of occurrences of each word in a given input set.
This code example is from MapReduce tutorial available here. You can checkout source code directly from this small Github project I created.
Step 1. Install and start Hadoop server
In this tutorial, I assume your hadoop installation is ready. For Single Node setup,visit here.
Start Hadoop
[email protected]:/home/amresh$ cd /usr/local/hadoop/
[email protected]:/usr/local/hadoop-1.0.2$ bin/start-all.sh
[email protected]:/usr/local/hadoop-1.0.2$ sudo jps
6098 JobTracker
8024 Jps
5783 DataNode
5997 SecondaryNameNode
5571 NameNode
6310 TaskTracker
(Make sure NameNode, DataNode, JobTracker, TaskTracker, SecondaryNameNode are running)
Step 2. Write Map-Reduce Job for Wordcount
Map.java (Mapper Implementation)
package com.impetus.code.examples.hadoop.mapred.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Reduce.java (Reducer Implementation)
package com.impetus.code.examples.hadoop.mapred.wordcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
WordCount.java (Job)
package com.impetus.code.examples.hadoop.mapred.wordcount;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WordCount
{
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Step 3. Compile and Create Jar file
I prefer maven for building my java project. You can find POM file here and add to your java project. This will make sure you have Hadoop Jar dependency ready.
Just Run:
[email protected]:/usr/local/hadoop-1.0.2$ cd ~/development/hadoop-examples
[email protected]:/home/amresh/development/hadoop-examples$ mvn clean install
Step 4. Create input files to copy words from
[email protected]:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -mkdir ~/wordcount/input
[email protected]:/usr/local/hadoop-1.0.2$ sudo vi file01 (Hello World Bye World)
[email protected]:/usr/local/hadoop-1.0.2$ sudo vi file02 (Hello Hadoop Goodbye Hadoop)
[email protected]:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -copyFromLocal file01 /home/amresh/wordcount/input/
[email protected]:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -copyFromLocal file02 /home/amresh/wordcount/input/
[email protected]:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -ls /home/amresh/wordcount/input/
Found 2 items
-rw-r--r-- 1 amresh supergroup 0 2012-05-08 14:51 /home/amresh/wordcount/input/file01
-rw-r--r-- 1 amresh supergroup 0 2012-05-08 14:51 /home/amresh/wordcount/input/file02
Step 5. Run Map-Reduce job you wrote
[email protected]:/usr/local/hadoop-1.0.2$ bin/hadoop jar ~/development/hadoop-examples/target/hadoop-examples-1.0.jar com.impetus.code.examples.hadoop.mapred.wordcount.WordCount /home/amresh/wordcount/input /home/amresh/wordcount/output
[email protected]:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -ls /home/amresh/wordcount/output/
Found 3 items
-rw-r--r-- 1 amresh supergroup 0 2012-05-08 15:23 /home/amresh/wordcount/output/_SUCCESS
drwxr-xr-x - amresh supergroup 0 2012-05-08 15:22 /home/amresh/wordcount/output/_logs
-rw-r--r-- 1 amresh supergroup 41 2012-05-08 15:23 /home/amresh/wordcount/output/part-00000
[email protected]:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -cat /home/amresh/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Thanks for this tutorial, it was of great help. I am reading Hadoop The Definitive Guide but had a difficult time with what they were talking about and wanted to run an example, and you have a great concise example!
Hi Chris,
Thanks for your appreciation.
I am having trouble with step three. Should I be adjusting the pom.xml file in order for mvn clean install to work?
your pom file should be parallel to your src directory.
Thanks for this amresh. 🙂
Being able to run some code is always better than just reading books.
It was great help.
Very helpful Thanks