Let’s run the Word Count Program in Java
Required : 1.Hadoop must be installed
2. Java SDK
Steps:
First Open Eclipse
Then follow below steps
File -New - Java Project -( Give project name as MapreduceDemo) - Finish.
Right Click on project -New - Package ( give package name as Company) -Finish.
Right Click on Package Comapny -New - Class (Name it as WordCount).
Now add below jar files:
Right Click on Project - Build Path- Add External Jars
hadoop-core.jar
Commons-cli-1.2.jar
Now type the code:
package Company;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.Reducer;
import java.util.Iterator;
public class WordCount_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one1 = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one1);
}
}
}
public class WordCount_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}
public class WordCount {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WordCount class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WordCount_Mapper.class);
conf.setCombinerClass(WordCount_Reducer.class);
conf.setReducerClass(WordCount_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
removeDir(args[1], conf);
JobClient.runJob(conf);
}
private void removeDir(String pathToDirectory, Configuration conf)
throws IOException {
Path pathToRemove = new Path(pathToDirectory);
FileSystem fileSystem = FileSystem.get(conf);
if (fileSystem.exists(pathToRemove)) {
fileSystem.delete(pathToRemove, true);
}
}
}
6. Create the jar file of this program and name it WordCount.jar and place the jar in hdfs along input file named file.csv
Input file format:
MapReduce is a programming model by which large dataset is processed in a parallel .
7.
Now run the jar file by typing below command in linux console
hadoop
jar /New_dir/WordCount .jar /New_dir/input/file.csv /New_dir
/output
8.The output file is stored in /New_dir /outputpart-00000
Output:
MapReduce 1
is 2
a 2
programming 1
model 1
by 1
which 1
large 1
dataset 1
processed 1
in 1
parallel 1
2 Comments
Very nice article
ReplyDeleteGreat technical information!
ReplyDeletePlease do not enter any spam link in the comment box