ITechShree-Data-Analytics-Technologies

Working model of Mapreduce : Quick start guide for Mapreduce Programming






Map Reduce is a programming model by which large data set is processed in a parallel and distributed manner on a cluster .

This concept is first introduced by Google and later, it became part of Apache.



How Mapreduce works?

The processing unit of Mapreduce model is key-value pair. This model processes large data sets whether structured or unstructured and generates key-value pairs.
Generally Mapreduce consists of two phases.

·Map function is the first phase through which a block of data is read and processed to produce key-value pairs as intermediate outputs.
·Reduce Function receives the key-value pair from multiple map jobs and aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.


Let's explain about Map reduce workflow.

1.Input data:
These are input files and stored in hdfs

2.Input Splits:
Logical representation of data that contains a single map task in the MapReduce Program.Thus the number of map tasks will be equal to the number of InputSplits. Each division of splits is termed as record and then, each record will be processed by the mapper itself.

3.Record Reader:
It communicates with the Input split and converts the obtained data in the form of Key-Value pairs.

4.Mapper:
It processes each input record (from Record Reader) and generates new key-value pair, and written to the local disk.

5. Combiner:
It acts on the mappers’ output and does local aggregation , which helps to minimize the data transfer between mapper and reducer.

6. Partitioner:
It comes into the picture where there are more than one reducer .The number of partitioners is equal to the number of reducers .

That means each combiner output is partitioned and, a record having same key value goes into the same partition, and then each partition is sent to a reducer.
Hence mappers output distributes over the reducer through partitioning.

7.Shuffling and Sorting:

Shuffling and Sorting in Hadoop occurs simultaneously.
Shuffling is the process by which the physical movement of the data takes place over the network by transferring mappers intermediate output to the reducer. The intermediate key-value generated by mapper is sorted automatically by key. In Sort phase merging and sorting happens on the output of  map phase.

8.Reducer:
The Intermediate output of mapper is moved to the reducer which does processing over it and generates the final output stored into Hdfs.

Mapreduce API:
The interface and classes used in Mapreduce programming :
1.Job Context Interface :
It is used to define different jobs in MapReduce and acts as Super interface for all the classes.

2.Job Class :
It plays an important role in the MapReduce API which allows the user for job configuration, submission and controlling job execution, and querying the job.

3.Mapper class:
It is used to define the Map job

4.Reducer Class:
Reduce job in MapReduce is defined by Reduce Class.

Let’s run the Word Count Program in Java

Required : 1.Hadoop must be installed

                   2. Java SDK



Steps:

  1. First Open Eclipse

    Then follow below steps

    File -New - Java Project -( Give project name as MapreduceDemo) - Finish.

  2. Right Click on project -New - Package ( give package name as Company) -Finish.

  3. Right Click on Package Comapny -New - Class (Name it as WordCount).

  4. Now add below jar files:

    1. Right Click on Project - Build Path- Add External Jars

      1. hadoop-core.jar

      2. Commons-cli-1.2.jar

  1. Now type the code:



package Company;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapred.Reducer;

    import java.util.Iterator;

public class WordCount_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>{

private final static IntWritable one1 = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());

output.collect(word, one1);

}

}

}







public class WordCount_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {

int sum=0;

while (values.hasNext()) {

sum+=values.next().get();

}

output.collect(key,new IntWritable(sum));

}

}





public class WordCount {

public static void main(String[] args) throws IOException{

JobConf conf = new JobConf(WordCount class);

conf.setJobName("WordCount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(WordCount_Mapper.class);

conf.setCombinerClass(WordCount_Reducer.class);

conf.setReducerClass(WordCount_Reducer.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);



FileInputFormat.setInputPaths(conf,new Path(args[0]));

FileOutputFormat.setOutputPath(conf,new Path(args[1]));

removeDir(args[1], conf);

JobClient.runJob(conf);

}


private void removeDir(String pathToDirectory, Configuration conf)

throws IOException {

Path pathToRemove = new Path(pathToDirectory);

FileSystem fileSystem = FileSystem.get(conf);

if (fileSystem.exists(pathToRemove)) {

fileSystem.delete(pathToRemove, true);

}

}

}





    6. Create the jar file of this program and name it WordCount.jar and place the jar in hdfs along input file named file.csv

    Input file format:

    MapReduce is a programming model by which large dataset is processed in a parallel .

    7. Now run the jar file by typing below command in linux console
    hadoop jar /New_dir/WordCount .jar /New_dir/input/file.csv /New_dir /output

    8.The output file is stored in /New_dir /outputpart-00000

Output:

MapReduce 1

is 2

a 2

programming 1

model 1

by 1

which 1

large 1

dataset 1

processed 1

in 1

parallel 1


Now you have a basic understanding of MapReduce framework and realized how the MapReduce framework facilitates us to process huge data present in the HDFS. 



#Mapreduce API# Working model of Mapreduce #Writing Mapreduce programming

See you in my next blog!! 














Post a Comment

2 Comments

Please do not enter any spam link in the comment box