0% found this document useful (0 votes)

32 views62 pages

Map Reduce-LO2

MapReduce is a framework for processing large datasets in a distributed manner. It consists of two phases - the Map phase and the Reduce phase. In the Map phase, a map function processes input data and produces intermediate key-value pairs. The Reduce phase then combines these intermediate pairs based on their keys using a reduce function to produce the final output. For example, a MapReduce job is used to find the highest recorded global temperature for each year by having the map function extract the year and temperature from weather data and the reduce function pick the maximum temperature for each unique year key.

Uploaded by

Kai Enezhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views62 pages

Map Reduce-LO2

Uploaded by

Kai Enezhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 62

Map Reduce

08/05/2023 1
MapReduce Overview
• A method for distributing computation across multiple nodes
• Each node processes the data that is stored at that node
• Consists of two main phases
• Map
• Reduce
Map Reduce
• Hadoop Ecosystem component ‘MapReduce’ works by breaking the
processing into two phases:
• Map phase
• Reduce phase
• Each phase has key-value pairs as input and output. In addition,
programmer also specifies two functions: map function and reduce
function

08/05/2023 3
Map Reduce
• Map function takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples
(key/value pairs).
Reduce function takes the output from the Map as an input and
combines those data tuples based on the key and accordingly
modifies the value of the key.

08/05/2023 4
MR – Important Notes
• Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits.
Hadoop creates one map task for each split, which runs the user-defined map function for each
record in the split.
• For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by default
• For better performance, it should be clear why the optimal split size is the same as the block size:
it is the largest size of input that can be guaranteed to be stored on a single node. If the split
spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some of the
split would have to be transferred across the network to the node running the map task, which is
clearly less efficient than running the whole map task using local data.
• Map tasks write their output to the local disk, not to HDFS.
• If the node running the map task fails before the map output has been consumed by the reduce
task, then Hadoop will automatically rerun the map task on another node to re-create the map
output.
• The input to a single reduce task is normally the output from all mappers. The output of the reduce
is normally stored in HDFS for reliability.
• Reducers can be 0,1 or multiple

08/05/2023 5
Map Reduce – Detail Flow Process

08/05/2023 6
MapReduce Features
• Automatic parallelization and distribution
• Fault-Tolerance
• Provides a clean abstraction for programmers to use
The Mapper

• Reads data as key/value pairs

• The key is often discarded

• Outputs zero or more key/value pairs

Shuffle and Sort
• Output from the mapper is sorted by key

• All values with the same key are guaranteed to go to the same
machine
The Reducer
• Called once for each unique key
• Gets a list of all values associated with a key as input
• The reducer outputs zero or more final key/value pairs
• Usually just one output per input key
Map Reduce

08/05/2023 11
MR- Data Flow

08/05/2023 12
MR- Data Flow

08/05/2023 13
RDBMS Vs Map Reduce
• MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion.
An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver
low-latency retrieval and update times of a relatively small amount of data.
• MapReduce suits applications where the data is written once and read many times, whereas a
relational database is good for datasets that are continually updated.

08/05/2023 14
RDBMS Vs Map Reduce

08/05/2023 15
Map Reduce Paradigm
• Map and Reduce are based on functional programming

Apply function Map: Reduce:

Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output

Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
– Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Read and write to Disk before and after Map and Reduce
(stateless machine)
– Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
– Support for others languages needed
• Only for Batch processing
– Interactivity, streaming data

• Solution is Apache Spark

9/17/2018 18
MR Example For Weather Dataset
• Weather sensors collect data every hour at many locations across the globe
and gather a large volume of log data, which is a good candidate for analysis
with MapReduce because we want to process all the data, and the data is
semi-structured and record-oriented.
• What’s the highest recorded global temperature for each year in the dataset?

• Data Format:
• The data is stored using a line-oriented ASCII format, in which each line is a record. The
format supports a rich set of meteorological elements, many of which are optional or
with variable data lengths. For simplicity, we focus on the basic elements, such as
temperature, which are always present and are of fixed width.
Map Reduce -Example

08/05/2023 20
Map Reduce
• Data Files example:
• Datafiles are organized by date and weather station. There is a
directory for each year from 1901 to 2001, each containing a gzipped
file for each weather station with its readings for that year

08/05/2023 21
Map Reduce –Example for Weather Data set
• Unix may also gives you result but the process becomes very slow
with large data set.
• Lets see how MR works.
• MapReduce works by breaking the processing into two phases: the
map phase and the reduce phase.
• Each phase has key-value pairs as input and output, the types of
which may be chosen by the programmer.
• The programmer also specifies two functions: the map function and
the reduce function.

08/05/2023 22
Map Reduce –Example for Weather Data set
• The input to our map phase is the raw data. We choose a text input format
that gives us each line in the dataset as a text value.
• The key is the offset of the beginning of the line from the beginning of the
file, but as we have no need for this, we ignore it.
• Our map function is simple. We pull out the year and the air temperature,
because these are the only fields we are interested in. In this case, the map
function is just a data preparation phase, setting up the data in such a way
that the reduce function can do its work on it: finding the maximum
temperature for each year.
• The map function is also a good place to drop bad records: here we filter out
temperatures that are missing, suspect, or erroneous.
08/05/2023 23
Map Reduce –Example for Weather Data set
• To visualize the way the map works, consider the following sample
lines of input data

• These lines are presented to the map function as the key-value pairs:

08/05/2023 24
Map Reduce –Example for Weather Data set
• The keys are the line offsets within the file, which we ignore in our
map function. The map function merely extracts the year and the air
temperature, and emits them as its output (the temperature values
have been interpreted as integers):
• (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78)
• The output from the map function is processed by the MapReduce
framework before being sent to the reduce function. This processing
sorts and groups the key-value pairs by key.
• So, continuing the example, our reduce function sees the following
input: (1949, [111, 78]) (1950, [0, 22, −11])
08/05/2023 25
Map Reduce –Example for Weather Data set
• Each year appears with a list of all its air temperature readings.
• All the reduce function has to do now is iterate through the list and
pick up the maximum reading: (1949, 111) (1950, 22).
• This is the final output: the maximum global temperature recorded
in each year.

08/05/2023 26
Map Reduce- Example for Weather Data set

08/05/2023 27
MR - Example for Weather Data set
• Lets code the above explanation.
• We need three things:
• a map function
• a reduce function
• and some code to run the job.
• The map function is represented by the Mapper class, which declares
an abstract map() method.

08/05/2023 28
Java MR - Example for Weather Data set
• Mapper for the maximum temperature example

08/05/2023 29
Java MR - Example for Weather Data set
• The Mapper class is a generic type, with four formal type parameters that specify the input key,
input value, output key, and output value types of the map function. For the present example, the
input key is a long integer offset, the input value is a line of text, the output key is a year, and the
output value is an air temperature (an integer). O/p from mapper: (1950, 0) (1950, 22) (1950,
−11) (1949, 111) (1949, 78)
• Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and
IntWritable (like Java Integer).
• The map() method is passed a key and a value. We convert the Text value containing the line of
input into a Java String, then use its substring() method to extract the columns we are interested
in.
• The map() method also provides an instance of Context to write the output to. In this case, we
write the year as a Text object (since we are just using it as a key), and the temperature is
wrapped in an IntWritable. We write an output record only if the temperature is present.

08/05/2023 30
Java MR - Example for Weather Data set
• Reducer for the maximum temperature example

08/05/2023 31
Java MR - Example for Weather Data set
• Again, four formal type parameters are used to specify the input and output types, this time for
the reduce function.
• The input types of the reduce function must match the output types of the map function: Text
and IntWritable.
• And in this case, the output types of the reduce function are Text and IntWritable, for a year and its maximum
temperature, which we find by iterating through the temperatures and comparing each with a record of the highest
found so far.

08/05/2023 32
Java MR - Example for Weather Data set
• Application to find the maximum temperature in the weather dataset

08/05/2023 33
Java MR - Example for Weather Data set
• A Job object forms the specification of the job and gives you control over how the job is run.
When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop
will distribute around the cluster).
• Rather than explicitly specifying the name of the JAR file, we can pass a class in the Job’s
setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the
JAR file containing this class.
• Having constructed a Job object, we specify the input and output paths.
• An input path is specified by calling the static addInputPath() method on FileInputFormat, and it
can be a single file, a directory (in which case, the input forms all the files in that directory), or a
file pattern. As the name suggests, addInputPath() can be called more than once to use input
from multiple paths.

08/05/2023 34
Java MR - Example for Weather Data set
• The output path (of which there is only one) is specified by the static setOutputPath() method on
FileOutputFormat. It specifies a directory where the output files from the reduce function are
written.
• The directory shouldn’t exist before running the job because Hadoop will complain and not run the
job. This precaution is to prevent data loss (it can be very annoying to accidentally overwrite the
output of a long job with that of another).
• Next, we specify the map and reduce types to use via the setMapperClass() and setReducerClass()
methods.
• The setOutputKeyClass() and setOutputValueClass() methods control the output types for the reduce
function, and must match what the Reduce class produces.
• The map output types default to the same types, so they do not need to be set if the mapper
produces the same types as the reducer (as it does in our case). However, if they are different, the
map output types must be set using the setMapOutputKeyClass() and setMapOutputValueClass()
methods.

08/05/2023 35
Java MR - Example for Weather Data set
• The input types are controlled via the input format, which we have not explicitly set because we
are using the default TextInputFormat.
• After setting the classes that define the map and reduce functions, we are ready to run the job.
The waitForCompletion() method on Job submits the job and waits for it to finish.
• The single argument to the method is a flag indicating whether verbose output is generated.
When true, the job writes information about its progress to the console.
• The return value of the waitForCompletion() method is a Boolean indicating success (true) or
failure (false), which we translate into the program’s exit code of 0 or 1.

08/05/2023 36
Java MR - Example for Weather Data set
• Now, Test it on few sample records. Log will look like:

08/05/2023 37
Java MR - Example for Weather Data set
• Log will look like:

08/05/2023 38
Java MR - Example for Weather Data set
• When the hadoop command is invoked with a classname as the first argument, it launches a Java
virtual machine (JVM) to run the class.
• The hadoop command adds the Hadoop libraries (and their dependencies) to the classpath and
picks up the Hadoop configuration, too. To add the application classes to the classpath, we’ve
defined an environment variable called HADOOP_CLASSPATH, which the hadoop script picks up.
• The output from running the job provides some useful information.
• For example, we can see that the job was given an ID of job_local26392882_0001, and it ran one map task and one
reduce task (with the following IDs: attempt_local26392882_0001_m_000000_0 and
attempt_local26392882_0001_r_000000_0). Knowing the job and task IDs can be very useful when debugging
MapReduce jobs.
• Counters from Log: we can follow the number of records that went through the system: five map
input records produced five map output records (since the mapper emitted one output record for
each valid input record), then five reduce input records in two groups (one for each unique key)
produced two reduce output records.
• The output was written to the output directory, which contains one output file per reducer. The
job had a single reducer, so we find a single file, named part-r-00000:

08/05/2023 39
Map Reduce
• That’s all. Well Done.

08/05/2023 40
Another Example For Map Reduce - Word
Count
• Mapper
• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer
• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program
• Defines this job
• Submits job to cluster
Map Reduce

08/05/2023 42
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();

public static void map(LongWritable key, Text value,

OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void map(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Word Count Example
• Jobs are controlled by configuring JobConfs
• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is
executed
• conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
• conf.set(“my.string”, “foo”);
• conf.set(“my.integer”, 12);
• JobConf is available to all tasks
Putting it all together
• Create a launching program for your application
• The launching program configures:
• The Mapper and Reducer to use
• The output key and value types (input types are inferred
from the InputFormat)
• The locations for your input and output
• The launching program then submits the job and
typically waits for it to complete
Putting it all together
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read by
specifying an InputFormat to be used
• A Map/Reduce may specify how it’s output is to be written
by specifying an OutputFormat to be used
• These default to TextInputFormat and TextOutputFormat,
which process line-based text data
• Another common choice is SequenceFileInputFormat and
SequenceFileOutputFormat for binary data
• These are file-based, but they are not required to be
How many Maps and Reduces
• Maps
• Usually as many as the number of HDFS blocks being
processed, this is the default
• Else the number of maps can be specified as a hint
• The number of maps can also be controlled by specifying
the minimum split size
• The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps),
min_split_size
How many Maps and Reduces
• Reduces
• Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maximum

08/05/2023 50
Some handy tools
• Partitioners
• Combiners
• Compression
• Counters
• Speculation
• Zero Reduces
• Distributed File Cache
• Tool
Partitioners
• Partitioners are application code that define how
keys are assigned to reduces
• Default partitioning spreads keys evenly, but
randomly
• Uses key.hashCode() % num_reduces
Partitioners

• Custom partitioning is often required, for example, to produce a total

order in the output
• Should implement Partitioner interface
• Set by calling conf.setPartitionerClass(MyPart.class)
• To get a total order, sample the map output keys and pick values to divide the
keys into roughly equal buckets and use that in your partitioner

08/05/2023 53
Combiners
• When maps produce many repeated keys
• It is often useful to do a local aggregation following the
map
• Done by specifying a Combiner
• Goal is to decrease size of the transient data
• Combiners have the same interface as Reduces, and
often are the same class
• Combiners must not side effects, because they run an
intermdiate number of times
• In WordCount,
conf.setCombinerClass(Reduce.class);
Compression

• Compressing the outputs and intermediate data will

often yield huge performance gains
• Can be specified via a configuration file or set
programmatically
• Set mapred.output.compress to true to compress job
output
• Set mapred.compress.map.output to true to compress
map outputs
Compression

• Compression Types (mapred(.map)?.output.compression.type)

• “block” - Group of keys and values are compressed together
• “record” - Each value is compressed individually
• Block compression is almost always best
• Compression Codecs (mapred(.map)?.output.compression.codec)
• Default (zlib) - slower, but more compression
• LZO - faster, but less compression

08/05/2023 56
Counters
• Often Map/Reduce applications have countable events
• For example, framework counts records in to and out of
Mapper and Reducer
• To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
• Define nice names in a MyClass_Counter.properties file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2
Speculative execution
• The framework can run multiple instances of slow tasks
• Output from instance that finishes first is used
• Controlled by the configuration variable
mapred.speculative.execution
• Can dramatically bring in long tails on jobs
Zero Reduces
• Frequently, we only need to run a filter on the input
data
• No sorting or shuffling required by the job
• Set the number of reduces to 0
• Output from maps will go directly to OutputFormat and
disk
Distributed File Cache
• Sometimes need read-only copies of data on the local
computer
• Downloading 1GB of data for each Mapper is expensive
• Define list of files you need to download in JobConf
• Files are downloaded once per computer
• Add to launching program:
DistributedCache.addCacheFile(new
URI(“hdfs://nn:8020/foo”), conf);
• Add to task:
Path[] files =
DistributedCache.getLocalCacheFiles(conf);
Tool
• Handle “standard” Hadoop command line options
• -conf file - load a configuration file named file
• -D prop=value - define a single configuration property prop
• Class looks like:
public class MyApp extends Configured
implements Tool {
public static void main(String[] args) throws
Exception {
System.exit(ToolRunner.run(new
Configuration(), new
MyApp(), args));
}
public int run(String[] args) throws Exception
{
…. getConf() ….
}
}
File Formats & Compression

Logical Table

08/05/2023 62

MGT555 Individual Assignment 1
No ratings yet
MGT555 Individual Assignment 1
11 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Congratulations, You Passed The Quiz IBM Data Resilience L1 Course
No ratings yet
Congratulations, You Passed The Quiz IBM Data Resilience L1 Course
6 pages
Computer Architecture & Organization Unit 4
100% (2)
Computer Architecture & Organization Unit 4
24 pages
Bda Material Unit 3
No ratings yet
Bda Material Unit 3
14 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
Unit IV BDA
No ratings yet
Unit IV BDA
32 pages
Map Reduce
No ratings yet
Map Reduce
14 pages
MapReduce and Yarn
No ratings yet
MapReduce and Yarn
39 pages
3 MapReduce Framework
No ratings yet
3 MapReduce Framework
28 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Unit III EBDP 2022
No ratings yet
Unit III EBDP 2022
77 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
5-Yarn Architecture Components Workflow Scheduling-22-01-2025
No ratings yet
5-Yarn Architecture Components Workflow Scheduling-22-01-2025
26 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
05 - MapReduce in Hadoop - An Introduction
No ratings yet
05 - MapReduce in Hadoop - An Introduction
31 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Map Reduce 1
No ratings yet
Map Reduce 1
8 pages
Map Reduce Design and EXECUTION FRAMEWORK
No ratings yet
Map Reduce Design and EXECUTION FRAMEWORK
21 pages
Analyzing Data With Hadoop
No ratings yet
Analyzing Data With Hadoop
54 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Short Programs
No ratings yet
Short Programs
41 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Unit - 5
No ratings yet
Unit - 5
57 pages
Top Answers To Map Reduce Interview Questions: Criteria Mapreduce Spark
No ratings yet
Top Answers To Map Reduce Interview Questions: Criteria Mapreduce Spark
2 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
Hadoop Big Data Unit 3
No ratings yet
Hadoop Big Data Unit 3
22 pages
BDT Unit - Iii
No ratings yet
BDT Unit - Iii
12 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Medha 8059
No ratings yet
Medha 8059
4 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Map Reduce On Red Green Blue Architecture
No ratings yet
Map Reduce On Red Green Blue Architecture
11 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
BDA Lec5
No ratings yet
BDA Lec5
40 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
5 RK - MapReduce - v3
No ratings yet
5 RK - MapReduce - v3
30 pages
Unit Iii LM
No ratings yet
Unit Iii LM
14 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
3d Optical Data Storage
0% (1)
3d Optical Data Storage
15 pages
Twitter Sentimental Analysis
No ratings yet
Twitter Sentimental Analysis
42 pages
Microsoft Official Course: Implementing Active Directory Domain Services Sites and Replication
No ratings yet
Microsoft Official Course: Implementing Active Directory Domain Services Sites and Replication
31 pages
How To Write A Research Paper For Microbiology
100% (1)
How To Write A Research Paper For Microbiology
7 pages
Research
No ratings yet
Research
11 pages
Statistik
No ratings yet
Statistik
8 pages
Error Checking Method
No ratings yet
Error Checking Method
23 pages
AM9.30 Administration en
No ratings yet
AM9.30 Administration en
348 pages
A Survey On Football Player Performance and Value Estimation Using Machine Learning Techniques (#1215552) - 2816789
No ratings yet
A Survey On Football Player Performance and Value Estimation Using Machine Learning Techniques (#1215552) - 2816789
6 pages
DataPreparation Outlier Treatment
100% (1)
DataPreparation Outlier Treatment
3 pages
BSC6900 GSM V900R012 Dimensioning
No ratings yet
BSC6900 GSM V900R012 Dimensioning
88 pages
Accessing Cluster Tables in SAP
No ratings yet
Accessing Cluster Tables in SAP
3 pages
Basic Terms in Statistics
No ratings yet
Basic Terms in Statistics
7 pages
Bhupendra CV
No ratings yet
Bhupendra CV
2 pages
Action Research Project AITEACH
No ratings yet
Action Research Project AITEACH
7 pages
Using Big Data To Solve Economic and Social Problems: Professor Raj Chetty Head Section Leader: Gregory Bruich, PH.D
No ratings yet
Using Big Data To Solve Economic and Social Problems: Professor Raj Chetty Head Section Leader: Gregory Bruich, PH.D
31 pages
Fusion Technical
100% (1)
Fusion Technical
544 pages
D - Anas Bin Ariffin
No ratings yet
D - Anas Bin Ariffin
15 pages
Project Report
No ratings yet
Project Report
10 pages
Resume Yehuda Friedman
No ratings yet
Resume Yehuda Friedman
1 page
05.signal Encoding Techniques Part 1
No ratings yet
05.signal Encoding Techniques Part 1
56 pages
Export Gridview To PDF in ASP Net 3 5
0% (1)
Export Gridview To PDF in ASP Net 3 5
2 pages
Job Description-Cloudera
No ratings yet
Job Description-Cloudera
1 page
Revised BSNL Bharat Fiber FTTH Broadband Plans October 2020
No ratings yet
Revised BSNL Bharat Fiber FTTH Broadband Plans October 2020
18 pages
Lesson-Plan-2-Cross-Curriculum 1
No ratings yet
Lesson-Plan-2-Cross-Curriculum 1
6 pages
Icddrb Data Access Policy
No ratings yet
Icddrb Data Access Policy
4 pages
Hana Ha
No ratings yet
Hana Ha
140 pages

Map Reduce-LO2

Uploaded by

Map Reduce-LO2

Uploaded by

Map Reduce

• Reads data as key/value pairs

• Outputs zero or more key/value pairs

Apply function Map: Reduce:

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

Input Map Reduce Output

• Solution is Apache Spark

public static void map(LongWritable key, Text value,

FileInputFormat.setInputPaths(conf, new Path(args[0]));

• Custom partitioning is often required, for example, to produce a total

• Compressing the outputs and intermediate data will

• Compression Types (mapred(.map)?.output.compression.type)

You might also like