0% found this document useful (0 votes)
44 views19 pages

Unit Iii

This document discusses Spark Streaming, machine learning with Spark MLlib, and advanced Spark programming concepts. It provides an overview of Spark Streaming architecture and abstractions, transformations, and output operations. It also describes machine learning tasks, tools in Spark MLlib including algorithms, featurization, and utilities, and data types used in MLlib. Finally, it discusses chaining multiple MapReduce jobs and joining data from different sources in advanced Spark applications.

Uploaded by

karimunisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views19 pages

Unit Iii

This document discusses Spark Streaming, machine learning with Spark MLlib, and advanced Spark programming concepts. It provides an overview of Spark Streaming architecture and abstractions, transformations, and output operations. It also describes machine learning tasks, tools in Spark MLlib including algorithms, featurization, and utilities, and data types used in MLlib. Finally, it discusses chaining multiple MapReduce jobs and joining data from different sources in advanced Spark applications.

Uploaded by

karimunisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT - III

Streaming in Spark, Streaming features, Streaming Fundamentals. Usecase on streaming.


Machine Learning, Spark MLlib Overview, Tools, Algorithms-Classification, Regression,
Clustering, Dimensionality Reduction, Feature Extraction. MapReduce Advanced
Programming-Chaining Map Reduce jobs, joining data from different sources. Usecase.

Streaming in Spark:
Spark Streaming is an extension of the core Spark API that allows data engineers and
data scientists to process real-time data from various sources including (but not limited to)
Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems,
databases, and live dashboards.
Spark is built on the concept of RDDs, Spark Streaming provides an abstraction called
DStreams, or discretized streams. A DStream is a sequence of data arriving over time.
Internally, each DStream is represented as a sequence of RDDs arriving at each time step
(hence the name ―discretized‖). DStreams can be created from various input sources, such as
Flume, Kafka, or HDFS. Once built, they offer two types of operations: transformations,
which yield a new DStream, and output operations, which write data to an external system.
DStreams provide many of the same operations available on RDDs, plus new operations
related to time, such as sliding windows.
Spark Streaming is available only in Java and Scala. Experimental Python support
was added in Spark 1.2, though it supports only text data.
Streaming features:
1. Fast and general-purpose engine for large-scale data processing
a. Not a modified version of Hadoop
b. The leading candidate for ―successor to Map Reduce‖
2. Spark can efficiently support more types of computations
a. For example, interactive queries, stream processing
3. Can read/write to any Hadoop-supported system (e.g., HDFS)
4. Speed: in-memory data storage for very fast iterative queries

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 1
a. the system is also more efficient than MapReduce for complex applications
running on disk
b. up to 40x faster than Hadoop
c. Ingest data from many sources: Kafka, Twitter, HDFS, TCP sockets
d. Results can be pushed out to file-systems, databases, live dashboards, but not
only
Architecture and Abstraction:
1. Spark Streaming uses a ―micro-batch‖ architecture, where the streaming computation
is treated as a continuous series of batch computations on small batches of data.
2. Spark Streaming receives data from various input sources and groups it into small
batches. New batches are created at regular time intervals.
3. At the beginning of each time interval a new batch is created, and any data that arrives
during that interval gets added to that batch. At the end of the time interval the batch
is done growing.
4. The size of the time intervals is determined by a parameter called the batch interval.
The batch interval is typically between 500 milliseconds and several seconds, as
configured by the application developer.
5. Each input batch forms an RDD, and is processed using Spark jobs to create other
RDDs. The processed results can then be pushed out to external systems in batches.
This high-level architecture is shown in the following Figure.

High-level architecture of Spark Streaming


The programming abstraction in Spark Streaming is a discretized stream or a
DStream, which is a sequence of RDDs, where each RDD has one time slice of the data in the
stream and is shown in the following Figure.

DStream as a continuous series of RDDs

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 2
The execution of Spark Streaming within Spark‘s driver-worker components is shown
in the following Figure. For each input source, Spark Streaming launches receivers, which are
tasks running within the application‘s executors that collect data from the input source and
save it as RDDs. These receive the input data and replicate it (by default) to another executor
for fault tolerance. This data is stored in the memory of the executors in the same way as
cached RDDs. The ―StreamingContext‖ in the driver program then periodically runs Spark
jobs to process this data and combine it with RDDs from previous time steps.

Execution of Spark Streaming within Spark’s components


Transformations: Transformations on DStreams can be grouped into either stateless or
stateful.
1. In stateless transformations the processing of each batch does not depend on the data
of its previous batches. They include the common RDD transformations like map(),
filter(), and reduceByKey().
2. Stateful transformations, in contrast, use data or intermediate results from previous
batches to compute the results of the current batch. They include transformations
based on sliding windows and on tracking state across time.

Examples of stateless DStream transformations

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 3
Stateful Transformations: These are operations on DStreams that track data across time;
that is, some data from previous batches is used to generate the results for a new batch. The
two main types are windowed operations and UpdateStateByKey transformation.
Windowed operations compute results across a longer time period than the
StreamingContext‘s batch interval, by combining results from multiple batches. The
updateStateByKey(), which is used to track state across events for each key (e.g., to build up
an object representing each user session).
Output Operations:
1. Output operations specify what needs to be done with the final transformed data in a
stream (e.g., pushing it to an external database or printing it to the screen).
2. A common debugging output operation is print(). This grabs the first 10 elements
from each batch of the DStream and prints the results.
3. Once we‘ve debugged our program, we can also use output operations to save results.
4. Spark Streaming has similar save() operations for DStreams, each of which takes a
directory to save files into and an optional suffix.
5. The results of each batch are saved as subdirectories in the given directory, with the
time and the suffix in the filename.
Machine Learning with Spark MLlib:
Machine learning is a method of data analysis that automates analytical model
building. It is a branch of artificial intelligence based on the idea that systems can learn from
data, identify patterns and make decisions with minimal human intervention. The tools of
Machine Learning are shown in the following Figure.

Machine Learning tools


Machine learning is closely related to computational statistics, which also focuses on
prediction-making through the use of computers. It has strong ties to mathematical

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 4
optimization, which delivers methods, theory and application domains to the field. Within the
field of data analytics, machine learning is a method used to devise complex models and
algorithms that lend themselves to a prediction which in commercial use is known as
predictive analytics. There are three categories of Machine learning tasks:
1. Supervised Learning: Supervised learning is where you have input variables (x) and an
output variable (Y) and you use an algorithm to learn the mapping function from the input to
the output.
2. Unsupervised Learning: Unsupervised learning is a type of machine learning algorithm
used to draw inferences from datasets consisting of input data without labeled responses.
3. Reinforcement Learning: A computer program interacts with a dynamic environment in
which it must perform a certain goal (such as driving a vehicle or playing a game against an
opponent). The program is provided feedback in terms of rewards and punishments as it
navigates its problem space. This concept is called reinforcement learning.
Spark MLlib Overview:
Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists
popular algorithms and utilities.
1. spark.mllib contains the original API built on top of RDDs. It is currently in
maintenance mode.
2. spark.ml provides higher level API built on top of DataFrames for constructing ML
pipelines. spark.ml is the primary Machine Learning API for Spark at the moment.
Spark MLlib Tools: This provides the following tools:
1. ML Algorithms: ML Algorithms form the core of MLlib. These include common
learning algorithms such as classification, regression, clustering and collaborative
filtering.
2. Featurization: Featurization includes feature extraction, transformation,
dimensionality reduction and selection.
3. Pipelines: Pipelines provide tools for constructing, evaluating and tuning ML
Pipelines.
4. Persistence: Persistence helps in saving and loading algorithms, models and Pipelines.
5. Utilities: Utilities for linear algebra, statistics and data handling.
Data Types:
MLlib contains a few specific data types, located in the org.apache.spark.mllib
package (Java/Scala) or pyspark.mllib (Python). The main ones are:

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 5
1. Vector: A mathematical vector. MLlib supports both dense vectors, where every entry is
stored, and sparse vectors, where only the nonzero entries are stored to save space. We will
discuss the different types of vectors shortly. Vectors can be constructed with the
mllib.linalg.Vectors class.
2. LabeledPoint: A labeled data point for supervised learning algorithms such as classification
and regression. It includes a feature vector and a label (which is a floating-point value).
Located in the mllib.regression package.
3. Rating: A rating of a product by a user, used in the mllib.recommendation package for
product recommendation.
4. Various Model classes: Each Model is the result of a training algorithm, and typically has a
predict() method for applying the model to a new data point or to an RDD of new data points.
MLlib Algorithms: The popular algorithms and utilities in Spark MLlib are:
1. Basic Statistics
2. Classification and Regression
3. Clustering
4. Collaborative Filtering and Recommendation
5. Dimensionality Reduction
6. Feature Extraction
7. Optimization
1. Basic Statistics: This includes the most basic of machine learning techniques. These
include:
a. Summary Statistics: Examples include mean, variance, count, max, min and
numNonZeros.
b. Correlations: Spearman and Pearson are some ways to find correlation.
c. Stratified Sampling: These include sampleBykey and sampleByKeyExact.
d. Hypothesis Testing: Pearson‘s chi-squared test is an example of hypothesis testing.
e. Random Data Generation: RandomRDDs, Normal and Poisson are used to generate
random data.
2. Classification and Regression: Classification is the problem of identifying to which of a
set of categories (sub-populations) a new observation belongs, on the basis of a training set of
data containing observations (or instances) whose category membership is known. It is an
example of pattern recognition.

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 6
Regression analysis is a statistical process for estimating the relationships among
variables. It includes many techniques for modeling and analyzing several variables when the
focus is on the relationship between a dependent variable and one or more independent
variables.
Classification and regression are two common forms of supervised learning, where
algorithms attempt to predict a variable from features of objects using labeled training data
(i.e., examples where we know the answer). The difference between them is the type of
variable predicted: in classification, the variable is discrete (i.e., it takes on a finite set of
values called classes); for example, classes might be spam or non-spam for emails, or the
language in which the text is written. In regression, the variable predicted is continuous (e.g.,
the height of a person given her age and weight).
Both classification and regression use the LabeledPoint class in MLlib which resides
in the mllib.regression package. A Label edPoint consists simply of a label (which is always a
Double value, but can be set to discrete integers for classification) and a features vector.
MLlib includes a variety of methods for classification and regression, including simple linear
methods and decision trees and forests.
a. Linear regression: Linear regression is one of the most common methods for regression,
predicting the output variable as a linear combination of the features. MLlib also supports L1
and L2 regularized regression, commonly known as Lasso and ridge regression.
The linear regression algorithms are available through the
mllib.regression.LinearRegressionWithSGD, LassoWithSGD, and
RidgeRegressionWithSGD classes. These follow a common naming pattern throughout
MLlib, where problems involving multiple algorithms have a ―With‖ part in the class name to
specify the algorithm used. Here, SGD is Stochastic Gradient Descent. These classes all have
several parameters to tune the algorithm:
i) numIterations - Number of iterations to run (default: 100).
ii) stepSize - Step size for gradient descent (default: 1.0).
iii) intercept - Whether to add an intercept or bias feature to the data—that is, another feature
whose value is always 1 (default: false).
iv) regParam - Regularization parameter for Lasso and ridge (default: 1.0).
# Linear regression in Python
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
points = # (create RDD of LabeledPoint)

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 7
model = LinearRegressionWithSGD.train(points, iterations=200, intercept=True)
print "weights: %s, intercept: %s" % (model.weights, model.intercept)
# Linear regression in Scala
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
val points: RDD[LabeledPoint] = // ...
val lr = new LinearRegressionWithSGD().setNumIterations(200).setIntercept(true)
val model = lr.run(points)
println("weights: %s, intercept: %s".format(model.weights, model.intercept))

// Linear regression in Java


import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
import org.apache.spark.mllib.regression.LinearRegressionModel;
JavaRDD<LabeledPoint> points = // ...
LinearRegressionWithSGD lr =
new LinearRegressionWithSGD().setNumIterations(200).setIntercept(true);
LinearRegressionModel model = lr.run(points.rdd());
System.out.printf("weights: %s, intercept: %s\n",
model.weights(), model.intercept());

b. Logistic regression: Logistic regression is a binary classification method that identifies a


linear separating plane between positive and negative examples. In MLlib, it takes
LabeledPoints with label 0 or 1 and returns a LogisticRegressionModel that can predict new
points.
The logistic regression algorithm has a very similar API to linear regression, covered
in the previous section. One difference is that there are two algorithms available for solving
it: SGD and LBFGS LBFGS is generally the best choice, but is not available in some earlier
versions of MLlib (before Spark 1.2). These algorithms are available in the
mllib.classification.LogisticRegressionWithLBFGS and WithSGD classes, which have
interfaces similar to LinearRegressionWithSGD. They take all the same parameters as linear
regression.
The LogisticRegressionModel from these algorithms computes a score between 0 and
1 for each point, as returned by the logistic function. It then returns either 0 or 1 based on a

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 8
threshold that can be set by the user: by default, if the score is at least 0.5, it will return 1. We
can change this threshold via setThreshold(). You can also disable it altogether via
clearThreshold(), in which case predict() will return the raw scores.
c. Support Vector Machines: Support Vector Machines, or SVMs, are another binary
classification method with linear separating planes, again expecting labels of 0 or 1. They are
available through the SVMWithSGD class, with similar parameters to linear and logisitic
regression. The returned SVMModel uses a threshold for prediction like
LogisticRegressionModel.
d. Naive Bayes: Naive Bayes is a multiclass classification algorithm that scores how well each
point belongs in each class based on a linear function of the features. It is commonly used in
text classification with TF-IDF features, among other applications. MLlib implements
Multinomial Naive Bayes, which expects nonnegative frequencies (e.g., word frequencies) as
input features.
In MLlib, you can use Naive Bayes through the mllib.classification.NaiveBayes class.
It supports one parameter, lambda (or lambda_ in Python), used for smoothing. We can call it
on an RDD of LabeledPoints, where the labels are between 0 and C–1 for C classes. The
returned NaiveBayesModel can predict() the class in which a point best belongs.
e. Decision trees and random forests: Decision trees are a flexible model that can be used for
both classification and regression. They represent a tree of nodes, each of which makes a
binary decision based on a feature of the data (e.g., is a person‘s age greater than 20?), and
where the leaf nodes in the tree contain a prediction (e.g., is the person likely to buy a
product?). Decision trees are attractive because the models are easy to inspect and because
they support both categorical and continuous features. The following Figure shows an
example tree.

An example decision tree predicting whether a user might buy a product


In MLlib, you can train trees using the mllib.tree.DecisionTree class, through the
static methods trainClassifier() and trainRegressor(). Unlike in some of the other algorithms,

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 9
the Java and Scala APIs also use static methods instead of a DecisionTree object with setters.
The training methods take the following parameters:
i) data - RDD of LabeledPoint.
ii) numClasses (classification only) - Number of classes to use.
iii) impurity - Node impurity measure; can be gini or entropy for classification, and must be
variance for regression.
iv) maxDepth - Maximum depth of tree (default: 5).
v) maxBins- Number of bins to split data into when building each node (suggested value: 32).
vi) categoricalFeaturesInfo - A map specifying which features are categorical, and how many
categories they each have. For example, if feature 1 is a binary feature with labels 0 and 1,
and feature 2 is a three-valued feature with values 0, 1, and 2, you would pass {1: 2, 2: 3}.
Use an empty map if no features are categorical.
In Spark 1.2, MLlib adds an experimental RandomForest class in Java and Scala to
build ensembles of trees, also known as random forests. It is available through
RandomForest.trainClassifier and trainRegressor. Apart from the pertree parameters just
listed, RandomForest takes the following parameters:
i) numTrees - How many trees to build. Increasing numTrees decreases the likelihood of
overfitting on training data.
ii) featureSubsetStrategy - Number of features to consider for splits at each node; can be auto
(let the library select it), all, sqrt, log2, or onethird; larger values are more expensive.
iii) seed - Random-number seed to use.
Random forests return a WeightedEnsembleModel that contains several trees (in the
weakHypotheses field, weighted by weakHypothesisWeights) and can predict() an RDD or
Vector. It also includes a toDebugString to print all the trees.
3. Clustering: Classification is the problem of identifying to which of a set of categories
(sub-populations) a new observation belongs, on the basis of a training set of data containing
observations (or instances) whose category membership is known. It is an example of pattern
recognition.
K-means: MLlib includes the popular K-means algorithm for clustering, as well as a variant
called K-means|| that provides better initialization in parallel environments. Kmeans||is
similar to the K-means++ initialization procedure often used in single node settings. The
most important parameter in K-means is a target number of clusters to generate, K. Apart
from K, K-means in MLlib takes the following parameters:

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 10
i) initializationMode - The method to initialize cluster centers, which can be either ―k-
means||‖ or ―random‖; k-means|| (the default) generally leads to better results but is slightly
more expensive.
ii) maxIterations - Maximum number of iterations to run (default: 100).
iii) runs - Number of concurrent runs of the algorithm to execute. MLlib‘s K-means supports
running from multiple starting positions concurrently and picking the best result, which is a
good way to get a better overall model (as K-means runs can stop in local minima).
4. Collaborative Filtering and Recommendation: Collaborative filtering is a technique for
recommender systems wherein users‘ ratings and interactions with various products are used
to recommend new ones. Collaborative filtering is attractive because it only needs to take in a
list of user/product interactions: either ―explicit‖ interactions (i.e., ratings on a shopping site)
or ―implicit‖ ones (e.g., a user browsed a product page but did not rate the product). Based
solely on these interactions, collaborative filtering algorithms learn which products are
similar to each other (because the same users interact with them) and which users are similar
to each other, and can make new recommendations.
While the MLlib API talks about ―users‖ and ―products,‖ you can also use
collaborative filtering for other applications, such as recommending users to follow on a
social network, tags to add to an article, or songs to add to a radio station.
Alternating Least Squares: MLlib includes an implementation of Alternating Least Squares
(ALS), a popular algorithm for collaborative filtering that scales well on clusters.6 It is
located in the mllib.recommendation.ALS class.
ALS works by determining a feature vector for each user and product, such that the
dot product of a user‘s vector and a product‘s is close to their score. It takes the following
parameters:
i) rank - Size of feature vectors to use; larger ranks can lead to better models but are more
expensive to compute (default: 10).
ii) iterations - Number of iterations to run (default: 10).
iii) lambda - Regularization parameter (default: 0.01).
iv) alpha - A constant used for computing confidence in implicit ALS (default: 1.0).
numUserBlocks, numProductBlocks Number of blocks to divide user and product data in, to
control parallelism; we can pass –1 to let MLlib automatically determine this (the default
behavior).

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 11
5. Dimensionality Reduction: Dimensionality Reduction is the process of reducing the
number of random variables under consideration, via obtaining a set of principal variables. It
can be divided into feature selection and feature extraction.
a. Feature Selection: Feature selection finds a subset of the original variables (also called
features or attributes).
b. Feature Extraction: This transforms the data in the high-dimensional space to a space
of fewer dimensions. The data transformation may be linear, as in Principal
Component Analysis (PCA), but many nonlinear dimensionality reduction techniques
also exist.
PCA is currently available only in Java and Scala (as of MLlib 1.2). To invoke it, we
must first represent your matrix using the mllib.linalg.distributed. RowMatrix class, which
stores an RDD of Vectors, one per row. We can then call PCA as shown in the following
Example.
# PCA in Scala
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val points: RDD[Vector] = // ...
val mat: RowMatrix = new RowMatrix(points)
val pc: Matrix = mat.computePrincipalComponents(2)
// Project points to low-dimensional space
val projected = mat.multiply(pc).rows
// Train a k-means model on the projected 2-dimensional data
val model = KMeans.train(projected, 10)
In this example, the projected RDD contains a two-dimensional version of the original
points RDD, and can be used for plotting or performing other MLlib algorithms, such as
clustering via K-means. The computePrincipalComponents() returns a mllib.linalg.Matrix
object, which is a utility class representing dense matrices, similar to Vector. You can get at
the underlying data with ‗toArray‘.
Singular value decomposition: MLlib also provides the lower-level singular value
decomposition (SVD) primitive. The SVD factorizes an m × n matrix A into three matrices A
≈ UΣV T , where:
a. U is an orthonormal matrix, whose columns are called left singular vectors.

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 12
b. Σ is a diagonal matrix with nonnegative diagonals in descending order, whose
diagonals are called singular values.
c. V is an orthonormal matrix, whose columns are called right singular vectors.
For large matrices, usually we don‘t need the complete factorization but only the top
singular values and its associated singular vectors. This can save storage, denoise, and
recover the low-rank structure of the matrix. To achieve the decomposition, we call
computeSVD on the RowMatrix class, as shown in following Example.
# SVD in Scala
// Compute the top 20 singular values of a RowMatrix mat and their singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] =
mat.computeSVD(20, computeU=true)
val U: RowMatrix = svd.U // U is a distributed RowMatrix.
val s: Vector = svd.s // Singular values are a local dense vector.
val V: Matrix = svd.V // V is a local dense matrix.
6. Feature Extraction: This starts from an initial set of measured data and builds derived
values (features) intended to be informative and non-redundant, facilitating the subsequent
learning and generalization steps, and in some cases leading to better human interpretations.
This is related to dimensionality reduction.
7. Optimization: Optimization is the selection of the best element (with regard to some
criterion) from some set of available alternatives.
In the simplest case, an optimization problem consists of maximizing or minimizing a
real function by systematically choosing input values from within an allowed set and
computing the value of the function. The generalization of optimization theory and
techniques to other formulations comprises a large area of applied mathematics. More
generally, optimization includes finding ―best available‖ values of some objective function
given a defined domain (or input), including a variety of different types of objective functions
and different types of domains.
Chaining MapReduce jobs:
Many complex tasks need to be broken down into simpler subtasks, each
accomplished by an individual MapReduce job. For example, from the citation data set to
find the ten most cited patents require sequence of two MapReduce jobs. The first one creates
the ―inverted‖ citation data set and counts the number of citations for each patent, and the
second job finds the top ten in that ―inverted‖ data.

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 13
Chaining MapReduce jobs in a sequence: Though we can execute the two jobs manually one
after the other, it‘s more convenient to automate the execution sequence. We can chain
MapReduce jobs to run sequentially, with the output of one MapReduce job being the input
to the next. This is done using Unix pipes as the following.
mapreduce-1 | mapreduce-2 | mapreduce-3 | ... .
Chaining MapReduce jobs with complex dependency: Hadoop has a mechanism to simplify
the management of nonlinear job dependencies via the Job and JobControl classes. A Job
object is a representation of a MapReduce job. We instantiate a Job object by passing a
JobConf object to its constructor. In addition to holding job configuration information, Job
also holds dependency information, specified through the addDependingJob() method.
For Job objects x and y,
x.addDependingJob(y)
means x will not start until y has finished.
Chaining preprocessing and postprocessing steps: A lot of data processing tasks involve
record-oriented preprocessing and postprocessing. For example, in processing documents for
information retrieval , we may have one step to remove stop words (words like a, the, and is
that occur frequently but aren‘t too meaningful), and another step for stemming (converting
different forms of a word into the same form, such as finishing and finished into finish.) We
can write a separate MapReduce job for each of these pre- and postprocessing steps and chain
them together, using IdentityReducer (or no reducer at all) for these steps. This approach
is inefficient as each step in the chain takes up I/O and storage to process the intermediate
results.
Another approach is to write a mapper such that it calls all the preprocessing steps
beforehand and the reducer to call all the postprocessing steps afterward. Hadoop introduced
the ChainMapper and the ChainReducer classes in version 0.19.0 to simplify the
composition of pre- and postprocessing.
For example, four mappers (Map1, Map2, Map3, and Map4) and one reducer
(Reduce), and they‘re chained into a single MapReduce job in this sequence:
Map1 | Map2 | Reduce | Map3 | Map4
We need to make sure the key and value outputs of one task have matching types
(classes) with the inputs of the next task. This is explained in the following code:

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 14
The driver first sets up the ―global‖ JobConf object with the job‘s name, input path,
output path, and so forth.

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 15
Joining Data from Different Sources:
Unfortunately, joining data in Hadoop is more complex, and there are several possible
approaches with different trade-offs. We use a couple toy datasets to better illustrate
joining in Hadoop. Let‘s take a comma-separated Customers file where each record has three
fields: Customer ID, Name, and Phone Number. We put four records in the file for
illustration:

We store Customer orders in a separate file, called Orders. It‘s in CSV format, with
four fields: Customer ID, Order ID, Price, and Purchase Date.

If we applied an inner join of the two data sets above, the desired output is:

Hadoop can also perform outer joins. But we focus on inner joins.
Reduce-side joining: Hadoop has a contrib package called datajoin that works as a
generic framework for data joining in Hadoop. Its jar file is at
contrib/datajoin/hadoop-*-datajoin.jar. To distinguish it from other joining
techniques, it‘s called the reduce-side join, as we do most of the processing on the
reduce side. It‘s also known as the repartitioned join (or the repartitioned sort-
merge join), as it‘s the same as the database technique of the same name. Although it‘s not
the most efficient joining technique, it‘s the most general and forms the basis of some more
advanced techniques (such as the semijoin).
Reduce-side join introduces some new terminologies and concepts, namely, data
source, tag, and group key. A data source is similar to a table in relational databases. We
have two data sources in our toy example: Customers and Orders. A data source can be a
single file or multiple files. The important point is that all the records in a data source have
the same structure, equivalent to a schema.

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 16
The MapReduce paradigm calls for processing each record one at a time in a stateless
manner. If we want some state information to persist, we have to tag the record with such
state. For example, given our two files, a record may look to a mapper like this:

where the record type (Customers or Orders) is dissociated from the record itself. Tagging
the record will ensure that specific metadata will always go along with the record. For the
purpose of data joining, we want to tag each record with its data source.
The group key functions like a join key in a relational database. For our example,
the group key is the Customer ID.
DATA FLOW OF A REDUCE-SIDE JOIN: The following Figure illustrates the data flow of a
repartitioned join on the toy data sets Customers and Orders, up to the reduce stage.

The function reduce() will take its input and do a full cross-product on the
values. Reduce() creates all combinations of the values with the constraint that a
combination will not be tagged more than once. In cases where reduce() sees values of
distinct tags, the cross-product is the original set of values. In our example, this is the case for
group keys 1, 2, and 4. The following Figure illustrates cross product for group key 3.
We have three values, one tagged with Customers and two tagged with Orders. The cross-

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 17
product creates two combinations. Each combination consists of the Customers value and one
of the Orders value.

IMPLEMENTING JOIN WITH THE DATAJOIN PACKAGE: Hadoop‘s datajoin package has
three abstract classes that we inherit and make concrete: DataJoinMapperBase,
DataJoinReducerBase, and TaggedMapOutput. As the names suggest, our MapClass
will extend DataJoinMapperBase, and our Reduce class will extend
DataJoinReducerBase. The datajoin package has already implemented the map() and
reduce() methods in these respective base classes to perform the join dataflow.
Replicated joins using DistributedCache: Hadoop has a mechanism called distributed
cache that‘s designed to distribute files to all nodes in a cluster. Distributed cache is handled
by the appropriately named class DistributedCache.

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 18
Semijoin: reduce-side join with map-side filtering: When processing records from Customers
and Orders, the mapper will drop any record whose key is not in the set CustomerID415
(415 is the area code). This is sometimes called a semijoin, taking the terminology from the
database world.
Last but not least, what if the file CustomerID415 is still too big to fit in memory?
Or maybe CustomerID415 does fit in memory but its size makes replicating it across all the
mappers inefficient. This situation calls for a data structure called a Bloom filter. A
Bloom filter is a compact representation of a set that supports only the contain query.

&&&&&&&&&

Dr. K.B.V. Brahmarao Professor & HOD-MCA B.V. Raju College Vishnu Campus Bhimavaram – 534202 Page 19

You might also like