Analysis of Mapreduce Algorithms: Harini Padmanaban
Analysis of Mapreduce Algorithms: Harini Padmanaban
Harini Padmanaban
Computer Science Department
San Jose State University
San Jose, CA 95192
408-924-1000
[email protected]
3. MAPREDUCE ALGORITHMS
Having understood about the hadoop architecture and basic map
reduce concepts, let us look into some map reduce algorithms that
involve huge data and understand how the parallelism achieved
through MapReduce helps in improving the efficiency.
3.1 Information Retrieval
The concept of information retrieval mainly deals with processing
and storing huge amounts of data in a specific format and then
querying the store data or get some information. The two
important tasks in information retrieval are indexing the data
(process and store the data) and searching (querying the stored
data). Let us consider how MapReduce algorithms can be used for
each of these tasks.
Searching:
Figure 1: Architecture of HDFS [6] Given a set of files where each file contains some text, the goal is
to find the files that contain the given pattern. An implementation
Also in order to ensure availability and reliability, the HDFS of this is as follows[5]:
replicates 3 separate copies of each data block and stores it across [(Filename, Text) ,(Pattern)] is mapped to [filename, 1/0] where
different data nodes[6]. 1 is emitted when the pattern exists in the file and 0 otherwise.
The reducer simply collects the filename corresponding to the
emission of 1. Figure 3, that is modified and reproduced form [5]
2.4 Hadoop Architecture gives a scenario for this implementation.
The Hadoop architecture consists of three major components viz., In the process of searching, the pattern in the query must be
the name node which acts as the master and stores the meta data checked for in all the data files (documents) and the documents
need be ranked bases on their relevance to the query. To achieve STEP 2: Compute total number of terms in a document.
this we need to compute certain metrics from the data in The mapper takes as input the output of step 1 reducer and outputs
documents. This metric is called as the TF-IDF score of the the term count for each of the documents. The Reducer in turn
documents. sums the individual counts of words belonging to the same
TF-IDF refers to term frequency–inverse document frequency. It document and at the same time also retains the word-specific
is a metric used to quantify the relative importance of a word to a count. The functions are as given below as discussed in [5].
document in a collection or corpus [3]. It increases with the Mapper:
increase in number of occurrences of that term in the document. Input: ((term, document id),count)
Output: (document id,(term, count))
Reducer: takes as input metrics corrponding to same document id
Output: ((term,document id), (count, totalCount))
Pseudo Code for Parallel Breadth First Search, reproduced As can be seen from the above steps, the algorithm is amenable to
from reference [6] MapReduce architecture where cluster computation and cluster
association can be done in parallel. In [7], it has been shown that
3.6 Machine Learning and MapReduce the complexity can be reduced by a factor of P where P is the
Machine learning is an important application for MapReduce number of cores.
since most of the algorithms related to machine learning perform
data intensive applications. Some of the common algorithms in
4. CONCLUSION
the field of machine learning include, Naïve Bayes Classification Parallel computations have stated dominating today's world.
and K-Means Clustering. Given any problem, the best solution needs to achieve the results
efficiently. Solving a problem is no longer enough. In order to
Naïve Bayes Classification survive in this competitive world, everything should be performed
Here the input to the algorithm is a huge data set that contains the time efficiently. To achieve that, we need to move toward parallel
values for the multiple attributes and the corresponding classifier. computing. The programmers can no longer be kept blind about
The algorithm needs to learn the correlation of the attributes with the underlying hardware architecture. In order to scale up and
utilize the parallel infrastructure, the must know how to utilize the [2] What id Hadoop Distributed File System (HDFS)? Retrieved
potential of such architecture and code efficiently. The April 5, 2014, from, https://www-
MapReduce paradigm helps the programmers to achieve this by 01.ibm.com/software/data/infosphere/hadoop/hdfs/
providing an abstraction to the programmer and hides the
implementation details of the code. But the programmers still [3] TF-IDF. Retrieved April 10, 2014, form,
need to decide on which operation should be parallelized. That is, http://en.wikipedia.org/wiki/Tf%E2%80%93idf
it is the duty of the programmer to decide on what the map and [4] Buttcher, S., Clarke, C.L.A., Cormack G.V., (2010)
reduce functions do. From there on, hadoop takes care of the Information Retrieval: Implementing and Evaluating Search
implementation of parallelism making it easy for developer. Engines.
Dynamic programming is also another algorithmic concept that
helps in solving highly complex problems. [5] MapReduce Algorithms. Retrieved April 10, 2014, from
http://courses.cs.washington.edu/courses/cse490h/08au/lectur
By analyzing the various MapReduce algorithms, we could realize es/algorithms.pdf
the potential of parallelization. There are many more MapReduce
algorithms solving various problems. So we need to clearly [6] Lin, J., Dyer, C. (2010). Data-Intensive Text Processing with
identify which problems fit into the MapReduce model and MapReduce. Retrieved February 10, 2014,from
efficiently utilize the parallel infrastructure. http://lintool.github.io/MapReduceAlgorithms/MapReduce-
book-final.pdf
[7] Chu, C.-T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng,
5. REFERENCES A., and Olukotun, K. 2006. Map-Reduce for machine
learning on multicore. In Proceedings of Neural Information
[1] What is MapReduce? Retrieved April 5, 2014, form,
Processing Systems Conference (NIPS). Vancouver, Canada.
https://www-
01.ibm.com/software/data/infosphere/hadoop/MapRed [8] Retrieved April 5, 2014, form,
uce/ http://en.wikipedia.org/wiki/K-means_clustering