Summary Hadoop
Summary Hadoop
If you are ready to dive into the MapReduce framework for processing
large datasets, this practical book takes you step by step through
the algorithms and tools you need to build distributed MapReduce
applications with Apache Hadoop or Apache Spark. Each chapter provides
a recipe for solving a massive computational problem, such as building a
recommendation system. You’ll learn how to implement the appropriate
MapReduce solution with code that you can use in your projects.
Dr. Mahmoud Parsian covers basic design patterns, optimization techniques,
and data mining and machine learning solutions for problems in bioinformatics,
genomics, statistics, and social network analysis. This book also includes an
overview of MapReduce, Hadoop, and Spark.
Topics include:
■ Market basket analysis for a large set of transactions
■ Data mining algorithms (K-means, KNN, and Naive Bayes)
■ Using huge genomic data to sequence DNA and RNA
■ Naive Bayes theorem and Markov chains for data and market
prediction
■ Recommendation algorithms and pairwise document similarity
■ Linear regression, Cox regression, and Pearson correlation
■ Allelic frequency and mining DNA
■ Social network analysis (recommendation systems, counting
triangles, sentiment analysis)
Mahmoud Parsian, PhD in Computer Science, is a practicing software professional with
30 years of experience as a developer, designer, architect, and author. Currently the leader
of Illumina’s Big Data team, he’s spent the past 15 years working with Java (server-side),
databases, MapReduce, and distributed computing. Mahmoud is the author of JDBC
Recipes and JDBC Metadata, MySQL, and Oracle Recipes (both Apress).
is chapter shows you how to implement a left outer join in the MapReduce envi‐
ronment. I provide three distinct implementations in MapReduce/Hadoop and Spark:
• MapReduce/Hadoop solution using the classic map() and reduce() functions
• Spark solution without using the built-in JavaPairRDD.leftOuterJoin()
• Spark solution using the built-in JavaPairRDD.leftOuterJoin()
Left Outer Join Example
Consider a company such as Amazon, which has over 200 million users and can do
hundreds of millions of transactions per day. To understand the concept of a left
outer join, assume we have two types of data: users and transactions. The users data
consists of users’ location information (say, location_id) and the transactions data
includes user identity information (say, user_id), but no direct information about a
user’s location. Given users and transactions, then:
users(user_id, location_id)
transactions(transaction_id, product_id, user_id, quantity, amount)
our goal is to find the number of unique locations in which each product has been
sold.
But what exactly is a left outer join? Let T1
(a left table) and T2
(a right table) be two
The core pieces of the left outer join data flow are as follows:
Transaction mapper
The transaction map() reads (transaction_id, product_id, user_id, quan
tity, amount) and emits a key-value pair composed of (user_id, product_id).
User mapper
The user map() reads (user_id, location_id) and emits a key-value pair com‐
posed of (user_id, location_id).
The reducer for phase 1 gets both the user’s location_id and product_id and emits
(product_id, location_id). Now, the question is how the reducer will distinguish
location_id from product_id. In Hadoop, the order of reducer values is undefined.
Therefore, the reducer for a specific key (user_id) has no clue how to process the
values. To remedy this problem we modify the transaction and user mappers/reduc‐
ers (which we will call version 2):
Transaction mapper (version 2)
As shown in Example 4-1, the transaction map() reads (transaction_id, prod
uct_id, user_id, quantity, amount) and emits the key pair (user_id, 2)
and the value pair ("P", product_id). By adding a “2” to the reducer key, we
guarantee that product_id(s) arrive at the end. This will be accomplished
through the secondary sorting technique described in Chapters 1 and 2. We
added “P” to the value to identify products. In Hadoop, to imp