Data Science With Python - Lesson 12 - Python Integration With Hadoop
Data Science With Python - Lesson 12 - Python Integration With Hadoop
We have seen how big data is generated and understood that to extract insights, proper
analysis of data is more important than its size.
Quick Recap: Need for Real-Time Analytics
Real-time analytics is the rage right now because it helps extract information
from different data sources almost instantly.
Quick Recap: Need for Real-Time Analytics
Real-Time
Analytics
Acquire
Wrangle
Explore
Model
Data Science
Visualize
Bokeh
Disparity in Programming Languages
However, Big Data can only be accessed through Hadoop which is completely developed and implemented in
Java. Also, analytics platforms are coded in different programming languages.
Hadoop
Infrastructure
(HDFS)
Python
Hadoop
Infrastructure
(HDFS)
Python Python
APIs
Spark Big Data
Analytics
Platform
Data Science
Multiple
Programming
Languages
Hadoop
HDFS MapReduce
(Hadoop Distributed File System)
This example illustrates the Hadoop system architecture and the ways to store data in a cluster.
Data
Sources Name node
Large file
Data nodes
(Hadoop cluster)
File blocks
(64MB or 128MB)
Secondary name
node
MapReduce
The second core component of Hadoop is MapReduce, the primary framework of the HDFS architecture.
Input
HDFS
sort Output
HDFS
Split 0 map copy
merge
HDFS
reduc part 0
replicatio
e
n
Split 1 map
HDFS
reduce part 1
replicatio
n
Split 2 map
MapReduce: Mapper and Reducer
Let us discuss the MapReduce functions, mapper and reducer, in detail.
Mapper Reducer
• Mappers run locally on the data • All intermediate values for a given
nodes to avoid the network traffic. intermediate key are combined together
• Multiple mappers run in parallel into a list and given to a reducer.
processing a portion of the input • This step is known as shuffle and sort.
data. • The reducer outputs either zero or
• The mapper reads data in the form of more final key-value pairs. These
key-value pairs. are written to HDFS.
• If the mapper write generates an output, it is written
in the form of key-value pairs.
Hadoop Streaming: Python API for Hadoop
Hadoop Streaming acts like a bridge between your Python code and the Java-based HDFS, and lets
you seamlessly access Hadoop clusters and execute MapReduce tasks.
Hadoop
Streaming
You can now sum the numbers using the reduce function
Import functools as f
sum_squared = f. reduce(sum, a)
Cloudera provides enterprise-ready Hadoop Big Data platform which supports Python as well.
To set up the Cloudera Hadoop environment, visit the Cloudera link:
http://www.cloudera.com/downloads/quickstart_vms/5-7.html
Cloudera recommends that you use 7-Zip to extract these files. To download and install it, visit the link:
http://www.7-zip.org/
Cloudera QuickStart VM: Prerequisites
• These 64-bit VMs require a 64-bit host OS and a virtualization product that can support a 64-bit guest OS.
• To use a VMware VM, you must use a player compatible with WorkStation 8.x or higher:
• Player 4.x or higher
• Fusion 4.x or higher
• Older versions of WorkStation can be used to create a new VM using the same virtual disk (VMDK file), but
some features in VMware tools are not available.
• The amount of RAM required varies by the run-time option you choose
Launching VMware Image
https://www.vmware.com/products/player/pl https://www.vmware.com/products/fusion/fu
ayerpro-evaluation.html sion-evaluation.html
QuickStart VMware Image
Launch VMware player with Cloudera VM
Launch Terminal
Account:
username: cloudera
password: cloudera
QuickStart VM Terminal
Step 1 Step 2
Unix command :
• pwd to verify present working directory
• ls -lrt to list files and directories
Using Hadoop Streaming for Calculating Word Count
Input
Query 1 Result 1
Query 2 Result 2
Input Query 3
Result 3
Apache Spark Uses In-Memory Instead of Disk I/O
HDFS
read
Iteration 1 Iteration 2
Memory Memory (RAM)
(RAM)
Input
Query 1 Result 1
Query 2
Result 2
Input Query 3
Result 3
Distributed memory
CPUs
CPUs Memory
MapReduce Spark
Apache Spark Resilient Distributed Systems (RDD)
PySpark is the Spark Python API which enables data scientists to access Spark programming model.
PySpark
Transformation Action
Spark MLlib
Spark GraphX
SQL (machine
Streaming (graph)
learning)
Spark
http://spark.apache.org/downloads.html
[installed directory]\spark-1.6.1-bin-
hadoop2.4\spark-1.6.1-bin-hadoop2.4
Setup the
pyspark
notebook
specific
variables
Check SparkContext
Using PySpark to Determine Word Count
Demonstrate how to use the Jupyter integrated PySpark API to determine the word count
of a given dataset
Word Count
Use the given dataset to count and display all the airports based in New York using
PySpark. Perform the following steps:
• View all the airports listed in the dataset
• View only the first 10 records
• Filter the data for all airports located in New York
• Clean up the dataset if required
Knowledge Check
Knowledge
Check
What are the core components of Hadoop? Select all that apply.
1
a. MapReduce
b. HDFS
c. Spark
d. RDD
Knowledge
Check
What are the core components of Hadoop? Select all that apply.
1
a. MapReduce
b. HDFS
c. Spark
d. RDD
a. at DataNode
b. at NameNode
c. on client side
d. in memory
Knowledge
Check
MapReduce is a data processing framework which gets executed _____.
2
a. at DataNode
b. at NameNode
c. on client side
d. in memory
a. Reducer
b. Mapper
c. Partitioner
a. Reducer
b. Mapper
c. Partitioner
a. Mapper
b. Reducer
c. Combiner
d. Partitioner
Knowledge
Check
What transforms input key-value pairs to a set of intermediate key-value pairs?
4
a. Mapper
b. Reducer
c. Combiner
d. Partitioner
Import the financial data using Yahoo data reader for the following
companies:
• Yahoo
• Apple
• Amazon
• Microsoft
• Google
On April 15, 1912, the Titanic sank after colliding with an iceberg, killing
1502 out of 2224 passengers and crew. This tragedy shocked the world
and led to better safety regulations for ships. Here, we ask you to
perform the analysis through the exploratory data analysis technique.
In particular, we want you to apply the tools of machine learning to
predict the survived passengers.