0% found this document useful (0 votes)

310 views53 pages

Data Science With Python - Lesson 12 - Python Integration With Hadoop

Uploaded by

Zozer Mbula Lwanga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

310 views53 pages

Data Science With Python - Lesson 12 - Python Integration With Hadoop

Uploaded by

Zozer Mbula Lwanga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Data Science with Python

Python Integration with Hadoop MapReduce and Spark

Learning Objectives

By the end of this lesson, you will be able to:

Explain why Python should be integrated with Hadoop

Outline the ecosystem and architecture of Hadoop

Explain the functioning of MapReduce

Discuss Apache Spark functions and their benefits

Write Python programs for Hadoop operations

Quick Recap: Need for Real-Time Analytics

We have seen how big data is generated and understood that to extract insights, proper
analysis of data is more important than its size.
Quick Recap: Need for Real-Time Analytics

Real-time analytics is the rage right now because it helps extract information
from different data sources almost instantly.
Quick Recap: Need for Real-Time Analytics

Demographic Orders Products Visits/Channels Social Media Customer Support

s Sentiments and Surveys

Real-Time
Analytics

Higher Conversion Targeted Marketing Better Customer Increased

Satisfaction Sales
Quick Recap: Why Python
Data Scientists all over the world prefer Python because it is an easy-to-use language that has diverse
libraries required for analytics.

Acquire

Wrangle

Explore

Model

Data Science
Visualize
Bokeh
Disparity in Programming Languages
However, Big Data can only be accessed through Hadoop which is completely developed and implemented in
Java. Also, analytics platforms are coded in different programming languages.

Hadoop
Infrastructure
(HDFS)

Python

Spark Big Data

Analytics
Platform
Data Science

Big Data Platform

Integrating Python with Hadoop
As Python is a Data Scientist’s first language of choice, both Hadoop and Spark provide Python APIs that
allow easy access to the Big Data platform.

Hadoop
Infrastructure
(HDFS)

Python Python
APIs
Spark Big Data
Analytics
Platform
Data Science

Big Data Platform

Need for Big Data Solutions in Python
There are several reasons for creating Big Data solutions in Python.

Multiple Data Processing

Frameworks

Multiple
Programming
Languages

Data Scientist Python APIs

Multiple Big Data
Vendors
Hadoop: Core Components

Hadoop

HDFS MapReduce
(Hadoop Distributed File System)

• MapReduce is a data processing framework to

• It is responsible for storing data on a cluster
process data on the cluster
• Data is split into blocks and distributed across
• Two consecutive phases: Map and Reduce
multiple nodes in a cluster
• Each map task operates on discrete portions of
• Each block is replicated multiple times
data
o Default is 3 times
• After map, reduce works on the intermediate
o Replicas are stored on different nodes
data distributed on nodes
Hadoop: System Architecture

This example illustrates the Hadoop system architecture and the ways to store data in a cluster.

Data
Sources Name node

Large file

Data nodes
(Hadoop cluster)

File blocks
(64MB or 128MB)
Secondary name
node
MapReduce
The second core component of Hadoop is MapReduce, the primary framework of the HDFS architecture.

Input
HDFS

sort Output
HDFS
Split 0 map copy
merge
HDFS
reduc part 0
replicatio
e
n

Split 1 map

HDFS
reduce part 1
replicatio
n
Split 2 map
MapReduce: Mapper and Reducer
Let us discuss the MapReduce functions, mapper and reducer, in detail.

Mapper Reducer
• Mappers run locally on the data • All intermediate values for a given
nodes to avoid the network traffic. intermediate key are combined together
• Multiple mappers run in parallel into a list and given to a reducer.
processing a portion of the input • This step is known as shuffle and sort.
data. • The reducer outputs either zero or
• The mapper reads data in the form of more final key-value pairs. These
key-value pairs. are written to HDFS.
• If the mapper write generates an output, it is written
in the form of key-value pairs.
Hadoop Streaming: Python API for Hadoop

Hadoop Streaming acts like a bridge between your Python code and the Java-based HDFS, and lets
you seamlessly access Hadoop clusters and execute MapReduce tasks.

Hadoop
Streaming

Data Scientist Python API

Mapper in Python

Python supports map and reduce operations:

Suppose you have list of numbers you want

to square = [1, 2, 3, 4, 5, 6 ]

Square function is written as follows:

def square(num):
return num * num

You can square this list using the following code:

squared_nums = map(square, numbers)

Output would be:

[1, 4, 9, 16, 25, 36]
Reducer in Python

Suppose you want to sum the squared numbers:

[1, 4, 9, 16, 25, 36]

Use the sum function to add two numbers

def sum(a, b ):
return a + b

You can now sum the numbers using the reduce function
Import functools as f
sum_squared = f. reduce(sum, a)

Output would be:

[91]
Setting Up Cloudera QuickStart Virtual Machine

Cloudera provides enterprise-ready Hadoop Big Data platform which supports Python as well.
To set up the Cloudera Hadoop environment, visit the Cloudera link:

http://www.cloudera.com/downloads/quickstart_vms/5-7.html

Cloudera recommends that you use 7-Zip to extract these files. To download and install it, visit the link:
http://www.7-zip.org/
Cloudera QuickStart VM: Prerequisites

• These 64-bit VMs require a 64-bit host OS and a virtualization product that can support a 64-bit guest OS.
• To use a VMware VM, you must use a player compatible with WorkStation 8.x or higher:
• Player 4.x or higher
• Fusion 4.x or higher
• Older versions of WorkStation can be used to create a new VM using the same virtual disk (VMDK file), but
some features in VMware tools are not available.
• The amount of RAM required varies by the run-time option you choose
Launching VMware Image

To launch the VMware, visit the VMware link:

https://www.vmware.com/products/player/pl https://www.vmware.com/products/fusion/fu
ayerpro-evaluation.html sion-evaluation.html
QuickStart VMware Image
Launch VMware player with Cloudera VM

Launch Terminal

Account:
username: cloudera
password: cloudera
QuickStart VM Terminal

Step 1 Step 2

Unix command :
• pwd to verify present working directory
• ls -lrt to list files and directories
Using Hadoop Streaming for Calculating Word Count

Demonstrate how to create a MapReduce program and use Hadoop Streaming to

determine the word count of a document
MapReduce Uses Disk I/O Operations

HDFS HDFS HDFS HDFS

read write read write
Iteration 1 Iteration 2

Input

Query 1 Result 1

Query 2 Result 2

Input Query 3
Result 3
Apache Spark Uses In-Memory Instead of Disk I/O

HDFS
read
Iteration 1 Iteration 2
Memory Memory (RAM)
(RAM)
Input

Query 1 Result 1

Query 2
Result 2

Input Query 3
Result 3
Distributed memory

10-100 X faster than network and disk

Hardware Requirements for MapReduce and Spark

Hard Drives Hard Drives

CPUs
CPUs Memory
MapReduce Spark
Apache Spark Resilient Distributed Systems (RDD)

Some basic concepts about Resilient Distributed Datasets (RDD) are:

• The main programming approach of Spark is RDD.

• They are fault-tolerant collections of objects spread across a cluster that you can
operate on in parallel. They can automatically recover from machine failure.
• You can create an RDD either by copying the elements from an existing collection or by
referencing a dataset stored externally.
• RDDs support two types of operations: transformations and actions.
• Transformations use an existing dataset to create a new one.
• Example: Map, filter, join
• Actions compute on the dataset and return the value to the driver program.
• Example: Reduce, count, collect, save

If the available memory is insufficient, then the data is written to disk.

Advantages of Spark

Faster: 10 to 100 times faster than Hadoop MapReduce

• Simple data processing framework

Simplified: • Interactive APIs for Python for faster application
development

Efficient: Has multiple tools for complex analytics operations

Integrated: Can be easily integrated with existing Hadoop

infrastructure
PySpark: Python API for Spark

PySpark is the Spark Python API which enables data scientists to access Spark programming model.

PySpark

Data Python API

Scientist
PySpark: RDD Transformations and Actions

Transformation Action

Transformation Description Action Description

Returns RDD, formed by

map() passing data element of the Returns all elements of the
collect()
source dataset as an array

Returns RDD based on Returns the number of

filter() count()
selection elements present in the dataset
Maps items present in the Returns the first element in the
flatMap() first()
dataset and returns sequence dataset
Returns key value pairs where Returns number of elements (n)
reduceByKey() values for which each key is take(n) as specified by the number in
aggregated by value the parenthesis

SparkContext or SC is the entry point to spark for the spark application

Spark Tools

Spark MLlib
Spark GraphX
SQL (machine
Streaming (graph)
learning)

Spark

Interactive Python APIs

Setting Up Apache Spark

To set up the Apache Spark environment, access the link:

http://spark.apache.org/downloads.html

Please use 7-Zip to extract these files.

Setting Up Environmental Variable of Apache Spark

[installed directory]\spark-1.6.1-bin-
hadoop2.4\spark-1.6.1-bin-hadoop2.4

[installed directory] \spark-1.6.1-bin-

hadoop2.4\spark-1.6.1-bin-
hadoop2.4\bin
Integrating Jupyter Notebook with Apache Spark

Setup the
pyspark
notebook
specific
variables

Run the pyspark

command

Check SparkContext
Using PySpark to Determine Word Count

Demonstrate how to use the Jupyter integrated PySpark API to determine the word count
of a given dataset
Word Count

Determine the word count of the given Amazon dataset:

• Create a MapReduce program to determine the word count of the Amazon
dataset
• Submit the MapReduce task to HDFS and run it
• Verify the output
Word Count

Use the given dataset to count and display all the airports based in New York using
PySpark. Perform the following steps:
• View all the airports listed in the dataset
• View only the first 10 records
• Filter the data for all airports located in New York
• Clean up the dataset if required
Knowledge Check
Knowledge
Check
What are the core components of Hadoop? Select all that apply.
1

a. MapReduce

b. HDFS

c. Spark

d. RDD
Knowledge
Check
What are the core components of Hadoop? Select all that apply.
1

a. MapReduce

b. HDFS

c. Spark

d. RDD

The correct answer a and b

is
MapReduce and HDFS are the core components of Hadoop.
Knowledge
Check
MapReduce is a data processing framework which gets executed _____.
2

a. at DataNode

b. at NameNode

c. on client side

d. in memory
Knowledge
Check
MapReduce is a data processing framework which gets executed _____.
2

a. at DataNode

b. at NameNode

c. on client side

d. in memory

The correct answer a

is
The MapReduce program is executed at the data node and the output is written to the disk.
Knowledge
Check Which of the following functions is responsible for consolidating the results produced by each
of the Map() functions/tasks?
3

a. Reducer

b. Mapper

c. Partitioner

d. All of the above

Knowledge
Check Which of the following functions is responsible for consolidating the results produced by each
of the Map() functions/tasks?
3

a. Reducer

b. Mapper

c. Partitioner

d. All of the above

The correct answer a

is
Reducer combines or aggregates results produced by mappers.
Knowledge
Check
What transforms input key-value pairs to a set of intermediate key-value pairs?
4

a. Mapper

b. Reducer

c. Combiner

d. Partitioner
Knowledge
Check
What transforms input key-value pairs to a set of intermediate key-value pairs?
4

a. Mapper

b. Reducer

c. Combiner

d. Partitioner

The correct answer a

is
Mapper processes input data to intermediate key-value pairs which are in turn processed by reducers.
Key Takeaways

You are now able to:

Explain why Python should be integrated with Hadoop

Outline the ecosystem and architecture of Hadoop

Explain the functioning of MapReduce

Discuss Apache Spark functions and their benefits

Write Python programs for Hadoop operations

Stock Market Data Analysis

Import the financial data using Yahoo data reader for the following
companies:
• Yahoo
• Apple
• Amazon
• Microsoft
• Google

Perform fundamental data analysis

• Fetch the previous year’s data
• View the values of Apple’s stock
• Display the plot of closing price
• Display the stock trade by volume
• Plot all companies’ data together for closing prices

Stock Market Data Analysis

Perform Daily Return Analysis and show the relationship between

different stocks
• Plot the percentage change plot for Apple’s stock
• Show a joint plot for Apple and Google
• Use PairPlot to show the correlation between all the stocks

Perform risk analysis

Titanic Data Set Analysis

On April 15, 1912, the Titanic sank after colliding with an iceberg, killing
1502 out of 2224 passengers and crew. This tragedy shocked the world
and led to better safety regulations for ships. Here, we ask you to
perform the analysis through the exploratory data analysis technique.
In particular, we want you to apply the tools of machine learning to
predict the survived passengers.

Titanic Data Set Analysis

The details of these projects and their scope are below:

Data acquisition of the Titanic dataset

• Train dataset
• Test dataset

Perform the Exploratory Data Analysis (EDA) for train dataset

• Passengers age distribution
• Passengers survival by age
• Passengers survival breakdown
• Passengers class distribution
• Passengers embarkation by locations

Titanic Data Set Analysis

Perform machine learning to train the model and

• Create user defined function to load train data set
• Create user defined function to load test data set
• Create machine model
• Train the machine
• Predict whether a passenger survived the tragedy or not
• Persist the mode for future re-use

Thank You

M1 - Introducing Google Cloud v5.2 - ILT
No ratings yet
M1 - Introducing Google Cloud v5.2 - ILT
69 pages
Lesson 2 - CyberSecurity Fundamentals
100% (7)
Lesson 2 - CyberSecurity Fundamentals
132 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Python Libraries
No ratings yet
Python Libraries
17 pages
Python For Non-Programmers Final
No ratings yet
Python For Non-Programmers Final
218 pages
BigData Hadoop Notes
No ratings yet
BigData Hadoop Notes
101 pages
Programming and Data Analytics Using Python
100% (1)
Programming and Data Analytics Using Python
16 pages
Data Mining Using Phyton
No ratings yet
Data Mining Using Phyton
50 pages
Data Science Course Content
No ratings yet
Data Science Course Content
8 pages
NumPy, Pandas, MatplotLib, Seaborn, ScikitLearn (SkLearn)
No ratings yet
NumPy, Pandas, MatplotLib, Seaborn, ScikitLearn (SkLearn)
14 pages
Pandas
100% (2)
Pandas
2,017 pages
Distributed Database System
No ratings yet
Distributed Database System
6 pages
Data Science Assignment 1
No ratings yet
Data Science Assignment 1
20 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
Course File of Ecommerce
100% (2)
Course File of Ecommerce
30 pages
11 Beginner Tips For Learning Python Programming - Real Python
No ratings yet
11 Beginner Tips For Learning Python Programming - Real Python
8 pages
M.Tech (CSE) Big Data Analytics Curriculum
No ratings yet
M.Tech (CSE) Big Data Analytics Curriculum
69 pages
PythonProgramming Book
No ratings yet
PythonProgramming Book
232 pages
Datascience With Python
100% (1)
Datascience With Python
110 pages
2023 Python
No ratings yet
2023 Python
439 pages
Programming For Data Science With Python: Nanodegree Program Syllabus
No ratings yet
Programming For Data Science With Python: Nanodegree Program Syllabus
13 pages
Map Reduce With Hadoop:: Presented by ANIVESHA-126 ARITRA-128 RIA-142 Shashvat - 150 SHEKHAR-151
100% (1)
Map Reduce With Hadoop:: Presented by ANIVESHA-126 ARITRA-128 RIA-142 Shashvat - 150 SHEKHAR-151
9 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Basics of Machine Learning
No ratings yet
Basics of Machine Learning
20 pages
Python
0% (1)
Python
67 pages
Edureka Python Ebook
No ratings yet
Edureka Python Ebook
21 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Pandas Practice Questions
No ratings yet
Pandas Practice Questions
2 pages
Python Pandas
No ratings yet
Python Pandas
230 pages
MEAN Ebook - CodeWithRandom
No ratings yet
MEAN Ebook - CodeWithRandom
524 pages
Python Data Analysis Visualization
No ratings yet
Python Data Analysis Visualization
34 pages
Python Report
No ratings yet
Python Report
19 pages
CB Queryoptimization 01
No ratings yet
CB Queryoptimization 01
78 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
49 pages
Keras Cheat Sheet Python
No ratings yet
Keras Cheat Sheet Python
1 page
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Python For Data Science and Machine Learning
No ratings yet
Python For Data Science and Machine Learning
3 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
100% (1)
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
73 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Programing With Java - Course 1
No ratings yet
Programing With Java - Course 1
152 pages
Data Visualization Using Matplotlib in Python
No ratings yet
Data Visualization Using Matplotlib in Python
15 pages
BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
Data Science Learning Path For 50 Days
No ratings yet
Data Science Learning Path For 50 Days
15 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
Apache Spark Graph Processing - Sample Chapter
No ratings yet
Apache Spark Graph Processing - Sample Chapter
22 pages
Django - Overview: MVC Pattern
No ratings yet
Django - Overview: MVC Pattern
3 pages
Data Science ML Full Stack 2022 GitHub
No ratings yet
Data Science ML Full Stack 2022 GitHub
9 pages
Data Mining With Bigdata
No ratings yet
Data Mining With Bigdata
30 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Week7_MidtermReview
No ratings yet
Week7_MidtermReview
34 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Unit 4
No ratings yet
Unit 4
8 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
13 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Blockchain Interview Guide
No ratings yet
Blockchain Interview Guide
15 pages
Blockchain Career Guide
No ratings yet
Blockchain Career Guide
7 pages
Data Science With Python - Lesson 03 - Statistical Analysis and Business Applications
No ratings yet
Data Science With Python - Lesson 03 - Statistical Analysis and Business Applications
57 pages
Data Science With Python - Lesson 11 - Web Scraping
No ratings yet
Data Science With Python - Lesson 11 - Web Scraping
79 pages
Lesson 5 - Incident Management
No ratings yet
Lesson 5 - Incident Management
66 pages
Lesson 3 - Secure Enterprise Architecture and Component
No ratings yet
Lesson 3 - Secure Enterprise Architecture and Component
82 pages
Lesson 2 - CyberSecurity Fundamentals
100% (1)
Lesson 2 - CyberSecurity Fundamentals
68 pages
Lesson 1 - Course Introduction
No ratings yet
Lesson 1 - Course Introduction
11 pages
EC8491 CT Notes Full - by WWW - EasyEngineering.net 4 PDF
No ratings yet
EC8491 CT Notes Full - by WWW - EasyEngineering.net 4 PDF
152 pages
CVT Chain1
No ratings yet
CVT Chain1
6 pages
Prasana Kumar.S: Educational Qualification
No ratings yet
Prasana Kumar.S: Educational Qualification
2 pages
Datadgeling
No ratings yet
Datadgeling
22 pages
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
No ratings yet
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
16 pages
PEC Magazine
No ratings yet
PEC Magazine
77 pages
CSC Examination Result
No ratings yet
CSC Examination Result
2 pages
Whitepaper AI Web3 PDF
No ratings yet
Whitepaper AI Web3 PDF
22 pages
Source Code
No ratings yet
Source Code
49 pages
HUT-A Hydraulic Universal Testing Machine 2018.6.26 PDF
No ratings yet
HUT-A Hydraulic Universal Testing Machine 2018.6.26 PDF
6 pages
Polymorphism Assignment
No ratings yet
Polymorphism Assignment
5 pages
Compiler MCQ (MCA 504A)
No ratings yet
Compiler MCQ (MCA 504A)
23 pages
1 - Introduction To BI
No ratings yet
1 - Introduction To BI
16 pages
Chapter 2 Linear Signal Models
No ratings yet
Chapter 2 Linear Signal Models
40 pages
PMP Certification: PMBOK® 6.0
No ratings yet
PMP Certification: PMBOK® 6.0
11 pages
LPN03
No ratings yet
LPN03
17 pages
Assignment Sum22
No ratings yet
Assignment Sum22
4 pages
HFS: Server, An Edited Version
No ratings yet
HFS: Server, An Edited Version
53 pages
Cranes&Hoists For Mining Industry
No ratings yet
Cranes&Hoists For Mining Industry
2 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
16 pages
SuperMark 1.5T Proposal - 2108
No ratings yet
SuperMark 1.5T Proposal - 2108
29 pages
2 Static & Dynamic Web Pages
No ratings yet
2 Static & Dynamic Web Pages
24 pages
AOPA - GPS Technology
100% (1)
AOPA - GPS Technology
16 pages
Assigning A Sound File To An Instance. Assigning A Keyboard Key To An Instance. Assigning An Image File To An Instance. All of The Above. ( )
No ratings yet
Assigning A Sound File To An Instance. Assigning A Keyboard Key To An Instance. Assigning An Image File To An Instance. All of The Above. ( )
4 pages
Mamata Java Developer
No ratings yet
Mamata Java Developer
7 pages
6670 01 Que 2003 SPECIMEN
No ratings yet
6670 01 Que 2003 SPECIMEN
4 pages
Stage 4 Business Analysis and System Recommendation
No ratings yet
Stage 4 Business Analysis and System Recommendation
8 pages
!!!!!!!!!AC SINGLE PHASE INDUCTION MOTOR SPEED CONTROL U2008b PDF
No ratings yet
!!!!!!!!!AC SINGLE PHASE INDUCTION MOTOR SPEED CONTROL U2008b PDF
6 pages
Maximize Overall Equipment Efficiency by Total Productive Maintenance
No ratings yet
Maximize Overall Equipment Efficiency by Total Productive Maintenance
15 pages
Unit 1 DBMS
No ratings yet
Unit 1 DBMS
107 pages

Data Science With Python - Lesson 12 - Python Integration With Hadoop

Uploaded by

Data Science With Python - Lesson 12 - Python Integration With Hadoop

Uploaded by

Data Science with Python

Python Integration with Hadoop MapReduce and Spark

By the end of this lesson, you will be able to:

Explain why Python should be integrated with Hadoop

Outline the ecosystem and architecture of Hadoop

Explain the functioning of MapReduce

Discuss Apache Spark functions and their benefits

Write Python programs for Hadoop operations

Demographic Orders Products Visits/Channels Social Media Customer Support

Higher Conversion Targeted Marketing Better Customer Increased

Spark Big Data

Big Data Platform

Big Data Platform

Multiple Data Processing

Data Scientist Python APIs

• MapReduce is a data processing framework to

Data Scientist Python API

Python supports map and reduce operations:

Suppose you have list of numbers you want

Square function is written as follows:

You can square this list using the following code:

Output would be:

Suppose you want to sum the squared numbers:

Use the sum function to add two numbers

Output would be:

To launch the VMware, visit the VMware link:

Demonstrate how to create a MapReduce program and use Hadoop Streaming to

HDFS HDFS HDFS HDFS

10-100 X faster than network and disk

Hard Drives Hard Drives

Some basic concepts about Resilient Distributed Datasets (RDD) are:

• The main programming approach of Spark is RDD.

If the available memory is insufficient, then the data is written to disk.

Faster: 10 to 100 times faster than Hadoop MapReduce

• Simple data processing framework

Efficient: Has multiple tools for complex analytics operations

Integrated: Can be easily integrated with existing Hadoop

Data Python API

Transformation Description Action Description

Returns RDD, formed by

Returns RDD based on Returns the number of

SparkContext or SC is the entry point to spark for the spark application

Interactive Python APIs

To set up the Apache Spark environment, access the link:

Please use 7-Zip to extract these files.

[installed directory] \spark-1.6.1-bin-

Run the pyspark

Determine the word count of the given Amazon dataset:

The correct answer a and b

The correct answer a

d. All of the above

d. All of the above

The correct answer a

The correct answer a

You are now able to:

Explain why Python should be integrated with Hadoop

Outline the ecosystem and architecture of Hadoop

Explain the functioning of MapReduce

Discuss Apache Spark functions and their benefits

Write Python programs for Hadoop operations

Perform fundamental data analysis

© Copyright 2015, Simplilearn. All rights reserved.

Perform Daily Return Analysis and show the relationship between

Perform risk analysis

© Copyright 2015, Simplilearn. All rights reserved.

© Copyright 2015, Simplilearn. All rights reserved.

The details of these projects and their scope are below:

Data acquisition of the Titanic dataset

Perform the Exploratory Data Analysis (EDA) for train dataset

© Copyright 2015, Simplilearn. All rights reserved.

Perform machine learning to train the model and

© Copyright 2015, Simplilearn. All rights reserved.

© Copyright 2015, Simplilearn. All rights reserved.

You might also like