0% found this document useful (0 votes)
44 views29 pages

BDACh 05 L05 Python Librariesfor Analysis

Uploaded by

Shaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views29 pages

BDACh 05 L05 Python Librariesfor Analysis

Uploaded by

Shaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Lesson 5

Python and its Libraries with


Spark for Data Analysis

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 1
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python
• A general purpose, interpreted,
interactive, object oriented and high
level programming language
• Defines the basic data types,
containers, lists, dictionaries, sets,
tuples, functions and classes
• Expressive Programming statements

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 2
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python Libraries
• Extensive Python Standard Library
• Libraries for regular expressions
• Documentation generation
• Unit testing
• Web browsers
• Threading

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 3
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python Libraries
• Databases
• CGI
• Email
• Image manipulation

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 4
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python and Spark Binding
• Gives a strong combination of
performance and features in the same
bundle of codes
• Spark SQL binds with Python easily

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 5
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python and Spark Binding
• Spark SQL features together with
Python help a programmer to build
challenging applications for Big Data

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 6
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark added Python API support for
UDFs
• Functions take one row at a time That
requires overhead (additional codes)
for SerDe
• UDFs defined the UDFs in Java or
Scala, and then invoked them from
Python

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 7
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark 2.3 Arrow Support to VUDFs
and GVUDFs
• Supports to UDFs vectorized UDFs
(VUDFs) vectorized UDFs (VUDFs)
• Spark and Apache Arrow facilitates
VUDFs, which enables high
performance Python UDFs for SerDe
and data pipelines
• Provisions statistical functions

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 8
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python for data analysis and Plotting
• NumPy for numerical (Num) analysis
• SciPy scientific (Sci) computations
• Scikit-learn
• Pandas
• StatsModel
• matplotlib functions for plotting the
mathematical functions
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 9
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python Pandas for Panel data
(Grouped Vectors Data) Analytics
• An open source Python package, and
consists of BSD-licensed library
functions using the Panda (Panel
Data)
• Pandas give high performance, easy-
to-use data structures and data
analysis tools

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 10
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Figure 5.7 Main features of Panda for data analysis

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 11
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Support to VUDFs and GVUDFs

• Supports to UDFs vectorized UDFs


(VUDFs) vectorized UDFs (VUDFs)
• Spark and Apache Arrow facilitates
VUDFs, which enables high
performance Python UDFs for SerDe
and data pipelines

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 12
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Stats Model and NumPy
• Provisions statistical functions
• NumPy includes (i) N-dimensional
array objects and vector mathematics;
(ii) linear algebraic functions, Fourier
transform and random number
functions; sophisticated
(broadcasting) functions (iii)
integration with C and Fortran codes
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 13
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
NumPy
• Table 5.5 examples of NumPy
functions for data analysis problems
• NumPy provides multi-dimensional
efficient containers of generic data
and definitions of arbitrary data types.

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 14
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
NumPy
• Integrates easily with a wide variety
of databases
• NumPy provides import, export
(load/save) files,
• Creation of arrays
• Inspection of properties

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 15
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
NumPy
• Copying, sorting and reshaping,
addition and removal of elements in
the arrays, indexing, sub-setting and
slicing of the arrays, scalar and vector
mathematics (such as +, −, ×, ÷,
power, sqr, sin, log, ceil – round up to
nearest int, floor – round down up to
the nearest int, round – round to
nearest integer)
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 16
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
SciPy
• Adds on top of NumPy
• SciPy defines some useful functions
for computing distances between a set
of points
• Includes to MATLAB files and special
functions, such as routines for
numerical integration and optimization

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 17
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
User-Defined Functions (UDFs)
• The SQL registers the UDFs and calls
them
• Exposes advanced functionality to
SQL users
• User codes call UDFs into the SQL
statements without writing the
detailed codes

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 18
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Example of Using UDFs
• Example 5.4 explains creation of a
UDF, udfCostPlus() in pandas
• Table column puzzleCost creates
using jigsaw_puzzle_info.txt from an
RDD
• UDF gives the increased costs in the
column, puzzle_cost_USD by 10%.

2019 “Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics 19
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Vectorized User Defined Functions
(VUDFs)
• Spark Arrow facilitates columnar in-
memory analytics, which results in
high performance of Python UDFs,
SerDe and data pipelines
• Example 5.5 explains creation of a
vectorized UDF (VUDF)

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 20
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Creation of a vectorized UDF
(VUDF)
• First define a pandas_UDFCostPlus
for increasing cost puzzle_cost_USD
of toys in puzzle_Costs RDD created
from jigsaw_puzzle_info.txt,

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 21
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
VUDF Code Example
• def vectorized_plusTenPercent (v):
• return v4 + 0.1
• df.withColumn(‘v4’, vectorized_
plusTenPercent (df.v))

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 22
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Grouped Vectorized UDFs
(GVUDFs)
• Uses Panda library split-apply-
combine pattern in data analysis
• Operates on all the data for a group,
such as operate on all the data, “for
each car showroom, compute yearly
sales

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 23
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Step 1 for GVUDF
1. Splits a Spark DataFrame into
groups based on the conditions
specified in the groupBy operator

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 24
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Step 2 for GVUDF
2. Applies a vectorized user-defined
function (pandas.DataFrame ->
pandas.DataFrame) to each group

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 25
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Steps 3 and 4 in GVUDF
3. Combines into new group
4. Returns the results as a new Spark
DataFrame

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 26
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Example
• Example 5.6 explains GVUDF for
adding 10% in a cost of group of rows
for toy products.

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 27
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Summary
We learnt :
• Python integration with Spark
• Spark support to Python UDFs
• Spark Arrow for VUDFs and GVUDFs
• Panda analytics tools in Python

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 28
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
End of Lesson 5 on
Python and its Libraries with
Spark for Data Analysis

“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 29
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)

You might also like