BDACh 05 L05 Python Librariesfor Analysis
BDACh 05 L05 Python Librariesfor Analysis
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 1
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python
• A general purpose, interpreted,
interactive, object oriented and high
level programming language
• Defines the basic data types,
containers, lists, dictionaries, sets,
tuples, functions and classes
• Expressive Programming statements
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 2
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python Libraries
• Extensive Python Standard Library
• Libraries for regular expressions
• Documentation generation
• Unit testing
• Web browsers
• Threading
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 3
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python Libraries
• Databases
• CGI
• Email
• Image manipulation
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 4
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python and Spark Binding
• Gives a strong combination of
performance and features in the same
bundle of codes
• Spark SQL binds with Python easily
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 5
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python and Spark Binding
• Spark SQL features together with
Python help a programmer to build
challenging applications for Big Data
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 6
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark added Python API support for
UDFs
• Functions take one row at a time That
requires overhead (additional codes)
for SerDe
• UDFs defined the UDFs in Java or
Scala, and then invoked them from
Python
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 7
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark 2.3 Arrow Support to VUDFs
and GVUDFs
• Supports to UDFs vectorized UDFs
(VUDFs) vectorized UDFs (VUDFs)
• Spark and Apache Arrow facilitates
VUDFs, which enables high
performance Python UDFs for SerDe
and data pipelines
• Provisions statistical functions
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 8
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python for data analysis and Plotting
• NumPy for numerical (Num) analysis
• SciPy scientific (Sci) computations
• Scikit-learn
• Pandas
• StatsModel
• matplotlib functions for plotting the
mathematical functions
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 9
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Python Pandas for Panel data
(Grouped Vectors Data) Analytics
• An open source Python package, and
consists of BSD-licensed library
functions using the Panda (Panel
Data)
• Pandas give high performance, easy-
to-use data structures and data
analysis tools
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 10
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Figure 5.7 Main features of Panda for data analysis
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 11
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Support to VUDFs and GVUDFs
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 12
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Stats Model and NumPy
• Provisions statistical functions
• NumPy includes (i) N-dimensional
array objects and vector mathematics;
(ii) linear algebraic functions, Fourier
transform and random number
functions; sophisticated
(broadcasting) functions (iii)
integration with C and Fortran codes
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 13
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
NumPy
• Table 5.5 examples of NumPy
functions for data analysis problems
• NumPy provides multi-dimensional
efficient containers of generic data
and definitions of arbitrary data types.
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 14
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
NumPy
• Integrates easily with a wide variety
of databases
• NumPy provides import, export
(load/save) files,
• Creation of arrays
• Inspection of properties
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 15
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
NumPy
• Copying, sorting and reshaping,
addition and removal of elements in
the arrays, indexing, sub-setting and
slicing of the arrays, scalar and vector
mathematics (such as +, −, ×, ÷,
power, sqr, sin, log, ceil – round up to
nearest int, floor – round down up to
the nearest int, round – round to
nearest integer)
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 16
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
SciPy
• Adds on top of NumPy
• SciPy defines some useful functions
for computing distances between a set
of points
• Includes to MATLAB files and special
functions, such as routines for
numerical integration and optimization
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 17
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
User-Defined Functions (UDFs)
• The SQL registers the UDFs and calls
them
• Exposes advanced functionality to
SQL users
• User codes call UDFs into the SQL
statements without writing the
detailed codes
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 18
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Example of Using UDFs
• Example 5.4 explains creation of a
UDF, udfCostPlus() in pandas
• Table column puzzleCost creates
using jigsaw_puzzle_info.txt from an
RDD
• UDF gives the increased costs in the
column, puzzle_cost_USD by 10%.
2019 “Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics 19
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Vectorized User Defined Functions
(VUDFs)
• Spark Arrow facilitates columnar in-
memory analytics, which results in
high performance of Python UDFs,
SerDe and data pipelines
• Example 5.5 explains creation of a
vectorized UDF (VUDF)
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 20
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Creation of a vectorized UDF
(VUDF)
• First define a pandas_UDFCostPlus
for increasing cost puzzle_cost_USD
of toys in puzzle_Costs RDD created
from jigsaw_puzzle_info.txt,
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 21
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
VUDF Code Example
• def vectorized_plusTenPercent (v):
• return v4 + 0.1
• df.withColumn(‘v4’, vectorized_
plusTenPercent (df.v))
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 22
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Grouped Vectorized UDFs
(GVUDFs)
• Uses Panda library split-apply-
combine pattern in data analysis
• Operates on all the data for a group,
such as operate on all the data, “for
each car showroom, compute yearly
sales
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 23
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Step 1 for GVUDF
1. Splits a Spark DataFrame into
groups based on the conditions
specified in the groupBy operator
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 24
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Step 2 for GVUDF
2. Applies a vectorized user-defined
function (pandas.DataFrame ->
pandas.DataFrame) to each group
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 25
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Steps 3 and 4 in GVUDF
3. Combines into new group
4. Returns the results as a new Spark
DataFrame
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 26
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Example
• Example 5.6 explains GVUDF for
adding 10% in a cost of group of rows
for toy products.
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 27
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Summary
We learnt :
• Python integration with Spark
• Spark support to Python UDFs
• Spark Arrow for VUDFs and GVUDFs
• Panda analytics tools in Python
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 28
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
End of Lesson 5 on
Python and its Libraries with
Spark for Data Analysis
“Big Data Analytics “, Ch.05 L05: Spark and Big Data Analytics
2019 29
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)