PySpark FP Course ID 58339
PySpark FP Course ID 58339
In this course, you will learn the concepts of Apache Spark using Spark's Python API PySpark.
Pyspark
RDD
Dataframe
if we are able to analyze them properly then we will get a number of advantages
The best tool to handle big data in real time and perform analysis is using Spark
Data
The Large Hadron Collider produces about 30 petabytes of data per year
The New York stock exchange generates about 4 terabyte of data per day
Apache Spark is an open-source,lightning fast big data framework which is designed to enhance
the computational speed.
More on Spark
Capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase, S3, …
Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey, reduceByKey, sortByKey,
collect, count, first…
In-memory caching of data (for iterative, graph, and machine learning algorithms, etc.)
Features of Spark
The main feature is its in-memory cluster computing which in turns increases the processing speed
Features Continued....
Speed:Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data
processing. By controlled partitioning it is achieving this speed.
Deployment:It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster
manager.
Real-Time: Latency of Spark is very low and in addition it offers computation in Real-time. Both
of them are achieved by in-memory computation.
A Polyglot API
But now it has become a polyglot framework that the user can interface with using Scala, Java,
Python or the R language.
Scala lacks the amount of Data Science libraries and tools as Python.
These limitations in data analytics and data science lead to make a separate API in python for
spark usage
What is PySpark ?
PySpark is nothing but the Python API for Apache Spark.
It offers PySpark Shell which connects the Python API to the spark core
and in turn initializes the Spark context.
By default, PySpark has SparkContext available as sc, so creating a new SparkContext won't work.
Py4J
Py4J enables Python programs running in a Python interpreter to dynamically access Java objects
in a Java Virtual Machine.
Here methods are called as if the Java objects resided in the Python interpreter and Java
collections. can be accessed through standard Python collection methods.
In the Python driver program, SparkContext uses Py4J to launch a JVM and create
a JavaSparkContext.
To establish local communication between the Python and Java SparkContext objects Py4J is
used on the driver.
By default, PySpark requires python to be available on the system PATH and use it to run
programs.
Among PySpark’s library dependencies all of them are bundled with PySpark including Py4J and
they are automatically imported.
Getting Started
We can enter the Spark's python environment by running the given
command in the shell.
./bin/pyspark
1. According to Spark advocates, how faster can Apache Spark potentially run batch-
processing programs when processed in memory than MapReduce can?
10 times
20 times
50 times
True
False - Answer
./bin/pyspark - Answer
./bin/spark-shell
4. For any Spark functionality, the entry point is SparkContext.
True is answer
Resilient distributed datasets (RDDs) are known as the main abstraction in Spark.
It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or
on disk.
Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data
source providing a Hadoop InputFormat.
Features Of RDDs
Resilient, i.e. tolerant to faults using RDD lineage graph and therefore ready to
recompute damaged or missing partitions due to node failures.
Dataset - A set of partitioned data with primitive values or values of values, For example, records
or tuples.
Creating RDDs
Parallelizing a collection in driver program.
E.g., here is how to create a parallelized collection holding the
numbers 1 to 5:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Creating RDDs
Referencing one dataset in an external storage system, like
a shared filesystem, HBase, HDFS, or any data source providing a
Hadoop InputFormat.
For example, text file RDDs can be created using the
method SparkContext’s textFile.
For the file (local path on the machine, hdfs://, s3n://, etc URI) the
above method takes a URI and then reads it as a collection containing
lines to produce the RDD.
distFile = sc.textFile("data.txt")
RDD Operations
RDDs support two types of operations: transformations, which
create a new dataset from an existing one, and actions, which return a
value to the driver program after running a computation on the
dataset.
For example, map is a transformation that passes each dataset element
through a function and returns a new RDD representing the results.
Similiarly, reduce is an action which aggregates all RDD elements by
using some functions and then returns the final result to driver
program.
Transformations are functions that use an RDD as the input and return one or more RDDs as the
output.
randomSplit, cogroup, join, reduceByKey, filter, and map are examples of few transformations.
Transformations do not change the input RDD, but always create one or more new RDDs by
utilizing the computations they represent.
By using transformations, you incrementally create an RDD lineage with all the parent RDDs of
the last RDD.
Transformations are lazy, i.e. are not run immediately. Transformations are done on demand.
Examples Of Transformations
filter(func): Returns a new dataset (RDD) that are created by choosing the elements of the
source on which the function returns true.
map(func): Passes each element of the RDD via the supplied function.
union(): New RDD contains elements from source argument and RDD.
intersection(): New RDD includes only common elements from source argument and RDD.
cartesian(): New RDD cross product of all elements from source argument and RDD.
Actions
Actions trigger execution utilising lineage graph to load the data into original RDD, and then
execute all intermediate transformations and write final results out to file system or return it to
Driver program.
Count, collect, reduce, take, and first are few actions in spark.
Example of Actions
reduce(func): Aggregate the data elements in an RDD using this function which takes two
arguments and returns one.
take (n): Fetch first n data elements in an RDD computed by driver program.
foreach(func): Execute function for each data element in RDD. usually used to update an
accumulator or interacting with external systems.
saveAsTextFile(path): Writes the content of RDD to a text file or a set of text files to local file
system/HDFS.
What is Dataframe ?
In general DataFrames can be defined as a data structure, which is
tabular in nature. It represents rows, each of them consists of a
number of observations.
Rows can have a variety of data formats (heterogeneous), whereas a
column can have data of the same data type (homogeneous).
They mainly contain some metadata in addition to data like column and
row names.
Features of DataFrame
Distributed
Lazy Evals
Immutable
Features Explained
DataFrames are Distributed in Nature, which makes it fault tolerant and highly available data
structure.
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an expression until its
value is needed.
DataFrames are Immutable in nature which means that it is an object whose state cannot be
modified after it is created.
DataFrame Sources
For constructing a DataFrame a wide range of sources are available such as:
Tables in Hive
External Databases
Existing RDDs
Spark SQL
Spark introduces a programming module for structured data
processing called Spark SQL.
It provides a programming abstraction called DataFrame and can act
as distributed SQL query engine.
The main capabilities of using structured and semi-structured data, by Spark SQL. Such as:
Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured
formats.
For more details about Spark SQL refer the fresco course Spark SQL
More On Classes
More On Creation
Import the sql module from pyspark
from pyspark.sql import *
Student = Row("firstName", "lastName", "age", "telephone")
s1 = Student('David', 'Julian', 22, 100000)
s2 = Student('Mark', 'Webb', 23, 658545)
StudentData=[s1,s2]
df=spark.createDataFrame(StudentData)
df.show()
Result of show()
Once show() is executed we can view the following result in
the pyspark shell
+---------+--------+---+---------+
|firstName|lastName|age|telephone|
+---------+--------+---+---------+
| David| Julian| 22| 100000|
| Mark| Webb| 23| 658545|
+---------+--------+---+---------+
1. Dataframes are immutable in nature implies it is fault tolerant and highly available data
structure.
True - Answer
False
2. park SQL can read and write data from Hive Tables.
False
True
foreach(func) - Answer
map(func)
cartesian()
union()
collect()
take(n)
groupByKey([numPartitions]) - answer
count()
Data Sources
Spark SQL supports operating on a variety of data sources through the DataFrame interface.
A DataFrame can be operated on using relational transformations and can also be used to create
a temporary view.
Registering a DataFrame as a temporary view allows you to run SQL queries over its data.
This chapter describes the general methods for loading and saving data using the Spark Data
Sources.
Apache Parquet
Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem,
irrespective of the choice of the framework used for data processing, the model of data or
programming language used.
Spark SQL provides support for both reading and writing Parquet files.
Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due
to compatibility reasons.
On comparing with the traditional way instead of storing in row-oriented way in parquet is
more efficient
Parquet is the choice of Big data because it serves both needs, efficient and performance in
both storage and processing.
Binary
Columnar - Answer
Spark SQL does not provide support for both reading and writing Parquet files.
True
False – Answer
In this chapter, you will learn how to perform some advanced operations on DataFrames.
CSV is a file format which allows the user to store the data in tabular format.
CSV stands for comma-separated values.
CSV Loading
To load a csv data set user has to make use of spark.read.csv method to load it into a DataFrame.
Here we are loading a football player dataset using the spark csvreader.
CSV Loading
inferSchema (default false): From the data, it infers the input schema automatically.
header (default false): Using this it inherits the first line as column names.
The argument 2 will display the first two rows of the resulting DataFrame.
For every example from now onwards we will be using football player DataFrame
Schema of DataFrame
It results in different columns in our DataFrame, along with the datatype and the nullable conditions.
df.printSchema()
root
For finding the column names, count of the number of rows and columns we can use the following
methods.
df.columns
['ID', 'Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value', 'Wage', 'Special']
Row count
df.count()
17981
Column count
len(df.columns)
10
To get the summary of any particular column make use of describe method.
This method gives us the statistical summary of the given column, if not specified, it provides
the statistical summary of the DataFrame.
df.describe('Name').show()
+-------+-------------+
|summary| Name|
+-------+-------------+
| count| 17981|
| mean| null|
| stddev| null|
| min| A. Abbas|
| max|Óscar Whalley|
+-------+-------------+
df.describe('Age').show()
+-------+------------------+
|summary| Age|
+-------+------------------+
| count| 17981|
| mean|25.144541460430453|
| stddev| 4.614272345005111|
| min| 16|
| max| 47|
+-------+------------------+
For selecting particular columns from the DataFrame, one can use the select method.
**show() is optional **
One can load the result into another DataFrame by simply equating.
ie
Selection Operation
Selecting the column ID and Name and loading the result to a new DataFrame.
dfnew=df.select('ID','Name')
+------+-----------------+
| ID| Name|
+------+-----------------+
| 20801|Cristiano Ronaldo|
|158023| L. Messi|
|190871| Neymar|
|176580| L. Suárez|
|167495| M. Neuer|
+------+-----------------+
Filtering Data
df.filter(df.Club=='FC Barcelona').show(3)
+------+----------+---+-----------+-------+---------+------------+------+-----+-------+
+------+----------+---+-----------+-------+---------+------------+------+-----+-------+
+------+----------+---+-----------+-------+---------+------------+------+-----+-------+
only showing top 3 rows since we had given 3 in the show() as the argument
+------+---------------+---+-----------+-------+---------+------------+------+-----+-------+
+------+---------------+---+-----------+-------+---------+------------+------+-----+-------+
In pyspark in default, it will sort in ascending order but we can change it into descending order as well.
Sorting
The result of the first order by operation results in the following output.
+------+---------------+---+-----------+-------+---------+------------+------+-----+-------+
+------+---------------+---+-----------+-------+---------+------------+------+-----+-------+
+------+---------------+---+-----------+-------+---------+------------+------+-----+-------+
DataFrames are introduced in Spark 1.3 to make operations using Spark easier.
Cross-tabulation
Random Data generation is useful when we want to test algorithms and to implement new ones.
In Spark under sql.functions we have methods to generate random data. e.g., uniform (rand), and
standard normal (randn).
df = spark.range(0,7)
df1 = df.select("id").orderBy(rand()).limit(4)
df1.show()
+---+
| id|
+---+
| 0|
| 3|
| 6|
| 1|
+---+
By using uniform distribution and normal distribution generate two more columns.
+---+-------------------+-------------------+
+---+-------------------+-------------------+
| 0|0.41371264720975787| 0.5888539012978773|
| 1| 0.1982919638208397|0.06157382353970104|
| 2|0.12030715258495939| 1.0854146699817222|
| 3|0.44292918521277047|-0.4798519469521663|
+---+-------------------+-------------------+
The first operation to perform after importing data is to get some sense of what it looks like.
The function describe returns a DataFrame containing information such as number of non-null entries
(count), mean, standard deviation, and minimum and maximum value for each numerical column.
df.describe('uniform', 'normal').show()
+-------+-------------------+--------------------+
+-------+-------------------+--------------------+
| mean| 0.3841685645682706|-0.15825812884638607|
| stddev|0.31309395532409323| 0.963345903544872|
| min|0.03650707717266999| -2.1591956435415334|
+-------+-------------------+--------------------+
Descriptive Statistics
In the same way, we can also make use of some standard statistical functions also.
+------------------+-------------------+------------------+
+------------------+-------------------+------------------+
|0.3841685645682706|0.03650707717266999|0.8898784253886249|
+------------------+-------------------+------------------+
In statistics Co-Variance means how one random variable changes with respect to other.
More On Co-Variance
df.stat.cov('rand1', 'rand2')
0.031109767020625314
From the above we can infer that co-variance of two random columns is near to zero.
Correlation
df.stat.corr('rand1', 'rand2')
0.30842745432650953
Cross Tabulation provides a frequency distribution table for a given set of variables.
One of the powerful tool in statistics to observe the statistical independence of variables.
Consider an example
Contingency Table)
For applying the cross tabulation we can make use of the crosstab method.
df.stat.crosstab("name", "item").show()
+---------+------+-----+------+----+-------+
|name_item|apples|bread|butter|milk|oranges|
+---------+------+-----+------+----+-------+
| Bob| 6| 7| 7| 6| 7|
| Mike| 7| 6| 7| 7| 6|
| Alice| 7| 7| 6| 7| 7|
+---------+------+-----+------+----+-------+
Step 4: From the DataFrame, display the associates who are mapped to ‘JAVA’ stream. Save the resultant
DataFrame to a parquet file with name JavaEmployees.
spark = SparkSession.builder \
.getOrCreate()
df = spark.read.json("employees.json")
df.show()
df.write.parquet("Employees")
java_employees_df.show()
java_employees_df.write.parquet("JavaEmployees")
Step 3: Create 10 random values as a column and the name the column as rand1
Step 4: Create another 10 random values as column and name column as rand2
Step 5: Create a new Dataframe with Header name as “Stats” and “Value”
Step 6 Fill the new Dataframe with the obtained values as “Co-variance” and “Correlation”
Step 7: Save the resultant DataFrame to a CSV file with name Result
Note:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand, lit
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
Apache Hive
The Apache Hive data warehouse software allows reading, writing, and managing large datasets residing
in distributed storage and queried using SQL syntax.
Features of Apache Hive
Apache Hive is built on top of Apache Hadoop.
The below mentioned are the features of Apache Hive.
Apache Hive is having tools to allow easy and quick access to data using SQL, thus enables data
warehousing tasks such like extract/transform/load (ETL), reporting, and data analysis.
Mechanisms for imposing structure on a variety of data formats.
Access to files stored either directly in Apache HDFS or in other data storage systems such
as Apache HBase.
Features Of Apache Hive
Query execution via Apache Tez,Apache Spark, or MapReduce.
A procedural language with HPL-SQL.
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
#warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
Creating Hive Table From Spark
We can easily create a table in hive warehouse programmatically from Spark.
The syntax for creating a table is as follows:
spark.sql("CREATE TABLE IF NOT EXISTS table_name(column_name_1 DataType,column_name_2
DataType,......,column_name_n DataType) USING hive")
To load a DataFrame into a table.
df.write.insertInto("table name",overwrite = True)
Now verify the result by using the select statement.
Hive External Table
3. If the schema of the table does not match with the data types present in the file
containing the table, then Hive ________
What is HBase ?
HBase is a distributed column-oriented data store built on top of HDFS.
HBase is an Apache open source project whose goal is to
provide storage for the Hadoop Distributed Computing.
Data is logically organized into tables, rows and columns.
More On HBase
Various Stages
Transformation
Cleaning
Validation
Writing of the data received from the various sources
Transformation And Cleaning
Data Transformation
Data Cleaning
Data Validation
In this stage, we can validate the data with respect to some standard validations such as length,
patterns and so on.
Writing
At last, data passed from the previous three stages is passed on to the writing application which
simply writes this final set of data to HBase for further data analysis.
Uber – the online taxi company is an apt example for Spark. They are gathering terabytes of
event data from its various users.
Uses Kafka, Spark Streaming, and HDFS, to build a continuous ETL pipeline.
Convert raw unstructured data into structured data as it is collected.
Uses it further complex analytics and optimization of operations.
Leverages Spark Streaming to gain immediate insight into how users all over the world
are engaging with Pins in real time.
Can make more relevant recommendations as people navigate the site.
Recommends related Pins.
Determine which products to buy, or destinations to visit.
Conviva is using Spark for reducing the customer churn by managing live video
traffic and optimizing video streams.
They maintain a consistently smooth high-quality viewing experience.
Capital One – makes use of Spark and data science algorithms for a better understanding of its
customers.
Course Summary
Congratulations! You have completed the course on PySpark.
We read about the Spark concepts, Need of
PySpark,RDD ,DataFrames,Hive and HBase Connection
Now it's time to check your understanding
Step 1: Import the SparkSession package
Step 3: Create a DataFrame with the following details under the headers as “ID”, “Name”,
“Age”, “Area of Interest”
Step 5: Use describe method on Age column and observe the statistical parameters and save the
data into a parquet file and the folder with name “Age” inside/projects/challenge/.
Step 6: Select the columns ID, Name, and Age and Name should be in descending order. Save
the resultant into a paraquet file under the file with name “NameSorted”
inside/projects/challenge/.
Note:
df = spark.createDataFrame(data, schema=columns)
# Step 5: Use describe method on Age column and save the data into a parquet file
age_stats = df.select("Age").describe()
age_stats.show() # To observe the statistical parameters
age_stats.coalesce(1).write.parquet("/projects/challenge/Age", mode="overwrite")
# Step 6: Select the columns ID, Name, and Age, sort by Name in descending order
sorted_df = df.select("ID", "Name", "Age").orderBy("Name", ascending=False)
5. Spark SQL can read and write data from Hive Tables.
Answer:-(1)True
6. Which of the following can be used to launch Spark jobs inside MapReduce ?
Please choose the coorect option from below list
(1)SIM
(2)SIR
(3)RIS
(4)SIMR
Answer:-(4)SIMR
(1)union()
(2)map(func)
(3)foreach(func)
(4)cartesian()
Answer:-(3)foreach(func)
According to Spark advocates, how faster can Apache Spark potentially run batch-processing
programs when processed in memory than MapReduce can?
Please choose the coorect option from below list
(1)50 times
(2)10 times
(3)20 times
(4)100 times
Answer:-(4)100 times
Answer:-(2)Columnar
Spark SQL does not provide support for both reading and writing Parquet files.
Please choose the correct option from below list
(1)True
(2)False
Answer:-(2)False
. HBase is a distributed ________ database built on top of the Hadoop file system.
a) Column-oriented - Ans
b) Row-oriented
c) Tuple-oriented
d) None of the mentioned
In case of External Tables, data needs to remain in the underlying location even after the
user drops the table.
(1)False
(2)True
Answer:-(2)True
On can invoke PySpark shell by running the _________ command in Shell.
Please choose the coorect option from below list
(1)./bin/spark-shell
(2)./bin/pyspark
Answer:-(2)./bin/pyspark
If the schema of the table does not match with the data types present in the file
containing the table, then Hive ________
Please choose the coorect option from below list
(1)Automatically drops the file
(2)Reports Null values for mismatched data
(3)Automatically corrects the file
df.count
df.rowcount()
df.rowCount()
df.count() -
Answer
A distributed collection of data grouped into named columns is ___________
pyspark.sql.DataFrame - Answer
pyspark.sql.DataFrameNaFunctions
pyspark.sql.GroupedData
pyspark.sql.Window
Answer:-(1)groupByKey([numPartitions])
Registering a DataFrame as a ________ view allows you to run SQL queries over its data.
Permanent
Temporary – Ans
Select the correct statement.
RSS abstraction provides distributed task dispatching, scheduling, and basic I/O functionalities
Step 3: Read the json file, and create a DataFrame with the json data. Display the DataFrame.
Save the DataFrame to a paraquet file with name Employees.
Step 4: From the DataFrame, display the associates who are mapped to ‘JAVA’ stream. Save
the resultant DataFrame to a paraquet file with name JavaEmployees.
Note:
spark = SparkSession.builder \
.getOrCreate()
# Step 3: Read the JSON file and create a DataFrame with the JSON data
json_file_path = "path_to_json_file"
df = spark.read.json(json_file_path)
df.show()
df.coalesce(1).write.mode("overwrite").parquet("Employees")
java_employees.show()
spark.stop()
spark = SparkSession.builder \
.getOrCreate()
# Step 3: Read the JSON file and create a DataFrame with the JSON data
json_file_path = "path_to_json_file"
df = spark.read.json(json_file_path)
print("Original DataFrame:")
df.show()
df.coalesce(1).write.mode("overwrite").parquet("Employees")
java_employees.show()
# Save the resultant DataFrame to a Parquet file with the name 'JavaEmployees'
java_employees.coalesce(1).write.mode("overwrite").parquet("JavaEmployees")
spark.stop()