0% found this document useful (0 votes)

95 views

Pyspark

PySpark is the Python API for Apache Spark, allowing Python users to interact with Spark's core functionalities through the PySpark Shell and SparkContext. It supports the creation and manipulation of Resilient Distributed Datasets (RDDs) and DataFrames, enabling efficient processing of large datasets. PySpark also integrates with Spark SQL for structured data processing and provides various operations for data transformation and action.

Uploaded by

Venky P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views

Pyspark

Uploaded by

Venky P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

What is PySpark ?

PySpark is nothing but the Python API for Apache Spark.

It offers PySpark Shell which connects the Python API to the spark core and in turn
initializes the Spark context.

 For any spark functionality, the entry point is SparkContext.

ed
 SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext.
nk

nk
 By default, PySpark has SparkContext available as sc, so creating a new
SparkContext won't work.
Li

Li
|

|
Py4J
G

G
ya

ya
 PySpark is built on top of Spark's Java API.
 Data is processed in Python and cached / shuffled in the JVM.
da

da
 Py4J enables Python programs running in a Python interpreter to
U

U
dynamically access Java objects in a Java Virtual Machine.
 Here methods are called as if the Java objects resided in the Python
interpreter and Java collections. can be accessed through
standard Python collection methods.

Installing and Configuring PySpark

 PySpark requires Python 2.6 or higher.
 PySpark applications are executed using a standard CPython interpreter in
order to support Python modules that use C extensions.
 By default, PySpark requires python to be available on the system PATH and
In

use it to run programs.

 Among PySpark’s library dependencies all of them are bundled
ed

ed
with PySpark including Py4J and they are automatically imported.
nk

nk
Getti ng Started
Li

Li
|

|
We can enter the Spark's python environment by running the given command in the
G

G
shell.
ya

ya
./bin/pyspark
da

da
This will start yourPySpark shell.`
U

Python 2.7.12 (default, Nov 20 2017, 18:23:56)

U
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Python version 2.7.12 (default, Nov 20 2017 18:23:56)
SparkSession available as 'spark'.
<<<
In
ed

ed
nk

nk
Li

Li
|

|
G

G
ya

ya
da

da
U

U
Resilient Distributed Datasets (RDDs)
 Resilient distributed datasets (RDDs) are known as the main abstraction
in Spark.
 It is a partitioned collection of objects spread across a cluster, and can be
persisted in memory or on disk.
 Once RDDs are created they are immutable.
In

There are two ways to create RDDs:

ed
1. Parallelizing a collection in driver program.
2. Referencing one dataset in an external storage system, like a
nk

nk
shared filesystem, HBase, HDFS, or any data source providing a Hadoop
Li

Li
InputFormat.
|

|
Features Of RDDs
G

G
ya

ya
 Resilient, i.e. tolerant to faults using RDD lineage graph and therefore ready
da

da
to recompute damaged or missing partitions due to node failures.
 Dataset - A set of partitioned data with primitive values or values of values,
U


For example, records or tuples.
U
Distributed with data remaining on multiple nodes in a cluster.

Creating RDDs
Parallelizing a collection in driver program.
E.g., here is how to create a parallelized collection holding the numbers 1 to 5:

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

Here, distData is the new RDD created by calling SparkContext’s

parallelize method.

Creating RDDs
In

Referencing one dataset in an external storage system, like a shared filesystem,

HBase, HDFS, or any data source providing a Hadoop InputFormat .

For example, text file RDDs can be created using the

method SparkContext’s textFile.
Li

For the file (local path on the machine, hdfs://, s3n://, etc URI) the above method
|

takes a URI and then reads it as a collection containing lines to produce the RDD.
G

G
ya

ya
da

da
U

U
distFile = sc.textFile("data.txt")

RDD Operations
RDDs support two types of operations: transformations, which create a new
dataset from an existing one, and actions, which return a value to the driver program
after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a
function and returns a new RDD representing the results.
In

Similiarly, reduce is an action which aggregates all RDD elements by using some
ed

ed
functions and then returns the final result to driver program.
nk

nk
More On RDD Operations
Li

Li
As a recap to RDD basics, consider the simple program shown below:
|

|
G

G
lines = sc.textFile("data.txt")
ya

ya
lineLengths = lines.map(lambda s: len(s))
da

da
totalLength = lineLengths.reduce(lambda a, b: a + b)
U

The first line defines a base RDD from an external file.

U
The second line defines lineLengths as the result of a map transformation.
Finally, in the third line, we run reduce, which is an action.

Transformations
 Transformations are functions that use an RDD as the input and return one or
more RDDs as the output.
 randomSplit, cogroup, join, reduceByKey, filter, and map are examples of few
transformations.
 Transformations do not change the input RDD, but always create one or more
new RDDs by utilizing the computations they represent.
 By using transformations, you incrementally create an RDD lineage with all the
parent RDDs of the last RDD.
 Transformations are lazy, i.e. are not run immediately. Transformations are
done on demand.
In

 Transformations are executed only after calling an action.

Examples Of Transformations
nk

nk
Li

 filter(func): Returns a new dataset (RDD) that are created by choosing the
|

elements of the source on which the function returns true.

G
ya

ya
da

da
U

U
 map(func): Passes each element of the RDD via the supplied function.
 union(): New RDD contains elements from source argument and RDD.
 intersection(): New RDD includes only common elements from source
argument and RDD.
 cartesian(): New RDD cross product of all elements from source argument and
RDD.

Actions
 Actions return concluding results of RDD computations.
In

 Actions trigger execution utilising lineage graph to load the data into original
ed

ed
RDD, and then execute all intermediate transformations and write final results
out to file system or return it to Driver program.
nk

nk
 Count, collect, reduce, take, and first are few actions in spark.
Li

Li
Example of Actions
|

|
G

G
 count(): Get the number of data elements in the RDD.
ya

ya
 collect(): Get all the data elements in an RDD as an array.
reduce(func): Aggregate the data elements in an RDD using this function
da

da

which takes two arguments and returns one.
U


 U
take (n): Fetch first n data elements in an RDD computed by driver program.
foreach(func): Execute function for each data element in RDD. usually used to
update an accumulator or interacting with external systems.
 first(): Retrieves the first data element in RDD. It is similar to take(1).
 saveAsTextFile(path): Writes the content of RDD to a text file or a set of text
files to local file system/HDFS.

What is Dataframe ?
In general DataFrames can be defined as a data structure, which is tabular in nature.
It represents rows, each of them consists of a number of observations.
Rows can have a variety of data formats (heterogeneous), whereas a column can
have data of the same data type (homogeneous).
They mainly contain some metadata in addition to data like column and row names.

Why DataFrames ?
In
ed

 DataFrames are widely used for processing a large collection of structured or

semi-structured data
 They are having the ability to handle petabytes of data
Li

 In addition, it supports a wide range of data format for reading as well as

writing
|

|
G

G
ya

ya
da

da
U

U
As a conclusion DataFrame is data organized into named columns

Features of DataFrame
 Distributed
 Lazy Evals
 Immutable

Features Explained
In

 DataFrames are Distributed in Nature, which makes it fault tolerant and highly
ed

ed
available data structure.
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
nk

nk

expression until its value is needed.
Li

Li
 DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created.
|

|
G

G
DataFrame Sources
ya

ya
da

da
For constructing a DataFrame a wide range of sources are available such as:
U



Structured data files
Tables in Hive U
 External Databases
 Existing RDDs

Spark SQL
Spark introduces a programming module for structured data processing called Spark
SQL.

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.

Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL.
ed

Such as:
nk

 Provides DataFrame abstraction in Scala, Java, and Python.

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in
Li

various structured formats.

 Data can be queried by using Spark SQL.

G
ya

ya
da

da
U

U
For more details about Spark SQL refer the fresco course Spark SQL

Important classes of Spark SQL and

DataFrames
 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL
functionality
 pyspark.sql.DataFrame :A distributed collection of data grouped into named
columns
In

 pyspark.sql.Column : A column expression in a DataFrame.

 pyspark.sql.Row : A row of data in a DataFrame.
ed

ed
 pyspark.sql.GroupedData :Aggregation methods, returned by
nk

nk
DataFrame.groupBy().
Li

Li
More On Classes
|

|
G

G
 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null
ya

ya
values).
 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality.
da

da
 pyspark.sql.functions : List of built-in functions available for DataFrame.
pyspark.sql.types : List of data types available.
U

U

 pyspark.sql.Window : For working with window functions.

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.
To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Data Frame Example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
In

Import the sql module from pyspark

Li
|

from pyspark.sql import *

G
ya

ya
da

da
U

U
Student = Row("firstName", "lastName", "age", "telephone")
s1 = Student('David', 'Julian', 22, 100000)
s2 = Student('Mark', 'Webb', 23, 658545)
StudentData=[s1,s2]
df=spark.createDataFrame(StudentData)
df.show()
In

from pyspark.sql import SparkSession

from pyspark import *
ed

ed
spark = SparkSession \
.builder \
nk

nk
.config("spark.some.config.option", "some-value") \
Li

Li
.getOrCreate()
passenger = Row("Name", "age", "source", "destination")
|

|
s1 = passenger('David', 22, 'London', 'Paris')
G

G
s2 = passenger('Steve', 22, 'New York', 'Sydney')
x = [s1,s2]
ya

ya
df1=spark.createDataFrame(x)
da

da
df1.show()
U

U
Result of show()
Once show() is executed we can view the following result in the pyspark shell

+---------+--------+---+---------+
|firstName|lastName|age|telephone|
+---------+--------+---+---------+
| David| Julian| 22| 100000|
| Mark| Webb| 23| 658545|

Data Sources
In

 Spark SQL supports operating on a variety of data sources through the

DataFrame interface.
 A DataFrame can be operated on using relational transformations and can
nk

also be used to create a temporary view.

Registering a DataFrame as a temporary view allows you to run SQL
Li


queries over its data.
|

|
G

G
ya

ya
da

da
U

U
 This chapter describes the general methods for loading and saving
data using the Spark Data Sources.

Generic Load/Save Functions

In most of the cases, the default data source will be used for all operations.

df = spark.read.load("file path")
In
# Spark load the data source from the defined file path
ed

ed
df.select("column name", "column name").write.save("file name")
nk

nk
Li

Li
# The DataFrame is saved in the defined format
|

|
G

G
# By default it is saved in the Spark Warehouse
ya

ya
File path can be from local machine as well as from HDFS.
da

da
Manually Specifying Options
U

U
You can also manually specify the data source that will be used along with any extra
options that you would like to pass to the data source.

Data sources fully qualified name is used to specify them, but for built-in sources, you
can also use their short names (json, parquet, jdbc, orc, libsvm, csv, text) .

Specific File Formats

DataFrames which are loaded from any type of data can be converted to other types
by using the syntax shown below.
A json file can be loaded:

df = spark.read.load("path of json file", format="json")

Apache Parquet
In
ed

 Apache Parquet is a columnar storage format available to all projects in

the Hadoop ecosystem, irrespective of the choice of the framework used for
nk

data processing, the model of data or programming language used.

 Spark SQL provides support for both reading and writing Parquet files.
 Automatic conversion to nullable occurs when one tries to write Parquet
|

files, This is done due to compatibility reasons.

G
ya

ya
da

da
U

U
Reading A Parquet File
Here we are loading a json file into a dataframe.

df = spark.read.json("path of the file")

For saving the dataframe into parquet format.

df.write.parquet("parquet file name")

# Put your code here

from pyspark.sql import *
ed

ed
spark = SparkSession.builder.getOrCreate()
nk

nk
df = spark.read.json("emp.json")
df.show()
Li

Li
df.write.parquet("Employees")
df.createOrReplaceTempView("data")
|

|
res = spark.sql("select age,name,stream from data where stream='JAVA'")
G

G
res.show()
ya

ya
res.write.parquet("JavaEmployees")
da

da
U

Verifying The Result U

We can verify the result by loading in Parquet format.

pf = spark.read.parquet("parquet file name")

Here we are reading in Parquet format.

To view the the DataFrame use show() method.

Why Parquet File Format ?

 Parquet stores nested data structures in a flat columnar format.
 On comparing with the traditional way instead of storing in row-oriented
way in parquet is more efficient
 Parquet is the choice of Big data because it serves both
needs, efficient and performance in both storage and processing.
In
ed

Why Parquet File Format ?

 Parquet stores nested data structures in a flat columnar format.

 On comparing with the traditional way instead of storing in row-oriented

way in parquet is more efficient

G
ya

ya
da

da
U

U
 Parquet is the choice of Big data because it serves both
needs, efficient and performance in both storage and processing.

Advanced Concepts in Data Frame

In this chapter, you will learn how to perform some advanced operations
on DataFrames.
Throughout the chapter, we will be focusing on csv files.

Reading Data From A CSV File

In
ed

ed
What is a CSV file?
nk

nk
 CSV is a file format which allows the user to store the data in tabular format.
Li

Li
 CSV stands for comma-separated values.
It's data fields are most often separated, or delimited, by a comma.
|

|

G

G
CSV Loading
ya

ya
da

da
To load a csv data set user has to make use of spark.read.csv method to load it into
a DataFrame.
U

U
Here we are loading a football player dataset using the spark csvreader.

df = spark.read.csv("path-of-file/fifa_players.csv", inferSchema = True, header =

True)

CSV Loading
inferSchema (default false): From the data, it infers the input schema automatically.
header (default false): Using this it inherits the first line as column names.
To verify we can run df.show(2).
The argument 2 will display the first two rows of the resulting DataFrame.
For every example from now onwards we will be using football player DataFrame

Schema of DataFrame
In
ed

What is meant by schema?

It’s just the structure of the DataFrame.

To check the schema one can make use of printSchema method.

|
G

G
ya

ya
da

da
U

U
It results in different columns in our DataFrame, along with the datatype and
the nullable conditions.

How To Check The Schema

To check the schema of the loaded csv data.

df.printSchema()

Once executed we will get the following result.

root
ed

ed
|-- ID: integer (nullable = true)
nk

nk
|-- Name: string (nullable = true)
Li

Li
|-- Age: integer (nullable = true)
|-- Nationality: string (nullable = true
|

|
G

G
|-- Overall: integer (nullable = true)
|-- Potential: integer (nullable = true)
ya

ya
|-- Club: string (nullable = true)
da

da
|-- Value: string (nullable = true)
U

|-- Wage: string (nullable = true)

|-- Special: integer (nullable = true)
U
Column Names and Count (Rows and
Column)
For finding the column names, count of the number of rows and columns we can use
the following methods.
For Column names

df.columns
['ID', 'Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value',
'Wage', 'Special']

Row count
In
ed

df.count()
nk

17981
Li

Column count
|

|
G

G
ya

ya
da

da
U

U
len(df.columns)
10

Describing a Particular Column

To get the summary of any particular column make use of describe method.
This method gives us the statistical summary of the given column, if not specified, it
provides the statistical summary of the DataFrame.
In

df.describe('Name').show()
ed

ed
The result we will be as shown below.
nk

nk
+-------+-------------+
Li

Li
|summary| Name|
|

|
+-------+-------------+
G

G
| count| 17981|
ya

ya
| mean| null|
da

da
| stddev| null|
U

|
|
min| A. Abbas|
max|Óscar Whalley|
U
+-------+-------------+

Advanced Concepts in Data Frame

In this chapter, you will learn how to perform some advanced operations
on DataFrames.
Throughout the chapter, we will be focusing on csv files.

Reading Data From A CSV File

What is a CSV file?

 CSV is a file format which allows the user to store the data in tabular format.
In

 CSV stands for comma-separated values.

It's data fields are most often separated, or delimited, by a comma.
ed


nk

CSV Loading
Li

To load a csv data set user has to make use of spark.read.csv method to load it into
|

a DataFrame.
G

G
ya

ya
da

da
U

U
Here we are loading a football player dataset using the spark csvreader.

df = spark.read.csv("path-of-file/fifa_players.csv", inferSchema = True, header =

True)

CSV Loading
inferSchema (default false): From the data, it infers the input schema automatically.
header (default false): Using this it inherits the first line as column names.
In

To verify we can run df.show(2).

ed
The argument 2 will display the first two rows of the resulting DataFrame.
nk

nk
For every example from now onwards we will be using football player DataFrame
Li

Li
Schema of DataFrame
|

|
G

G
What is meant by schema?
ya

ya
It’s just the structure of the DataFrame.
da

da
To check the schema one can make use of printSchema method.
It results in different columns in our DataFrame, along with the datatype and
U

the nullable conditions. U

How To Check The Schema
To check the schema of the loaded csv data.

df.printSchema()

Once executed we will get the following result.

root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Nationality: string (nullable = true
In

|-- Overall: integer (nullable = true)

|-- Potential: integer (nullable = true)

|-- Club: string (nullable = true)

|-- Value: string (nullable = true)
Li

|-- Wage: string (nullable = true)

|
G

G
ya

ya
da

da
U

U
|-- Special: integer (nullable = true)

Column Names and Count (Rows and

Column)
For finding the column names, count of the number of rows and columns we can use
the following methods.
For Column names
In

df.columns
ed

ed
['ID', 'Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value',
nk

nk
'Wage', 'Special']
Li

Li
Row count
|

|
df.count()
G

G
17981
ya

ya
da

da
Column count
U

len(df.columns)
10
U
Describing a Particular Column
To get the summary of any particular column make use of describe method.
This method gives us the statistical summary of the given column, if not specified, it
provides the statistical summary of the DataFrame.

df.describe('Name').show()

The result we will be as shown below.

+-------+-------------+
|summary| Name|
In

+-------+-------------+
ed

| stddev| null|
Li

| min| A. Abbas|
|

| max|Óscar Whalley|
G

G
ya

ya
da

da
U

U
+-------+-------------+

Describing A Different Column

Now try it on some other column.

df.describe('Age').show()
In

Observe the various Statistical Parameters.

ed
+-------+------------------+
nk

nk
|summary| Age|
+-------+------------------+
Li

Li
| count| 17981|
|

|
| mean|25.144541460430453|
G

G
| stddev| 4.614272345005111|
ya

ya
| min| 16|
da

da
| max| 47|
U

+-------+------------------+
U
Selecting Multiple Columns
For selecting particular columns from the DataFrame, one can use the select method.
Syntax for performing selection operation is:

df.select('Column name 1,'Column name 2',......,'Column name n').show()

**show() is optional **
One can load the result into another DataFrame by simply equating.
ie

dfnew=df.select('Column name 1,'Column name 2',......,'Column name n')

Selection Operation
ed

Selecting the column ID and Name and loading the result to a new DataFrame.
nk

nk
Li

dfnew=df.select('ID','Name')
|

Verifying the result

G
ya

ya
da

da
U

U
dfnew.show(5)
+------+-----------------+
| ID| Name|
+------+-----------------+
| 20801|Cristiano Ronaldo|
|158023| L. Messi|
|190871| Neymar|
|176580| L. Suárez|
In

|167495| M. Neuer|
ed

ed
+------+-----------------+
nk

nk
only showing top 5 rows
Li

Li
Filtering Data
|

|
G

G
For filtering the data filter command is used.
ya

ya
df.filter(df.Club=='FC Barcelona').show(3)
da

da
The result will be as follows:
U

U
+------+----------+---+-----------+-------+---------+------------+------+-----
+-------+
| ID| Name|Age|Nationality|Overall|Potential| Club| Value| Wage|
Special|
+------+----------+---+-----------+-------+---------+------------+------+-----
+-------+
|158023| L. Messi| 30| Argentina| 93| 93|FC Barcelona| €105M|€565K|
2154|
|176580| L. Suárez| 30| Uruguay| 92| 92|FC Barcelona| €97M|€510K|
2291|
|168651|I. Rakitić| 29| Croatia| 87| 87|FC Barcelona|€48.5M|€275K|
2129|
+------+----------+---+-----------+-------+---------+------------+------+-----
+-------+
In
ed

only showing top 3 rows since we had given 3 in the show() as the argument
Verify the same by your own
nk

nk
Li

Li
|

|
G

G
ya

ya
da

da
U

+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
ed

ed
|152729| Piqué| 30| Spain| 87| 87|FC Barcelona|€37.5M|
nk

nk
€240K| 1974|
| 41| Iniesta| 33| Spain| 87| 87|FC Barcelona|€29.5M|
Li

Li
€260K| 2073|
|

|
|189511|Sergio Busquets| 28| Spain| 86| 86|FC Barcelona| €36M|
G

G
€250K| 1998|
ya

ya
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
da

da
only showing top 3 rows
U

In a similar way we can use other logical operators. U

Sorting Data (OrderBy)
To sort the data use the OrderBy method.
In pyspark in default, it will sort in ascending order but we can change it
into descending order as well.

df.filter((df.Club=='FC Barcelona') &

(df.Nationality=='Spain')).orderBy('ID',).show(5)

To sort in descending order:

df.filter((df.Club=='FC Barcelona') &

(df.Nationality=='Spain')).orderBy('ID',ascending=False).show(5)
In

Sorting
ed

The result of the first order by operation results in the following output.
nk

nk
Li

+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
|

|
G

G
ya

ya
da

da
U

U
| ID| Name|Age|Nationality|Overall|Potential| Club| Value|
Wage|Special|
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
| 41| Iniesta| 33| Spain| 87| 87|FC Barcelona|€29.5M|
€260K| 2073|
|152729| Piqué| 30| Spain| 87| 87|FC Barcelona|€37.5M|
€240K| 1974|
|189332| Jordi Alba| 28| Spain| 85| 85|FC Barcelona|€30.5M|
€215K| 2206|
In

|189511|Sergio Busquets| 28| Spain| 86| 86|FC Barcelona| €36M|

ed
€250K| 1998|
|199564| Sergi Roberto| 25| Spain| 81| 86|FC Barcelona|€19.5M|
nk

nk
€140K| 2071|
Li

Li
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
|

|
G

G
only showing top 5 rows
ya

ya
Random Data Generation
da

da
Random Data generation is useful when we want to test algorithms and to implement
U

new ones. U
In Spark under sql.functions we have methods to generate random data. e.g.,
uniform (rand), and standard normal (randn).

from pyspark.sql.functions import rand, randn

df = sqlContext.range(0, 7)
df.show()

The output will be as follows

+---+
| id|
+---+
| 0|
In

| 1|
| 2|
ed

| 3|
nk

+---+
Li

More on Random Data

|
G

G
ya

ya
da

da
U

U
By using uniform distribution and normal distribution generate two more columns.

df.select("id", rand(seed=10).alias("uniform"),
randn(seed=27).alias("normal")).show()
+---+-------------------+-------------------+
| id| uniform| normal|
+---+-------------------+-------------------+
| 0|0.41371264720975787| 0.5888539012978773|
| 1| 0.1982919638208397|0.06157382353970104|
In

| 2|0.12030715258495939| 1.0854146699817222|
ed

ed
| 3|0.44292918521277047|-0.4798519469521663|
+---+-------------------+-------------------+
nk

nk
Li

Li
Summary and Descriptive Statistics
|

|
G

G
The first operation to perform after importing data is to get some sense of what it
looks like.
ya

ya
The function describe returns a DataFrame containing information such as number
da

da
of non-null entries (count), mean, standard deviation, and minimum and maximum value
U

for each numerical column.

U
Summary and Descriptive Statistics
df.describe('uniform', 'normal').show()

The result is as follows:

+-------+-------------------+--------------------+
|summary| uniform| normal|
+-------+-------------------+--------------------+
| count| 10| 10|
| mean| 0.3841685645682706|-0.15825812884638607|
| stddev|0.31309395532409323| 0.963345903544872|
| min|0.03650707717266999| -2.1591956435415334|
In

| max| 0.8898784253886249| 1.0854146699817222|

+-------+-------------------+--------------------+
nk

Descriptive Statistics
Li

Li
|

For a quick review of a column describe works fine.

G
ya

ya
da

da
U

U
In the same way, we can also make use of some standard statistical
functions also.

from pyspark.sql.functions import mean, min, max

df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| avg(uniform)| min(uniform)| max(uniform)|
+------------------+-------------------+------------------+
|0.3841685645682706|0.03650707717266999|0.8898784253886249|
In

+------------------+-------------------+------------------+
ed

ed
Sample Co-Variance and Correlation
nk

nk
Li

Li
In statistics Co-Variance comeans how one random variable changes with respect to
|

|
other.
G

G
Positive value indicates a trend in increase when the other increases.
ya

ya
Negative value indicates a trend in decrease when the other increases.
da

da
The sample co-variance of two columns of a DataFrame can be calculated as follows:
U

More On Co-Variance U
from pyspark.sql.functions import rand
df = sqlContext.range(0, 10).withColumn('rand1',
rand(seed=10)).withColumn('rand2', rand(seed=27))
df.stat.cov('rand1', 'rand2')
0.031109767020625314

From the above we can infer that co-variance of two random columns is near
to zero.

Correlation
Correlation provides the statistical dependence of two random variables.
In

df.stat.corr('rand1', 'rand2')
ed

0.30842745432650953
nk

Two randomly generated columns have low correlation value.

Cross Tabulation (Contingency Table)

|
G

G
ya

ya
da

da
U

U
Cross Tabulation provides a frequency distribution table for a given set of
variables.
One of the powerful tool in statistics to observe the statistical independence of
variables.
Consider an example

# Create a DataFrame with two columns (name, item)

names = ["Alice", "Bob", "Mike"]
items = ["milk", "bread", "butter", "apples", "oranges"]
In

df = sqlContext.createDataFrame([(names[i % 3], items[i % 5]) for i in

ed
range(100)], ["name", "item"])
nk

nk
Contingency Table)
Li

Li
For applying the cross tabulation we can make use of the crosstab method.
|

|
G

G
df.stat.crosstab("name", "item").show()
ya

ya
+---------+------+-----+------+----+-------+
da

| Bob| 6| 7| 7| 6| 7| U
| Mike| 7| 6| 7| 7| 6|
| Alice| 7| 7| 6| 7| 7|
+---------+------+-----+------+----+-------+

Cardinality of columns we run crosstab on cannot be too big.

# Put your code here
from pyspark.sql import *
from pyspark import SparkContext
from pyspark.sql.functions import rand, randn
from pyspark.sql import SQLContext
from pyspark.sql.types import FloatType
spark = SparkSession.builder.getOrCreate()
df1 = Row("1","2")
sqlContext = SQLContext(spark)
In

df = sqlContext.range(0, 10).withColumn('rand1', rand(seed=10)).withColumn('ra

nd2', rand(seed=27))
a = df.stat.cov('rand1', 'rand2')
nk

b = df.stat.corr('rand1', 'rand2')
Li

s1 = df1("Co-variance",a)
s2 = df1("Correlation",b)
|

x=[s1,s2]
G

G
ya

ya
da

da
U

U
df2 = spark.createDataFrame(x)
df2.show()
df2.write.parquet("Result")

What is Spark SQL ?

Spark SQL brings native support for SQL to Spark.
In

Spark SQL blurs the lines between RDD's and relational tables.
ed

ed
By integrating these powerful features Spark makes it easy for developers to use SQL
commands for querying external data with complex analytics, all within in a single
nk

nk
application.
Li

Li
Performing SQL Queries
|

|
G

G
We can also pass SQL queries directly to any DataFrame.
ya

ya
For that, we need to create a table from the DataFrame using
da

da
the registerTempTable method.
U

U
After that use sqlContext.sql() to pass the SQL queries.

Apache Hive
In
ed

The Apache Hive data warehouse software allows reading, writing, and
managing large datasets residing in distributed storage and queried using SQL
nk

syntax.
Li

Li
|

Features of Apache Hive

G
ya

ya
da

da
U

U
Apache Hive is built on top of Apache Hadoop.
The below mentioned are the features of Apache Hive.

 Apache Hive is having tools to allow easy and quick access to data using SQL,
thus enables data warehousing tasks such like extract/transform/load (ETL),
reporting, and data analysis.
 Mechanisms for imposing structure on a variety of data formats.
 Access to files stored either directly in Apache HDFS or in other data storage
systems such as Apache HBase.
In

Features Of Apache Hive

ed
nk

nk
 Query execution via Apache Tez,Apache Spark, or MapReduce.
 A procedural language with HPL-SQL.
Li

Li
 Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
|

|
What Hive Provides ?
G

G
ya

ya
Apache Hive provides the standard SQL functionalities, which includes many of the
da

da
later SQL:2003 and SQL:2011 features for analytics.
U

U
We can extend Hive's SQL with the user code by using user-defined
functions (UDFs), user-defined aggregates (UDAFs), and user-defined table
functions (UDTFs).
Hive comes with built-in connectors for comma and tab-separated values
(CSV/TSV) text files, Apache Parquet,Apache ORC, and other formats.

Spark SQL supports reading and writing data stored in Hive. Connecting
Hive From Spark
When working with Hive, one must instantiate SparkSession with Hive support,
including connectivity to a persistent Hive metastore, support for Hive serdes,
and Hive user-defined functions.

How To Enable Hive Support

from os.path import expanduser, join, abspath

ed
nk

from pyspark.sql import SparkSession

from pyspark.sql import Row
Li

Li
|

|
G

G
ya

ya
da

da
U

U
#warehouse_location points to the default location for managed databases and
tables

warehouse_location = abspath('spark-warehouse')

spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
In

.config("spark.sql.warehouse.dir", warehouse_location) \
ed

ed
.enableHiveSupport() \
.getOrCreate()
nk

nk
Li

Li
reating Hive Table From Spark
|

|
G

G
We can easily create a table in hive warehouse programmatically from Spark.
ya

ya
The syntax for creating a table is as follows:
da

da
spark.sql("CREATE TABLE IF NOT EXISTS table_name(column_name_1
DataType,column_name_2 DataType,......,column_name_n DataType) USING hive")
U

To load a DataFrame into a table. U

df.write.insertInto("table name",overwrite = True)

Now verify the result by using the select statement.

Hive External Table

In
ed

ed
nk

nk
Li

Li
|

|
G

G
ya

ya
da

da
U

U
In
ed

ed
nk

nk
Li

Li
|

|
G

G
ya

ya
da

da
U

U
 External tables are used to store data outside the hive.
 Data needs to remain in the underlying location even after the user drop the
table.

Handling External Hive Tables From Apache

Spark
First, create an external table in the hive by specifying the location.
One can create an external table in the hive by running the following query on hive
shell.
In

hive> create external table table_name(column_name1 DataType,column_name2

DataType,......,column_name_n DataType) STORED AS Parquet location ' path of

external table';
nk

 The table is created in Parquet schema.

 The table is saved in the hdfs directory.

|
G

G
ya

ya
da

da
U

U
Loading Data From Spark To The Hive Table
We can load data to hive table from the DataFrame.
For doing the same schema of both hive table and the DataFrame should be equal.
Let us take a sample CSV file.
We can read the csv the file by making use of spark csv reader.

df = spark.read.csv("path-of-file", inferSchema = True, header = True)

The schema of the DataFrame will be same as the schema of the CSV file itself.
ed

ed
Data Loading To External Table
nk

nk
Li

Li
For loading the data we have to save the dataframe in external hive table location.
|

|
df.write.mode('overwrite').format("format").save("location")
G

G
ya

ya
Since our hive external table is in parquet format in place of format we have to
mention 'parquet'.
da

da
The location should be same as the hive external table location in hdfs directory.
U

U
If the schema is matching then data will load automatically to the hive table.
By querying the hive table we can verify it.

What is HBase ?
In
ed

ed
nk

HBase is a distributed column-oriented data store built on top of HDFS.

Li
|

|
G

G
ya

ya
da

da
U

U
HBase is an Apache open source project whose goal is to provide storage for the
Hadoop Distributed Computing.

Data is logically organized into tables, rows and columns.

More On HBase
 HBase features compression, in-memory operation, and Bloom filters on a per-
column basis as outlined in the original Bigtable paper.
In

 Tables in HBase can serve as the input and output for Map Reduce jobs run in
Hadoop, and may be accessed through the Java API but also
ed

ed
through REST, Avro or Thrift gateway APIs.
nk

nk
 It is a column-oriented key-value data store and has been idolized widely
because of its lineage with Hadoop and HDFS.
Li

Li
 HBase runs on top of HDFS and is well-suited for faster read and write
|

|
operations on large datasets with high throughput and low input/output
latency.
G

G
ya

ya
How To Connect Spark and HBase
da

da
U


 U
To connect we require hdfs,Spark and HBase installed in the local machine.
Make sure that your versions are matching with each other.
 Copy all the HBase jar files to the Spark lib folder.
 Once done set the SPARK_CLASSPATH in spark-env.sh with lib.

Building A Real-Time Data Pipeline

In
ed

ed
nk

nk
Li

Li
|

|
G

G
ya

ya
da

da
U

U
In
ed

ed
nk

nk
Li

Li
|

|
G

G
ya

ya
da

da
U

U
Real Time Pipeline using HDFS,Spark and HBase

Various Stages
It has 4 main stages which includes:

 Transformation
 Cleaning
 Validation
 Writing of the data received from the various sources

Transformation And Cleaning

In
ed

Data Transformation
nk

 This is an entry point for the streaming application.

 Here the operations related to normalization of data are performed.

 Transformation of data can be performed by using built-in functions
|

like map, filter, foreachRDD etc.

G
ya

ya
da

da
U

U
Data Cleaning

 During preprocessing cleaning is very important.

 In this stage, we can use custom built libraries for cleaning the data.

Validation And Writing

Data Validation
In this stage, we can validate the data with respect to some standard
validations such as length, patterns and so on.
In
ed

ed
Writing
At last, data passed from the previous three stages is passed on to the writing
nk

nk
application which simply writes this final set of data to HBase for further data
Li

Li
analysis.
|

|
Spark In Real World
G

G
ya

ya
Uber – the online taxi company is an apt example for Spark. They are gathering
terabytes of event data from its various users.
da

da
U

U
 Uses Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline.
 Convert raw unstructured data into structured data as it is collected.
 Uses it further complex analytics and optimization of operations.

Spark In Real World

Pinterest – Uses a Spark ETL pipeline

 Leverages Spark Streaming to gain immediate insight into how users all over
the world are engaging with Pins in real time.
 Can make more relevant recommendations as people navigate the site.
 Recommends related Pins.
 Determine which products to buy, or destinations to visit.

Spark In Real World

Conviva – 4 million video feeds per month.

ed
nk

 Conviva is using Spark for reducing the customer churn by managing live
video traffic and optimizing video streams.
Li

 They maintain a consistently smooth high-quality viewing experience.

|
G

G
ya

ya
da

da
U

U
Spark In Real World
Capital One – makes use of Spark and data science algorithms for a better
understanding of its customers.

 Developing the next generation of financial products and services.

 Find attributes and patterns of increased probability for fraud.

Netflix – Movie recommendation engine from user data.

 User data is also used for content creation

ed
nk

nk
Li

Li
|

# Put your code here

|
from pyspark.sql import *
G

G
spark = SparkSession.builder.getOrCreate()
ya

ya
df = Row("ID","Name","Age","AreaofInterest")
s1 = df("1","Jack",22,"Data Science")
da

da
s2 = df("2","Leo",21,"Data Analytics")
s3 = df("3","Luke",24,"Micro Services")
U

s4 = df("4","Mark",21,"Data Analytics")
x = [s1,s2,s3,s4]
U
df1 = spark.createDataFrame(x)
df3 = df1.describe("Age")
df3.show()
df3.write.parquet("Age")
df1.createOrReplaceTempView("data")
df4 = spark.sql("select ID,Name,Age from data order by ID desc")
df4.show()
df4.write.parquet("NameSorted")
In
ed

ed
nk

nk
Li

Li
|

|
G

G
ya

ya
da

da
U

azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
BX Options Class Builder
No ratings yet
BX Options Class Builder
82 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
spark
No ratings yet
spark
160 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Databricks Widgets
No ratings yet
Databricks Widgets
13 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
100% (1)
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
62 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Lecture # 12 - Introduction To React JS
No ratings yet
Lecture # 12 - Introduction To React JS
76 pages
_ Databricks & PySpark learning day-10
No ratings yet
_ Databricks & PySpark learning day-10
4 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
React Js
No ratings yet
React Js
82 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Databricks Final
100% (1)
Databricks Final
81 pages
Company Result - (:companyname) : Freecodecamp
No ratings yet
Company Result - (:companyname) : Freecodecamp
10 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
React Js
No ratings yet
React Js
21 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Databricks Certified Machine Learning Associate Exam Guide
No ratings yet
Databricks Certified Machine Learning Associate Exam Guide
9 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Industrial Training report of frontend web development using react
No ratings yet
Industrial Training report of frontend web development using react
24 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
Technical Interview Questions For Freshers - With Answers (2024)
No ratings yet
Technical Interview Questions For Freshers - With Answers (2024)
7 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Databricks Certified Data Engineer Associate Practice Exams - 1
No ratings yet
Databricks Certified Data Engineer Associate Practice Exams - 1
25 pages
Full download PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes 1st Edition Raju Kumar Mishra pdf docx
100% (2)
Full download PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes 1st Edition Raju Kumar Mishra pdf docx
50 pages
Python Challenge
No ratings yet
Python Challenge
10 pages
MSSQL Server 2008 Developer
No ratings yet
MSSQL Server 2008 Developer
240 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
angular
No ratings yet
angular
2 pages
Troubleshooting Network and TCP-UDP Port Connectivity Issues On ESX-ESXi
No ratings yet
Troubleshooting Network and TCP-UDP Port Connectivity Issues On ESX-ESXi
7 pages
Long Vowel Sounds Phonics Pack E
No ratings yet
Long Vowel Sounds Phonics Pack E
7 pages
Unit 3 Lesson 2 4e Anglais
No ratings yet
Unit 3 Lesson 2 4e Anglais
4 pages
Barthes, R. Camera Lucida
No ratings yet
Barthes, R. Camera Lucida
30 pages
Municipality of Bataraza
No ratings yet
Municipality of Bataraza
20 pages
Wa0010
No ratings yet
Wa0010
3 pages
COS 202 Note
No ratings yet
COS 202 Note
13 pages
CAT Practical Study Guide 2022
No ratings yet
CAT Practical Study Guide 2022
196 pages
Wa0021
No ratings yet
Wa0021
409 pages
ISO 8000 Data Quality Workshop
No ratings yet
ISO 8000 Data Quality Workshop
41 pages
Form 1 Shubham 2
No ratings yet
Form 1 Shubham 2
1 page
Complete Answer Guide for Learning and Behavior 7th Edition Mazur Test Bank
100% (3)
Complete Answer Guide for Learning and Behavior 7th Edition Mazur Test Bank
41 pages
MARTINS Zoonpolitikon 2019
No ratings yet
MARTINS Zoonpolitikon 2019
37 pages
Exercise: Abstract Superclass Shape and Its Concrete Subclasses
No ratings yet
Exercise: Abstract Superclass Shape and Its Concrete Subclasses
4 pages
A Reading of T.S.Eliot's Ash-Wednesday1
100% (1)
A Reading of T.S.Eliot's Ash-Wednesday1
22 pages
Academic Writing Class B - Syllabus
No ratings yet
Academic Writing Class B - Syllabus
8 pages
Zakaria El Gtay - CV
No ratings yet
Zakaria El Gtay - CV
3 pages
Class 12 English
No ratings yet
Class 12 English
90 pages
Hardware Quiz123
No ratings yet
Hardware Quiz123
7 pages
spoken grade 3 up to
No ratings yet
spoken grade 3 up to
4 pages
9CU SINIF AYLIQ SINAQ 4
No ratings yet
9CU SINIF AYLIQ SINAQ 4
3 pages
JDI-Manual CN 1.3 Programmer
No ratings yet
JDI-Manual CN 1.3 Programmer
46 pages
Informatica Deployment Checklist
No ratings yet
Informatica Deployment Checklist
9 pages
10 Minute Tutorial Apache Shiro
No ratings yet
10 Minute Tutorial Apache Shiro
4 pages
Graph Theory MCQs
No ratings yet
Graph Theory MCQs
9 pages
Logcat
No ratings yet
Logcat
11 pages
Ukg Syllabus 2021
No ratings yet
Ukg Syllabus 2021
1 page
CA-103-C-Language Notes
No ratings yet
CA-103-C-Language Notes
77 pages