Chapter 3
Chapter 3
PySpark
DataFrames
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K
Upendra Devise y
Science Analyst, CyVerse
What are PySpark DataFrames?
PySpark SQL is a Spark library for structured data. It provides more information about the
structure of data and computation
Designed for processing both structured (e.g relational database) and semi-structured data
(e.g JSON)
DataFrames in PySpark support both SQL queries ( SELECT * from table ) or expression
methods ( df.select() )
From various data sources (CSV, JSON, TXT) using SparkSession's read method
Schema provides information about column name, type of data in the column, empty values
etc.,
pyspark.sql.dataframe.DataFrame
Upendra Devise y
Science Analyst, CyVerse
DataFrame operators in PySpark
DataFrame operations: Transformations and Actions
DataFrame Transformations:
select(), lter(), groupby(), orderby(), dropDuplicates() and withColumnRenamed()
DataFrame Actions :
printSchema(), head(), show(), count(), columns and describe()
df_id_age = test.select('Age')
df_id_age.show(3)
+---+
|Age|
+---+
| 17|
| 17|
| 17|
+---+
only showing top 3 rows
+-------+------+---+
|User_ID|Gender|Age|
+-------+------+---+
|1000002| M| 55|
|1000003| M| 26|
|1000004| M| 46|
+-------+------+---+
only showing top 3 rows
test_df_age_group = test_df.groupby('Age')
test_df_age_group.count().show(3)
+---+------+
|Age| count|
+---+------+
| 26|219587|
| 17| 4|
| 55| 21504|
+---+------+
only showing top 3 rows
test_df_age_group.count().orderBy('Age').show(3)
+---+-----+
|Age|count|
+---+-----+
| 0|15098|
| 17| 4|
| 18|99660|
+---+-----+
only showing top 3 rows
5892
+-------+---+---+
|User_ID|Sex|Age|
+-------+---+---+
|1000001| F| 17|
|1000001| F| 17|
|1000001| F| 17|
+-------+---+---+
test_df.printSchema()
test_df.columns
test_df.describe().show()
+-------+------------------+------+------------------+
|summary| User_ID|Gender| Age|
+-------+------------------+------+------------------+
| count| 550068|550068| 550068|
| mean|1003028.8424013031| null|30.382052764385495|
| stddev|1727.5915855307312| null|11.866105189533554|
| min| 1000001| F| 0|
| max| 1006040| M| 55|
+-------+------------------+------+------------------+
Upendra Devise y
Science Analyst, CyVerse
DataFrame API vs SQL queries
In PySpark You can interact with SparkSQL through DataFrame API and SQL queries
The DataFrame API provides a programmatic domain-speci c language (DSL) for data
sql() method takes a SQL statement as an argument and returns the result as DataFrame
df.createOrReplaceTempView("table1")
test_product_df = spark.sql(query)
test_product_df.show(5)
+----------+
|Product_ID|
+----------+
| P00069042|
| P00248942|
| P00087842|
| P00085442|
| P00285442|
+----------+
spark.sql(query).show(5)
+-----+-------------+
| Age|max(Purchase)|
+-----+-------------+
|18-25| 23958|
|26-35| 23961|
| 0-17| 23955|
|46-50| 23960|
|51-55| 23960|
+-----+-------------+
only showing top 5 rows
query = '''SELECT Age, Purchase, Gender FROM table1 WHERE Purchase > 20000 AND Gender == "F"'''
spark.sql(query).show(5)
+-----+--------+------+
| Age|Purchase|Gender|
+-----+--------+------+
|36-45| 23792| F|
|26-35| 21002| F|
|26-35| 23595| F|
|26-35| 23341| F|
|46-50| 20771| F|
+-----+--------+------+
only showing top 5 rows
Upendra Devise y
Science Analyst, CyVerse
What is Data visualization?
Data visualization is a way of representing your data in graphs or charts
Plo ing graphs using PySpark DataFrames is done using three methods
pyspark_dist_explore library
toPandas()
HandySpark library
test_df_age = test_df.select('Age')
test_df_sample_pandas = test_df_sample.toPandas()
test_df_sample_pandas.hist('Age')
hdf = test_df.toHandy()
hdf.cols["Age"].hist()