0% found this document useful (0 votes)

67 views33 pages

Chapter 3

This document provides an introduction to PySpark DataFrames. It discusses that PySpark SQL uses DataFrames for structured data processing. DataFrames are immutable distributed collections of data with named columns that can handle both structured and semi-structured data. The SparkSession object provides the main entry point for working with DataFrames and for executing SQL queries. DataFrames can be created from existing RDDs or by reading data from files. Common operations on DataFrames include selecting columns, filtering rows, grouping, ordering, and aggregating data. SQL queries can also be used to perform the same operations on DataFrames.

Uploaded by

Ace

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views33 pages

Chapter 3

Uploaded by

Ace

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Introduction to

PySpark
DataFrames
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Upendra Devise y
Science Analyst, CyVerse
What are PySpark DataFrames?
PySpark SQL is a Spark library for structured data. It provides more information about the
structure of data and computation

PySpark DataFrame is an immutable distributed collection of data with named columns

Designed for processing both structured (e.g relational database) and semi-structured data
(e.g JSON)

Dataframe API is available in Python, R, Scala, and Java

DataFrames in PySpark support both SQL queries ( SELECT * from table ) or expression
methods ( df.select() )

BIG DATA FUNDAMENTALS WITH PYSPARK

SparkSession - Entry point for DataFrame API
SparkContext is the main entry point for creating RDDs

SparkSession provides a single point of entry to interact with Spark DataFrames

SparkSession is used to create DataFrame, register DataFrames, execute SQL queries

SparkSession is available in PySpark shell as spark

BIG DATA FUNDAMENTALS WITH PYSPARK

Creating DataFrames in PySpark
Two di erent methods of creating DataFrames in PySpark
From existing RDDs using SparkSession's createDataFrame() method

From various data sources (CSV, JSON, TXT) using SparkSession's read method

Schema controls the data and helps DataFrames to optimize queries

Schema provides information about column name, type of data in the column, empty values
etc.,

BIG DATA FUNDAMENTALS WITH PYSPARK

Create a DataFrame from RDD
iphones_RDD = sc.parallelize([
("XS", 2018, 5.65, 2.79, 6.24),
("XR", 2018, 5.94, 2.98, 6.84),
("X10", 2017, 5.65, 2.79, 6.13),
("8Plus", 2017, 6.23, 3.07, 7.12)
])

names = ['Model', 'Year', 'Height', 'Width', 'Weight']

iphones_df = spark.createDataFrame(iphones_RDD, schema=names)

type(iphones_df)

pyspark.sql.dataframe.DataFrame

BIG DATA FUNDAMENTALS WITH PYSPARK

Create a DataFrame from reading a CSV/JSON/TXT
df_csv = spark.read.csv("people.csv", header=True, inferSchema=True)

df_json = spark.read.json("people.json", header=True, inferSchema=True)

df_txt = spark.read.txt("people.txt", header=True, inferSchema=True)

Path to the le and two optional parameters

Two optional parameters

header=True , inferSchema=True

BIG DATA FUNDAMENTALS WITH PYSPARK

Let's practice
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K
Interacting with
PySpark
DataFrames
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Upendra Devise y
Science Analyst, CyVerse
DataFrame operators in PySpark
DataFrame operations: Transformations and Actions

DataFrame Transformations:
select(), lter(), groupby(), orderby(), dropDuplicates() and withColumnRenamed()

DataFrame Actions :
printSchema(), head(), show(), count(), columns and describe()

BIG DATA FUNDAMENTALS WITH PYSPARK

select() and show() operations
select() transformation subsets the columns in the DataFrame

df_id_age = test.select('Age')

show() action prints rst 20 rows in the DataFrame

df_id_age.show(3)

+---+
|Age|
+---+
| 17|
| 17|
| 17|
+---+
only showing top 3 rows

BIG DATA FUNDAMENTALS WITH PYSPARK

filter() and show() operations
filter() transformation lters out the rows based on a condition

new_df_age21 = new_df.filter(new_df.Age > 21)

new_df_age21.show(3)

+-------+------+---+
|User_ID|Gender|Age|
+-------+------+---+
|1000002| M| 55|
|1000003| M| 26|
|1000004| M| 46|
+-------+------+---+
only showing top 3 rows

BIG DATA FUNDAMENTALS WITH PYSPARK

groupby() and count() operations
groupby() operation can be used to group a variable

test_df_age_group = test_df.groupby('Age')
test_df_age_group.count().show(3)

+---+------+
|Age| count|
+---+------+
| 26|219587|
| 17| 4|
| 55| 21504|
+---+------+
only showing top 3 rows

BIG DATA FUNDAMENTALS WITH PYSPARK

orderby() Transformations
orderby() operation sorts the DataFrame based one or more columns

test_df_age_group.count().orderBy('Age').show(3)

+---+-----+
|Age|count|
+---+-----+
| 0|15098|
| 17| 4|
| 18|99660|
+---+-----+
only showing top 3 rows

BIG DATA FUNDAMENTALS WITH PYSPARK

dropDuplicates()
dropDuplicates() removes the duplicate rows of a DataFrame

test_df_no_dup = test_df.select('User_ID','Gender', 'Age').dropDuplicates()

test_df_no_dup.count()

5892

BIG DATA FUNDAMENTALS WITH PYSPARK

withColumnRenamed Transformations
withColumnRenamed() renames a column in the DataFrame

test_df_sex = test_df.withColumnRenamed('Gender', 'Sex')

test_df_sex.show(3)

+-------+---+---+
|User_ID|Sex|Age|
+-------+---+---+
|1000001| F| 17|
|1000001| F| 17|
|1000001| F| 17|
+-------+---+---+

BIG DATA FUNDAMENTALS WITH PYSPARK

printSchema()
printSchema() operation prints the types of columns in the DataFrame

test_df.printSchema()

|-- User_ID: integer (nullable = true)

BIG DATA FUNDAMENTALS WITH PYSPARK

columns actions
columns operator prints the columns of a DataFrame

test_df.columns

['User_ID', 'Gender', 'Age']

BIG DATA FUNDAMENTALS WITH PYSPARK

describe() actions
describe() operation compute summary statistics of numerical columns in the DataFrame

test_df.describe().show()

+-------+------------------+------+------------------+
|summary| User_ID|Gender| Age|
+-------+------------------+------+------------------+
| count| 550068|550068| 550068|
| mean|1003028.8424013031| null|30.382052764385495|
| stddev|1727.5915855307312| null|11.866105189533554|
| min| 1000001| F| 0|
| max| 1006040| M| 55|
+-------+------------------+------+------------------+

BIG DATA FUNDAMENTALS WITH PYSPARK

Let's practice
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K
Interacting with
DataFrames using
PySpark SQL
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Upendra Devise y
Science Analyst, CyVerse
DataFrame API vs SQL queries
In PySpark You can interact with SparkSQL through DataFrame API and SQL queries

The DataFrame API provides a programmatic domain-speci c language (DSL) for data

DataFrame transformations and actions are easier to construct programmatically

SQL queries can be concise and easier to understand and portable

The operations on DataFrames can also be done using SQL queries

BIG DATA FUNDAMENTALS WITH PYSPARK

Executing SQL Queries
The SparkSession sql() method executes SQL query

sql() method takes a SQL statement as an argument and returns the result as DataFrame

df.createOrReplaceTempView("table1")

df2 = spark.sql("SELECT field1, field2 FROM table1")

df2.collect()

[Row(f1=1, f2='row1'), Row(f1=2, f2='row2'), Row(f1=3, f2='row3')]

BIG DATA FUNDAMENTALS WITH PYSPARK

SQL query to extract data
test_df.createOrReplaceTempView("test_table")

query = '''SELECT Product_ID FROM test_table'''

test_product_df = spark.sql(query)
test_product_df.show(5)

+----------+
|Product_ID|
+----------+
| P00069042|
| P00248942|
| P00087842|
| P00085442|
| P00285442|
+----------+

BIG DATA FUNDAMENTALS WITH PYSPARK

Summarizing and grouping data using SQL queries
test_df.createOrReplaceTempView("test_table")

query = '''SELECT Age, max(Purchase) FROM test_table GROUP BY Age'''

spark.sql(query).show(5)

+-----+-------------+
| Age|max(Purchase)|
+-----+-------------+
|18-25| 23958|
|26-35| 23961|
| 0-17| 23955|
|46-50| 23960|
|51-55| 23960|
+-----+-------------+
only showing top 5 rows

BIG DATA FUNDAMENTALS WITH PYSPARK

Filtering columns using SQL queries
test_df.createOrReplaceTempView("test_table")

query = '''SELECT Age, Purchase, Gender FROM table1 WHERE Purchase > 20000 AND Gender == "F"'''

spark.sql(query).show(5)

+-----+--------+------+
| Age|Purchase|Gender|
+-----+--------+------+
|36-45| 23792| F|
|26-35| 21002| F|
|26-35| 23595| F|
|26-35| 23341| F|
|46-50| 20771| F|
+-----+--------+------+
only showing top 5 rows

BIG DATA FUNDAMENTALS WITH PYSPARK

Time to practice!
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K
Data Visualization in
PySpark using
DataFrames
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

Upendra Devise y
Science Analyst, CyVerse
What is Data visualization?
Data visualization is a way of representing your data in graphs or charts

Open source plo ing tools to aid visualization in Python:

Matplotlib, Seaborn, Bokeh etc.,

Plo ing graphs using PySpark DataFrames is done using three methods
pyspark_dist_explore library

toPandas()

HandySpark library

BIG DATA FUNDAMENTALS WITH PYSPARK

Data Visualization using Pyspark_dist_explore
Pyspark_dist_explore library provides quick insights into DataFrames

Currently three functions available – hist() , distplot() and pandas_histogram()

test_df = spark.read.csv("test.csv", header=True, inferSchema=True)

test_df_age = test_df.select('Age')

hist(test_df_age, bins=20, color="red")

BIG DATA FUNDAMENTALS WITH PYSPARK

Using Pandas for plotting DataFrames
It's easy to create charts from pandas DataFrames

test_df = spark.read.csv("test.csv", header=True, inferSchema=True)

test_df_sample_pandas = test_df_sample.toPandas()

test_df_sample_pandas.hist('Age')

BIG DATA FUNDAMENTALS WITH PYSPARK

Pandas DataFrame vs PySpark DataFrame
Pandas DataFrames are in-memory, single-server based structures and operations on
PySpark run in parallel

The result is generated as we apply any operation in Pandas whereas operations in

PySpark DataFrame are lazy evaluation

Pandas DataFrame as mutable and PySpark DataFrames are immutable

Pandas API support more operations than PySpark Dataframe API

BIG DATA FUNDAMENTALS WITH PYSPARK

HandySpark method of visualization
HandySpark is a package designed to improve PySpark user experience

test_df = spark.read.csv('test.csv', header=True, inferSchema=True)

hdf = test_df.toHandy()

hdf.cols["Age"].hist()

BIG DATA FUNDAMENTALS WITH PYSPARK

Let's visualize
DataFrames
B I G D ATA F U N D A M E N TA L S W I T H P Y S PA R K

BCS Recrutment & Selection Process
100% (5)
BCS Recrutment & Selection Process
17 pages
Introduction To Cruise
100% (3)
Introduction To Cruise
28 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
4 BNI Python Training
100% (1)
4 BNI Python Training
126 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
Path of The Magi
No ratings yet
Path of The Magi
138 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
06 MGMT 590 Fall 2019 Data Handling With Pandas
No ratings yet
06 MGMT 590 Fall 2019 Data Handling With Pandas
14 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Data Frames
No ratings yet
Data Frames
12 pages
Group Lesson Plan
83% (6)
Group Lesson Plan
7 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
Analysis of Heart Disease Dataset
No ratings yet
Analysis of Heart Disease Dataset
16 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Transaid Drtss Malawi Final Report
No ratings yet
Transaid Drtss Malawi Final Report
82 pages
Cse413 201-15-3452 Lab-Report 02
No ratings yet
Cse413 201-15-3452 Lab-Report 02
6 pages
Bda U5
No ratings yet
Bda U5
42 pages
Leitax - Case Solution
No ratings yet
Leitax - Case Solution
4 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Practical
No ratings yet
Practical
12 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Page 01
No ratings yet
Page 01
2 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Rrs Local Pagpag
No ratings yet
Rrs Local Pagpag
8 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Pyspark
No ratings yet
Pyspark
10 pages
Taj Mahal
No ratings yet
Taj Mahal
65 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Desert Magazine 1941 February
100% (2)
Desert Magazine 1941 February
48 pages
Likert Scale in Social Sciences Research: Problems and Difficulties
No ratings yet
Likert Scale in Social Sciences Research: Problems and Difficulties
14 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Long Quiz
No ratings yet
Long Quiz
3 pages
Printable Central Paris Metro Plan - Enhanced by
No ratings yet
Printable Central Paris Metro Plan - Enhanced by
2 pages
Comparison of A Midsummer Night's Dream Adaptations and Play
No ratings yet
Comparison of A Midsummer Night's Dream Adaptations and Play
4 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Red Hat 3scale API Management 2.8
No ratings yet
Red Hat 3scale API Management 2.8
74 pages
Assistant Paper (BS-16) 23!1!2022
No ratings yet
Assistant Paper (BS-16) 23!1!2022
113 pages
Senate Select Comite On Intelligence (Commite Study of The Central Intelligence Agency's Detention and Interrogation Program) 2014 OC (R) Part 2
No ratings yet
Senate Select Comite On Intelligence (Commite Study of The Central Intelligence Agency's Detention and Interrogation Program) 2014 OC (R) Part 2
262 pages
Theme Adjective-1
No ratings yet
Theme Adjective-1
15 pages
Le Pigeon by Gabriel Rucker and Meredith Erickson, With Lauren Fortgang and Andrew Fortgang - Recipes and Excerpt
0% (1)
Le Pigeon by Gabriel Rucker and Meredith Erickson, With Lauren Fortgang and Andrew Fortgang - Recipes and Excerpt
14 pages
CRATE-2024 Paper Format
No ratings yet
CRATE-2024 Paper Format
3 pages
Admitcard NSB Coimbatore SGX201M003245
No ratings yet
Admitcard NSB Coimbatore SGX201M003245
1 page
Explaining The Consumer Decision-Making Process: Critical Literature Review
No ratings yet
Explaining The Consumer Decision-Making Process: Critical Literature Review
9 pages
Microalgae Poster
No ratings yet
Microalgae Poster
1 page
POLYAMIDE
No ratings yet
POLYAMIDE
12 pages
Full Science 7 Notes
No ratings yet
Full Science 7 Notes
22 pages
Ashok Finance Manager
No ratings yet
Ashok Finance Manager
4 pages
General Knowledge
No ratings yet
General Knowledge
8 pages
Assignment Ish652 Human Resource Management Group 7 Ic2206d - Succession Planning Human Resource in Islam
No ratings yet
Assignment Ish652 Human Resource Management Group 7 Ic2206d - Succession Planning Human Resource in Islam
17 pages
Ria Fortuna Wijaya - Assignment1 - StratManagement
No ratings yet
Ria Fortuna Wijaya - Assignment1 - StratManagement
2 pages
Partners Guidelines en
No ratings yet
Partners Guidelines en
4 pages
575 - Sahodaya Post Mid Term Circular 2024
No ratings yet
575 - Sahodaya Post Mid Term Circular 2024
1 page
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Beginning C# and .NET
From Everand
Beginning C# and .NET
Benjamin Perkins
No ratings yet
Comprehensive Guide to the Pandas Library: Unlocking Data Manipulation and Analysis in Python
From Everand
Comprehensive Guide to the Pandas Library: Unlocking Data Manipulation and Analysis in Python
Adam Jones
No ratings yet
Getting Started with SAS Programming: Using SAS Studio in the Cloud
From Everand
Getting Started with SAS Programming: Using SAS Studio in the Cloud
Ron Cody
No ratings yet
Mastering SAS Programming: From Basics to Expert Proficiency
From Everand
Mastering SAS Programming: From Basics to Expert Proficiency
William Smith
No ratings yet

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Introduction to

PySpark DataFrame is an immutable distributed collection of data with named columns

Dataframe API is available in Python, R, Scala, and Java

BIG DATA FUNDAMENTALS WITH PYSPARK

SparkSession provides a single point of entry to interact with Spark DataFrames

SparkSession is used to create DataFrame, register DataFrames, execute SQL queries

SparkSession is available in PySpark shell as spark

BIG DATA FUNDAMENTALS WITH PYSPARK

Schema controls the data and helps DataFrames to optimize queries

BIG DATA FUNDAMENTALS WITH PYSPARK

names = ['Model', 'Year', 'Height', 'Width', 'Weight']

iphones_df = spark.createDataFrame(iphones_RDD, schema=names)

BIG DATA FUNDAMENTALS WITH PYSPARK

df_json = spark.read.json("people.json", header=True, inferSchema=True)

df_txt = spark.read.txt("people.txt", header=True, inferSchema=True)

Path to the le and two optional parameters

Two optional parameters

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

show() action prints rst 20 rows in the DataFrame

BIG DATA FUNDAMENTALS WITH PYSPARK

new_df_age21 = new_df.filter(new_df.Age > 21)

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

test_df_no_dup = test_df.select('User_ID','Gender', 'Age').dropDuplicates()

BIG DATA FUNDAMENTALS WITH PYSPARK

test_df_sex = test_df.withColumnRenamed('Gender', 'Sex')

BIG DATA FUNDAMENTALS WITH PYSPARK

|-- User_ID: integer (nullable = true)

BIG DATA FUNDAMENTALS WITH PYSPARK

['User_ID', 'Gender', 'Age']

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

DataFrame transformations and actions are easier to construct programmatically

SQL queries can be concise and easier to understand and portable

The operations on DataFrames can also be done using SQL queries

BIG DATA FUNDAMENTALS WITH PYSPARK

df2 = spark.sql("SELECT field1, field2 FROM table1")

[Row(f1=1, f2='row1'), Row(f1=2, f2='row2'), Row(f1=3, f2='row3')]

BIG DATA FUNDAMENTALS WITH PYSPARK

query = '''SELECT Product_ID FROM test_table'''

BIG DATA FUNDAMENTALS WITH PYSPARK

query = '''SELECT Age, max(Purchase) FROM test_table GROUP BY Age'''

BIG DATA FUNDAMENTALS WITH PYSPARK

BIG DATA FUNDAMENTALS WITH PYSPARK

Open source plo ing tools to aid visualization in Python:

BIG DATA FUNDAMENTALS WITH PYSPARK

Currently three functions available – hist() , distplot() and pandas_histogram()

test_df = spark.read.csv("test.csv", header=True, inferSchema=True)

hist(test_df_age, bins=20, color="red")

BIG DATA FUNDAMENTALS WITH PYSPARK

test_df = spark.read.csv("test.csv", header=True, inferSchema=True)

BIG DATA FUNDAMENTALS WITH PYSPARK

The result is generated as we apply any operation in Pandas whereas operations in

Pandas DataFrame as mutable and PySpark DataFrames are immutable

Pandas API support more operations than PySpark Dataframe API

BIG DATA FUNDAMENTALS WITH PYSPARK

test_df = spark.read.csv('test.csv', header=True, inferSchema=True)

BIG DATA FUNDAMENTALS WITH PYSPARK

You might also like