0% found this document useful (0 votes)

12 views26 pages

1 - Introduction ToPySpark

The document provides an introduction to PySpark, a tool for distributed data processing that supports various data formats and integrates SQL for querying. It covers key concepts such as Spark clusters, SparkSessions, DataFrames, and essential functions for data manipulation and analytics. Additionally, it highlights the creation of DataFrames from different data sources and the importance of schema inference and data types in PySpark.

Uploaded by

maengora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views26 pages

1 - Introduction ToPySpark

Uploaded by

maengora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Introduction to

PySpark
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
Meet your instructor
Almost a Decade of Data Experience with PySpark
Used PySpark for Machine Learning, ETL tasks, and much more more

Enthusiastic teacher of new tools for all!

INTRODUCTION TO PYSPARK
What is PySpark?
Distributed data processing: Designed to handle large datasets across clusters

Supports various data formats including CSV, Parquet, and JSON

SQL integration allows querying of data using both Python and SQL syntax

Optimized for speed at scale

INTRODUCTION TO PYSPARK
When would we use PySpark?
Big data analytics
Distributed data processing

Real-time data streaming

Machine learning on large datasets

ETL and ELT pipelines

Working with diverse data sources:

1. CSV

2. JSON

3. Parquet
4. Many Many More

INTRODUCTION TO PYSPARK
Spark cluster
Master Node Worker Nodes
Manages the cluster, coordinates tasks, Execute the tasks assigned by the master
and schedules jobs
Responsible for executing the actual
computations and storing data in memory
or disk

INTRODUCTION TO PYSPARK
SparkSession
SparkSessions allow you to access your Spark cluster and are critical for using PySpark.

# Import SparkSession
from pyspark.sql import SparkSession

# Initialize a SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

.builder() sets up a session

getOrCreate() creates or retrieves a session

.appName() helps manage multiple sessions

INTRODUCTION TO PYSPARK
PySpark DataFrames
Similar to other DataFrames but
Optimized for PySpark

# Import and initialize a Spark session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

# Create a DataFrame
census_df = spark.read.csv("census.csv",
["gender","age","zipcode","salary_range_usd","marriage_status"])

# Show the DataFrame

census_df.show()

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K
Introduction to
PySpark
DataFrames
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
About DataFrames
DataFrames: Tabular format (rows/columns)

Supports SQL-like operations

Comparable to a Pandas Dataframe or a SQL TABLE

Structured Data

INTRODUCTION TO PYSPARK
Creating DataFrames from filestores
# Create a DataFrame from CSV
census_df = spark.read.csv('path/to/census.csv', header=True, inferSchema=True)

INTRODUCTION TO PYSPARK
Printing the DataFrame
# Show the first 5 rows of the DataFrame
census_df.show()

age education.num marital.status occupation income

0 90 9 Widowed ? <=50K
1 82 9 Widowed Exec-managerial <=50K
2 66 10 Widowed ? <=50K
3 54 4 Divorced Machine-op-inspct <=50K
4 41 10 Separated Prof-specialty <=50K

INTRODUCTION TO PYSPARK
Printing DataFrame Schema
# Show the schema
census_df.printSchema()
Output:
root
|-- age: integer (nullable = true)
|-- education.num: integer (nullable = true)
|-- marital.status: string (nullable = true)
|-- occupation: string (nullable = true)
|-- income: string (nullable = true)

INTRODUCTION TO PYSPARK
Basic analytics on PySpark DataFrames
# .count() will return the total row numbers in the DataFrame
row_count = census_df.count()
print(f'Number of rows: {row_count}')

# groupby() allows the use of sql-like aggregations

census_df.groupBy('gender').agg({'salary_usd': 'avg'}).show()

Other aggregate functions are:

sum()

min()

max()

INTRODUCTION TO PYSPARK
Key functions for PySpark analytics
.select() : Selects specific columns from the DataFrame

.filter() : Filters rows based on specific conditions

.groupBy() : Groups rows based on one or more columns

.agg() : Applies aggregate functions to grouped data

INTRODUCTION TO PYSPARK
Key Functions For Example
# Using filter and select, we can narrow down our DataFrame
filtered_census_df = census_df.filter(df['age'] > 50).select('age', 'occupation')
filtered_census_df.show()
Output
+---+------------------+
|age| occupation |
+---+------------------+
| 90| ?|
| 82| Exec-managerial|
| 66| ?|
| 54| Machine-op-inspct|
+---+------------------+

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K
More on Spark
DataFrames
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
Creating DataFrames from various data sources
CSV Files: Common for structured, Example:
delimited data spark.read.csv("path/to/file.csv")

JSON Files: Semi-structured, hierarchical

Example:
data format
spark.read.json("path/to/file.json")
Parquet Files: Optimized for storage and
querying, often used in data engineering Example:
spark.read.parquet("path/to/file.parquet")

1 https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv

INTRODUCTION TO PYSPARK
Schema inference and manual schema definition
Spark can infer schemas from data with inferSchema=True
Manually define schema for better control - useful for fixed data structures

INTRODUCTION TO PYSPARK
DataTypes in PySpark DataFrames
IntegerType : Whole numbers
E.g., 1 , 3478 , -1890456

LongType: Larger whole numbers

E.g., 8-byte signed numbers, 922334775806

FloatType and DoubleType: Floating-point numbers for decimal values

E.g., 3.14159

StringType: Used for text or string data

E.g., "This is an example of a string."

...

INTRODUCTION TO PYSPARK
DataTypes Syntax for PySpark DataFrames
# Import the necessary types as classes
from pyspark.sql.types import (StructType,
StructField, IntegerType,
StringType, ArrayType)

# Construct the schema

schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("scores", ArrayType(IntegerType()), True)
])

# Set the schema

df = spark.createDataFrame(data, schema=schema)

INTRODUCTION TO PYSPARK
DataFrame operations - selection and filtering
Use .select() to choose specific columns
Use .filter() or .where() to filter rows based on conditions

Use .sort() to order by a collection of columns

# Select and show only the name and age columns

df.select("name", "age").show()

# Filter on age > 30

df.filter(df["age"] > 30).show()

# Use Where to filter match a specific value

df.where(df["age"] == 30).show()

INTRODUCTION TO PYSPARK
Sorting and dropping missing values
Order data using .sort() or .orderBy()
Use na.drop() to remove rows with null values

# Sort using the age column

df.sort("age", ascending=False).show()

# Drop missing values

df.na.drop().show()

INTRODUCTION TO PYSPARK
Cheatsheet
spark.read_json() : Load data from JSON

spark.read.schema() : Define schemas explicitly

.na.drop() : Drop rows with missing values

.select() , .filter() , .sort() , .orderBy() : Basic data manipulation functions

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K

Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
DLP Analyn Lesson 3
100% (2)
DLP Analyn Lesson 3
8 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Pyspark
No ratings yet
Pyspark
10 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Page 01
No ratings yet
Page 01
2 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Analysis of Heart Disease Dataset
No ratings yet
Analysis of Heart Disease Dataset
16 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
No ratings yet
Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames
37 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Data Frame
No ratings yet
Data Frame
95 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Py Spark
No ratings yet
Py Spark
9 pages
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
PySpark Slides
No ratings yet
PySpark Slides
30 pages
Python Intro Tut 16 Jun
No ratings yet
Python Intro Tut 16 Jun
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
1 page
Py Spark
No ratings yet
Py Spark
177 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Working With CSV File in Databricks
No ratings yet
Working With CSV File in Databricks
4 pages
Cse413 201-15-3452 Lab-Report 02
No ratings yet
Cse413 201-15-3452 Lab-Report 02
6 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Pyspark Questions
No ratings yet
Pyspark Questions
63 pages
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
From Everand
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
Anand Vemula
No ratings yet
Logistic Regressionand Regularization - 3
No ratings yet
Logistic Regressionand Regularization - 3
19 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Linear Classifiers in Python - 1
No ratings yet
Linear Classifiers in Python - 1
17 pages
Prediction Equations - 2
No ratings yet
Prediction Equations - 2
22 pages
Exploring Morecomplex Charts - 2
No ratings yet
Exploring Morecomplex Charts - 2
11 pages
CIS PostgreSQL 11 Benchmark v1.0.0 PDF
No ratings yet
CIS PostgreSQL 11 Benchmark v1.0.0 PDF
189 pages
Marriland Team Builder For Pokémon Teams
No ratings yet
Marriland Team Builder For Pokémon Teams
1 page
G321.93 0.01: A Rare Site of Multiple Hub-Filament Systems With Evidence of Collision and Merging of Filaments
No ratings yet
G321.93 0.01: A Rare Site of Multiple Hub-Filament Systems With Evidence of Collision and Merging of Filaments
27 pages
Short Parable Stories
No ratings yet
Short Parable Stories
14 pages
Session 2.1 Ancient World
No ratings yet
Session 2.1 Ancient World
31 pages
Problem Solving Python Programming
No ratings yet
Problem Solving Python Programming
5 pages
Absolute Beginner S1 #1 Are You Indonesian?: Lesson Transcript
No ratings yet
Absolute Beginner S1 #1 Are You Indonesian?: Lesson Transcript
4 pages
Tiếng Anh TT5
No ratings yet
Tiếng Anh TT5
13 pages
Year 1: Semester 2
No ratings yet
Year 1: Semester 2
7 pages
Theocritus' Idyll 13 Love and The Hero
No ratings yet
Theocritus' Idyll 13 Love and The Hero
19 pages
CS - 8TH Bridge Course
No ratings yet
CS - 8TH Bridge Course
3 pages
Surah Ar-Rum Ayat 21 (30 - 21 Quran) With Tafsir - My Islam
No ratings yet
Surah Ar-Rum Ayat 21 (30 - 21 Quran) With Tafsir - My Islam
9 pages
Identity Essay Rough Draft
No ratings yet
Identity Essay Rough Draft
3 pages
Project On Banking System in Mis PDF
No ratings yet
Project On Banking System in Mis PDF
43 pages
1 Transformation and Collineations
No ratings yet
1 Transformation and Collineations
2 pages
English Verb Conjugation 2
No ratings yet
English Verb Conjugation 2
2 pages
Advanced Excel: Multiple Worksheets
No ratings yet
Advanced Excel: Multiple Worksheets
9 pages
GWG PDFX4 Workflow EN
No ratings yet
GWG PDFX4 Workflow EN
36 pages
SM Contents-1
No ratings yet
SM Contents-1
8 pages
Ss Lab Manual With Scilab Programs
No ratings yet
Ss Lab Manual With Scilab Programs
49 pages
Individual Workplan
No ratings yet
Individual Workplan
1 page
Jesus Is Arrested and Crucified
No ratings yet
Jesus Is Arrested and Crucified
22 pages
FS 1 Activity 1.1
No ratings yet
FS 1 Activity 1.1
3 pages
Example Worksheets Format
No ratings yet
Example Worksheets Format
5 pages
Mid1-ITC-Fall-2015 - DONE
No ratings yet
Mid1-ITC-Fall-2015 - DONE
11 pages
The 7 Most Effective Data Masking Techniques
No ratings yet
The 7 Most Effective Data Masking Techniques
8 pages
SOL Study Material Appointment
No ratings yet
SOL Study Material Appointment
1 page
Enclitics 2
No ratings yet
Enclitics 2
4 pages
SSC Shedule by Shubh Chahhc 2025
No ratings yet
SSC Shedule by Shubh Chahhc 2025
93 pages
Expl NetFund CH 01 Intro - 56 Slides
No ratings yet
Expl NetFund CH 01 Intro - 56 Slides
68 pages

1 - Introduction ToPySpark

Uploaded by

1 - Introduction ToPySpark

Uploaded by

Introduction to

Enthusiastic teacher of new tools for all!

Supports various data formats including CSV, Parquet, and JSON

Optimized for speed at scale

Real-time data streaming

Machine learning on large datasets

ETL and ELT pipelines

Working with diverse data sources:

.builder() sets up a session

getOrCreate() creates or retrieves a session

.appName() helps manage multiple sessions

# Import and initialize a Spark session

# Show the DataFrame

Supports SQL-like operations

Comparable to a Pandas Dataframe or a SQL TABLE

age education.num marital.status occupation income

# groupby() allows the use of sql-like aggregations

Other aggregate functions are:

.filter() : Filters rows based on specific conditions

.groupBy() : Groups rows based on one or more columns

.agg() : Applies aggregate functions to grouped data

JSON Files: Semi-structured, hierarchical

LongType: Larger whole numbers

FloatType and DoubleType: Floating-point numbers for decimal values

StringType: Used for text or string data

# Construct the schema

# Set the schema

Use .sort() to order by a collection of columns

# Select and show only the name and age columns

# Filter on age > 30

# Use Where to filter match a specific value

# Sort using the age column

# Drop missing values

spark.read.schema() : Define schemas explicitly

.na.drop() : Drop rows with missing values

.select() , .filter() , .sort() , .orderBy() : Basic data manipulation functions

You might also like