1 - Introduction ToPySpark
1 - Introduction ToPySpark
PySpark
I N T R O D U C T I O N T O P Y S PA R K
Benjamin Schmidt
Data Engineer
Meet your instructor
Almost a Decade of Data Experience with PySpark
Used PySpark for Machine Learning, ETL tasks, and much more more
INTRODUCTION TO PYSPARK
What is PySpark?
Distributed data processing: Designed to handle large datasets across clusters
SQL integration allows querying of data using both Python and SQL syntax
INTRODUCTION TO PYSPARK
When would we use PySpark?
Big data analytics
Distributed data processing
2. JSON
3. Parquet
4. Many Many More
INTRODUCTION TO PYSPARK
Spark cluster
Master Node Worker Nodes
Manages the cluster, coordinates tasks, Execute the tasks assigned by the master
and schedules jobs
Responsible for executing the actual
computations and storing data in memory
or disk
INTRODUCTION TO PYSPARK
SparkSession
SparkSessions allow you to access your Spark cluster and are critical for using PySpark.
# Import SparkSession
from pyspark.sql import SparkSession
# Initialize a SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()
INTRODUCTION TO PYSPARK
PySpark DataFrames
Similar to other DataFrames but
Optimized for PySpark
# Create a DataFrame
census_df = spark.read.csv("census.csv",
["gender","age","zipcode","salary_range_usd","marriage_status"])
INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K
Introduction to
PySpark
DataFrames
I N T R O D U C T I O N T O P Y S PA R K
Benjamin Schmidt
Data Engineer
About DataFrames
DataFrames: Tabular format (rows/columns)
Structured Data
INTRODUCTION TO PYSPARK
Creating DataFrames from filestores
# Create a DataFrame from CSV
census_df = spark.read.csv('path/to/census.csv', header=True, inferSchema=True)
INTRODUCTION TO PYSPARK
Printing the DataFrame
# Show the first 5 rows of the DataFrame
census_df.show()
INTRODUCTION TO PYSPARK
Printing DataFrame Schema
# Show the schema
census_df.printSchema()
Output:
root
|-- age: integer (nullable = true)
|-- education.num: integer (nullable = true)
|-- marital.status: string (nullable = true)
|-- occupation: string (nullable = true)
|-- income: string (nullable = true)
INTRODUCTION TO PYSPARK
Basic analytics on PySpark DataFrames
# .count() will return the total row numbers in the DataFrame
row_count = census_df.count()
print(f'Number of rows: {row_count}')
sum()
min()
max()
INTRODUCTION TO PYSPARK
Key functions for PySpark analytics
.select() : Selects specific columns from the DataFrame
INTRODUCTION TO PYSPARK
Key Functions For Example
# Using filter and select, we can narrow down our DataFrame
filtered_census_df = census_df.filter(df['age'] > 50).select('age', 'occupation')
filtered_census_df.show()
Output
+---+------------------+
|age| occupation |
+---+------------------+
| 90| ?|
| 82| Exec-managerial|
| 66| ?|
| 54| Machine-op-inspct|
+---+------------------+
INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K
More on Spark
DataFrames
I N T R O D U C T I O N T O P Y S PA R K
Benjamin Schmidt
Data Engineer
Creating DataFrames from various data sources
CSV Files: Common for structured, Example:
delimited data spark.read.csv("path/to/file.csv")
1 https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv
INTRODUCTION TO PYSPARK
Schema inference and manual schema definition
Spark can infer schemas from data with inferSchema=True
Manually define schema for better control - useful for fixed data structures
INTRODUCTION TO PYSPARK
DataTypes in PySpark DataFrames
IntegerType : Whole numbers
E.g., 1 , 3478 , -1890456
...
INTRODUCTION TO PYSPARK
DataTypes Syntax for PySpark DataFrames
# Import the necessary types as classes
from pyspark.sql.types import (StructType,
StructField, IntegerType,
StringType, ArrayType)
INTRODUCTION TO PYSPARK
DataFrame operations - selection and filtering
Use .select() to choose specific columns
Use .filter() or .where() to filter rows based on conditions
INTRODUCTION TO PYSPARK
Sorting and dropping missing values
Order data using .sort() or .orderBy()
Use na.drop() to remove rows with null values
INTRODUCTION TO PYSPARK
Cheatsheet
spark.read_json() : Load data from JSON
INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K