0% found this document useful (0 votes)
12 views26 pages

1 - Introduction ToPySpark

The document provides an introduction to PySpark, a tool for distributed data processing that supports various data formats and integrates SQL for querying. It covers key concepts such as Spark clusters, SparkSessions, DataFrames, and essential functions for data manipulation and analytics. Additionally, it highlights the creation of DataFrames from different data sources and the importance of schema inference and data types in PySpark.

Uploaded by

maengora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

1 - Introduction ToPySpark

The document provides an introduction to PySpark, a tool for distributed data processing that supports various data formats and integrates SQL for querying. It covers key concepts such as Spark clusters, SparkSessions, DataFrames, and essential functions for data manipulation and analytics. Additionally, it highlights the creation of DataFrames from different data sources and the importance of schema inference and data types in PySpark.

Uploaded by

maengora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to

PySpark
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
Meet your instructor
Almost a Decade of Data Experience with PySpark
Used PySpark for Machine Learning, ETL tasks, and much more more

Enthusiastic teacher of new tools for all!

INTRODUCTION TO PYSPARK
What is PySpark?
Distributed data processing: Designed to handle large datasets across clusters

Supports various data formats including CSV, Parquet, and JSON

SQL integration allows querying of data using both Python and SQL syntax

Optimized for speed at scale

INTRODUCTION TO PYSPARK
When would we use PySpark?
Big data analytics
Distributed data processing

Real-time data streaming

Machine learning on large datasets

ETL and ELT pipelines

Working with diverse data sources:


1. CSV

2. JSON

3. Parquet
4. Many Many More

INTRODUCTION TO PYSPARK
Spark cluster
Master Node Worker Nodes
Manages the cluster, coordinates tasks, Execute the tasks assigned by the master
and schedules jobs
Responsible for executing the actual
computations and storing data in memory
or disk

INTRODUCTION TO PYSPARK
SparkSession
SparkSessions allow you to access your Spark cluster and are critical for using PySpark.

# Import SparkSession
from pyspark.sql import SparkSession

# Initialize a SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

.builder() sets up a session

getOrCreate() creates or retrieves a session

.appName() helps manage multiple sessions

INTRODUCTION TO PYSPARK
PySpark DataFrames
Similar to other DataFrames but
Optimized for PySpark

# Import and initialize a Spark session


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

# Create a DataFrame
census_df = spark.read.csv("census.csv",
["gender","age","zipcode","salary_range_usd","marriage_status"])

# Show the DataFrame


census_df.show()

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K
Introduction to
PySpark
DataFrames
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
About DataFrames
DataFrames: Tabular format (rows/columns)

Supports SQL-like operations

Comparable to a Pandas Dataframe or a SQL TABLE

Structured Data

INTRODUCTION TO PYSPARK
Creating DataFrames from filestores
# Create a DataFrame from CSV
census_df = spark.read.csv('path/to/census.csv', header=True, inferSchema=True)

INTRODUCTION TO PYSPARK
Printing the DataFrame
# Show the first 5 rows of the DataFrame
census_df.show()

age education.num marital.status occupation income


0 90 9 Widowed ? <=50K
1 82 9 Widowed Exec-managerial <=50K
2 66 10 Widowed ? <=50K
3 54 4 Divorced Machine-op-inspct <=50K
4 41 10 Separated Prof-specialty <=50K

INTRODUCTION TO PYSPARK
Printing DataFrame Schema
# Show the schema
census_df.printSchema()
Output:
root
|-- age: integer (nullable = true)
|-- education.num: integer (nullable = true)
|-- marital.status: string (nullable = true)
|-- occupation: string (nullable = true)
|-- income: string (nullable = true)

INTRODUCTION TO PYSPARK
Basic analytics on PySpark DataFrames
# .count() will return the total row numbers in the DataFrame
row_count = census_df.count()
print(f'Number of rows: {row_count}')

# groupby() allows the use of sql-like aggregations


census_df.groupBy('gender').agg({'salary_usd': 'avg'}).show()

Other aggregate functions are:

sum()

min()

max()

INTRODUCTION TO PYSPARK
Key functions for PySpark analytics
.select() : Selects specific columns from the DataFrame

.filter() : Filters rows based on specific conditions

.groupBy() : Groups rows based on one or more columns

.agg() : Applies aggregate functions to grouped data

INTRODUCTION TO PYSPARK
Key Functions For Example
# Using filter and select, we can narrow down our DataFrame
filtered_census_df = census_df.filter(df['age'] > 50).select('age', 'occupation')
filtered_census_df.show()
Output
+---+------------------+
|age| occupation |
+---+------------------+
| 90| ?|
| 82| Exec-managerial|
| 66| ?|
| 54| Machine-op-inspct|
+---+------------------+

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K
More on Spark
DataFrames
I N T R O D U C T I O N T O P Y S PA R K

Benjamin Schmidt
Data Engineer
Creating DataFrames from various data sources
CSV Files: Common for structured, Example:
delimited data spark.read.csv("path/to/file.csv")

JSON Files: Semi-structured, hierarchical


Example:
data format
spark.read.json("path/to/file.json")
Parquet Files: Optimized for storage and
querying, often used in data engineering Example:
spark.read.parquet("path/to/file.parquet")

1 https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv

INTRODUCTION TO PYSPARK
Schema inference and manual schema definition
Spark can infer schemas from data with inferSchema=True
Manually define schema for better control - useful for fixed data structures

INTRODUCTION TO PYSPARK
DataTypes in PySpark DataFrames
IntegerType : Whole numbers
E.g., 1 , 3478 , -1890456

LongType: Larger whole numbers


E.g., 8-byte signed numbers, 922334775806

FloatType and DoubleType: Floating-point numbers for decimal values


E.g., 3.14159

StringType: Used for text or string data


E.g., "This is an example of a string."

...

INTRODUCTION TO PYSPARK
DataTypes Syntax for PySpark DataFrames
# Import the necessary types as classes
from pyspark.sql.types import (StructType,
StructField, IntegerType,
StringType, ArrayType)

# Construct the schema


schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("scores", ArrayType(IntegerType()), True)
])

# Set the schema


df = spark.createDataFrame(data, schema=schema)

INTRODUCTION TO PYSPARK
DataFrame operations - selection and filtering
Use .select() to choose specific columns
Use .filter() or .where() to filter rows based on conditions

Use .sort() to order by a collection of columns

# Select and show only the name and age columns


df.select("name", "age").show()

# Filter on age > 30


df.filter(df["age"] > 30).show()

# Use Where to filter match a specific value


df.where(df["age"] == 30).show()

INTRODUCTION TO PYSPARK
Sorting and dropping missing values
Order data using .sort() or .orderBy()
Use na.drop() to remove rows with null values

# Sort using the age column


df.sort("age", ascending=False).show()

# Drop missing values


df.na.drop().show()

INTRODUCTION TO PYSPARK
Cheatsheet
spark.read_json() : Load data from JSON

spark.read.schema() : Define schemas explicitly

.na.drop() : Drop rows with missing values

.select() , .filter() , .sort() , .orderBy() : Basic data manipulation functions

INTRODUCTION TO PYSPARK
Let's practice!
I N T R O D U C T I O N T O P Y S PA R K

You might also like