PySpark_slides
PySpark_slides
PySpark Foundations:
Process, analyze, and
summarize data
Estimated Time
45 minutes
1
Meet Olayinka Arimoro Talking head goes here
(transitional effect to
enlarge and shrink
● 5 years experience analyzing data. circle between
● 3+ years experience building machine preceding and
learning models working with health data. succeeding slides is
● Top-rated Coursera guided project instructor optional)
with 40 courses and about 100K learners.
Recommended Background
Basic Python Programming
● Knowledge of Python syntax
● Basic data structures, and data frame operations
like filtering, grouping, and summarizing data.
INTRODUCTION
Project Outcome
By the end of this project-based course, learners will be able to process,
analyze, and summarize large data using PySpark, including performing
data cleaning and aggregation for data insights.
Learning Objectives
● Process large datasets using PySpark, including data loading,
cleaning, and preprocessing.
● Perform data exploration and visualization using dataframe
operations.
● Perform data aggregation and summarization using PySpark
and dataframe operations.
Apache Spark
Source: https://spark.apache.org/ 5
PySpark
● PySpark provides an interface for Apache Spark in
Python.
● You can write Python and SQL-like commands to
manipulate and analyze data in a distributed
processing environment.
6
Project Scenario
Your Role: Entry-level data analyst/scientist
In this project, you will take on the role of a junior or
entry-level data analyst/scientist and will use the
employees/salaries data to perform analysis that
covers key areas such as employee distribution
across departments, average salaries, and age
demographics. These analyses will provide key
decision-makers with insight on how to compensate,
retain and hire.
7
TASK OBJECTIVE
Task 1
Set up and overview of the project
You will get an overview of the project and set up the building
blocks for the project.
Task Summary
Key Takeaways
● PySpark is an interface for Apache Spark in Python to manipulate
and analyze data in a distributed processing environment.
● SparkSession.builder creates an entry point to using PySpark;
appName() names the Spark application; getOrCreate() retrieves
an existing Spark session or creates a new one.
TASK OBJECTIVE
Task 2
Load the data
You will load the employees.csv and updated_salaries.csv data.
Task Summary
Key Takeaways
● spark.read.csv: a method in Spark that reads a CSV file into a data
frame.
● header=True: specifies that the first row of the CSV file contains the
header (column names)
● inferSchema=True: tells Spark to automatically infer the data types
of each column based on the values in the CSV file.
TASK OBJECTIVE
Task 3
Clean and process data
You will perform quick data cleaning by converting columns to proper
data types.
Task Summary
Key Takeaways
● Formatting allows for standardization and normalization of data. It
aids in error detection and data cleaning, setting the stage for
reliable data analytics.
TASK OBJECTIVE
Task 4
Explore the data
You will explore the salaries data by computing summary statistics
and visualizing the salary column.
Task Summary
Key Takeaways
● Data exploration, one of the first steps in data preparation, is a way
to get to know data before working with it.
● toPandas() converts the Spark data frame to a Pandas data frame.
Practice Activity
This task is optional and ungraded.
The goal is to check your understanding.
Practice Activity
In this practice task, you will explore the
Things to Note
employees data. To complete this practice task,
please follow the instructions below: ➔ When returning the counts, you may
decide to use or not use the f-string
● Create a sum of missing values per column in format to print the return.
(Note to the learner: Pause the video to complete the task and unpause to see the solution once the task is complete.)
TASK OBJECTIVE
Task 5
Aggregate and summarize the data
You will perform data aggregation and summarization using the
salaries data.
Task Summary
Key Takeaways
● The groupBy operation allows you to perform aggregate functions
(such as sum, count, max, min, etc.) on grouped data, based on one
or more columns.
● The agg() function is used to apply aggregate functions to the
grouped data. You would typically pass functions inside agg() to
specify what operation to perform on the grouped data.
TASK OBJECTIVE
Task 6
Join the data sets
You will join the salaries and employees data using the employees
number.
Task Summary
Key Takeaways
● Use left join when you need all records from the left table and the
matching records from the right table. Unmatched records from the
right table will have NULL values.
Congratulations on
completing your Guided
Project!
��
23
Next steps?
24
Thank you!!!
25
Cumulative Activity
This activity is optional and ungraded.
The goal is for you to apply the knowledge and skills
learned within this Guided Project to boost your
confidence.
Cumulative Activity Scenario
As a junior data analyst at a growing company, you are tasked with analyzing employee retention. Your aim is
to find departments with the highest amount of employees that have worked longer than ten years. This will
assist HR in enhancing employees' engagement and retention strategies. To complete this activity, you will use
the employee dataset and create a data frame with the employee counts in each department for a period over
ten years (calculated by from_date and to_date). Finally, you'll visualize how long-term employees are spread
across departments via a bar chart.
Your Task
1. Calculate the years worked based on the difference between 'to_date' and
'from_date'.
2. Group by emp_no and dept_no to sum the years worked.
3. Filter employees who have worked more than 10 years.
4. Group by department and count distinct employees who worked more than 5 years.
5. Create a bar chart to visualize the distribution of long-term employees across
departments.
Cumulative Activity
Given that we have repeated rows for employees, where Things to Note
each row represents a period of employment with
➔ Make sure that you convert the Spark data
corresponding dates (from_date and to_date). To solve frame to Pandas before creating the bar
this, you can: chart.
(Note to the learner: Pause the video to complete the task and unpause to see the solution once the task is complete.)
Thank you!!!
30