0% found this document useful (0 votes)
2 views

PySpark_slides

This guided project focuses on using PySpark to process, analyze, and summarize large datasets, particularly employee and salary data. Learners will gain skills in data loading, cleaning, exploration, aggregation, and visualization, culminating in a project that provides insights for decision-making in HR. The course is designed for individuals with basic Python programming knowledge and aims to enhance their data analysis capabilities.

Uploaded by

ram14linga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

PySpark_slides

This guided project focuses on using PySpark to process, analyze, and summarize large datasets, particularly employee and salary data. Learners will gain skills in data loading, cleaning, exploration, aggregation, and visualization, culminating in a project that provides insights for decision-making in HR. The course is designed for individuals with basic Python programming knowledge and aims to enhance their data analysis capabilities.

Uploaded by

ram14linga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Guided Project

PySpark Foundations:
Process, analyze, and
summarize data
Estimated Time
45 minutes

1
Meet Olayinka Arimoro Talking head goes here
(transitional effect to
enlarge and shrink
● 5 years experience analyzing data. circle between
● 3+ years experience building machine preceding and
learning models working with health data. succeeding slides is
● Top-rated Coursera guided project instructor optional)
with 40 courses and about 100K learners.
Recommended Background
Basic Python Programming
● Knowledge of Python syntax
● Basic data structures, and data frame operations
like filtering, grouping, and summarizing data.
INTRODUCTION

Project Outcome
By the end of this project-based course, learners will be able to process,
analyze, and summarize large data using PySpark, including performing
data cleaning and aggregation for data insights.

Learning Objectives
● Process large datasets using PySpark, including data loading,
cleaning, and preprocessing.
● Perform data exploration and visualization using dataframe
operations.
● Perform data aggregation and summarization using PySpark
and dataframe operations.
Apache Spark

Source: https://spark.apache.org/ 5
PySpark
● PySpark provides an interface for Apache Spark in
Python.
● You can write Python and SQL-like commands to
manipulate and analyze data in a distributed
processing environment.

6
Project Scenario
Your Role: Entry-level data analyst/scientist
In this project, you will take on the role of a junior or
entry-level data analyst/scientist and will use the
employees/salaries data to perform analysis that
covers key areas such as employee distribution
across departments, average salaries, and age
demographics. These analyses will provide key
decision-makers with insight on how to compensate,
retain and hire.

7
TASK OBJECTIVE

Task 1
Set up and overview of the project
You will get an overview of the project and set up the building
blocks for the project.
Task Summary

Set up and overview of the project

Key Takeaways
● PySpark is an interface for Apache Spark in Python to manipulate
and analyze data in a distributed processing environment.
● SparkSession.builder creates an entry point to using PySpark;
appName() names the Spark application; getOrCreate() retrieves
an existing Spark session or creates a new one.
TASK OBJECTIVE

Task 2
Load the data
You will load the employees.csv and updated_salaries.csv data.
Task Summary

Load the data

Key Takeaways
● spark.read.csv: a method in Spark that reads a CSV file into a data
frame.
● header=True: specifies that the first row of the CSV file contains the
header (column names)
● inferSchema=True: tells Spark to automatically infer the data types
of each column based on the values in the CSV file.
TASK OBJECTIVE

Task 3
Clean and process data
You will perform quick data cleaning by converting columns to proper
data types.
Task Summary

Clean and process data

Key Takeaways
● Formatting allows for standardization and normalization of data. It
aids in error detection and data cleaning, setting the stage for
reliable data analytics.
TASK OBJECTIVE

Task 4
Explore the data
You will explore the salaries data by computing summary statistics
and visualizing the salary column.
Task Summary

Explore the data

Key Takeaways
● Data exploration, one of the first steps in data preparation, is a way
to get to know data before working with it.
● toPandas() converts the Spark data frame to a Pandas data frame.
Practice Activity
This task is optional and ungraded.
The goal is to check your understanding.
Practice Activity
In this practice task, you will explore the
Things to Note
employees data. To complete this practice task,
please follow the instructions below: ➔ When returning the counts, you may
decide to use or not use the f-string
● Create a sum of missing values per column in format to print the return.

the employees data.


● Count the number of rows in the employees
data.
● Count the number of unique first names in Pro Tip
the employees data. ★ For this activity, you may review the
task on “Data exploration”

(Note to the learner: Pause the video to complete the task and unpause to see the solution once the task is complete.)
TASK OBJECTIVE

Task 5
Aggregate and summarize the data
You will perform data aggregation and summarization using the
salaries data.
Task Summary

Aggregate and summarize the data

Key Takeaways
● The groupBy operation allows you to perform aggregate functions
(such as sum, count, max, min, etc.) on grouped data, based on one
or more columns.
● The agg() function is used to apply aggregate functions to the
grouped data. You would typically pass functions inside agg() to
specify what operation to perform on the grouped data.
TASK OBJECTIVE

Task 6
Join the data sets
You will join the salaries and employees data using the employees
number.
Task Summary

Join the data sets

Key Takeaways
● Use left join when you need all records from the left table and the
matching records from the right table. Unmatched records from the
right table will have NULL values.
Congratulations on
completing your Guided
Project!
��
23
Next steps?

24
Thank you!!!

25
Cumulative Activity
This activity is optional and ungraded.
The goal is for you to apply the knowledge and skills
learned within this Guided Project to boost your
confidence.
Cumulative Activity Scenario
As a junior data analyst at a growing company, you are tasked with analyzing employee retention. Your aim is
to find departments with the highest amount of employees that have worked longer than ten years. This will
assist HR in enhancing employees' engagement and retention strategies. To complete this activity, you will use
the employee dataset and create a data frame with the employee counts in each department for a period over
ten years (calculated by from_date and to_date). Finally, you'll visualize how long-term employees are spread
across departments via a bar chart.

Your Task
1. Calculate the years worked based on the difference between 'to_date' and
'from_date'.
2. Group by emp_no and dept_no to sum the years worked.
3. Filter employees who have worked more than 10 years.
4. Group by department and count distinct employees who worked more than 5 years.
5. Create a bar chart to visualize the distribution of long-term employees across
departments.
Cumulative Activity
Given that we have repeated rows for employees, where Things to Note
each row represents a period of employment with
➔ Make sure that you convert the Spark data
corresponding dates (from_date and to_date). To solve frame to Pandas before creating the bar
this, you can: chart.

● Aggregate the years worked by each employee


over multiple periods.
● That is, after you have calculated the number of
years worked for each row (difference between
from_date and to_date), group by emp_no and
dept_no, and sum the years worked across all Pro Tip
periods. ★ Review the task titled “aggregate
and summarize data”

(Note to the learner: Pause the video to complete the task and unpause to see the solution once the task is complete.)
Thank you!!!

30

You might also like