0% found this document useful (0 votes)

12 views30 pages

PySpark_slides

This guided project focuses on using PySpark to process, analyze, and summarize large datasets, particularly employee and salary data. Learners will gain skills in data loading, cleaning, exploration, aggregation, and visualization, culminating in a project that provides insights for decision-making in HR. The course is designed for individuals with basic Python programming knowledge and aims to enhance their data analysis capabilities.

Uploaded by

ram14linga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views30 pages

PySpark_slides

Uploaded by

ram14linga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Guided Project

PySpark Foundations:
Process, analyze, and
summarize data
Estimated Time
45 minutes

1
Meet Olayinka Arimoro Talking head goes here
(transitional eﬀect to
enlarge and shrink
● 5 years experience analyzing data. circle between
● 3+ years experience building machine preceding and
learning models working with health data. succeeding slides is
● Top-rated Coursera guided project instructor optional)
with 40 courses and about 100K learners.
Recommended Background
Basic Python Programming
● Knowledge of Python syntax
● Basic data structures, and data frame operations
like filtering, grouping, and summarizing data.
INTRODUCTION

Project Outcome
By the end of this project-based course, learners will be able to process,
analyze, and summarize large data using PySpark, including performing
data cleaning and aggregation for data insights.

Learning Objectives
● Process large datasets using PySpark, including data loading,
cleaning, and preprocessing.
● Perform data exploration and visualization using dataframe
operations.
● Perform data aggregation and summarization using PySpark
and dataframe operations.
Apache Spark

Source: https://spark.apache.org/ 5
PySpark
● PySpark provides an interface for Apache Spark in
Python.
● You can write Python and SQL-like commands to
manipulate and analyze data in a distributed
processing environment.

6
Project Scenario
Your Role: Entry-level data analyst/scientist
In this project, you will take on the role of a junior or
entry-level data analyst/scientist and will use the
employees/salaries data to perform analysis that
covers key areas such as employee distribution
across departments, average salaries, and age
demographics. These analyses will provide key
decision-makers with insight on how to compensate,
retain and hire.

7
TASK OBJECTIVE

Task 1
Set up and overview of the project
You will get an overview of the project and set up the building
blocks for the project.
Task Summary

Set up and overview of the project

Key Takeaways
● PySpark is an interface for Apache Spark in Python to manipulate
and analyze data in a distributed processing environment.
● SparkSession.builder creates an entry point to using PySpark;
appName() names the Spark application; getOrCreate() retrieves
an existing Spark session or creates a new one.
TASK OBJECTIVE

Task 2
Load the data
You will load the employees.csv and updated_salaries.csv data.
Task Summary

Load the data

Key Takeaways
● spark.read.csv: a method in Spark that reads a CSV file into a data
frame.
● header=True: specifies that the first row of the CSV file contains the
header (column names)
● inferSchema=True: tells Spark to automatically infer the data types
of each column based on the values in the CSV file.
TASK OBJECTIVE

Task 3
Clean and process data
You will perform quick data cleaning by converting columns to proper
data types.
Task Summary

Clean and process data

Key Takeaways
● Formatting allows for standardization and normalization of data. It
aids in error detection and data cleaning, setting the stage for
reliable data analytics.
TASK OBJECTIVE

Task 4
Explore the data
You will explore the salaries data by computing summary statistics
and visualizing the salary column.
Task Summary

Explore the data

Key Takeaways
● Data exploration, one of the first steps in data preparation, is a way
to get to know data before working with it.
● toPandas() converts the Spark data frame to a Pandas data frame.
Practice Activity
This task is optional and ungraded.
The goal is to check your understanding.
Practice Activity
In this practice task, you will explore the
Things to Note
employees data. To complete this practice task,
please follow the instructions below: ➔ When returning the counts, you may
decide to use or not use the f-string
● Create a sum of missing values per column in format to print the return.

the employees data.

● Count the number of rows in the employees
data.
● Count the number of unique first names in Pro Tip
the employees data. ★ For this activity, you may review the
task on “Data exploration”

(Note to the learner: Pause the video to complete the task and unpause to see the solution once the task is complete.)
TASK OBJECTIVE

Task 5
Aggregate and summarize the data
You will perform data aggregation and summarization using the
salaries data.
Task Summary

Aggregate and summarize the data

Key Takeaways
● The groupBy operation allows you to perform aggregate functions
(such as sum, count, max, min, etc.) on grouped data, based on one
or more columns.
● The agg() function is used to apply aggregate functions to the
grouped data. You would typically pass functions inside agg() to
specify what operation to perform on the grouped data.
TASK OBJECTIVE

Task 6
Join the data sets
You will join the salaries and employees data using the employees
number.
Task Summary

Join the data sets

Key Takeaways
● Use left join when you need all records from the left table and the
matching records from the right table. Unmatched records from the
right table will have NULL values.
Congratulations on
completing your Guided
Project!
��
23
Next steps?

24
Thank you!!!

25
Cumulative Activity
This activity is optional and ungraded.
The goal is for you to apply the knowledge and skills
learned within this Guided Project to boost your
confidence.
Cumulative Activity Scenario
As a junior data analyst at a growing company, you are tasked with analyzing employee retention. Your aim is
to find departments with the highest amount of employees that have worked longer than ten years. This will
assist HR in enhancing employees' engagement and retention strategies. To complete this activity, you will use
the employee dataset and create a data frame with the employee counts in each department for a period over
ten years (calculated by from_date and to_date). Finally, you'll visualize how long-term employees are spread
across departments via a bar chart.

Your Task
1. Calculate the years worked based on the diﬀerence between 'to_date' and
'from_date'.
2. Group by emp_no and dept_no to sum the years worked.
3. Filter employees who have worked more than 10 years.
4. Group by department and count distinct employees who worked more than 5 years.
5. Create a bar chart to visualize the distribution of long-term employees across
departments.
Cumulative Activity
Given that we have repeated rows for employees, where Things to Note
each row represents a period of employment with
➔ Make sure that you convert the Spark data
corresponding dates (from_date and to_date). To solve frame to Pandas before creating the bar
this, you can: chart.

● Aggregate the years worked by each employee

over multiple periods.
● That is, after you have calculated the number of
years worked for each row (diﬀerence between
from_date and to_date), group by emp_no and
dept_no, and sum the years worked across all Pro Tip
periods. ★ Review the task titled “aggregate
and summarize data”

(Note to the learner: Pause the video to complete the task and unpause to see the solution once the task is complete.)
Thank you!!!

Floor Control With PRV
No ratings yet
Floor Control With PRV
3 pages
Section Course Outline MY
No ratings yet
Section Course Outline MY
3 pages
Salaries for San Francisco Employee _ ML _ FA _ DA projects
No ratings yet
Salaries for San Francisco Employee _ ML _ FA _ DA projects
33 pages
IP_Employee_Project
No ratings yet
IP_Employee_Project
31 pages
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
No ratings yet
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
9 pages
Pc 4113 Final Rubrics
No ratings yet
Pc 4113 Final Rubrics
2 pages
Raghavippracticalfile_organized(0)
No ratings yet
Raghavippracticalfile_organized(0)
12 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
2 pages
Employee Management Project
No ratings yet
Employee Management Project
33 pages
Capstone Project Assignment
No ratings yet
Capstone Project Assignment
3 pages
Data Sciene file
No ratings yet
Data Sciene file
36 pages
PRT_2_q's
No ratings yet
PRT_2_q's
7 pages
Job Analyser 1
No ratings yet
Job Analyser 1
28 pages
prac1
No ratings yet
prac1
5 pages
Gateway System
No ratings yet
Gateway System
31 pages
22067515 Kushal Kadayat
No ratings yet
22067515 Kushal Kadayat
33 pages
Maxbox Starter139 Top5 Data Diagram Types
No ratings yet
Maxbox Starter139 Top5 Data Diagram Types
4 pages
Data Mining_Week - 4
No ratings yet
Data Mining_Week - 4
8 pages
Procedural Programming Paradigm
No ratings yet
Procedural Programming Paradigm
15 pages
Ip file . Jasleen
No ratings yet
Ip file . Jasleen
44 pages
Implementation of Internal Innovation
No ratings yet
Implementation of Internal Innovation
16 pages
IP project file 2
No ratings yet
IP project file 2
34 pages
Viksit Ip Project File
No ratings yet
Viksit Ip Project File
33 pages
DADM Unit 5 Programs
No ratings yet
DADM Unit 5 Programs
63 pages
New Final Ip Project
No ratings yet
New Final Ip Project
33 pages
GANTT CHART Excel
No ratings yet
GANTT CHART Excel
1 page
IP classs 12 for topic employee management
No ratings yet
IP classs 12 for topic employee management
40 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
Data Project
No ratings yet
Data Project
12 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Ip Project Dineshh
No ratings yet
Ip Project Dineshh
30 pages
Ip Kamalesh
No ratings yet
Ip Kamalesh
30 pages
Ip Kamalesh
No ratings yet
Ip Kamalesh
29 pages
211423205047-Exp1d
No ratings yet
211423205047-Exp1d
6 pages
Mimic Cabinet BUR-200: Autroprime Interactive Fire Detection System Product Datasheet
No ratings yet
Mimic Cabinet BUR-200: Autroprime Interactive Fire Detection System Product Datasheet
2 pages
Business Intelligence and Analytics
No ratings yet
Business Intelligence and Analytics
8 pages
A Beginner's Guide To Grabbing and Analyzing Salary Data in Python - by Matt Grierson - Towards Data Science
No ratings yet
A Beginner's Guide To Grabbing and Analyzing Salary Data in Python - by Matt Grierson - Towards Data Science
20 pages
Pyspark coding questions from StrataScratch platform
No ratings yet
Pyspark coding questions from StrataScratch platform
23 pages
Employee Management Project
No ratings yet
Employee Management Project
33 pages
prac1
No ratings yet
prac1
5 pages
Matplotlib Project Report AIPT (2)
No ratings yet
Matplotlib Project Report AIPT (2)
6 pages
[email protected]
No ratings yet
[email protected]
13 pages
Employ Management System
No ratings yet
Employ Management System
29 pages
IS5312 Mini Project-2
No ratings yet
IS5312 Mini Project-2
5 pages
Gigacontrol T: en Translation of The Original Installation and Operating Manual
No ratings yet
Gigacontrol T: en Translation of The Original Installation and Operating Manual
20 pages
Challenges, Opportunities and Future Perspectives in Including Children With Disabilities in The Design of Interactive Technology
No ratings yet
Challenges, Opportunities and Future Perspectives in Including Children With Disabilities in The Design of Interactive Technology
4 pages
IP Project Deepika
No ratings yet
IP Project Deepika
26 pages
Srivardhan Python
No ratings yet
Srivardhan Python
25 pages
Pyspark_Coding_Interview_Questions
No ratings yet
Pyspark_Coding_Interview_Questions
19 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Big Data With Spark and Hadoop
No ratings yet
Big Data With Spark and Hadoop
9 pages
Thermostat in A Gas
No ratings yet
Thermostat in A Gas
4 pages
Pandas_Dataframe_All_Operations_1735471870
No ratings yet
Pandas_Dataframe_All_Operations_1735471870
4 pages
Böhler 70t4 en
No ratings yet
Böhler 70t4 en
1 page
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
Error - Code - List - Ver5.00 5.10
No ratings yet
Error - Code - List - Ver5.00 5.10
35 pages
Certification Report Ford EcoSport MH03BS6753 2412177
No ratings yet
Certification Report Ford EcoSport MH03BS6753 2412177
7 pages
dataframing_in_csv
No ratings yet
dataframing_in_csv
14 pages
2.2. Components of Hadoop - Analysing.docx
No ratings yet
2.2. Components of Hadoop - Analysing.docx
16 pages
Kunj Project 1
No ratings yet
Kunj Project 1
34 pages
SMARAN HR Analytics - Ipynb - Colab
No ratings yet
SMARAN HR Analytics - Ipynb - Colab
65 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
CPI Ku Band Outdoor TWTA
100% (1)
CPI Ku Band Outdoor TWTA
2 pages
Data Representation
No ratings yet
Data Representation
13 pages
MS Excel
No ratings yet
MS Excel
15 pages
data analysis
No ratings yet
data analysis
42 pages
Spesifikasi
No ratings yet
Spesifikasi
3 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Chapter 1
No ratings yet
Chapter 1
19 pages
Imperialist Competitive Algorithm ICA Presentation Slides
No ratings yet
Imperialist Competitive Algorithm ICA Presentation Slides
35 pages
Computer Engineering Orientation - Module 04 - 05
No ratings yet
Computer Engineering Orientation - Module 04 - 05
9 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
Wongjianwei (TP061912)
No ratings yet
Wongjianwei (TP061912)
13 pages
Wii System Settings Translation v1.0
No ratings yet
Wii System Settings Translation v1.0
29 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Predicting Employee Churn in Python
100% (1)
Predicting Employee Churn in Python
19 pages
Gianan - Developing Search Strategy
No ratings yet
Gianan - Developing Search Strategy
16 pages
Signal Conditioning and Conversion: Bridge Circuit
No ratings yet
Signal Conditioning and Conversion: Bridge Circuit
14 pages
Elantra Ecu PDF
88% (8)
Elantra Ecu PDF
14 pages
Condocs Numbering System
No ratings yet
Condocs Numbering System
0 pages
CONSTRUDATA 190 - TOAZ - INFOvalor Real Del Salario
No ratings yet
CONSTRUDATA 190 - TOAZ - INFOvalor Real Del Salario
1 page
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Lesson 07 Data Manipulation With Pandas
No ratings yet
Lesson 07 Data Manipulation With Pandas
82 pages
Ptr8000ii Service Man A#be0
100% (1)
Ptr8000ii Service Man A#be0
170 pages
Department of Mechanical Engineering Nano Fluids and Its Applications
No ratings yet
Department of Mechanical Engineering Nano Fluids and Its Applications
23 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Guidance Document Verification For TMHS Manufacturers
No ratings yet
Guidance Document Verification For TMHS Manufacturers
25 pages
PFDA
No ratings yet
PFDA
23 pages
Cable Systems - ETAP
No ratings yet
Cable Systems - ETAP
10 pages
Microsoft Project Introduction: Desktop Edition
From Everand
Microsoft Project Introduction: Desktop Edition
Seth Bonder
No ratings yet

PySpark_slides

Uploaded by

PySpark_slides

Uploaded by

Guided Project

Set up and overview of the project

Load the data

Clean and process data

Explore the data

the employees data.

Aggregate and summarize the data

Join the data sets

● Aggregate the years worked by each employee

You might also like