0% found this document useful (0 votes)
9 views

Pyspark_tutorial_3

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Pyspark_tutorial_3

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

PYSPARK TUTORIAL - 3

PySpark Tutorial 3:
Advanced DataFrame
Transformations

Complete the given assignment and please upload to


https://forms.gle/Uw9Z7TyoayHzDUiV6

Dr. J Geetha, Associate Professor, Department of CSE, RIT


PYSPARK TUTORIAL - 3

Objective

In this tutorial, Perform advanced transformations like joins, window


functions, and pivots on large academic datasets.

Prerequisites

Students need Python and PySpark installed. This ensures they can run
the code examples

1. Advanced DataFrame Operations


Objective:
Perform advanced transformations like joins, window functions, and pivots on
large academic datasets.
Example: Analyzing Student Performance
A college has a dataset of student grades across subjects:
columns = ["StudentID", "Name", "Subject", "Marks"]
Exercise 1:

 Calculate average marks for each student


 Rank students based on their average marks
 Add a column categorizing students into grades (e.g., A, B, C) based
on their marks.
 Pivot the dataset to show subjects as columns and marks as values.
 Find the top-scoring student for each subject.

Additional Exercises(optional ):

1. Analyze attendance records for each student and compute their attendance percentage for a
semester.
2. Use a window function to calculate the cumulative GPA for students across multiple
semesters.
3. Identify students who have consistently scored below a certain threshold across all subjects.

2. PySpark SQL for Querying Data

Objective:

 Use PySpark SQL for complex queries on college-related data.

Example: Library Data Analysis


columns = ["StudentID", "Name", "Book", "BorrowedDate", "DaysBorrowed"]

Dr. J Geetha, Associate Professor, Department of CSE, RIT


PYSPARK TUTORIAL - 3

Exercises:

1. Query to find the total books borrowed by each student


2. Query to find books borrowed for more than 5 days
3. Find the student who borrowed the most books.
4. Identify the most frequently borrowed book.
5. Compute the average borrowing period for each student.

Additional Exercises:

1. Create a query to find the most popular elective courses based on


enrollment numbers.
2. Generate a report showing the average marks of students grouped
by department and year.
3. Identify trends in late submission rates for assignments over
different semesters.

3. Machine Learning with PySpark MLlib

Objective:

Apply machine learning techniques to analyze college admission trends.

Example: Predicting Admission Chances


columns = ["GRE", "TOEFL", "GPA", "Research", "AdmitChance"]
 Prepare features and label
 Train a linear regression model
 Predict admission chances for new data

Exercises:

1. Use the dataset to classify students into "High", "Medium", and


"Low" admission chances.
2. Train a decision tree regressor and compare its accuracy with linear
regression.
3. Evaluate the model using RMSE and R² metrics.

Additional Exercises:

1. Create a query to find the most popular elective courses based on


enrollment numbers.
2. Generate a report showing the average marks of students grouped
by department and year.
3. Identify trends in late submission rates for assignments over
different semesters.

Dr. J Geetha, Associate Professor, Department of CSE, RIT


PYSPARK TUTORIAL - 3

4. Real-Time Data Streaming

Objective:

Process real-time data streams, such as campus event registrations.

Example: Event Registration Stream


from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 5) # 5-second window
lines = ssc.socketTextStream("localhost", 9999)
# Process streaming data
registrations = lines.map(lambda x: (x.split(",")[0], 1)) # Split by student name
count = registrations.reduceByKey(lambda a, b: a + b)
count.pprint()
ssc.start()
ssc.awaitTermination()

Exercises:

1. Modify the script to filter out duplicate registrations.


2. Aggregate registration data into hourly windows and count unique
registrations.
3. Create an alert system for when the total registrations exceed a threshold.

Additional Exercises:

1. Implement a real-time leaderboard for a college quiz competition


using streaming data.
2. Process live streaming data to identify the busiest times for campus
facilities, such as libraries or cafeterias.
3. Monitor and detect anomalies in a live student activity data stream,
such as sudden spikes in logins or downloads.

5. Optimizing PySpark Jobs

Objective:

Learn techniques like partitioning, caching, and using broadcast variables.

Dr. J Geetha, Associate Professor, Department of CSE, RIT


PYSPARK TUTORIAL - 3

Example: Optimizing Department Workload Analysis

columns = ["DeptID", "DeptName", "Students"]


Optimize using caching
Broadcast department information

Exercises:

1. Partition data by department and count the number of records in each


partition.
2. Measure the performance improvement of caching on a large dataset.
3. Use broadcast variables to enrich a dataset with additional metadata.

6. Handling Missing and Inconsistent Data

Objective:

Clean messy datasets using PySpark.

Example: Fixing Missing Grades

columns = ["StudentID", "Name", "Subject", "Marks"]


Fill missing values
Drop rows with missing names

Exercises:

1. Replace missing values with the mean or median for numeric columns.
2. Identify and remove duplicate rows from a dataset.
3. Detect and correct inconsistent data (e.g., invalid scores like -1).

Additional Exercises:

1. Create a script to identify and replace invalid values in a dataset,


such as marks greater than 100 or negative scores.
2. Detect missing information in registration forms and categorize
incomplete records for follow-up.
3. Implement a cleaning pipeline that normalizes inconsistent formats,
such as date fields or department codes.

Dr. J Geetha, Associate Professor, Department of CSE, RIT

You might also like