0% found this document useful (0 votes)

48 views

Lab 05 - PySpark - DataFrame (1)

The document outlines a lab assignment for a Big Data course focused on using PySpark with DataFrames to analyze COVID-19 data from a TSV file. Students must create a specific directory in Google Colab, write a PySpark program to calculate cumulative cases in ASEAN countries, and submit their work in a specified PDF format. The assignment includes tasks such as counting total cases, identifying the country with the maximum cases, and finding the top three countries with the lowest cases.

Uploaded by

52200136

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Lab 05 - PySpark - DataFrame (1)

Uploaded by

52200136

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Course: Big Data

Lab 05
PySpark - DataFrame

Question 1:
Given a tsv file WHO-COVID-19-20210601-213841.tsv which is corresponding to the WHO
Coronavirus (COVID-19) Dashboard.

Students are required to create a folder, named lab05, in /content directory of Google Colab
and then copy the tsv to /content/lab05/input/

Take a screenshot to show your work.

Question 2:
Write a PySpark program, located in ASEANCaseCount.py, using DataFrames to
● to count the number of cumulative total cases among ASEAN countries (South-East
Asia Region in the given data table)
● to find the country with the maximum number of cumulative total cases among ASEAN
countries.
● to find the top 3 countries with the lowest number of cumulative cases among ASEAN
countries.

● Insert your source code into the table below.

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, regexp_replace
from pyspark.sql.functions import sum as _sum

# Khởi tạo SparkContext và SQLContext

sc = SparkContext("local", "ASEAN COVID-19 Case Count")
sqlc = SQLContext(sc)
file_path = "/content/lab05/input/WHO-COVID-19-20210601-213841.tsv"
df = sqlc.read.option("header", True).option("delimiter", "\t").csv(file_path)

asean_df = df.filter(col("WHO Region") == "South-East Asia")

asean_df = asean_df.withColumn("Cases - cumulative total",
regexp_replace(col("Cases - cumulative total"), ",", "").cast("double"))

total_cases_df = asean_df.select(_sum("Cases - cumulative total").alias("Total Cases"))

total_cases = total_cases_df.first()["Total Cases"]
print(f"Total cumulative cases in ASEAN countries: {total_cases}")

max_cases_country = asean_df.orderBy(col("Cases - cumulative total").desc()).first()

print(f"Country with the highest cumulative cases: " +
f"{max_cases_country['Name']} with {max_cases_country['Cases - cumulative total']}")

top3_min_cases_df = asean_df.orderBy("Cases - cumulative total").limit(3)

print("Three countries with the lowest cumulative cases in ASEAN:")

for row in top3_min_cases_df.rdd.toLocalIterator():

print(f"{row['Name']}: {row['Cases - cumulative total']}")

● Take a screenshot of the terminal to visualize the program result.

Submission Notice

● Export your answer file as pdf

● Rename the pdf following the format:
lab05_<student number>_<full name>.pdf
E.g. lab05_123456_NguyenThanhAn.pdf
If you have not been assigned a student number yet, then use 123456 instead.
● Careless mistakes in filename, format, question order, etc. are not accepted (0 pts).

Viral Cat Story Video
100% (1)
Viral Cat Story Video
4 pages
Python - DataScience Question - Paper
No ratings yet
Python - DataScience Question - Paper
5 pages
Thank You For Your Argos Order - It's Ready To Collect!
No ratings yet
Thank You For Your Argos Order - It's Ready To Collect!
4 pages
The AutoFlow Function For The 5008 Therapy System
No ratings yet
The AutoFlow Function For The 5008 Therapy System
9 pages
Essential Software Assignment 3
No ratings yet
Essential Software Assignment 3
2 pages
Task2_Part 2
No ratings yet
Task2_Part 2
3 pages
Practical Imp Questions Class 12
No ratings yet
Practical Imp Questions Class 12
25 pages
Python Pandas Data Analysis
No ratings yet
Python Pandas Data Analysis
36 pages
Covid Analysis
No ratings yet
Covid Analysis
32 pages
A2_midterm_QP
No ratings yet
A2_midterm_QP
1 page
Assignment1_param - converted
No ratings yet
Assignment1_param - converted
10 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Neww
No ratings yet
Neww
4 pages
DATAFRAME
No ratings yet
DATAFRAME
11 pages
My P Report
No ratings yet
My P Report
14 pages
Practical File (Class 12)
No ratings yet
Practical File (Class 12)
18 pages
BDM1043 - Lab 5 - Covid Data Insights Using Hive Queries
No ratings yet
BDM1043 - Lab 5 - Covid Data Insights Using Hive Queries
6 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
report_MSA_Practice02
No ratings yet
report_MSA_Practice02
29 pages
Intro To Py and ML - Part 2
No ratings yet
Intro To Py and ML - Part 2
10 pages
Loc Iloc at Dataframe
No ratings yet
Loc Iloc at Dataframe
9 pages
Lab 04 - PySpark - RDD
No ratings yet
Lab 04 - PySpark - RDD
3 pages
MCQ On Dataframe
No ratings yet
MCQ On Dataframe
11 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Python Assignment
No ratings yet
Python Assignment
2 pages
211423205047-Exp6b
No ratings yet
211423205047-Exp6b
4 pages
PRACTICAL EXAMINATION SAMPLE PAPER
No ratings yet
PRACTICAL EXAMINATION SAMPLE PAPER
4 pages
12 IP Dataframe and Pyplot Notes
No ratings yet
12 IP Dataframe and Pyplot Notes
14 pages
PANDAS & VIS 2
No ratings yet
PANDAS & VIS 2
11 pages
Dataframe in Pandas
No ratings yet
Dataframe in Pandas
23 pages
12 Pandas
100% (1)
12 Pandas
21 pages
Practice Ques ip pract
No ratings yet
Practice Ques ip pract
6 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Untitled5
No ratings yet
Untitled5
10 pages
R_Queries
No ratings yet
R_Queries
4 pages
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet
Unit 1 Python Pandas
No ratings yet
Unit 1 Python Pandas
20 pages
NM
No ratings yet
NM
23 pages
Python_1st_10
No ratings yet
Python_1st_10
11 pages
Pandas 2 Complete Notes Class XII
No ratings yet
Pandas 2 Complete Notes Class XII
18 pages
Group 10A - GA2
No ratings yet
Group 10A - GA2
10 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
IP Practical 2023-24 (1 To 34)
100% (1)
IP Practical 2023-24 (1 To 34)
32 pages
Pyspark_Coding_Interview_Questions
No ratings yet
Pyspark_Coding_Interview_Questions
19 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Cs Sem v Dav Upc 32347507 Sl. No. Qp. 4432 Dec '23
No ratings yet
Cs Sem v Dav Upc 32347507 Sl. No. Qp. 4432 Dec '23
16 pages
PYTHON SQL
No ratings yet
PYTHON SQL
5 pages
101 Onwards On Python Pandas and Pyplot
No ratings yet
101 Onwards On Python Pandas and Pyplot
33 pages
DAV ALL PRACTICALS
No ratings yet
DAV ALL PRACTICALS
35 pages
Lab Record IP
No ratings yet
Lab Record IP
13 pages
Practice 1,2
No ratings yet
Practice 1,2
8 pages
Matplotlib linechatsy
No ratings yet
Matplotlib linechatsy
38 pages
Practical Exam Practice Questions
No ratings yet
Practical Exam Practice Questions
2 pages
Data Cheat Sheet
No ratings yet
Data Cheat Sheet
2 pages
Computer Science Ip
No ratings yet
Computer Science Ip
16 pages
12.IP.HY.BluePrint-Assignment.2024
No ratings yet
12.IP.HY.BluePrint-Assignment.2024
9 pages
Attachment
No ratings yet
Attachment
4 pages
LAB 3
No ratings yet
LAB 3
3 pages
Practical Set 2 Answers
No ratings yet
Practical Set 2 Answers
6 pages
Codes
No ratings yet
Codes
44 pages
IP- CAPSULE
No ratings yet
IP- CAPSULE
17 pages
Case Study Guidelines
No ratings yet
Case Study Guidelines
7 pages
IPModel_Practicals_QP
No ratings yet
IPModel_Practicals_QP
4 pages
C CER ISO 14001 Fronius International EN
No ratings yet
C CER ISO 14001 Fronius International EN
3 pages
SAF Volunteer Corps Application Form
No ratings yet
SAF Volunteer Corps Application Form
4 pages
Indoor Cultivation of Paddy Straw Mushroom
No ratings yet
Indoor Cultivation of Paddy Straw Mushroom
8 pages
Data Recovery Tomer
No ratings yet
Data Recovery Tomer
6 pages
Week3 - Introduction To CentOS
No ratings yet
Week3 - Introduction To CentOS
50 pages
MNT PF 001
No ratings yet
MNT PF 001
1 page
Less Than 3 Years - 21-Aug
No ratings yet
Less Than 3 Years - 21-Aug
7 pages
Holiday Hotel Payment Receipt
No ratings yet
Holiday Hotel Payment Receipt
2 pages
Inform Sinus EVO U Ru N Fo Yu en Rev3 1432
No ratings yet
Inform Sinus EVO U Ru N Fo Yu en Rev3 1432
2 pages
Earth5R Biotechnology Content Writing Internship Offer Letter
No ratings yet
Earth5R Biotechnology Content Writing Internship Offer Letter
8 pages
AI libraries
No ratings yet
AI libraries
3 pages
Color Television Receiver: Chassis: KS1C (P) Model: CS21A9W2QS/MUR
No ratings yet
Color Television Receiver: Chassis: KS1C (P) Model: CS21A9W2QS/MUR
40 pages
Acoustics For Music
No ratings yet
Acoustics For Music
4 pages
3VL98003MQ00 Siemens
No ratings yet
3VL98003MQ00 Siemens
4 pages
2-D and 3-D Grids - MATLAB Meshgrid
No ratings yet
2-D and 3-D Grids - MATLAB Meshgrid
8 pages
Result 6480243684
No ratings yet
Result 6480243684
1 page
BMW VIN Decoder by WWW - Etk.cc
No ratings yet
BMW VIN Decoder by WWW - Etk.cc
2 pages
Datasheet Forcepoint Ngfw 350 Series en 0
No ratings yet
Datasheet Forcepoint Ngfw 350 Series en 0
2 pages
Gantt Chart: Plant Pals Operations and Training Plan (Name)
0% (1)
Gantt Chart: Plant Pals Operations and Training Plan (Name)
5 pages
C# Program To Calculate The Distance Between Two Points in 2d and 3d
100% (1)
C# Program To Calculate The Distance Between Two Points in 2d and 3d
103 pages
Hrm Pro Plus
No ratings yet
Hrm Pro Plus
12 pages
Transitioning To ESXi
No ratings yet
Transitioning To ESXi
50 pages
Introduction To Financial Technology
No ratings yet
Introduction To Financial Technology
6 pages
How To Choose Wall Mirrors and Bathroom Accessories
No ratings yet
How To Choose Wall Mirrors and Bathroom Accessories
2 pages
Work Shop English II
No ratings yet
Work Shop English II
7 pages
(Alberto de Fraceschi) European Contract Law and T
No ratings yet
(Alberto de Fraceschi) European Contract Law and T
140 pages
Sustainability 13 01174 v2
No ratings yet
Sustainability 13 01174 v2
25 pages

Lab 05 - PySpark - DataFrame (1)

Uploaded by

Lab 05 - PySpark - DataFrame (1)

Uploaded by

Course: Big Data

Take a screenshot to show your work.

● Insert your source code into the table below.

# Khởi tạo SparkContext và SQLContext

asean_df = df.filter(col("WHO Region") == "South-East Asia")

total_cases_df = asean_df.select(_sum("Cases - cumulative total").alias("Total Cases"))

max_cases_country = asean_df.orderBy(col("Cases - cumulative total").desc()).first()

top3_min_cases_df = asean_df.orderBy("Cases - cumulative total").limit(3)

for row in top3_min_cases_df.rdd.toLocalIterator():

● Take a screenshot of the terminal to visualize the program result.

● Export your answer file as pdf

You might also like