0% found this document useful (0 votes)
7 views

Lab 05 - PySpark - DataFrame (1)

The document outlines a lab assignment for a Big Data course focused on using PySpark with DataFrames to analyze COVID-19 data from a TSV file. Students must create a specific directory in Google Colab, write a PySpark program to calculate cumulative cases in ASEAN countries, and submit their work in a specified PDF format. The assignment includes tasks such as counting total cases, identifying the country with the maximum cases, and finding the top three countries with the lowest cases.

Uploaded by

52200136
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lab 05 - PySpark - DataFrame (1)

The document outlines a lab assignment for a Big Data course focused on using PySpark with DataFrames to analyze COVID-19 data from a TSV file. Students must create a specific directory in Google Colab, write a PySpark program to calculate cumulative cases in ASEAN countries, and submit their work in a specified PDF format. The assignment includes tasks such as counting total cases, identifying the country with the maximum cases, and finding the top three countries with the lowest cases.

Uploaded by

52200136
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Course: Big Data

Lab 05
PySpark - DataFrame

Question 1:
Given a tsv file WHO-COVID-19-20210601-213841.tsv which is corresponding to the WHO
Coronavirus (COVID-19) Dashboard.

Students are required to create a folder, named lab05, in /content directory of Google Colab
and then copy the tsv to /content/lab05/input/

Take a screenshot to show your work.

Question 2:
Write a PySpark program, located in ASEANCaseCount.py, using DataFrames to
● to count the number of cumulative total cases among ASEAN countries (South-East
Asia Region in the given data table)
● to find the country with the maximum number of cumulative total cases among ASEAN
countries.
● to find the top 3 countries with the lowest number of cumulative cases among ASEAN
countries.

● Insert your source code into the table below.


from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, regexp_replace
from pyspark.sql.functions import sum as _sum

# Khởi tạo SparkContext và SQLContext


sc = SparkContext("local", "ASEAN COVID-19 Case Count")
sqlc = SQLContext(sc)
file_path = "/content/lab05/input/WHO-COVID-19-20210601-213841.tsv"
df = sqlc.read.option("header", True).option("delimiter", "\t").csv(file_path)

asean_df = df.filter(col("WHO Region") == "South-East Asia")


asean_df = asean_df.withColumn("Cases - cumulative total",
regexp_replace(col("Cases - cumulative total"), ",", "").cast("double"))

total_cases_df = asean_df.select(_sum("Cases - cumulative total").alias("Total Cases"))


total_cases = total_cases_df.first()["Total Cases"]
print(f"Total cumulative cases in ASEAN countries: {total_cases}")

max_cases_country = asean_df.orderBy(col("Cases - cumulative total").desc()).first()


print(f"Country with the highest cumulative cases: " +
f"{max_cases_country['Name']} with {max_cases_country['Cases - cumulative total']}")

top3_min_cases_df = asean_df.orderBy("Cases - cumulative total").limit(3)


print("Three countries with the lowest cumulative cases in ASEAN:")

for row in top3_min_cases_df.rdd.toLocalIterator():


print(f"{row['Name']}: {row['Cases - cumulative total']}")

● Take a screenshot of the terminal to visualize the program result.


Submission Notice

● Export your answer file as pdf


● Rename the pdf following the format:
lab05_<student number>_<full name>.pdf
E.g. lab05_123456_NguyenThanhAn.pdf
If you have not been assigned a student number yet, then use 123456 instead.
● Careless mistakes in filename, format, question order, etc. are not accepted (0 pts).

You might also like