0% found this document useful (0 votes)
5 views3 pages

Task2 - Part 2

The document outlines a PySpark task focused on analyzing COVID-19 patient data from a CSV file named PatientInfo. It describes the structure of the dataset, including patient details such as ID, sex, age, and infection case, and provides code snippets for loading the data and performing basic operations like counting released patients. Additionally, it includes tasks for handling null values and modifying the dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views3 pages

Task2 - Part 2

The document outlines a PySpark task focused on analyzing COVID-19 patient data from a CSV file named PatientInfo. It describes the structure of the dataset, including patient details such as ID, sex, age, and infection case, and provides code snippets for loading the data and performing basic operations like counting released patients. Additionally, it includes tasks for handling null values and modifying the dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

PySpark Task 2 _ Part 2 .

ipynb - Colab

keyboard_arrow_down PySpark Task 2 _ Part 2


In this task we will be using the "[[NeurIPS 2020] Data Science for COVID-19 (DS4C)]

The csv file that we will be using in this task is PatientInfo.

keyboard_arrow_down PatientInfo.csv Type your text

patient_id the ID of the patient

sex the sex of the patient

age the age of the patient

country the country of the patient

province the province of the patient

city the city of the patient

infection_case the case of infection

infected_by the ID of who infected the patient

contact_number the number of contacts with people

symptom_onset_date the date of symptom onset

confirmed_date the date of being confirmed

released_date the date of being released

deceased_date the date of being deceased

state isolated / released / deceased

keyboard_arrow_down Import and create SparkSession


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("COVID-19 Analysis").getOrCreate()

keyboard_arrow_down Load the PatientInfo.csv file and show the first 5 rows
df = spark.read.csv("/content/PatientInfo.csv", header=True, inferSchema=True)
df.show(5)

1/3
PySpark Task 2 _ Part 2 .ipynb - Colab

+----------+------+---+-------+--------+-----------+--------------------+----
|patient_id| sex|age|country|province| city| infection_case|infec
+----------+------+---+-------+--------+-----------+--------------------+----
|1000000001| male|50s| Korea| Seoul| Gangseo-gu| overseas inflow|
|1000000002| male|30s| Korea| Seoul|Jungnang-gu| overseas inflow|
|1000000003| male|50s| Korea| Seoul| Jongno-gu|contact with patient| 2002
|1000000004| male|20s| Korea| Seoul| Mapo-gu| overseas inflow|
|1000000005|female|20s| Korea| Seoul|Seongbuk-gu|contact with patient| 1000
+----------+------+---+-------+--------+-----------+--------------------+----
only showing top 5 rows

keyboard_arrow_down Display the schema of the dataset


df.printSchema()

root
|-- patient_id: long (nullable = true)
|-- sex: string (nullable = true)
|-- age: string (nullable = true)
|-- country: string (nullable = true)
|-- province: string (nullable = true)
|-- city: string (nullable = true)
|-- infection_case: string (nullable = true)
|-- infected_by: string (nullable = true)
|-- contact_number: string (nullable = true)
|-- symptom_onset_date: string (nullable = true)
|-- confirmed_date: date (nullable = true)
|-- released_date: date (nullable = true)
|-- deceased_date: date (nullable = true)
|-- state: string (nullable = true)

keyboard_arrow_down Using the state column.


How many people survived (released)?

df.filter(df.state == "released").count()

2929

keyboard_arrow_down Bonus Question!


Display the number of null values in each column we didn't cover how to do this, but we covered
something very similar. Check this link for a hint
"https://sparkbyexamples.com/pyspark/pyspark-find-count-of-null-none-nan-values/" If you get
stuck on this, don't worry, just view the solutions

2/3
PySpark Task 2 _ Part 2 .ipynb - Colab

Start coding or generate with AI.


keyboard_arrow_down
Fill the nulls in the infected_by column with the string "Unknown"

use shift + tab on fill function for a hint

[ ] ↳ 1 cell hidden
keyboard_arrow_down

Try to Drop the column infection_case


[ ] ↳ 1 cell hidden

3/3

You might also like