Task2 - Part 2
Task2 - Part 2
ipynb - Colab
keyboard_arrow_down Load the PatientInfo.csv file and show the first 5 rows
df = spark.read.csv("/content/PatientInfo.csv", header=True, inferSchema=True)
df.show(5)
1/3
PySpark Task 2 _ Part 2 .ipynb - Colab
+----------+------+---+-------+--------+-----------+--------------------+----
|patient_id| sex|age|country|province| city| infection_case|infec
+----------+------+---+-------+--------+-----------+--------------------+----
|1000000001| male|50s| Korea| Seoul| Gangseo-gu| overseas inflow|
|1000000002| male|30s| Korea| Seoul|Jungnang-gu| overseas inflow|
|1000000003| male|50s| Korea| Seoul| Jongno-gu|contact with patient| 2002
|1000000004| male|20s| Korea| Seoul| Mapo-gu| overseas inflow|
|1000000005|female|20s| Korea| Seoul|Seongbuk-gu|contact with patient| 1000
+----------+------+---+-------+--------+-----------+--------------------+----
only showing top 5 rows
root
|-- patient_id: long (nullable = true)
|-- sex: string (nullable = true)
|-- age: string (nullable = true)
|-- country: string (nullable = true)
|-- province: string (nullable = true)
|-- city: string (nullable = true)
|-- infection_case: string (nullable = true)
|-- infected_by: string (nullable = true)
|-- contact_number: string (nullable = true)
|-- symptom_onset_date: string (nullable = true)
|-- confirmed_date: date (nullable = true)
|-- released_date: date (nullable = true)
|-- deceased_date: date (nullable = true)
|-- state: string (nullable = true)
df.filter(df.state == "released").count()
2929
2/3
PySpark Task 2 _ Part 2 .ipynb - Colab
[ ] ↳ 1 cell hidden
keyboard_arrow_down
3/3