0% found this document useful (0 votes)
40 views13 pages

Assignment Sujith S

This document reads in COVID-19 case data from a CSV file and performs some initial data cleaning and exploration. It loads the data into a Pandas DataFrame, drops some columns, renames others, and uses a simple imputer to fill in missing values. It also lists the unique continent categories and shows the data types and shape of the DataFrame before and after processing.

Uploaded by

sujith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views13 pages

Assignment Sujith S

This document reads in COVID-19 case data from a CSV file and performs some initial data cleaning and exploration. It loads the data into a Pandas DataFrame, drops some columns, renames others, and uses a simple imputer to fill in missing values. It also lists the unique continent categories and shows the data types and shape of the DataFrame before and after processing.

Uploaded by

sujith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

11/1/22, 12:49 PM assignment.

ipynb - Colaboratory

import numpy as np
import pandas as pd
import matplotlib.pyplot  as plt
from sklearn.impute  import SimpleImputer

#Read the data:

d = pd.read_csv(r"/owid-covid-data.csv")

df =  pd.DataFrame(d)

#View the Data:

df.shape

(231176, 67)

df.head()

iso_code continent location date total_cases new_cases new_cases_smoothed to

2020-
0 AFG Asia Afghanistan 5.0 5.0 NaN
02-24

2020-
1 AFG Asia Afghanistan 5.0 0.0 NaN
02-25

2020-
2 AFG Asia Afghanistan 5.0 0.0 NaN
02-26

2020-
3 AFG Asia Afghanistan 5.0 0.0 NaN
02-27

2020-
4 AFG Asia Afghanistan 5.0 0.0 NaN
02-28

5 rows × 67 columns

df.dtypes

iso_code object

continent object

location object

date object

total_cases float64

...

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 1/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

population float64

excess_mortality_cumulative_absolute float64

excess_mortality_cumulative float64

excess_mortality float64

excess_mortality_cumulative_per_million float64

Length: 67, dtype: object

df.describe(include = "all")

iso_code continent location date total_cases new_cases new_cases_smo

count 231176 218126 231176 231176 2.180760e+05 2.177610e+05 2.165650

unique 248 6 248 1034 NaN NaN

2021-
top MEX Europe Mexico NaN NaN
09-14

freq 1033 52934 1033 247 NaN NaN

mean NaN NaN NaN NaN 4.479755e+06 1.234293e+04 1.238437

std NaN NaN NaN NaN 2.798278e+07 8.510011e+04 8.31688

min NaN NaN NaN NaN 1.000000e+00 0.000000e+00 0.000000

25% NaN NaN NaN NaN 4.427000e+03 0.000000e+00 6.000000

50% NaN NaN NaN NaN 4.975150e+04 5.300000e+01 9.328600

75% NaN NaN NaN NaN 5.259428e+05 9.510000e+02 1.12157

max NaN NaN NaN NaN 6.299857e+08 4.081968e+06 3.436032

11 rows × 67 columns

df.columns

Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',

'new_cases_smoothed', 'total_deaths', 'new_deaths',

'new_deaths_smoothed', 'total_cases_per_million',

'new_cases_per_million', 'new_cases_smoothed_per_million',

'total_deaths_per_million', 'new_deaths_per_million',

'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',

'icu_patients_per_million', 'hosp_patients',

'hosp_patients_per_million', 'weekly_icu_admissions',

'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',

'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests',

'total_tests_per_thousand', 'new_tests_per_thousand',

'new_tests_smoothed', 'new_tests_smoothed_per_thousand',

'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',

'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 2/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

'new_vaccinations', 'new_vaccinations_smoothed',

'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',

'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred',

'new_vaccinations_smoothed_per_million',

'new_people_vaccinated_smoothed',

'new_people_vaccinated_smoothed_per_hundred', 'stringency_index',

'population_density', 'median_age', 'aged_65_older', 'aged_70_older',

'gdp_per_capita', 'extreme_poverty', 'cardiovasc_death_rate',

'diabetes_prevalence', 'female_smokers', 'male_smokers',

'handwashing_facilities', 'hospital_beds_per_thousand',

'life_expectancy', 'human_development_index', 'population',

'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative',

'excess_mortality', 'excess_mortality_cumulative_per_million'],

dtype='object')

,'new_cases_per_million','total_cases_per_million'],axis = 1, inplace = True)

# shape of table After dropping  some columns

df.shape

(231176, 63)

ndex. In our dataset we will rename the columns:

ion':'Country','continent':'Continent','iso_code':'ISO_code'},inplace = True )

#List the continent name:

continent =  list(df.Continent.unique())

continent

['Asia', nan, 'Europe', 'Africa', 'North America', 'South America', 'Oceania']

#simple imputer:

#Simple imputer helps with missing values in a dataset. In the below code, a simple imputer w

imputer = SimpleImputer(strategy='constant')

df2 = pd.DataFrame(imputer.fit_transform(df),columns=df.columns)

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 3/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

df2

ISO_code Continent Country Date total_cases new_cases total_deaths n

2020-
0 AFG Asia Afghanistan 5.0 5.0 missing_value mis
02-24

2020-
1 AFG Asia Afghanistan 5.0 0.0 missing_value mis
02-25

2020-
2 AFG Asia Afghanistan 5.0 0.0 missing_value mis
02-26

2020-
3 AFG Asia Afghanistan 5.0 0.0 missing_value mis
02-27

2020-
4 AFG Asia Afghanistan 5.0 0.0 missing_value mis
02-28

... ... ... ... ... ... ... ...

2022-
231171 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-25

2022-
231172 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-26

2022-
231173 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-27

2022-
231174 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-28

2022-
231175 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-29

231176 rows × 63 columns

df2.groupby(['Date','Country'])[['Date','Country','total_cases','total_deaths','total_vaccina

df2

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 4/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

ISO_code Continent Country Date total_cases new_cases total_deaths n

2020-
0 AFG Asia Afghanistan 5.0 5.0 missing_value mis
02-24

2020-
1 AFG Asia Afghanistan 5.0 0.0 missing_value mis
02-25

2020-
2 AFG Asia Afghanistan 5.0 0.0 missing_value mis
02-26

2020-
3 AFG Asia Afghanistan 5.0 0.0 missing_value mis
02-27

2020-
4 AFG Asia Afghanistan 5.0 0.0 missing_value mis
02-28

... ... ... ... ... ... ... ...

2022-
231171 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-25

2022-
231172 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-26

2022-
231173 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-27

2022-
df3 = df2.groupby(['Date','Country'])[['Date','Country','total_cases','total_deaths','total_v
231174 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-28
df3.tail(10)

2022-
231175 ZWE Africa Zimbabwe 257893.0 0.0 5606.0
10-29

231176 rows × 63 columns

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 5/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

Date Country total_cases total_deaths total_vaccinations

#change missing_value to 0

231166 2022-10-29 Wallis and Futuna 761.0 7.0 missing_value


df3['total_cases'].replace({'missing_value':0},inplace=True)

231167 2022-10-29 World 629985701.0 6588602.0


df3['total_deaths'].replace({'missing_value':0},inplace=True)
12886036833.0
df3['total_vaccinations'].replace({'missing_value':0},inplace=True)

231168 2022-10-29 Yemen 11939.0 2158.0 missing_value


df3

231169 2022-10-29 Zambia 333674.0 4017.0 missing_value

231170 Date
2022-10-29 Country total_cases257893.0
Zimbabwe total_deaths 5606.0
total_vaccinations
missing_value
0
231171 2020-01-01
2022-10-30 Argentina Austria 0.0
missing_value 0.0
missing_value 0.0
missing_value
1
231172 2020-01-01
2022-10-30 MexicoGermany 0.0
missing_value 0.0
missing_value 0.0
missing_value
2
231173 2020-01-02
2022-10-30 Argentina Israel 0.0
missing_value 0.0
missing_value 0.0
missing_value
3
231174 2020-01-02
2022-10-30 MexicoMalaysia 0.0
missing_value 0.0
missing_value 0.0
missing_value
4
231175 2020-01-03
2022-10-30 Argentina Russia 0.0
missing_value 0.0
missing_value 0.0
missing_value
... ... ... ... ... ...

231171 2022-10-30 Austria 0.0 0.0 0.0

231172 2022-10-30 Germany 0.0 0.0 0.0

231173 2022-10-30 Israel 0.0 0.0 0.0

231174 2022-10-30 Malaysia 0.0 0.0 0.0

231175 2022-10-30 Russia 0.0 0.0 0.0

231176 rows × 5 columns

#total countries where total_deaths is greater than 1000000

df4=df3[df3['total_deaths']>1000000]

df4

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 6/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

Date Country total_cases total_deaths total_vaccinations

45117 2020-09-16 World 29933149.0 1004634.0 0.000000e+00

45345 2020-09-17 World 30248717.0 1010285.0 0.000000e+00

45573 2020-09-18 World 30575342.0 1016140.0 0.000000e+00

45801 2020-09-19 World 30868423.0 1021413.0 0.000000e+00

46029 2020-09-20 World 31123528.0 1025401.0 0.000000e+00

... ... ... ... ... ...

231093 2022-10-29 North America 115473140.0 1525509.0 0.000000e+00

231132 2022-10-29 South America 64273975.0 1332168.0 9.290533e+08

231158 2022-10-29 United States 97447532.0 1070264.0 0.000000e+00

231159 2022-10-29 Upper middle income 138798900.0 2497566.0 5.321340e+09

231167 2022-10-29 World 629985701.0 6588602.0 1.288604e+10

4770 rows × 5 columns

#unique conuntries where total_deaths is greater than 1000000
countries = df4['Country'].unique()
print(len(countries))

print()
print("conuntry_deaths_greater_than_1000000 : ")
print()

conuntry_deaths_greater_than_1000000 = list(df4['Country'].unique())
conuntry_deaths_greater_than_1000000

10

conuntry_deaths_greater_than_1000000 :

['World',

'High income',

'Upper middle income',

'Europe',

'South America',

'Asia',

'Lower middle income',

'North America',

'European Union',

'United States']

New Section

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 7/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

#plotting the trend
for idx in range(0, len(countries)):
 C = df4[df4['Country']==countries[idx]].reset_index()
 plt.scatter(np.arange(0, len(C)),C['total_cases'],color="blue",label="total_cases"
 plt.scatter(np.arange(0, len(C)),C['total_deaths'],color="red",label="total_deaths
 plt.scatter(np.arange(0, len(C)),C['total_vaccinations'],color="green", label="tot
 plt.title(countries[ idx])
 plt.xlabel("Number of days since first suspect")
 plt.ylabel("Number of cases")
 plt.legend()
 plt.show()

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 8/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 9/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 10/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

#group the countries

df5 = df4.groupby(['Country']) [['Country', 'total_cases', 'total_deaths']].sum().reset_index
df5

Country total_cases total_deaths

0 Asia 5.521382e+10 5.806319e+08

1 Europe 7.241052e+10 8.742173e+08

2 European Union 3.696045e+10 2.760102e+08

3 High income 1.213026e+11 1.303327e+09

4 Lower middle income 3.455630e+10 5.294324e+08

5 North America 3.665789e+10 5.602120e+08

6 South America 2.429292e+10 6.009055e+08

7 United States 1.583746e+10 1.799059e+08

8 Upper middle income 5.491293e+10 1.225825e+09

9 World 2.286620e+11 3.447812e+09

C = df5

plt.scatter (np.arange(0,len (C)),C['total_cases'], color="blue", label="total_cases")

plt.scatter(np.arange(0,len (C)),C['total_deaths'], color="red", label="total_deaths")

plt.title("World")

plt.xlabel("Number of days since first suspect")

plt.ylabel("Number of cases")

plt.legend()

plt.show()

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 11/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory

#analysis by date where total_deaths is greater than 1000000

date = df4['Date'].unique()
len (date)

774

df6 = df4.groupby(['Date']) [[ 'Date', 'total_cases', 'total_deaths']].sum().reset_index()

df6

Date total_cases total_deaths

0 2020-09-16 2.993315e+07 1004634.0

1 2020-09-17 3.024872e+07 1010285.0

2 2020-09-18 3.057534e+07 1016140.0

3 2020-09-19 3.086842e+07 1021413.0

4 2020-09-20 3.112353e+07 1025401.0

... ... ... ...

769 2022-10-25 2.128946e+09 21647461.0

770 2022-10-26 2.130554e+09 21656263.0

771 2022-10-27 2.132099e+09 21665448.0

772 2022-10-28 2.133303e+09 21670280.0

773 2022-10-29 2.133942e+09 21672141.0

774 rows × 3 columns

#graph plotting by Date
C = df6
plt.scatter (np.arange(0,len (C)),C['total_cases'], color="blue", label="total_case
plt.scatter (np.arange(0,len (C)),C['total_deaths' ], color="red", label="total_dea
plt.title("World")
plt.xlabel("Number of days since first suspect")
plt.ylabel("Number of cases")
https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 12/13
11/1/22, 12:49 PM assignment.ipynb - Colaboratory
plt.ylabel( Number of cases )
plt.legend()
plt.show()

Colab paid products


-
Cancel contracts here

check 0s completed at 12:36 PM

https://colab.research.google.com/drive/1AJFCKFDfnSSH-YAv3_nCI5B9eA7zmP7v#scrollTo=32WSiJI7AmH2&printMode=true 13/13

You might also like