0% found this document useful (0 votes)
31 views14 pages

Report - Data Visualization and Exploration

Uploaded by

leonard leo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

Report - Data Visualization and Exploration

Uploaded by

leonard leo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Visualization COURSEWORK 1

Brief Project Introduction


This project focuses on the impact of covid-19 around the world. This project analysis makes use
of 3 datasets. These datasets are:

1. Covid-19 Confirmed Cases


2. Covid-19 Confirmed Deaths
3. Government Stringency Index

As part of our analysis, we later derive another dataset - Covid-19 Confirmed Daily Cases which
is created from the Covid-19 Confirmed Cases. This dataset which is an excel file of 3 sheets is
collected and updated in real-time by a team of dozens of students and staff at Oxford University.
The dataset is titled OxCGRT which stands for Oxford Covid-19 Government Response Tracker,
provides an API (Application Programming Interface) that allows us access data relating covid-19
confirmed cases, confirmed deaths, and stringency index.

Covid-19 Confirmed Cases

This sheet provides information on the total number of confirmed cases for each country at a
specific time. The data collected is a cumulative of each daily record and values represented in
each day show the total of all cases till that day. The data here will be referenced in the creation
of another dataset used in our analysis - Covid-19 Confirmed Daily Cases.

Covid-19 Confirmed Deaths

As part of our analysis, we look at the total number of confirmed deaths for each country at a
specific time. The data collected is a cumulative of each daily record of deaths.

Covid-19 Government Stringency Index

The stringency index refers to governments’ reaction towards the total confirmed cases and
confirmed deaths, calculated as an index from 1-100 indicating the level to which government
applies and sets measures in place towards handling the wide spread of covid-19.
PROJECT PREREQUISITE
(A) Importing Libraries

The libraries used from the successful execution of this project include:

• Numpy - NumPy is a fundamental package for scientific computing with Python. It


provides support for arrays, matrices, and mathematical functions to operate on these
data structures efficiently.
• Pandas - Pandas is a powerful data manipulation and analysis library. It provides data
structures like DataFrame and Series, along with tools for reading and writing data
between in-memory data structures and various file formats.
• Seaborn - Seaborn is a statistical data visualization library based on Matplotlib. It
provides a high-level interface for drawing attractive and informative statistical graphics.
• SciPy - SciPy is a library used for scientific and technical computing. It builds on NumPy
and provides additional modules for optimization, integration, interpolation, linear algebra,
statistics, and more.
• Matplotlib - Matplotlib is a comprehensive plotting library in Python. It provides a
MATLAB-like interface for creating static, interactive, and animated visualizations.
• Joblib - Joblib is a library for lightweight pipelining in Python. It provides utilities for
saving and loading Python objects (e.g., trained machine learning models) to disk.
• Requests - Requests is an HTTP library for making HTTP requests in Python. It provides
a simple and elegant API for sending HTTP requests and handling responses.
• Ydata_Profiling - Ydata_Profiling is a library used for exploratory data analysis (EDA). It
automatically generates a comprehensive report with statistics, visualizations, and
insights about the dataset.

(B) Defining Functions


For this project, the following functions were defined to aid in the programming workflow. This
projects creation utilises functional programming as a means towards answering all proposed
questions.

• eda: Helps performs thorough exploratory data analysis on the dataset.


• extracting_data: This function takes the extracted data from the api as a Json file,
converts all into the required tables, and return them for analysis.
• get_data_from_api: This function is used to extract data from the api.
• date_to_string_date_format: Used to convert the dates into the desired format e.g.
20/05/2022 to 20May2022.
• set_value_missing_data: A simple function allowing us to fill missing values in the
required positions for the countries.
• count_daily_cases: Turns the total confirmed cases data to a daily confirmed cases
dataset. A very interesting function as it attempts to tackle a serious issue found during
EDA after initial conversion of the data to daily cases.
• drop_columns_in_range: Drop values within a specified range.
• categorize_deaths: This function bins the total confirmed deaths and is used to aid
visualization of the impact of deaths and confirmed cases of the covid-19 around the
world.
• find_negative_daily_cases: Used to extract the countries with data values for specific
days having negative values for the daily confirmed cases. It's a helper function to draw
more insights before specifying the method to be applied with
the count_daily_cases function.
Q1.
In the code section, we make use of the pickle file python object of the dictionary retrieved from the
API. This is to avoid sending multiple requests to the server at each rerun. All steps are provided in
Jupyter notebook file.

Q2.
SOLUTION 1: Identifying Missing Values in the Data of Each of the Three Measures
After successfully extracting the 3 sheets from the OxCGRT_summary.xlsx excel file using
pd.read_excel() and setting the parameter sheet_name to the index of each of the sheets
(confirmedcases, confirmeddeaths, stringencyindex) respectively, we conduct exploratory data
analysis (EDA) on the 3 datasets to find the total missing values on each sheet.

confirmedcases has 110 missing values.

confirmeddeaths has 110 missing values.

stringencyindex has 226 missing values.

Upon further analysis, we can detect the exact location of missing values in our data by
combining the .isna() pandas command and the .any() function to find any location in the
confirmedcases, confirmeddeaths, and stringencyindex where we have missing values. From
this, we find the following:

confirmedcases and confirmeddeaths:

The country "Turkmenistan" has no data recorded between valu


es for our analysis. i.e., all data between 22nd of January 2020
till 10th of May 2020 is missing for Turkmenistan. This accounts
for the 110 missing values in the confirmedcases and confirmedde
aths datasets.

stringencyindex:

Here, the 216 missing values is made of 3 categories:

(1) Entire data missing across all dates for a specific


country

(2) Data missing after a certain date till the last date
on 10th of May 2020

(3) Data missing between the dates from 22nd of January


2020 till 10th of May 2020

From the above category, the countries - Comoros and Grenada, fall into category one and
accounts for 220 missing values in the stringency index. Monaco falls into category two and has
data missing from 8th of May 2020 till 10th of May 2020. Finally, Mali has data missing between
5th of May 2020 till 7th of May 2020 and falls under the third category.
SOLUTION 2: Choose an Appropriate Strategy to Handle them in Each One of the Three
Measures. Justify your Choice and Write the Code Needed to Implement it.
Considering our findings in (1A), and treating confirmedcases, confirmeddeaths, and stringency
index as separate entities we want to perform analysis on, we come up with ways to handle the
missing values in each dataset and provide reasonable justification for our approach.

Given the nature of the missing values in confirmedcases and confirmeddeaths share similarities,
we define a singular approach for solving both cases.

• DATASET -> Confirmedcases and Confirmeddeaths


• APPROACH -> Drop Row

JUSTIFICATION -> An entire set of missing values between the dates 22nd of January 2020 till
10th of May 2020 indicates for the country Turkmenistan for both confirmedcases and
confirmeddeaths, is as good as saying "NO DATA". Filling with the mean, median, or mode will
be a very horrible technique as the dataset would not be well represented. Neither will a backfill
or forwardfill be useful as no data exists. Inserting any value to replace the missing values will
mean we create bias and introduce "lies" about the nature of events that occurred in
Turkmenistan. It is also worth noting that Turkmenistan doesn't exist in the data for confirmed
deaths. Turkmenistan has a stringency index which peaks at 50.930 on the 25th of March,
indicating that filling the Turkmenistan row of missing values with ZEROs will be biased as it
doesn't work in accordance with its respective stringency index. A further preprocessing step
consideration could involve removing Turkmenistan from the stringency index data as well.

• DATASET -> StringencyIndex


• APPROACH -> Fillna - 0 (Comoros and Grenada)

JUSTIFICATION -> The decision to fill the missing values for the countries Comoros and
Grenada, lies in the following findings:

1) Cosmos and Grenada appear in the confirmed cases data, confir


med deaths data, and stringency index data. Hence, it is a appea
rs to hold some level of insights for analysis.

2) When we look at the data specific to these countries for thei


r respective confirmed cases and confirmed deaths to understand
why those missing values exist in their stringency index, we obs
erve something very interesting in the data. While these countri
es had some cases of covid-19, for Comoros, the number of record
ed confirmed deaths hits a maximum of one person, and for Grenad
a, there were no deaths recorded.

The findings above lead us to believe that the missing values in the stringency index for Comoros
and Grenada is because of the government not establishing any laws or measures to curb down
the wide spread of covid-19 in these countries. There is no need to because the number of
recorded deaths in these countries was not alarming at any rate with the maximum combined
deaths between these two countries at one person.

• DATASET -> StringencyIndex


• APPROACH -> Forward Fill or Fillna - 75.0 (Monaco)

JUSTIFICATION -> The missing values in the row for Monaco begin on the 8th of May 2020 till
the 10th of May 2020. What we see 4 days prior to this is that the stringency index was
maintained at a level of 75 out of a 100 in the 4 days before the 8th of May 2020. i.e., stringency
index is 75 from 4th of May 2020 till 7th of May 2020. With this, we can assume the last 3 days
from the 8th of May 2020 till the 10th of May 2020 held the same position with a stringency index
of 75. This can be achieved using a forward fill approach or using fillna and setting the fill value.
This approach can be considered a more reasonable approach against other techniques like
filling with the centre - mean, median, mode or deciding to drop the rows or columns with the
missing values. For our report, we settled for using the FILLNA and specifying the value to be
75.0. Our approach tries to imply that the government maintained the same levels of law and
measures within these short timeframes.

• DATASET -> StringencyIndex


• APPROACH -> Forward Fill or Fillna - 69.440 (Mali)

JUSTIFICATION -> It becomes easier to understand our choice for filling the missing values with
69.440 or using a forward fill which will achieve the same results when we consider the
stringency value before the 3 consecutive missing values is 69.440 and after the missing values
is maintained at 69.440. It is safe to assume that the values within will maintain that same pattern
and therefore, we can set the missing values to 69.440. Here, the stringency index is not a
cumulative approach like the confirmedcases and confirmed deaths, so we consider what the
data says prior to the missing values as well. Considering the values from the 1st of May 2020 till
the 4th of May 2020, we see the stringencyindex for Mali set to 69.440. Our missing values occur
between 5thof May till the 7th of May 2020, after which, on the 8th of May 2020 and the 9th of
May 2020, we record the same value for stringency index to be 69.440. Therefore, once again, a
valid approach which we have settled for to solve this problem, is to set the missing values in
Mali to 69.440. In our report, we settled for using the FILLNA and specifying the value to be
69.440. Our approach tries to imply that the government maintained the same levels of law and
measures within these short timeframes.

NOTE: Forward Fill as indicated here is used to describe the filling of values from the previous
date to the missing values. It is seen as a column-wise movement in this dataset and not a row-
wise movement as the dates we are considering are columns not specified in the rows. Also, we
created a function called "set_value_missing_data" that filters the data for Monaco and Mali
respectively, fills the missing values, and returns the dataframe with missing values handled.
Q3.
Understanding Linear and Logarithmic Scales

Linear Scale:

Linear scaling in visualization across the x-axis and y-axis shows the unit change of values
specified by our data. This is used in situations where priority is placed on seeing the exact value
change. For example, moving our data point from 30 units to 45 units and moving 10 units to 25
units both result in a linear measurement showing a change of 15 units across both instances,
and the line generated for this will be the same. A major disadvantage of the linear scale is its
limited range.

Logarithmic Scale:

The logarithmic scale is primarily employed in visualization when the data we hope to visualize
has a high distinct range of values it encompasses. Unlike the linear scale, the logarithmic scale
handles this properly by focusing on percentage changes across data points. For example, a
movement from 30 units to 45 units and a movement from 10 units to 25 units are seen as a 50%
increase and 150% increase respectively, and both lines will be plotted in accordance with the
percentage changes rather than the unit difference. Another problem the logarithmic scale solves
is the loss of insights in analysis for smaller values in a large data range.

Impact on our Dataset Analysis and Figure 1:

In Figure 1, we see the logarithmic scale applied to only the x-axis, while the y-axis remains a
linear scale. Leaving the y-axis as a linear scale doesn't affect our analysis as the value range of
the stringency index holds from 0 to 100%, which can easily be represented with our graph.
However, the x-axis, which analyses the reported number of Covid-19 cases, has a wide range
of values from 0 reported cases to over a million reported cases in the United States and similar
high-value changes in other countries. Choosing the right scale that properly shows these
changes, from 0 cases to the hundreds of thousands and a million without losing insights, is
crucial. The logarithmic scale proves more beneficial in capturing these changes as the range of
the data is too large to be properly represented by a linear scale while maintaining the patterns
for each country's analysis.

Conclusion:

While the linear scale remains important for data visualization, it has limitations such as
accurately representing changes in small values when the value range is large and showing the
percentage of values in our analysis. These limitations are addressed by the logarithmic scale.
Given that the number of reported cases being plotted on the x-axis spans a large value range
from 0 to over 1,000,000 cases, the decision to use a logarithmic scale over a linear scale for
analysis is clear and provides a better solution for interpreting the insights the analysis was
designed to discover.
Q4.
In our initial calculation of the daily confirmed cases, we highlight a possible problem for our
analysis.

- NEGATIVE VALUES present accounting for daily cases. This doesn


't make any sense as we cannot have a negative number of new cas
es a day. This problem is most likely due to data entry.

This problem was found after performing exploratory data analysis on the daily confirmed cases.
Looking at the descriptive statistics’ minimum values across each column, we see evidence of
these negative values in our data. Given the nature of our analysis and the source of the data,
we make 2 possible assumptions on the reason for these negative cases in our data. The
conclusions drawn from these assumptions define the possible steps which can be taken
towards fixing this issue.

1. DATA ENTRY: The assumption is made that the decrease in values across the
confirmed cases leading to a negative daily cases is because of errors when computing
the values in the data. While this assumption can be considered valid, the fact that some
of the subsequent values after the initial drop in value (down-peak) follow a progressive
rise before reaching or getting a higher value than the initial value before the down-peak
says otherwise.

APPROACH: The function "count_daily_cases" was created to handle this possibility


using the parameter "method" and setting it to "ffill" which stands for Forward Fill. This
implies that for every down-peak we observe as we progressively calculate the daily
confirmed cases, we replace the values of all down-peaks with the value of the total
confirmed cases that precedes it before calculating the daily confirmed cases.

2. POSSIBLE MISDIAGNOSIS: Another assumption we need to consider is the possibility


that the down-peaks we observe in our confirmed cases data could be because of people
who were said to have had covid given initial diagnosis and symptoms displayed.
However, over subsequent days, it was discovered that they don't have covid-19.

APPROACH: We set the parameter "method" for the "count_daily_cases" function to


"bfill" which stands for Backward Fill. This technique finds the down-peak as we move
through the confirmed cases when calculating the daily confirmed cases. We replace all
values of the confirmed cases before the down-peak that are greater than the value of
the down-peak with the value of the down-peak before proceeding to calculate the daily
cases.

The possible misdiagnosis assumption looks at the down-peak as the correct values to be
specified and treats all other prior values greater than the down-peak as errors that need to be
corrected, while, the data entry assumption looks at the down-peaks as errors that need to be
corrected and uses the immediate prior value before the down-peak to replace the value of the
down-peak.

For our data, and based off the context of our analysis, we treat all negative values seen in the
daily confirmed cases as instances of POSSIBLE MISDIAGNOSIS and treat the problem with the
techniques specified above.
The countries where the issue of negative values for daily confirmed cases were found include:

- Afghanistan
- China
- Ecuador
- Spain
- France
- Guyana
- Israel
- Liberia
- New Zealand
- Puerto Rico
- Portugal
- Seychelles
- Uganda
- Uruguay
- Zimbabwe

The requested graph for QUESTION 4 is seen below:


Q5.
The requested graph for QUESTION 5 is seen below:
Q6.
Analysis of Stringency Index and Confirmed Cases (As of May 4, 2020)

Filtering Data:

Our analysis of stringency index and confirmed cases as of May 4, 2020, involves filtering both
individual dataframes (clean_confirmedcases and clean_stringencyindex) to include
only the country name and their respective values on May 4, 2020.

Merge:

For merging the two dataframes after filtering, we use a LEFT MERGE. This decision is made
because we are analyzing confirmed cases and the stringency index attached to them. It's crucial
to get the stringency index in relation to the confirmed cases. Therefore, a left merge captures all
possible countries' confirmed cases, and then we find the associated stringency index for them.

Checking for Missing Values:

After successfully merging and checking for missing values, we find no missing values in our
data.

Further Analysis:

The correlation coefficient achieved between the total number of confirmed cases and the
government stringency index is -0.07, indicating no correlation between the two variables. Our
findings are not statistically significant as we do not obtain a p-value below the 0.05 confidence
interval. With a p-value of 0.54, we cannot reject the null hypothesis at a significance level of
0.05. This means that the observed correlation coefficient is not statistically significant, and we
do not have sufficient evidence to conclude that the observed correlation is not due to random
chance.

From our graph, we are able to focus our conversation on countries that have high and extremely
high death rates. The countries that fall into this category are:

• United States of America (Extremely High)


• France (High)
• Spain (High)
• United Kingdom (High)
• Italy (High)

The countries categorized by high death rates are coloured purple, as seen in our graph. While
the United States of America, which is the only country falling into the category of extremely high
death rates with over 72,365 confirmed deaths, is coloured brown in our graph.

Focusing on the earlier mentioned countries, we learn the following:

1. United States of America (USA) - Confirmed Cases (1,185,709) - Stringency Index


(72.69) - Number of Death Cases (72,365)
2. Spain - Confirmed Cases (218,011) - Stringency Index (81.940) - Number of Death
Cases (25,428)
3. Italy - Confirmed Cases (211,938) - Stringency Index (75) - Number of Death Cases
(29,079)
4. United Kingdom (UK) - Confirmed Cases (191,843) - Stringency Index (79.63) - Number
of Death Cases (28,490)
5. France - Confirmed Cases (169,405) - Stringency Index (87.96) - Number of Death
Cases (25,168)

For our analysis, we introduce some new variables to help draw insights from our graph. These
new variables are:

1. Death Rate
2. Confirmed Cases to Confirmed Deaths Ratio

DEATH RATE:

The death rate tells us the degree to which we classify the amounts of confirmed deaths
registered between the following categories:

• Low
• Moderately Low
• Average
• Moderately High
• High
• Extremely High

For this, we define an index:

• if the number of confirmed deaths < 100 -> Low


• if the number of confirmed deaths < 1000 -> Moderate-Low
• if the number of confirmed deaths < 5000 -> Average
• if the number of confirmed deaths < 10000 -> Moderate-High
• if the number of confirmed deaths < 50000 -> High
• if the number of confirmed deaths > 50000 -> Extremely High

It was through this index that we were able to filter our data to annotate points where death rates
are either high or extremely high. Here, USA, Spain, Italy, UK, and France were highlighted and
coloured black.

CONFIRMED CASE TO CONFIRMED DEATH RATIO:

This is calculated by dividing the total number of confirmed cases by the total number of deaths.
The returned value tells us how many cases we expect to see in a specific country before we
record our first death. This value, when low, indicates high deaths in relation to confirmed cases,
and high values indicate a low number of registered deaths in relation to the confirmed cases.
This measure serves a better way to evaluate the true level of deaths across all countries
involved. For our analysis, we focus on the top 5 countries with the lowest Confirmed Cases to
Confirmed Deaths Ratio. Here, Belgium, France, UK, Italy, and Peru record the lowest values
respectively. These countries are highlighted by a grey colour in our graph; however, France,
UK, and Italy had previously been highlighted among the countries with the most death rates,
hence they maintain their initial annotation colour of black, and only Peru and Belgium get added
to the annotations with a grey colour.

In conclusion, we learn from our analysis of the data and the graph that the countries with the
most death rates are the countries with the most confirmed cases. This led us to attempt finding
a better measure that focuses on the true impact of COVID-19 on the respective countries by
considering the ratio of their confirmed cases to confirmed deaths. From this, we learn that
Belgium was most affected by COVID-19 as of May 4th, recording the least confirmed cases to
confirmed deaths ratio. This provides a better view of the impact than considering just the most
death rates, where we see the USA rank first.

The requested graph for QUESTION 6 is seen below:


Reference list

Colrd (n.d.). HTML/CSS Color Codes For YlOrRd 09 — Art & Design Inspiration at
ColRD.com. [online] colrd.com. Available at: http://colrd.com/palette/19079/?download=css.

Ebrahim, M. (2024). Customize Colorbar in Seaborn Heatmap. [online] Available at:


https://likegeeks.com/seaborn-heatmap-colorbar/.

Fessel, K. (2021a). Advanced Seaborn Color Palettes | Cubehelix palette, xkcd colors,
choose_colorbrewer_palette. [online] www.youtube.com. Available at:
https://youtu.be/Eg0NJcUWLRM.

Fessel, K. (2021b). Seaborn Color Palette Basics | Using named and custom color palettes in
Python seaborn. [online] www.youtube.com. Available at: https://youtu.be/2wRHBodrWuY.

GeeksforGeeks (2020). Seaborn - Bubble Plot. [online] GeeksforGeeks. Available at:


https://www.geeksforgeeks.org/seaborn-bubble-plot/.

GeeksforGeeks (2021). How to Plot Logarithmic Axes in Matplotlib? [online]


GeeksforGeeks. Available at: https://www.geeksforgeeks.org/how-to-plot-logarithmic-axes-
in-matplotlib/.

Matplotlib (n.d.). matplotlib.colorbar — Matplotlib 3.8.4 documentation. [online]


matplotlib.org. Available at: https://matplotlib.org/stable/api/colorbar_api.html.

Matplotlib (n.d.). matplotlib.pyplot.colorbar — Matplotlib 3.8.4 documentation. [online]


matplotlib.org. Available at:
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.colorbar.html.

Matplotlib (n.d.). Plots with different scales — Matplotlib 3.7.1 documentation. [online]
matplotlib.org. Available at:
https://matplotlib.org/stable/gallery/subplots_axes_and_figures/two_scales.html#sphx-glr-
gallery-subplots-axes-and-figures-two-scales-py.

Mdigi (n.d.). Darken Color. [online] mdigi.tools. Available at: https://mdigi.tools/darken-


color/#710d0e.
Noga H. Rotman. (2022). Seaborn Color Palettes and How to Use Them. [online] Available
at: https://r02b.github.io/seaborn_palettes/.

PracticalPythonForDataScience (n.d.). Seaborn Color Palettes — Practical Python for Data


Science. [online] www.practicalpythonfordatascience.com. Available at:
https://www.practicalpythonfordatascience.com/ap_seaborn_palette#ylorbr-ylorbr-r.

Seaborn (2013). Choosing color palettes — seaborn 0.9.0 documentation. [online]


Pydata.org. Available at: https://seaborn.pydata.org/tutorial/color_palettes.html.

Seaborn (n.d.). seaborn.heatmap — seaborn 0.10.1 documentation. [online]


seaborn.pydata.org. Available at: https://seaborn.pydata.org/generated/seaborn.heatmap.html.

Seaborn (n.d.). seaborn.scatterplot — seaborn 0.11.1 documentation. [online]


seaborn.pydata.org. Available at:
https://seaborn.pydata.org/generated/seaborn.scatterplot.html.

StackOverFlow (n.d.). Adjust Heatmap Bar Matplotlib/Seaborn. [online] Stack Overflow.


Available at: https://stackoverflow.com/questions/70998915/adjust-heatmap-bar-matplotlib-
seaborn.

StackOverFlow (n.d.). Remove a legend section from a seaborn plot. [online] Stack
Overflow. Available at: https://stackoverflow.com/questions/62163460/remove-a-legend-
section-from-a-seaborn-plot

You might also like