Report - Data Visualization and Exploration
Report - Data Visualization and Exploration
As part of our analysis, we later derive another dataset - Covid-19 Confirmed Daily Cases which
is created from the Covid-19 Confirmed Cases. This dataset which is an excel file of 3 sheets is
collected and updated in real-time by a team of dozens of students and staff at Oxford University.
The dataset is titled OxCGRT which stands for Oxford Covid-19 Government Response Tracker,
provides an API (Application Programming Interface) that allows us access data relating covid-19
confirmed cases, confirmed deaths, and stringency index.
This sheet provides information on the total number of confirmed cases for each country at a
specific time. The data collected is a cumulative of each daily record and values represented in
each day show the total of all cases till that day. The data here will be referenced in the creation
of another dataset used in our analysis - Covid-19 Confirmed Daily Cases.
As part of our analysis, we look at the total number of confirmed deaths for each country at a
specific time. The data collected is a cumulative of each daily record of deaths.
The stringency index refers to governments’ reaction towards the total confirmed cases and
confirmed deaths, calculated as an index from 1-100 indicating the level to which government
applies and sets measures in place towards handling the wide spread of covid-19.
PROJECT PREREQUISITE
(A) Importing Libraries
The libraries used from the successful execution of this project include:
Q2.
SOLUTION 1: Identifying Missing Values in the Data of Each of the Three Measures
After successfully extracting the 3 sheets from the OxCGRT_summary.xlsx excel file using
pd.read_excel() and setting the parameter sheet_name to the index of each of the sheets
(confirmedcases, confirmeddeaths, stringencyindex) respectively, we conduct exploratory data
analysis (EDA) on the 3 datasets to find the total missing values on each sheet.
Upon further analysis, we can detect the exact location of missing values in our data by
combining the .isna() pandas command and the .any() function to find any location in the
confirmedcases, confirmeddeaths, and stringencyindex where we have missing values. From
this, we find the following:
stringencyindex:
(2) Data missing after a certain date till the last date
on 10th of May 2020
From the above category, the countries - Comoros and Grenada, fall into category one and
accounts for 220 missing values in the stringency index. Monaco falls into category two and has
data missing from 8th of May 2020 till 10th of May 2020. Finally, Mali has data missing between
5th of May 2020 till 7th of May 2020 and falls under the third category.
SOLUTION 2: Choose an Appropriate Strategy to Handle them in Each One of the Three
Measures. Justify your Choice and Write the Code Needed to Implement it.
Considering our findings in (1A), and treating confirmedcases, confirmeddeaths, and stringency
index as separate entities we want to perform analysis on, we come up with ways to handle the
missing values in each dataset and provide reasonable justification for our approach.
Given the nature of the missing values in confirmedcases and confirmeddeaths share similarities,
we define a singular approach for solving both cases.
JUSTIFICATION -> An entire set of missing values between the dates 22nd of January 2020 till
10th of May 2020 indicates for the country Turkmenistan for both confirmedcases and
confirmeddeaths, is as good as saying "NO DATA". Filling with the mean, median, or mode will
be a very horrible technique as the dataset would not be well represented. Neither will a backfill
or forwardfill be useful as no data exists. Inserting any value to replace the missing values will
mean we create bias and introduce "lies" about the nature of events that occurred in
Turkmenistan. It is also worth noting that Turkmenistan doesn't exist in the data for confirmed
deaths. Turkmenistan has a stringency index which peaks at 50.930 on the 25th of March,
indicating that filling the Turkmenistan row of missing values with ZEROs will be biased as it
doesn't work in accordance with its respective stringency index. A further preprocessing step
consideration could involve removing Turkmenistan from the stringency index data as well.
JUSTIFICATION -> The decision to fill the missing values for the countries Comoros and
Grenada, lies in the following findings:
The findings above lead us to believe that the missing values in the stringency index for Comoros
and Grenada is because of the government not establishing any laws or measures to curb down
the wide spread of covid-19 in these countries. There is no need to because the number of
recorded deaths in these countries was not alarming at any rate with the maximum combined
deaths between these two countries at one person.
JUSTIFICATION -> The missing values in the row for Monaco begin on the 8th of May 2020 till
the 10th of May 2020. What we see 4 days prior to this is that the stringency index was
maintained at a level of 75 out of a 100 in the 4 days before the 8th of May 2020. i.e., stringency
index is 75 from 4th of May 2020 till 7th of May 2020. With this, we can assume the last 3 days
from the 8th of May 2020 till the 10th of May 2020 held the same position with a stringency index
of 75. This can be achieved using a forward fill approach or using fillna and setting the fill value.
This approach can be considered a more reasonable approach against other techniques like
filling with the centre - mean, median, mode or deciding to drop the rows or columns with the
missing values. For our report, we settled for using the FILLNA and specifying the value to be
75.0. Our approach tries to imply that the government maintained the same levels of law and
measures within these short timeframes.
JUSTIFICATION -> It becomes easier to understand our choice for filling the missing values with
69.440 or using a forward fill which will achieve the same results when we consider the
stringency value before the 3 consecutive missing values is 69.440 and after the missing values
is maintained at 69.440. It is safe to assume that the values within will maintain that same pattern
and therefore, we can set the missing values to 69.440. Here, the stringency index is not a
cumulative approach like the confirmedcases and confirmed deaths, so we consider what the
data says prior to the missing values as well. Considering the values from the 1st of May 2020 till
the 4th of May 2020, we see the stringencyindex for Mali set to 69.440. Our missing values occur
between 5thof May till the 7th of May 2020, after which, on the 8th of May 2020 and the 9th of
May 2020, we record the same value for stringency index to be 69.440. Therefore, once again, a
valid approach which we have settled for to solve this problem, is to set the missing values in
Mali to 69.440. In our report, we settled for using the FILLNA and specifying the value to be
69.440. Our approach tries to imply that the government maintained the same levels of law and
measures within these short timeframes.
NOTE: Forward Fill as indicated here is used to describe the filling of values from the previous
date to the missing values. It is seen as a column-wise movement in this dataset and not a row-
wise movement as the dates we are considering are columns not specified in the rows. Also, we
created a function called "set_value_missing_data" that filters the data for Monaco and Mali
respectively, fills the missing values, and returns the dataframe with missing values handled.
Q3.
Understanding Linear and Logarithmic Scales
Linear Scale:
Linear scaling in visualization across the x-axis and y-axis shows the unit change of values
specified by our data. This is used in situations where priority is placed on seeing the exact value
change. For example, moving our data point from 30 units to 45 units and moving 10 units to 25
units both result in a linear measurement showing a change of 15 units across both instances,
and the line generated for this will be the same. A major disadvantage of the linear scale is its
limited range.
Logarithmic Scale:
The logarithmic scale is primarily employed in visualization when the data we hope to visualize
has a high distinct range of values it encompasses. Unlike the linear scale, the logarithmic scale
handles this properly by focusing on percentage changes across data points. For example, a
movement from 30 units to 45 units and a movement from 10 units to 25 units are seen as a 50%
increase and 150% increase respectively, and both lines will be plotted in accordance with the
percentage changes rather than the unit difference. Another problem the logarithmic scale solves
is the loss of insights in analysis for smaller values in a large data range.
In Figure 1, we see the logarithmic scale applied to only the x-axis, while the y-axis remains a
linear scale. Leaving the y-axis as a linear scale doesn't affect our analysis as the value range of
the stringency index holds from 0 to 100%, which can easily be represented with our graph.
However, the x-axis, which analyses the reported number of Covid-19 cases, has a wide range
of values from 0 reported cases to over a million reported cases in the United States and similar
high-value changes in other countries. Choosing the right scale that properly shows these
changes, from 0 cases to the hundreds of thousands and a million without losing insights, is
crucial. The logarithmic scale proves more beneficial in capturing these changes as the range of
the data is too large to be properly represented by a linear scale while maintaining the patterns
for each country's analysis.
Conclusion:
While the linear scale remains important for data visualization, it has limitations such as
accurately representing changes in small values when the value range is large and showing the
percentage of values in our analysis. These limitations are addressed by the logarithmic scale.
Given that the number of reported cases being plotted on the x-axis spans a large value range
from 0 to over 1,000,000 cases, the decision to use a logarithmic scale over a linear scale for
analysis is clear and provides a better solution for interpreting the insights the analysis was
designed to discover.
Q4.
In our initial calculation of the daily confirmed cases, we highlight a possible problem for our
analysis.
This problem was found after performing exploratory data analysis on the daily confirmed cases.
Looking at the descriptive statistics’ minimum values across each column, we see evidence of
these negative values in our data. Given the nature of our analysis and the source of the data,
we make 2 possible assumptions on the reason for these negative cases in our data. The
conclusions drawn from these assumptions define the possible steps which can be taken
towards fixing this issue.
1. DATA ENTRY: The assumption is made that the decrease in values across the
confirmed cases leading to a negative daily cases is because of errors when computing
the values in the data. While this assumption can be considered valid, the fact that some
of the subsequent values after the initial drop in value (down-peak) follow a progressive
rise before reaching or getting a higher value than the initial value before the down-peak
says otherwise.
The possible misdiagnosis assumption looks at the down-peak as the correct values to be
specified and treats all other prior values greater than the down-peak as errors that need to be
corrected, while, the data entry assumption looks at the down-peaks as errors that need to be
corrected and uses the immediate prior value before the down-peak to replace the value of the
down-peak.
For our data, and based off the context of our analysis, we treat all negative values seen in the
daily confirmed cases as instances of POSSIBLE MISDIAGNOSIS and treat the problem with the
techniques specified above.
The countries where the issue of negative values for daily confirmed cases were found include:
- Afghanistan
- China
- Ecuador
- Spain
- France
- Guyana
- Israel
- Liberia
- New Zealand
- Puerto Rico
- Portugal
- Seychelles
- Uganda
- Uruguay
- Zimbabwe
Filtering Data:
Our analysis of stringency index and confirmed cases as of May 4, 2020, involves filtering both
individual dataframes (clean_confirmedcases and clean_stringencyindex) to include
only the country name and their respective values on May 4, 2020.
Merge:
For merging the two dataframes after filtering, we use a LEFT MERGE. This decision is made
because we are analyzing confirmed cases and the stringency index attached to them. It's crucial
to get the stringency index in relation to the confirmed cases. Therefore, a left merge captures all
possible countries' confirmed cases, and then we find the associated stringency index for them.
After successfully merging and checking for missing values, we find no missing values in our
data.
Further Analysis:
The correlation coefficient achieved between the total number of confirmed cases and the
government stringency index is -0.07, indicating no correlation between the two variables. Our
findings are not statistically significant as we do not obtain a p-value below the 0.05 confidence
interval. With a p-value of 0.54, we cannot reject the null hypothesis at a significance level of
0.05. This means that the observed correlation coefficient is not statistically significant, and we
do not have sufficient evidence to conclude that the observed correlation is not due to random
chance.
From our graph, we are able to focus our conversation on countries that have high and extremely
high death rates. The countries that fall into this category are:
The countries categorized by high death rates are coloured purple, as seen in our graph. While
the United States of America, which is the only country falling into the category of extremely high
death rates with over 72,365 confirmed deaths, is coloured brown in our graph.
For our analysis, we introduce some new variables to help draw insights from our graph. These
new variables are:
1. Death Rate
2. Confirmed Cases to Confirmed Deaths Ratio
DEATH RATE:
The death rate tells us the degree to which we classify the amounts of confirmed deaths
registered between the following categories:
• Low
• Moderately Low
• Average
• Moderately High
• High
• Extremely High
It was through this index that we were able to filter our data to annotate points where death rates
are either high or extremely high. Here, USA, Spain, Italy, UK, and France were highlighted and
coloured black.
This is calculated by dividing the total number of confirmed cases by the total number of deaths.
The returned value tells us how many cases we expect to see in a specific country before we
record our first death. This value, when low, indicates high deaths in relation to confirmed cases,
and high values indicate a low number of registered deaths in relation to the confirmed cases.
This measure serves a better way to evaluate the true level of deaths across all countries
involved. For our analysis, we focus on the top 5 countries with the lowest Confirmed Cases to
Confirmed Deaths Ratio. Here, Belgium, France, UK, Italy, and Peru record the lowest values
respectively. These countries are highlighted by a grey colour in our graph; however, France,
UK, and Italy had previously been highlighted among the countries with the most death rates,
hence they maintain their initial annotation colour of black, and only Peru and Belgium get added
to the annotations with a grey colour.
In conclusion, we learn from our analysis of the data and the graph that the countries with the
most death rates are the countries with the most confirmed cases. This led us to attempt finding
a better measure that focuses on the true impact of COVID-19 on the respective countries by
considering the ratio of their confirmed cases to confirmed deaths. From this, we learn that
Belgium was most affected by COVID-19 as of May 4th, recording the least confirmed cases to
confirmed deaths ratio. This provides a better view of the impact than considering just the most
death rates, where we see the USA rank first.
Colrd (n.d.). HTML/CSS Color Codes For YlOrRd 09 — Art & Design Inspiration at
ColRD.com. [online] colrd.com. Available at: http://colrd.com/palette/19079/?download=css.
Fessel, K. (2021a). Advanced Seaborn Color Palettes | Cubehelix palette, xkcd colors,
choose_colorbrewer_palette. [online] www.youtube.com. Available at:
https://youtu.be/Eg0NJcUWLRM.
Fessel, K. (2021b). Seaborn Color Palette Basics | Using named and custom color palettes in
Python seaborn. [online] www.youtube.com. Available at: https://youtu.be/2wRHBodrWuY.
Matplotlib (n.d.). Plots with different scales — Matplotlib 3.7.1 documentation. [online]
matplotlib.org. Available at:
https://matplotlib.org/stable/gallery/subplots_axes_and_figures/two_scales.html#sphx-glr-
gallery-subplots-axes-and-figures-two-scales-py.
StackOverFlow (n.d.). Remove a legend section from a seaborn plot. [online] Stack
Overflow. Available at: https://stackoverflow.com/questions/62163460/remove-a-legend-
section-from-a-seaborn-plot