Introduction R For DS
Introduction R For DS
Yuni Astari
2025-04-15
Introduction
The COVID-19 pandemic has significantly impacted countries worldwide, influencing public health systems,
economies, and daily life. One of the most important strategies in managing the spread of the virus has
been large-scale testing. Accurate and accessible testing data is essential to understanding the scope of the
outbreak and implementing timely interventions.
In this project, I take on the role of a data analyst for a news channel’s data science team. The team is
preparing a feature story on global COVID-19 testing efforts, and I have been assigned to gather and analyze
real-world testing data to support the story with data-driven insights.
#install.packages("httr")
#install.packages("rvest")
library(httr)
library(rvest)
get_wiki_covid19_page()
## Response [https://en.wikipedia.org/w/index.php?title=Template%3ACOVID-19_testing_by_country]
## Date: 2025-04-15 04:15
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 452 kB
## <!DOCTYPE html>
## <html class="client-nojs vector-feature-language-in-header-enabled vector-fea...
1
## <head>
## <meta charset="UTF-8">
## <title>Template:COVID-19 testing by country - Wikipedia</title>
## <script>(function(){var className="client-js vector-feature-language-in-heade...
## RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","user.st...
## <script>(RLQ=window.RLQ||[]).push(function(){mw.loader.impl(function(){return...
## }];});});</script>
## <link rel="stylesheet" href="/w/load.php?lang=en&modules=ext.cite.styles%...
## ...
Task 2: Extract COVID-19 testing data table from the wiki HTML page
## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-pa
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...
## {xml_nodeset (4)}
## [1] <table class="box-Update plainlinks ombox ombox-content ambox-Update" rol ...
## [2] <table class="wikitable plainrowheaders sortable collapsible autocollapse ...
## [3] <table class="plainlinks ombox mbox-small ombox-notice" role="presentatio ...
## [4] <table class="wikitable mw-templatedata-doc-params">\n<caption><p class=" ...
2
summary(data_covid)
return(data_frame)
}
proper_data_covid<- preprocess_covid_data(data_covid)
head(proper_data_covid)
3
## 4 Andorra 23 Feb 2022 300307 37958 12.6
## 5 Angola 2 Feb 2021 399228 20981 5.3
## 6 Antigua and Barbuda 6 Mar 2021 15268 832 5.4
## tested.population.ratio confirmed.population.ratio
## 1 0.40 0.130
## 2 15.00 3.400
## 3 0.53 0.130
## 4 387.00 49.000
## 5 1.30 0.067
## 6 15.90 0.860
summary(proper_data_covid)
write.csv(proper_data_covid,file='covid-19(2023).csv',row.names=FALSE)
## [1] "C:/Users/LENOVO/OneDrive/Documents/Coursera/IBM/covid.csv"
file.exists(file_path)
## [1] FALSE
4
# My saved file with new name
file_path <- paste(wd, sep="", "/covid-19(2023).csv")
print(file_path)
## [1] "C:/Users/LENOVO/OneDrive/Documents/Coursera/IBM/covid-19(2023).csv"
file.exists(file_path)
## [1] TRUE
## country confirmed
## 5 Angola 20981
## 6 Antigua and Barbuda 832
## 7 Argentina 9060495
## 8 Armenia 422963
## 9 Australia 10112229
## 10 Austria 5789991
## [1] 431434555
## [1] 5396881644
## [1] 0.08
Task 6: Get a sorted name list of countries that reported their testing data
5
covid_data$country <- as.character(covid_data$country)
# Sort A to Z
sort(covid_data$country)
6
## [145] "Slovakia" "Slovenia" "South Africa"
## [148] "South Korea" "South Sudan" "Spain"
## [151] "Sri Lanka" "Sudan" "Sweden"
## [154] "Switzerland[l]" "Taiwan[m]" "Tanzania"
## [157] "Thailand" "Togo" "Trinidad and Tobago"
## [160] "Tunisia" "Turkey" "Uganda"
## [163] "Ukraine" "United Arab Emirates" "United Kingdom"
## [166] "United States" "Uruguay" "Uzbekistan"
## [169] "Venezuela" "Vietnam" "Zambia"
## [172] "Zimbabwe"
# Sort Z to A
ztoa_country <- sort(covid_data$country, decreasing = TRUE)
print(ztoa_country)
7
## [118] "Finland" "Fiji" "Faroe Islands"
## [121] "Ethiopia" "Eswatini" "Estonia"
## [124] "Equatorial Guinea" "El Salvador" "Egypt"
## [127] "Ecuador" "DR Congo" "Dominican Republic"
## [130] "Dominica" "Djibouti" "Denmark[e]"
## [133] "Czechia" "Cyprus[d]" "Cuba"
## [136] "Croatia" "Costa Rica" "Colombia"
## [139] "China[c]" "Chile" "Chad"
## [142] "Canada" "Cameroon" "Cambodia"
## [145] "Burundi" "Burkina Faso" "Bulgaria"
## [148] "Brunei" "Brazil" "Botswana"
## [151] "Bosnia and Herzegovina" "Bolivia" "Bhutan"
## [154] "Benin" "Belize" "Belgium"
## [157] "Belarus" "Barbados" "Bangladesh"
## [160] "Bahrain" "Bahamas" "Azerbaijan"
## [163] "Austria" "Australia" "Armenia"
## [166] "Argentina" "Antigua and Barbuda" "Angola"
## [169] "Andorra" "Algeria" "Albania"
## [172] "Afghanistan"
# Find country names that contain a space (i.e., countries with multiple words in their name)
space_matches <- grep(" ", covid_data$country, value = TRUE)
print(space_matches)
Task 8: Pick two countries you are interested in, and then review their testing data
8
germany
Task 9: Compare which one of the selected countries has a larger ratio of confirmed cases to
population
## [1] "India has a higher COVID-19 infection rate per population than Germany."
Task 10: Find countries with confirmedcases to population ratio rate less than a threshold
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold
low_risk_countries <- covid_data[(covid_data$`confirmed.population.ratio` <1), ]
head(low_risk_countries)