Machine Learning and OLAP On Big COVID-19 Data
Machine Learning and OLAP On Big COVID-19 Data
Alfredo Cuzzocrea
Big Data Engineering and Analytics Lab, University of Calabria
Rende, Italy
Abstract—In the current technological era, huge amounts of • financial time series [11-13],
big data are generated and collected from a wide variety of rich
data sources. These big data can be of different levels of veracity • transportation data [14-17],
in the sense that some of them are precise while some others are
• omic data (e.g., genomic data) [18, 19],
imprecise and uncertain. Embedded in these big data are useful
information and valuable knowledge to be discovered. An example • disease reports [20-22], as well as
of these big data is healthcare and epidemiological data such as
data related to patients who suffered from epidemic diseases like • epidemiological data and statistics.
the coronavirus disease 2019 (COVID-19). Knowledge discovered
from these epidemiological data—via data science techniques such
Useful information and valuable knowledge is usually
as machine learning, data mining, and online analytical processing embedded in these big data. This calls for data science [23],
(OLAP)—helps researchers, epidemiologists and policy makers to which aims to discover knowledge from these big data via data
get a better understanding of the disease, which may inspire them mining algorithms [24-26], machine learning tools [27-29],
to come up ways to detect, control and combat the disease. In this online analytical processing (OLAP) techniques [30-32],
paper, we present a machine learning and big data analytic tool mathematical and statistical models [33, 34], data analytics, and
for processing and analyzing COVID-19 epidemiological data. visual analytics. The discovered knowledge is useful. For
Specifically, the tool makes good use of taxonomy and OLAP to instance, knowledge discovered from these epidemiological
generalize some specific attributes into some generalized data helps researchers, epidemiologists and policy makers to get
attributes for effective big data analytics. Instead of ignoring a better understanding of the disease, which may inspire them to
unknown or unstated values of some attributes, the tool provides come up ways to detect, prevent, and/or control diseases such as
users with flexibility of including or excluding these values, viral diseases. Examples of viral diseases include:
depending on their preference and applications. Moreover, the
tool discovers frequent patterns and their related patterns, which • severe acute respiratory syndrome (SARS), with
help reveal some useful knowledge such as absolute and relative outbreak in 2002–2004;
frequency of the patterns. Furthermore, the tool learns from the
patterns discovered from historical data and predicts useful • Middle East respiratory syndrome (MERS), with
information such as clinical outcomes for future data. As such, the outbreak in 2012–2015; and
tool helps users to get a better understanding of information about • coronavirus disease 2019 (COVID-19), with outbreak
the confirmed cases of COVID-19. Although this tool is designed started in 2019 and became pandemic in 2020
for machine learning and analytics of big epidemiological data, it
would be applicable to machine learning and analytics of big data Due to the COVID-19 pandemic, many researchers have
in many other real-life applications and services. focused on different aspects of the COVID-19 disease. These
include clinical and treatment information [35, 36], as well as
Keywords—big data, machine learning, online analytical drug discovery [18, 37], related on research medical and health
processing, OLAP, data science, data analytics, data mining, sciences. In contrast, as computer scientists, we focus on other
coronavirus disease, COVID-19, epidemiological data aspects of COVID-19 data—namely, epidemiological data.
I. INTRODUCTION Many existing works on the COVID-19 epidemiological
In the current technological era, big data are everywhere. To data focused on showing the numbers of confirmed cases and
elaborate, huge amounts of data have been easily generated and mortality spatially and/or temporally. In other words, they show:
collected from a wide variety of rich data sources at a rapid rate. • spatial differences among different continents, countries,
These big data can be of different levels of veracity (e.g., precise regions, or sovereignties; and/or
data, imprecise and uncertain data [1-3]). Examples of big data
include: • temporal differences among weeks or days along the
timeline—e.g., to show the effects of public health
• network (e.g., social network) data [4-10], strategies and mitigation techniques such as social/
© IEEE 2021. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.
5118
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
physical distancing, stay-at-home orders, and lockdowns supervised learning model for associative classification to
in “flattening the (epidemic) curve”. predict the clinical course and outcomes for new data.
As the numbers of inhabitants and tests both play roles in the Our key contributions of this paper include our design and
data and their analyses, they help in the computation of figures development of a machine learning tool for big COVID-19
like (a) the numbers of confirmed cases and mortality per epidemiological data. Our tool incorporates:
thousand/million inhabitants and (b) the number of tests per
thousand inhabitants. • OLAP techniques, with taxonomy for summarizing
specific details of COVID-19 cases by their more
While the numbers of confirmed cases and mortality are generalized forms for purposes like preserving privacy of
important in showing the severity of the disease in a certain COVID-19 cases and preparing for data analytics of big
location at a specific time or time interval, there are other data;
important knowledge that can be discovered from the
epidemiological data for revealing additional information • handling of NULL values, which allows users to include
associated with the disease. For instance, knowing that more or exclude NULL values in the analysis;
confirmed cases and mortality reported today when compared • data mining algorithms for the discovery of frequent
with yesterday indicates the severity of the COVID-19 situations patterns; and
in Canada. However, these numbers do not reveal information
such as: • machine learning procedures for conducting supervised
learning such that the resulting associative classifier—
• Which age groups tend to be more vulnerable to the which was trained on historical data—can predict the
disease (i.e., who is most at risk for COVID-19)? clinical course and outcomes for new data.
• Which age groups tend to be less vulnerable to the Our tool helps users (e.g., researchers, epidemiologists and
disease? policy makers) to get a better understanding of information
• What is likelihood of recovery for COVID-19 survivors about the confirmed cases of COVID-19. This, in turns, may
who were admitted to the intensive care units (ICU)? inspire them to come up ways to detect, control and combat the
disease. Moreover, despite that this tool is designed for machine
In this paper, we present a machine learning and big data learning and analytics of big epidemiological data, it is
analytic tool to discover this additional information associated applicable to machine learning and analytics of big data in many
with the disease from the epidemiological data. The tool collects other real-life applications and services.
a wide variety of data—such as (a) administrative information,
(b) case details, (c) symptoms, (d) clinical course and outcomes, The remainder of this paper is organized as follows. Next
(e) exposures, etc.—from a different data sources. With the section discusses some background and related work. Section III
increasing number of cases in Canada (and around the world), presents our machine learning tool. Section IV shows evaluation
these data are big and updated frequently. Due to the nature of results, and Section V draws the conclusions.
the data, it is not unusual to have different levels of veracity— II. BACKGROUND AND RELATED WORKS
i.e., with known values for some of the attributes (e.g., known
hospitalization status like “hospitalized and ICU admitted”) and A. COVID-19 Research
unknown/NULL values for some others (e.g., unstated Because of the COVID-19 pandemic, many researchers have
transmission methods of disease). Moreover, some data are quite explored on different aspects of the COVID-19 disease. These
detailed (e.g., “on January 23, a 56-year old male presented to led to numerous works on COVID-19. Examples include:
Sunnybrook Health Sciences Centre in Toronto with a new onset
of fever and non-productive cough following return from • systematic reviews on literature about medical and health
Wuhan, China, the day prior” [38]). Some other data are more science research on COVID-19 [43, 44]
abstract and general (e.g., “on Week 3—i.e., the third full
• clinical and treatment information [35, 36], as well as
week—of 2020, a male in his 50s—who was transmitted
drug discovery and vaccine development [18, 37], which
through international travel—in the province of Ontario showed
focus more on the medical and health science aspects
symptoms of fever and cough”), for preserving the privacy
[39-42] of the individuals. • crisis management for the COVID-19 outbreak [45],
It becomes logical to have taxonomy to perform OLAP such which focuses more on the social science aspects
as generalizing some very specific details into their generalized • artificial intelligence (AI)-driven informatics, sensing,
or aggregated forms to give an overview (i.e., a big picture) of imaging for tracking, testing, diagnosis, treatment and
data for data analytics. With the taxonomy, one can drill down prognosis [46] such as those imaging-based diagnosis of
if detailed information is needed. As a side-benefit, aggregated COVID-19 using chest computed tomography (CT)
counts for attributes reduce the dimensions of the data and the images [47, 48]
search space for machine learning on big data. Aggregated
counts for many of these attributes are expected to be • mathematical modelling of the spread of COVID-19 [49]
sufficiently frequent to be qualified as frequent patterns. The In contrast, the current paper focuses more on natural
discovered frequent patterns can then be used in training a sciences and engineering aspects—especially, takes on a more
computational favor. Moreover, our designed and developed
5119
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
machine learning tool examines textual-based COVID-19 TABLE I. COUNTRIES WITH THE TOP-10 NUMBER OF NEW COVID-19
CASES ON NOVEMBER 15, 2020
epidemiological data (rather than images). Instead of projecting
the spread of the disease, our tool predicts the clinical Rank Country New Cases
outcomes—e.g., whether the case recovered or died from the Global 594,000
disease. Furthermore, our tool conducts machine learning on big 1 USA 181,066
data, and it helps users to get a better understanding of 2 India 41,100
3 Italy 37,249
information about the confirmed cases of COVID-19. Although 4 France 32,059
this tool is designed for machine learning and analytics of big 5 Brazil 29,070
epidemiological data, it would be applicable to machine learning 6 UK 26,860
and analytics of big data in many other real-life applications and 7 Poland 25,571
services. 8 Russia 22,571
9 Germany 16,947
B. Confirmed Cases and Mortality 10 Argentina 11,859
Many existing works on the COVID-19 epidemiological
data focused on reporting the numbers of confirmed cases and TABLE II. COUNTRIES WITH THE TOP-10 NUMBER OF NEW DEATHS
FROM COVID-19 ON NOVEMBER 15, 2020
mortality spatially, which highlight spatial differences among
different continents, countries, regions, or sovereignties. Rank Country New Deaths
Examples of these works include data and dashboards reported Global 8,212
by organizations like: 1 USA 1,356
2 Mexico 568
• World Health Organization (WHO) [50]; 3 Poland 546
4 Italy 544
• Center for Systems Science and Engineering (CSSE) at 5 UK 462
Johns Hopkins University (JHU)1; 6 Brazil 456
7 Iran 452
• European Center for Disease Prevention and Control 8 India 447
(ECDC)2; 9 France 354
10 Russia 352
• governments (e.g., Government of Canada3); as well as
TABLE III. COUNTRIES WITH THE TOP-10 CUMULATIVE TOTAL NUMBER
• major news channels/media/networks (e.g., newspaper, OF COVID-19 CASES AS OF NOVEMBER 15, 2020
TV4) and Wikipedia5.
Rank Country Cumulative Cases
See Tables I, II, III and IV for some examples showing top-10 Global 53,766,728
countries with new (or cumulative) cases (or deaths) based on 1 USA 10,641,431
the WHO data [50]. Specifically: 2 India 8,814,579
3 Brazil 5,810,652
• Table I lists the top-10 countries with the highest daily 4 Russia 1,925,825
number of new COVID-19 cases, as well as the global 5 France 1,918,345
daily number of new cases, on November 15, 2020. 6 Spain 1,458,591
7 UK 1,344,360
• Table II lists the top-10 countries with the highest daily 8 Argentina 1,296,378
number of new COVID-19 deaths, as well as the global 9 Colombia 1,182,697
daily number of new cases, on November 15, 2020. 10 Italy 1,144,552
• Tables III lists the top-10 countries with the highest TABLE IV. COUNTRIES WITH THE TOP-10 CUMULATIVE TOTAL NUMBER
cumulative number of COVID-19 cases, as well as the OF DEATHS FROM COVID-19 AS OF NOVEMBER 15, 2020
global cumulative number of cases, as of November 15, Rank Country Cumulative Deaths
2020. Global 1,308,975
1 USA 242,542
• Tables IV lists the top-10 countries with the highest 2 Brazil 164,737
cumulative number of COVID-19 deaths, as well as the 3 India 129,635
global cumulative number of deaths, as of November 15, 4 Mexico 97,624
2020. 5 UK 51,766
6 Italy 44,683
Observed from these tables, several countries—such as Brazil, 7 France 43,913
France, India, UK, and USA—have been hit hard by COVID-19 8 Iran 41,034
as they appear on all four tables. 9 Spain 40,769
10 Peru 35,106
1
https://coronavirus.jhu.edu/map.html
2
https://qap.ecdc.europa.eu/public/extensions/COVID-19/COVID-19.html
3
https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html
4
https://newsinteractives.cbc.ca/coronavirustracker/
5
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Canada, https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/Canada_medical_cases
5120
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
Given the numbers of inhabitants and tests both play roles in different shading, coloring, or placing of symbols within
the data and analyses, the numbers of confirmed cases and predefined areas to indicate the number of cases (or deaths) for
mortality are sometimes represented in terms of per million each country. The darker the shading of a country, the more
inhabitants. See Tables V and VI, which reveal COVID-19 severity is its COVID-19 situation. On the one hand, the users
situations in terms of infection rate and death rate (cf. Tables I can then easily spot those countries with severe COVID-19
to IV show the absolute numbers of infection and death) based situations due to their shading. On the other hand, small
on the WHO data [15]. Observed from these tables, some countries in terms of geographic areas or sizes (e.g., Andorra,
geographically small or sparsely populated countries have been Monaco, San Marino) may not be easily visible on the map, let
hit hard by COVID-19. For example, in Andorra with alone visualizing their shading.
population around 77,000, the infection rate is at a worrisome
level of around a case per 13.5 inhabitants, and the death rate is These numbers of confirmed cases and mortality are
around a death per 1,030 inhabitants (cf. around a case per important in showing the severity of the disease in a certain
171.8 inhabitants and a death per 1,350 inhabitants in the USA). location at a specific time or time interval. However, it is equally
important to explore and discover other useful knowledge from
TABLE V. COUNTRIES WITH THE TOP-K CUMULATIVE TOTAL NUMBER the epidemiological data because the discovered knowledge can
OF COVID-19 CASES PER MILLION INHABITANTS AS OF NOV 15, 2020 reveal useful information (e.g., some characteristics of COVID-
19 cases) associated with the disease. This, in turn, helps users
Rank Country (or Region) Cum Cases Per 1M Pop'n
1 Andorra 74,095.6
to get a better understanding on characteristics of the confirmed
2 Bahrain 49,673.4 cases of COVID-19 (rather than just the numbers of cases).
3 Qatar 47,055.7
4 Belgium 45,832.7 III. OUR MACHINE LEARNING TOOL
5 Aruba 43,450.2 In this section, we describe our machine learning tool for big
6 Montenegro 42,810.8 data analytics of COVID-19 epidemiological data.
7 Czechia 42,789.2
8 French Polynesia 41,672.0 A. Collection and Integration of Data
9 Luxembourg 41,424.8
10 Armenia 39,597.5
Big COVID-19 epidemiological data can be of a wide
Global 6,887.6 variety (e.g., different types of data). They are usually generated
and collected from various data sources.
TABLE VI. COUNTRIES WITH THE TOP-K CUMULATIVE TOTAL NUMBER As a concrete example, in Canada, health care is a
OF DEATHS PER MILLION INHABITANTS FROM COVID-19 AS OF NOVEMBER
15, 2020 responsibility of provincial governments. So, Canadian COVID-
19 epidemiological data are gathered from each province (or
Rank Country (or Region) Cum Deaths Per 1M Popn
1 San Marino 1,237.6
territory), and provincial data are obtained from health regions
2 Belgium 1,234.1 (which are also known as health authorities) within the
3 Peru 1,064.7 province. For instance, in the province of Manitoba, COVID-19
4 Andorra 970.7 data can be gathered from Winnipeg Regional Health Authority
5 Spain 872.0 (WRHA) and four other health authorities6. Similarly, data for
6 Argentina 775.4 the province of British Columbia (BC) can be gathered from five
7 Brazil 775.0 health authorities such as Vancouver Coastal Health (VCH),
8 Chile 773.0 which obtained data from 14 local health areas (LHA) within the
9 UK 762.5
10 Mexico 757.2
three health service delivery areas (HSDA) in the VCH. In BC,
Global 167.7 there are 88 HSDA within the 16 LHA among the five health
authorities 7 . As a third example, data from the province of
Ontario can be gathered from public health units within the
As “a picture is worth a thousand words”, the numbers of
provincial 14 local health integration networks (LHIN)8.
cases and mortality are sometimes represented in graphical
forms by using bubble maps. In a bubble map, the number of In terms of data types, COVID-19 epidemiological data
cases (or deaths) for each country is indicated by the radius of usually include:
the bubble representing the country. The larger the bubble
representing a country, the more severity is its COVID-19 • administrative information, which includes:
situation. On the one hand, the users can then easily spot those o an unique privacy-preserving identifier for
countries with severe COVID-19 situations due to their large each case,
bubble sizes. On the other hand, bubbles may overlap. As such,
the overlapping and/or containment of bubbles can make it o its location, and
difficult to users to visualize the severity of the disease in dense o episode day (i.e., symptom onset day or its
regions such as Eastern Caribbean and Southeastern Europe. closest day).
Alternatively, the numbers of cases and mortality are
sometimes represented by choropleth maps. These maps use
6
https://www.gov.mb.ca/health/rha/
7
https://www2.gov.bc.ca/gov/content/data/geographic-data-services/land-use/administrative-boundaries/health-boundaries
8
http://www.lhins.on.ca/
5121
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
• case details, which include: attributes by exploiting taxonomy and OLAP. In other words,
data can be stored in a data cube so that users can (a) drill down
o gender, to find more details and (b) drill/roll up to get aggregate values
o age, and (e.g., usually count or sum of values). To elaborate, our tool
generalizes data by:
o specific occupation of the cases.
• applying taxonomy to case locations to group them into
• symptom-related data, which include additional local health regions, which then generalize to become
information for the case who is not asymptomatic (i.e., provinces, and then to regions within a country (e.g.,
symptomatic case) such as: Prairies, Atlantic region);
o onset day of symptoms, and • applying temporal hierarchy to group days into week
o a collection of symptoms (e.g., cough, fever, (e.g., episode week, onset week of symptoms, recovery
chills, sore throat, runny nose, shortness of week);
breath, nausea, headache, weakness, pain, • grouping ages to age groups (e.g., ≤ 19 years old,
irritability, diarrhea, and other symptoms). 20-29 years old, ..., 70-79 years old, ≥ 80 years old);
• clinical course and outcomes, which include: • generalizing specific occupation of the cases to some
o hospital status (e.g., hospitalized in the generalized key occupation groups—say, (a) health care
intensive care unit (ICU), non-ICU workers, (b) school or daycare workers, (c) long-term
hospitalized, not hospitalized). care residents, and (d) others;
o For recovered case, it also includes additional • generalizing specific transmission methods to some
information such as the recovery day. generalized key transmission methods—say,
(a) community exposures, (b) travel exposures, and
o For the case who has not recovered, it (c) others; as well as
indicates that the case died while infected by
COVID-19. • transforming any set of m symptoms (with potential set
size of 1 to m) into m Boolean attributes, each indicates
• exposures, which include transmission methods. whether a symptom is reported or not.
B. Handling of NULL values Note that generalization of some data helps preserve privacy
After collecting and integrating data from heterogeneous of some COVID-19 cases. Another side-benefit is to increase the
sources, we observe that there are some missing, unstated or frequency of some attributes in preparation of frequent pattern
unknown information (i.e., NULL values). Given the nature of mining.
these COVID-19 cases, it is not unusual to have NULL values
because values may not be available or recorded. For some other D. Mining of Frequent Patterns
attributes related to case details (e.g., personal information like After preprocessing and generalizing data, our tool conduct
gender, age), patients may prefer not to report it due the privacy big data analytics on the resulting COVID-19 data. With at least
concerns. 11 attributes and m symptoms (e.g., m = 13 symptoms listed
above), there can be a total of (11+m) dimensions in a data cube
To elaborate, since data are collected from administrative when data are stored in the cube. For each dimension, there can
health regions, their locations (or generalized regions within a be ni stated values for the attribute/dimension. Then, with the
country) are known. For other attributes, their values can be NULL value and ALL value, there can be (ni + 2) values for the
NULL to indicate that they are unknown or not stated. Although attribute. The total number of cells in the cube can be the product
NULL values are usually ignored in many other real-life of the number of values (i.e., ni + 2) in each dimension over at
applications, our tool captures and counts NULL values instead. least (11+m) dimensions. Thus, the search space can be large.
The rational is that, due to nature of the COVID-19 cases (e.g.,
for timely reporting of cases, privacy-preservation of the identity Our tool first provides users with insights about each
of cases), it is not too surprising to observe a significant number dimension D. It can do by setting all other dimensions to ALL
of NULL values. Ignoring these many NULL values may lead and enumerating all values of D. It repeats the same procedure
to inaccurate or incomplete analysis of the data. Hence, for each for each dimension. The dimension with highest frequency is the
of these nullable attribute, in additional to those stated values, most frequent singleton pattern. On the other end of the
our tool captures and counts NULL values. spectrum, the dimension with lowest frequency is the rarest
singleton pattern.
C. Preprocessing of Data with Taxonomy and OLAP
In addition to observing NULL values in the data, we also To a further extent, by setting all but k dimensions to ALL
observe that values for some attributes are too specific (e.g., and enumerating all values for the k dimensions, cells with high
reported symptom onset day, when may be inaccurate, partially frequency give frequent non-singleton patterns. Conversely,
due delays in testing). As another example, due to numerous cells with low frequency give rare non-singleton patterns.
values for some attributes (e.g., age, occupation), it would be Given the large search space, finding frequent or rare
logical to group similar values into a mega-value (say, ages can patterns with the aforementioned procedure can be time
be binned into age groups). Hence, our tool generalizes some consuming. Our tool also provides an alternative by applying
5122
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
traditional frequent pattern mining algorithms to find frequent 5. Age group: ≤ 19, 20s, 30s, 40s, 50s, 60s, 70s, ≥ 80s, and
and rare patterns. Benefits of using these algorithms include the NULL (e.g., unknown, prefer not to declared)
constant pruning of search space provided by the exploitation of
the property that a super-pattern is infrequent if any of its sub- 6. Occupation group, including:
pattern is infrequent. a) health care worker,
In addition to finding frequent patterns, our tool also b) school or daycare worker (or attendee),
provides users the flexibility to find patterns related to (or
complementary to) the mined frequent patterns. This gives c) long-term care resident,
insights about relative importance of the mined frequent d) other occupation, and
patterns. Specifically, our tool provides the relative percentages
of the frequent patterns when compared with their related e) NULL
patterns (with NULL values included or excluded). The tool first 7. Asymptomatic: Yes, No, and NULL
finds frequent patterns by applying traditional frequent pattern
mining algorithms and then looks up frequency of related 8. Hospital status, including:
patterns by enumerating values for the attributes in the frequent a) hospitalized in the ICU,
patterns. The related patterns can be looked up from the data
cube or the mined patterns. b) hospitalized but not in the ICU
E. Prediction of Outcomes by Supervised Learning c) not hospitalized, and
Once the frequent patterns are mined, they can be used for d) NULL
associative classification, which is a supervised learning
technique. By training our tool with different combinations of 9. Transmission method, including:
attribute-values, it can make predictions. A useful prediction is a) community exposures,
to predict the likelihood of clinical outcomes (e.g., recovered or
deceased). b) travel exposures, and
A. A Case Study on Real-Life COVID-19 Data 10. Clinical outcome: Recovered, death, and NULL
e) Atlantic (i.e., New Brunswick, Nova Scotia, 3) Preprocessing of Data with Taxonomy and OLAP
Prince Edward Island, Newfoundland and On the one hand, capturing COVID-19 cases in the dataset
Labrador) in a data cube provides users flexibility to apply OLAP
operations such as drill downs to the details of specific cases,
3. Episode week (or onset week of symptoms): From roll ups to some aggregated counts. On the other hand, having
Week 3 (i.e.., week of January 12-18, 2020) to now, and an ALL value capturing aggregated counts for each dimension
NULL further increases the number of combinations. This is also the
4. Gender, including NULL total number of cells in the data cube—namely,
1,161,600,000 cells, i.e., around 1.2 billion cells.
5123
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
With this setting, frequent patterns can be found by setting TABLE IX. DISTRIBUTION OF CLINICAL OUTCOME
some attribute values to ALL. Specifically, setting values of all Frequency
attributes—except the transmission method—to ALL, i.e., the Clinical Outcome
Absolute Relative
cell ALL, ..., ALL, transmission method = “community Recovered 158,528 94.3% 75.6%
exposures”, ALL, ALL, gives frequency of a singleton pattern Deceased 9,541 5.7% 4.5%
#cases w/ stated clinical outcome 168,069 100% 80.1%
{community exposures}. This reveals that 164,280 COVID-19
Unstated clinical outcomes 41,742 19.9%
cases in Canada were transmitted through community Total #cases 209,811 100%
exposures. Replacing the value of transmission method gives
two related singleton patterns {travel exposures} and {unstated TABLE X. DISTRIBUTION OF HOSPITAL STATUS
exposures}. The two corresponding cells in the data cube reveal
that (a) 4,810 cases were transmitted through travel exposures Frequency
Hospital Status
Absolute Relative
and (b) transmissions of other 40,721 cases were unstated. Not hospitalized 122,174 89.5% 58.23%
With these frequency values for related attribute-values Hospitalized but not ICU admitted 11,140 8.2% 5.31%
about transmission methods, our tool provides users with ICU admitted 3,203 2.3% 1.53%
#cases w/ stated hospital status 136,517 100% 65.07%
relative frequency information. Specifically, it reports to the Unstated hospital status 73,294 34.93%
users that, among 209,811 cases, 78.3% of cases were Total #cases 209,811 100%
transmitted through community exposures, 2.3% of cases were
transmitted through travel exposures and transmissions of the TABLE XI. DISTRIBUTION OF GENDER
remaining 19.4% were unstated. See Table VII.
Frequency
Gender
Absolute Relative
TABLE VII. DISTRIBUTION OF TRANSMISSION METHOD
Female 106,878 53.0% 50.9%
Transmission Absolute Frequency Male 94,736 47.0% 45.2%
Relative Frequency
Method (i.e., #Cases) #cases w/ stated gender 201,614 100% 96.1%
Community exposures 164,280 78.3% Unstated gender 8,197 3.9%
Travel exposures 4,810 2.3% Total #cases 209,811 100%
Unstated transmission
40,721 19.4%
method TABLE XII. DISTRIBUTION OF AGE GROUP
Total 209,811 100%
Frequency
Age Group
Absolute Relative
Moreover, in addition to showing the relative frequencies 20s 40,029 19.12% 19.08%
with NULL category for the attribute transmission methods, our 30s 32,904 15.72% 15.68%
tool also provides users with flexible of ignoring the NULL 40s 30,637 14.63% 14.60%
category for the attribute and focusing only on stated/known 50s 28,960 13.83% 13.80%
values. Specifically, it reports to the users that, among ≤ 19 years old 24,487 11.70% 11.67%
169,090 cases with known transmission methods, 97.2% of ≥ 80s 22,241 10.62% 10.60%
cases were transmitted through community exposures and the 60s 18,206 8.70% 8.68%
70s 11,890 5.68% 5.67%
remaining 2.8% were transmitted through travel exposures. See #cases w/ stated age group 209,354 100% 99.78%
Table VIII. Unstated age group 457 0.22%
Total #cases 209,811 100%
TABLE VIII. DISTRIBUTION OF STATED TRANSMISSION METHOD
Stated Transmission Absolute Frequency
Relative Frequency 4) Mining of Frequent Patterns
Method (i.e., #Cases) While our tool makes good use of the data cube in providing
Community exposures 164,280 97.2% users with insight about distributions of different attributes (i.e.,
Travel exposures 4,810 2.8%
Total 169,090 100%
singleton patterns and their related patterns), searching through
numerous cells in a data cube can be time consuming. Hence,
our tool provides users with an alternative by using traditional
Similarly, our tool provides data distributions for some other
frequent pattern mining algorithms to find frequent patterns. See
attributes. See Tables IX to XII, which reveal knowledge like:
Table XIII for top-10 frequent patterns, Table XIV for top-5
• A majority of cases covered. frequent singleton patterns (i.e., patterns involving only one
attribute), and Table XV for top-10 frequent non-singleton
• More than half of cases were not hospitalized. patterns (i.e., patterns involving more than one attributes). They
• There is no significant difference between the genders, reveal knowledge like:
though slightly more female cases than males. • A majority of cases were (a) transmitted via community
• There is also no significant difference among most age exposures and (b) recovered.
groups, though slightly more young cases (especially, • A majority of those community-exposed cases were
those in their 20s) than the elderly. (a) recovered and (a) not hospitalized.
• More than half of the cases were not hospitalized.
5124
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
• Slightly more than half of the cases were females. Table XVI reveals that most (i.e., 78.3%) of the cases were
exposed through the community (rather than other transmission
• Many cases that were not hospitalized and recovered. methods).
TABLE XIII. TOP-10 FREQUENT PATTERNS TABLE XVI. SOME FREQUENT PATTERNS AND THEIR RELATED PATTERNS
Frequency Frequency
Frequent Pattern Pattern
Absolute Relative Absolute Relative
{community exposures} 164,280 78.3% {community exposures} 164,280 78.3%
{recovered} 158,528 75.6% not hospitalized} 115,448 55.0%
{community exposures, recovered} 130,291 62.1% recovered} 92,559 44.1%
{not hospitalized} 122,174 58.2% not unstated clinical
{community exposures, not hospitalized} 115,448 55.0% 20,328 9.7%
{community hospitalized, outcome}
{female} 106,878 50.9% exposures, deceased} 2,561 1.2%
{not hospitalized, recovered} 97,422 46.4% unstated hospital status} 35,869 17.1%
{male} 94,736 45.2% non-ICU hospitalized} 10,190 4.9%
{community exposures, not hospitalized, recovered} 92,559 44.1% ICU hospitalized} 2,773 1.3%
{female, community exposures} 88,480 42.2% {unstated transmission method} 40,721 19.4%
All Canadian COVID-19 cases 209,811 100% {travel exposures} 4,810 2.3%
All Canadian COVID-19 cases 209,811 100%
TABLE XIV. TOP-5 FREQUENT SINGLETON PATTERNS
Frequency TABLE XVII. SOME FREQUENT PATTERNS ABOUT COMMUNITY
Frequent Singleton Pattern EXPOSURES AND THEIR RELATED PATTERNS
Absolute Relative
{community exposures} 164,280 78.3% Pattern with Community Exposures Frequency
{recovered} 158,528 75.6% & Hospital Status Absolute Relative
{not hospitalized} 122,174 58.2% not hospitalized} 115,448 89.9% 70.3%
{female} 106,878 50.9% {community
non-ICU hospitalized} 10,190 7.9% 6.2%
{male} 94,736 45.2% exposures,
ICU hospitalized} 2,773 2.2% 1.7%
All Canadian COVID-19 cases 209,811 100% Total for all stated hospital status
128,411 100% 78.2%
assoc with {community exposures}
TABLE XV. TOP-10 FREQUENT NON-SINGLETON PATTERNS {community
unstated hospital status} 35,869 21.8%
exposures,
Frequency Total for all hospital status
Frequent Non-singleton Pattern 164,280 100%
Absolute Relative assoc with {community exposures}
{community exposures, recovered} 130,291 62.1%
{community exposures, not hospitalized} 115,448 55.0%
TABLE XVIII. SOME FREQUENT PATTERNS ABOUT COMMUNITY
{not hospitalized, recovered} 97,422 46.4% EXPOSURES & NON-HOSPITALIZATION, AND THEIR RELATED PATTERNS
{community exposures, not hospitalized, recovered} 92,559 44.1%
{female, community exposures} 88,480 42.2% Pattern with Community Exposures, Frequency
{female, recovered} 84,371 40.2% Non-hospitalization & Clinical Outcome Absolute Relative
{male, community exposures} 75,294 35.9% {com. exp., recovered} 92,559 97.3% 80.2%
{male, recovered} 72,786 34.7% not hosp., deceased} 2,561 2.7% 2.2%
{female, community exposures, recovered} 71,138 33.9% Total for all stated clinical outcomes
95,120 100% 82.4%
{female, not hospitalized} 65,204 31.1% assoc w/ {com. exp., not hospitalized}
All Canadian COVID-19 cases 209,811 100% {com. exp.,
unstated clinical outcome} 20,328 17.6%
not hosp.,
Total for all clinical outcomes assoc w/
Based on the discovered frequent patterns, our tool allows {community exp., not hospitalized}
115,448 100%
users to further explore and expand the discovered patterns. To
elaborate, after finding a frequent singleton pattern {community Among these 164,280 community-exposed cases, 115,448
exposures}, our tool allows users to expand the pattern to (i.e., 70.3% as indicated in Table XVII) of them—which is
explore the hospital status. As shown in Table XVI, the 55.0% of all COVID-19 cases—were not hospitalized. Only
expanded pattern {community exposures, not hospitalized}— 6.2% were hospitalized (non-ICU or ICU admitted). Table XVII
which is a frequent non-singleton pattern—reveals that a also reveals that, when considering only 128,411 community-
majority of cases transmitted via community exposures did not exposed cases with stated clinical outcomes (by ignoring those
need to be hospitalized. Along this direction, the users can 35,869 cases with unstated clinical outcomes, which account for
further explore the expanded pattern to find patterns like 21.8% of community-exposed cases or 17.1% of all cases),
{community exposures, not hospitalized, recovered}, which 89.9% of them did not need to be hospitalized.
reveals that a majority of cases transmitted via community
exposures were not hospitalized but recovered. To a further extent, among the 115,448 non-hospitalized
community-exposed cases, 92,559 (i.e., 80.2% as indicated in
In addition to showing these frequent patterns, our tool also Table XVIII) of them—which account for 70.3% of community-
returns other patterns (which may be not so frequent) related to exposed cases and 44.1% of all COVID-19 cases) were
the frequent patterns. As a side-benefit, these related patterns— recovered. The table also reveals that, when considering only
as shown in Tables XVI to XVIII—provide additional 95,120 non-hospitalized community-exposed cases with stated
information such as relative frequency of the frequent patterns clinical outcomes (by ignoring those 20,328 non-hospitalized
with respect to all cases and/or groups of cases. For instance,
5125
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
community-exposed cases with unstated clinical outcomes, V. CONCLUSIONS
which account for 17.6% of non-hospitalized community- In this paper, we presented a machine learning tool for big
exposed cases, 12.4% of community-exposed cases, or 9.7% of analytics on big COVID-19 epidemological data. The tool
all cases), 97.3% of them were recovered. makes good use of taxonomy and OLAP to generalize some
5) Prediction of Outcomes by Supervised Learning attributes for effective analysis. Instead of ignoring unstated
Once frequent patterns (especially, frequent non-singleton values of some attributes, the tool provides users with flexibility
patterns) are discovered, our tool makes good use of them in of including or excluding these values. Moreover, the tool also
forming association rules. These rules are then used in discovers frequent patterns and their related patterns, which help
associative classification—i.e., associative supervised reveal some useful knowledge such as absolute and relative
learning—for predicting the clinical outcomes. frequency of the patterns. Our tool trains a supervised learning
model based on the frequent patterns discovered from historical
For example, based on frequent patterns {community data, and predicts clinical outcomes (e.g., recovered or deceased
exposures, not hospitalized, recovered} and {community from COVID-19) for future data. Evaluation results show the
exposures, not hospitalized} with respective frequencies of practicality of our tool in providing rich knowledge about
92,559 and 115,448, our tool infers an associative classification characteristics of COVID-19 cases. This helps researchers,
rule: epidemiologists and policy makers to get a better understanding
{community exposures, not hospitalized} recovered, of the disease, which may inspire them to come up ways to
detect, control and combat the disease.
which is supported by 92,559 COVID-19 cases and with 80%
confidence. Similarly, based on frequent patterns {community As ongoing and future work, we transfer knowledge learned
exposures, recovered} and {community exposures} with from the current work to machine learning and analytics of big
respective frequencies of 130,291 and 164,280, our tool infers data in many other real-life applications and services. Moreover,
another rule: we explore the incorporation of our machine learning tool with
a COVID-19 visualizer [52] such that the machine learning
{community exposures} recovered, serves as a back-end engine for big data analytics and the
which is supported by more cases (i.e., 130,291 cases) and with visualizer serves as a front-end interface for information
79% confidence. Some additional samples of associative visualization and visual analytics of big COVID-19
classification rules for the prediction of clinical outcomes are epidemological data.
shown in Table XIX. ACKNOWLEDGMENT
TABLE XIX. SAMPLE RULES FOR CLINICAL OUTCOME PREDICTION This work is partially supported by the Natural Sciences and
Engineering Research Council of Canada (NSERC), as well as
Associative Classifier Prediction Support Confidence
the University of Manitoba.
{travel exp., not hospitalized} recovered 3,003 97.4%
{male, travel exp. not hospitalized} recov'd 1,608 97.5%
REFERENCES
{40s, community exposures} recovered 20,688 85.4%
{50s, community exposures} recovered 19,270 84.5% [1] A. Alim, X. Zhao, J. Cho, F. Chen, “Uncertainty-aware opinion inference
{40s, not hospitalized} recovered 16,683 83.6% under adversarial attacks,” in IEEE BigData 2019, pp. 6-15.
{50s, not hospitalized} recovered 14,821 83.5% [2] F. Jiang, C.K. Leung, “A data analytic algorithm for managing, querying,
{30s, community exposures} recovered 21,343 83.1% and processing uncertain big data in cloud environments,” Algorithms
{20s, community exposures} recovered 24,444 81.9% 8(4), 2015, pp. 1175-1194.
{60s, community exposures} recovered 11,394 80.9% [3] C.K. Leung, et al., “Fast algorithms for frequent itemset mining from
uncertain data,” in IEEE ICDM 2014, pp. 893-898.
B. Functionality Check with Related Works [4] G. Chatzimilioudis, et al., “A novel distributed framework for optimizing
After demonstrating the features and usefulness of our query routing trees in wireless sensor networks via optimal operator
placement,” JCSS 79(3), 2013, pp. 349-368.
machine learning tool in analyzing real-life COVID-19 data, let
us evaluate its functionality when compared with related works. [5] A. Cuzzocrea, “Combining multidimensional user models and knowledge
representation and management techniques for making web services
First, most of the related works are observed to report mostly the knowledge-aware,” WIAS 4(3), 2006, pp. 289-312.
numbers of cases and deaths. They do not provide privacy- [6] C. He, S. Sun, B. Li, X. Tu, D. Yu, “Finding mutual X at WeChat-scale
preserving details and epidemiological characteristics of those social network in ten minitues,” in IEEE BigData 2019, pp. 288-297.
COVID-19 cases, which are provided by our tool. Second, for [7] F. Jiang, C.K. Leung, S.K. Tanbeer, “Finding popular friends in social
those related works that provide overall data distribution of networks,” in CGC 2012, pp. 501-508 .
cases, they are mostly confined to single dimensions/attributes. [8] C.K. Leung, C.L. Carmichael, “Exploring social networks: a frequent
In contrast, our tool provides multi-dimensional information pattern visualization approach,” in IEEE SocialCom 2010, pp. 419-424.
such as relationships among attributes in the form of frequent [9] C.K. Leung, A. Cuzzocrea, J.J. Mai, D. Deng, F. Jiang, “Personalized
patterns (and their related patterns) and associative classification DeepInf: enhanced social influence prediction with deep learning and
rules. Third, for related works focused on prediction, they transfer learning,” in IEEE BigData 2019, pp. 2871-2880.
mostly predict the trends (e.g., number of new cases) instead of [10] C.K. Leung, F. Jiang, “Big data analytics of social networks for the
discovery of "following" patterns,” in DaWaK 2015, pp. 123-135.
clinical outcomes. In contrast, our tool makes good use of the
[11] A.K. Chanda, et al. “A new framework for mining weighted periodic
discovered frequent patterns discovered from historical data to patterns in time series databases,” ESWA 79, 2017, pp. 207-224.
predict clinical outcomes for future data.
[12] C.K. Leung, R.K. MacKinnon, Y. Wang, “A machine learning approach
for stock price prediction,” in IDEAS 2014, pp. 274-277.
5126
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.
[13] R. Sharma, A. Mateush, J. Übi, “Tale of three states: analysis of large [34] C.K. Leung, “Mathematical model for propagation of influence in a social
person-to-person online financial transactions in three Baltic countries,” network,” Encyclopedia of Social Network Analysis and Mining, 2e,
in IEEE BigData 2019, pp. 1497-1505. 2018, pp. 1261-1269
[14] P.P.F. Balbin, et al., “Predictive analytics on open big data for supporting [35] A.A. Ardakani, et al., “Application of deep learning technique to manage
smart transportation services,” Procedia Computer Science 176, 2020, COVID-19 in routine clinical practice using CT images: results of
pp. 3009-3018. 10 convolutional neural networks,” Comp. Bio. Med. 121, 2020,
[15] C.K. Leung, et al., “An innovative fuzzy logic-based machine learning pp. 103795:1-103795:9.
algorithm for supporting predictive analytics on big transportation data,” [36] M.B. Jamshidi, et al., “Artificial intelligence and COVID-19: deep
in FUZZ-IEEE 2020. doi:10.1109/FUZZ48607.2020.9177823 learning approaches for diagnosis and treatment,” IEEE Access 8, 2020,
[16] C.K. Leung, et al., “Data mining on open public transit data for pp. 109581-109595.
transportation analytics during pre-COVID-19 era and COVID-19 era,” [37] B. Robson, “COVID-19 coronavirus spike protein analysis for synthetic
in INCoS 2020, pp. 133-144. vaccines, a peptidomimetic antagonist, and therapeutic drugs, and
[17] C.K. Leung, et al., “Urban analytics of big transportation data for analysis of a proposed achilles' heel conserved region to minimize
supporting smart cities,” in DaWaK 2019, pp. 24-33. probability of escape mutations and drug resistance,” Comp. Bio. Med.
121, 2020, pp. 103749:1-103749:28.
[18] D. Barh, et al.,, “Multi-omics-based identification of SARS-CoV-2
infection biology and candidate drugs against COVID-19,” Comput. Biol. [38] X. Marchand-Senécal, et al., “Diagnosis and management of first case of
Medicine 126, 2020, pp. 104051:1-104051:13. COVID-19 in Canada: lessons applied from SARS-CoV-1,” Clinical
Infectious Diseases, 2020. doi:10.1093/cid/ciaa227
[19] O.A. Sarumi, C.K. Leung, “Exploiting anti-monotonic constraints for
mining palindromic motifs from big genomic data,” in IEEE BigData [39] C.S. Eom, et al., “Effective privacy preserving data publishing by
2019, pp. 4864-4873. vectorization,” Information Sciences 527, 2020, pp. 311-328.
[20] P. Gupta, et al., “Vertical data mining from relational data and its [40] C.K. Leung, et al., “Privacy-preserving frequent pattern mining from big
application to COVID-19 data,” Big Data Analyses, Services, and Smart uncertain data,” in IEEE BigData 2018, pp. 5101-5110.
Data, 2021, pp. 106-116. [41] A.M. Olawoyin, et al., “Privacy-preserving spatio-temporal patient data
[21] J. Souza, et al., “An innovative big data predictive analytics framework publishing,” in DEXA 2020 (II), pp. 407-416.
over hybrid big data sources with an application for disease analytics,” in [42] B.H. Wodi, et al., “Fast privacy-preserving keyword search on encrypted
AINA 2020, pp. 669-680. outsourced data,” in IEEE BigData 2019, pp. 6266-6275.
[22] S. Tsumoto, et al., “Estimation of disease code from electronic patient doi:10.1109/BigData47090.2019.9046058
records, in IEEE BigData 2019, pp. 2698-2707. [43] W.T. Li, et al., “Using machine learning of clinical data to diagnose
[23] C.K. Leung, F. Jiang, “A data science solution for mining interesting COVID-19: a systematic review and meta-analysis,” BMC Medical
patterns from uncertain big data,” in IEEE BDCloud 2014, pp. 235-242. Informatics Decis. Mak. 20(1), 2020, pp. 247:1-247:13.
[24] A. Fariha, et al., “Mining frequent patterns from human interactions in [44] A.S. Albahri, et al., “Role of biological data mining and machine learning
meetings using directed acyclic graphs,” in PAKDD 2013 (I), pp. 38-49. techniques in detecting and diagnosing the novel coronavirus (COVID-
19): a systematic review,” J. Medical Syst. 44(7), 2020, pp. 122:1-122:11.
[25] C.K. Leung, “Uncertain frequent pattern mining,” Frequent Pattern
Mining, 2014, pp. 417-453. [45] W. Kuo, J. He, “Guest editorial: crisis management - from nuclear
accidents to outbreaks of COVID-19 and infectious diseases,” IEEE
[26] A.Y. Shahir, et al., “Mining vessel trajectories for illegal fishing Trans. Reliab. 69(3), 2020, pp. 846-850.
detection, in IEEE BigData 2019, pp. 1917-1927.
[46] A.A. Amini, et al., “Editorial special issue on "AI-driven informatics,
[27] S. Ahn, et al., “A fuzzy logic based machine learning tool for supporting sensing, imaging and big data analytics for fighting the COVID-19
big data business analytics in complex artificial intelligence pandemic". IEEE JBHI 24(10), 2020, pp. 2731-2732.
environments,” in FUZZ-IEEE 2019, pp. 1259-1264.
[47] D. Shen, et al., “Guest editorial: special issue on imaging-based diagnosis
[28] J.A. Brown, et al., “A machine learning system for supporting advanced of COVID-19,” IEEE TMI 39(8), 2020, pp. 2569-2571.
knowledge discovery from chess game data,” in IEEE ICMLA 2017,
pp. 649-654. [48] Y. Zhang, et al., “A five-layer deep convolutional neural network with
stochastic pooling for chest CT-based COVID-19 diagnosis,” Mach. Vis.
[29] K.J. Morris, et al., “Token-based adaptive time-series prediction by Appl. 32(1), 2021, pp. 14:1-14:13.
ensembling linear and non-linear estimators: a machine learning approach
for predictive analytics on big stock data,” in IEEE ICMLA 2018, [49] A. Viguerie, et al., “Simulating the spread of COVID-19 via a spatially-
pp. 1486-1491. resolved susceptible-exposed-infected-recovered-deceased (SEIRD)
model with heterogeneous diffusion,” Appl. Math. Lett. 111, 2021,
[30] A. Cuzzocrea, et al., “OLAP analysis of multidimensional tweet streams pp. 106617:1-106617:9.
for supporting advanced analytics,” in ACM SAC 2016, pp. 992-999.
[50] World Health Organization, WHO coronavirus disease (COVID-19)
[31] A. Cuzzocrea, C.K. Leung, “Efficiently compressing OLAP data cubes dashboard. https://covid19.who.int/
via R-tree based recursive partitions,” in ISMIS 2012, pp. 455-465.
[51] Public Health Agency of Canada, “Detailed preliminary information on
[32] A. Cuzzocrea, I. Song, “Big graph analytics: the state of the art and future confirmed cases of COVID-19 (revised),” Statistics Canada Table 13-10-
research agenda,” in DOLAP 2014, pp. 99-101. 0781-01. doi:10.25318/1310078101-eng
[33] S. Hirai, K. Yamanishi, “Detecting model changes and their early warning [52] C.K. Leung, et al, “Big data visualization and visual analytics of COVID-
signals using MDL change statistics,” in IEEE BigData 2019, pp. 84-93. 19 data,” in IV 2020, pp. 387-392. doi:10.1109/IV51561.2020.00073
5127
Authorized licensed use limited to: Walchand Institute Of Technology. Downloaded on November 30,2021 at 11:20:27 UTC from IEEE Xplore. Restrictions apply.