Exploratory Data Analysis
Exploratory Data Analysis
Contents
About this guide 3
1. Introduction 4
Appendix 29
1
2
3
1. Introduction
Machine learning (ML) projects typically start with a comprehensive exploration of the
provided datasets. It is critical that ML practitioners gain a deep understanding of:
2. Statistical data analysis
This section outlines the different statistical analyses performed, the motivation behind
them, and examples of each. The goal of these analyses is to determine the q uality of
features and their predictive power in contrast with target value or label. They provide a
more comprehensive understanding of the data and should be the first step in studying any
dataset, not just those for ML projects.
The exploration of the data is conducted from three different angles: d escriptive,
correlative, and contextual. Each type introduces information on the predictive power of the
features and enables an informed decision based on the outcome of the analysis. The
methodology and process outlined in this section lays the foundation for the decision process
described in Section 4.
2.1 Descriptive analysis (univariate analysis)
Descriptive analysis (or univariate analysis) provides an understanding of the characteristics
of each attribute of the dataset. It also offers important evidence for feature selection in a
later state.
4
The following table lists the suggested analysis for attributes that are common, numerical,
categorical, and textual.
Attribute type Statistic/calculation Details
Quantile statistics Q1, Q2, Q3, min, max, range, interquartile range
5
● Quantitative analysis. A
quantitative test of the relationship between X
and Y, based
on a hypothesis-testing framework. This perspective provides a formal and
mathematical methodology to quantitatively determine the existence and/or strength
of relationship.
The motivation for performing correlation analysis is to help determine:
● Which attributes are not predictive, in terms of correlation with the target value.
Special attention is usually needed for unpredictive attributes to reveal a stronger
relationship with the target.
● Which attributes hold redundant information that can be replaced with derived
attributes. Including them all might only serve to increase the resource demand and
will not introduce any gain in the ML process.
2.2.1 Qualitative analysis
Qualitative analysis is a primarily exploratory analysis used to gain an understanding of
underlying reasons, opinions, and motivations. It provides insights into the problem and helps
to develop ideas or hypotheses for potential quantitative research.
The following table lists the statistical analyses that could be performed between two features
of either categorical or numerical type. There is no qualitative analysis for numerical pairs
listed here, which is usually done by a sampled scatter plot.
Attribute types Analysis
6
Example
The following contingency table shows the relationship between sex and handedness.
Handedness
Right-handed Left-handed Total
Sex
Male 43 9 52
7
Y
X
Categorical Numerical
Student T-test
Chi-square test ANOVA
Categorical
Information gain Logistic regression
Discretize Y
(left column)
Example
The corresponding quantitative analysis that can be peformed on the attributes s ex and
handedness is the chi-square test. The following steps detail the process.
Hypothesis:
8
If the p-value is less than a threshold (for example, 0.05), then the null hypothesis is rejected.
However, the p-value here is larger than the typical threshold, which means that the null
hypothesis cannot be rejected and, correspondingly, that there is no strong enough evidence
showing that “sex and handedness are not independent.” This actually gives a contradictory
result to the observations made in the previous section. It also demonstrates the importance
of quantitative analysis.
2.3 Contextual analysis
Because neither descriptive analysis nor correlation analysis requires context information,
both are both generic enough to be performed on any (structured) dataset. To further
understand or profile the given dataset and to gain more domain-specific insights, two
generic contextual information-based analyses are recommended: t ime based and a gent
based.
It is expected that the quality of the dataset can be further verified based on the domain
knowledge from the contextual analysis result.
2.3.1 Time-based analysis
In many real-world datasets, the timestamp (or similar time-related attributes) is one of the
key pieces of contextual information. For example, operation logs for an online API service
usually contain the time that the log was generated and/or the time when a logged event
happened. Transaction logs for a retail company usually contain the time at which a
transaction occurred. Observing and/or understanding the characteristics of the data along
the time dimension, with various granularities, is essential to understanding the data
generation process and ensuring data quality.
With the timestamp attribute, the following analyses could be performed:
9
Example
The following figure displays the average number of train trips per hour originating from and
ending at one particular location based on a simulated dataset.
The conclusion that may be drawn is that peak times are around 8:30 a.m. and 5:30 p.m., which
is consistent with the intuition that these are the times when people would typically leave
home in the morning and return after a day of work.
2.3.2 Agent-based analysis
As an alternative to the timestamp, another common attribute is a unique identification (ID)
for each record. These representing IDs provide important contextual information, including,
for example:
10
In the figure, a long-tail distribution is observed with mean 26.6 and median 5. Domain
knowledge can be used to check whether the distribution makes sense. If it does not, there
could be issues at the data generating and/or collection stage.
11
To learn how visualization can be used properly, see the comprehensive s eaborn tutorial.
12
● High percentage of missing values. T he identified problem is that the attribute is
missing in a significant proportion of the data points. The threshold can be set based
on business domain knowledge.
There are two options to handle this, depending on the business scenario:
- A value can be missing due to misconfiguration, issues with data collection, or
untraceable random reasons, in which case t he feature can simply be
discarded if the historic data cannot be reconstituted.
More generally, missing values can be generally categorized into three cases:
- Missing at random. The propensity for a data point to be missing is not related
to the missing data, but it is related to some of the observed data.
- Missing at completely random. A certain missing value has nothing to do with
its hypothetical value or with the values of other variables.
13
- Missing not at random. The missing value depends on the hypothetical value
(for example, people with high salaries generally do not want to reveal their
incomes in surveys) or the missing value is dependent on some other variable’s
value (for example, women generally don’t want to reveal their age).
In the first two cases, it is safe to remove the data with missing values depending upon
the percentage, while in the third case removing observations with missing values can
produce a bias in the model.
● Low variance of numeric attributes. The identified problem is a very small variance
of the feature compared to the typical value range of the feature, or the distribution is
a sharp bell curve. In most cases, it is safe to remove numeric attributes with low
variance. This will not harm the performance of the model, and it can reduce the
complexity of the model.
● Low entropy of categorical attributes. T he identified problem is a very small entropy
of the feature, which means that most of the records have the same categorical
values. In most cases, it is safe to remove categorical attributes with low entropy. This
will not harm the performance of the model, and it can reduce the complexity of the
model.
● Imbalance of categorical target (class imbalance). A dataset is said to be “highly
class imbalanced” if a sample from one target class is significantly higher in number
than others. This can be treated as a special case of the previous “low entropy of
categorical attributes.” In an imbalanced dataset, the class with a higher number of
instances is called a major class, while those with relatively fewer instances are called
minor classes.
In this case, most of the classifiers are biased towards the major classes and, hence,
display poor classification rates on minor classes. It is also possible that the classifier
predicts everything as a major class and ignores the minor class. A variety of
techniques have been proposed to handle class imbalance:
- Alternative metric and/or loss function. Accuracy is not the metric to use
when working with an imbalanced dataset, and that can be misleading. There
are metrics that have been designed to tell a more truthful story when working
with imbalanced classes — for example, the area under the curve (AUC) and F1
score (average of precision and recall). It is also possible to alter the loss
function by imposing an additional cost on the model for making classification
mistakes on the minority class during training. These penalties can bias the
model to pay more attention to the minority class.
● Skew distribution. T he identified problem is that the distribution of the numeric
attribute exhibits a long-tail shape.
There are several options for handling this, depending on the business scenario:
- Sometimes outliers can originate from incorrect data, in which case an
understanding of the source of the error may allow changing the outlier value
with a plausible one (for example, 99999 cigarettes per day, because the
default value of the input zone is 99999).
- In certain cases, the extreme values are actually caused by outliers. Filtering out
the extreme values or outliers would bring the distribution of the attribute back
to normal. It is important to note that outlier removal should be applied
consistently to both training and serving.
There are two typical methods for removing the outliers:
- Applying a h ash trick, which converts a high cardinality categorical attribute to
a fixed sized one-hot-encoded space.
- Removing the feature if the number of occurrences per unique value of the
attributes is too low (for example, less than 5), which can typically happen in
cases like transaction ID or event ID.
4.2 Feature selection based on correlation analysis
The correlation analysis examines the relationship between two attributes. The typical action
points triggered by the correlation analysis in the context of feature selection or feature
engineering can be summarized as follows:
● Low correlation between feature and target. I f the correlation between feature and
target is found to be low, there are two possible reasons:
- The feature is not useful in terms of predicting the desired target, and
therefore it can be removed from the study.
16
- The feature is not useful given its available form, and transformation is required
to reveal a stronger relationship with the target. One possible example is when
the longitude and latitude don’t provide clear information in the raw form.
Bucketizing them to create the notion of “location” is often a useful
transformation that could increase their correlation with the target.
For these reasons, before removing the features from the scope, it is recommended to
have a domain expert review the analysis in order to make the final decision.
● High correlation between features. Another result that comes out of the correlation
analysis and that requires special attention is the high correlation between features.
Having highly correlated features could be a problem because:
17
● Facets. This data visualization tool for machine learning calculates data statistics and
allows for data distribution and comparative feature visualization on both training and
validation datasets.
● Cloud Dataprep. This tool is a cloud-based service built upon the Cloud Dataflow
service that supports visually exploring, cleaning, and preparing data for analysis. It can
generate features statistics and perform transformations.
● TensorFlow Data Validation. This tool provides calculation of summary statistics for the
training and test datasets. It includes anomaly detection to identify anomalies (such as
missing features, out-of-range values, or wrong feature types). It is focused on
recurrent data validation in production pipelines rather than on initial data exploration.
● AutoML tables. This tool computes the basic statistics of each attribute of the
imported dataset before the model is trained.
● Auto Data Exploration and Feature Recommendation Tool (Auto EDA). This tool
automates the data analysis described in this guide, regardless of the scale of the data,
using BigQuery as the backend compute engine. The result of the analysis is an
automatically generated report presenting the findings in a compelling manner.
18
Analysis report,
Auto EDA No textual Yes No Simple
in Markdown
Tensorflow
No Facet
Data No textual No Advanced
quantitative visualization
Validation
No Facet
Facets No textual No Intermedia
quantitative visualization
Quantitative
Pandas Analysis report,
Yes + Pearson No Simple
profiling in HTML
correlation
In summary, when selecting the appropriate exploration tool, consider the following:
19
20
21
22
Histograms of numerical attributes
23
25
Use information gain to find the dependency of two categorical variables from a different
perspective. Determine how much uncertainty for categorical variable A can be brought down
by categorical variable B.
In the following table, which displays information gain, income_bracket becomes more
certain if the person’s r
elationship and m arital_status are also known. Conversely, it
does not seem to be affected if the r ace is known. Thus, it’s expected that r
elationship
and marital_status are significant features in predicting a person’s income bracket.
cat-S cat-A H(S) H(S|A) IG
income_bracket workclass 0.552011 0.537059 0.014952
income_bracket education 0.552011 0.487139 0.064872
income_bracket marital_status 0.552011 0.443514 0.108497
income_bracket occupation 0.552011 0.487602 0.064409
income_bracket relationship 0.552011 0.437388 0.114623
income_bracket race 0.552011 0.546204 0.005807
income_bracket sex 0.552011 0.526246 0.025765
income_bracket native_country 0.552011 0.545984 0.006027
6.3.2 Numerical versus numerical
Using the following correlation heat map and table as reference, observe that there are no
strong correlations between any two numerical variables, which signifies that these features
may be able to provide complementary information when building the ML model to predict
income_bracket.
26
Heatmap of Pearson correlation
functional_ education_ capital_ capital_ hours_
age
weight num gain loss per_week
27
28
Appendix
A. Hypothesis testing
A statistical hypothesis is an assumption about a population parameter that may or may not
be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or
reject statistical hypotheses.
The best way to determine whether a statistical hypothesis is true is by examining the entire
population. Since that is often impractical, researchers typically examine a random sample
from the population. If the sample data is not consistent with the statistical hypothesis, the
hypothesis is rejected.
There are two types of statistical hypotheses:
● Null hypothesis. The null hypothesis, denoted by H0, is usually the hypothesis that
sample observations result purely from chance.
● Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the
hypothesis by which sample observations are influenced.
Statisticians follow a formal process to determine whether to reject the null hypothesis based
on sample data. The following activities comprise its process, called hypothesis testing:
1. State the hypotheses. This involves stating the null and alternative hypotheses in such
a way that they are mutually exclusive: that is, if one is true, the other must be false.
2. Formulate an analysis plan. The analysis plan describes how to use sample data to
evaluate the null hypothesis. The evaluation often focuses around a single test statistic.
3. Analyze sample data. Find the value of the test statistic (for example, its mean score,
proportion, t statistic, or z-score) described in the analysis plan.
4. Interpret results. Apply the decision rule described in the analysis plan. If the value of
the test statistic is unlikely based on the null hypothesis, reject the null hypothesis.
29
where r ∈ [− 1, 1].
The Pearson correlation can be used as a ranking measurement for the linear fit of
individual continuous variables:
30
C. Student T-test
The T-test is any statistical hypothesis test in which the test statistic follows a student's
t-distribution under the null hypothesis.
A T-test can be applied for verifying the relationship between features or between feature
and target variables in the following cases:
● Testing whether the distribution of the input variable between two groups (split
against categorical variable) is the same
● Testing the null hypothesis that the true correlation coefficient ρ between two
variables is equal to 0, based on the value of the sample correlation coefficient r
The following are examples of independent two-sample T-tests:
√
xA − xB S 2A S 2B
t = SE , where S E = + and df = min(nA − 1, nB − 1)
nA nB
31
● A t est of homogeneity compares the distribution of counts for two or more groups
using the same categorical variable (for example, choice of activity — college, military,
employment, travel — of graduates of a high school reported a year after graduation,
sorted by graduation year, to see whether the number of graduates choosing a given
activity has changed from class to class, or from decade to decade).
Statistical test
Hypothesis:
● H0: feature and target are independent
● H1: feature and target are not independent
Test statistics:
r c (O −E ) 2
ij ij
χ2 = ∑ ∑ E ij
i=1 j=1
Oj
and pj = N , where N is the sample size
* If the p-value is less than 0.05, then the null hypothesis is rejected; otherwise, the null
hypothesis is accepted.
32
Hypothesis:
● H0: μ0 = μ1 = μ2 = ... = μk
● H1: μ i =/ μ j , where μ i and μ j are sample means of any two samples considered for
the test
Test statistics:
M SG
F = M SE
* If p-value is less than 0.05, then the null hypothesis is rejected; otherwise, the null hypothesis
is accepted.
SSW
Residual/error SSE n-k M SE = n−k
F. Information gain
A measure of the mutual dependence between the two variables, which can be computed as
I nf ormation Gain(A, S) = H (S) − H (S|A)
The larger information gain indicates larger correlation between A and S.
34