Ch-1 Introduction To Data Analysis
Ch-1 Introduction To Data Analysis
achnsuus with
phionl preatatns.
ntroduction to Data Analysis
Chapter 1
Data
Data Analysis
Yes
Communicate
results
[111
TproCey
Con-b ineng nate
More eepbl anol essier
ls
776 Rau d omat,
Introduction to Data
Analysis Chapter 1
Data collection
Data collection is the natural first step for any data analysis-we can't analyze data
we don't have. In reality, our analysis can begin even before we have the data: when
we decide what we want to investigate or analyze, we have to think of what kind of
data we can collect that will be useful for our analysis. While data can come from
anywhere, we will explore the following sources throughout this book:
- [12]
w
Introduction to Data Analysis Chapter 1
Data wrangling
Data wrangling is the process of preparing the data and getting it into a format that
can be used for analysis. The unfortunate reality of data is that it is often dirty,
that it requires cleaning (preparation) before it can be used. The
meaning
following are some issues we may encounter with our data:
versions of the same entry recorded, such as New York City, NYC, and
nye
Computer error: Perhaps we weren't recording entries for a while (missing
data)
Unexpected values: Maybe whoever was recording the data decided to use
? for a missing value in a numeric column, so now all the entries in the
column will be treated as text instead of numeric values
Most of these data quality issues can be remedied, but some cannot, such as when the
data is collected daily and we need it on an hourly resolution. It is our responsibility
to carefully examine our data and to handle any issues, so that our analysis doesn't
get distorted. We will cover this process in depth in Chapter 3, Data Wrangling with
Pandas, and Chapter 4, Aggregating Pandas DataFrames.
13]
Chapter 1
Introduction to Data Analysis
can often
Data visualizations are very powerful; unfortunately, they
be misleading. One common issue stems from the scale of the y-axis.
Most plotting tools will zoom in by default to show the patterm
what the
up-close. It would be difficult for software to know it is our
appropriate axis limits are for every possible plot; therefore,
before our results. You can
jobto properly adjustthe axes presenting
read about some more ways plots can mislead here: ht tps://
venngage.com/blog/misleading-graphs/.
In the workflow diagram we saw earlier, EDA and data wrangling shared a box. This
is because they are closely tied:
When calculating summary statistics, we must keep the type of data we collected in
mind. Data can be quantitative (measurable quantities) or categorical (descriptions,
groupings, or categories). Within these classes of data, we have further subdivisions
that let us know what types of operations we can pertorm on them.
[141
Introduction to Data Analysis Chapter 1
For example,
categorical data can be nominal, where we assign a numeric value to
each level of the
category, such as on 1/off 0, but we can't say that one is
=
greater than the other because that distinction is meaningless. The fact that on is
greater than off has no meaning because we arbitrarily chose those numbers to
represent the states on and off. Note that in this case, we can represent the data with
a Boolean
(True/False value): is_on. Categorical data can also be ordinal, meaning
that we can
rank the levels (for instance, we can have lowmedium high).
With quantitative data, we can be on an
interval scale or a ratio scale. The interval
Scale includes things such as
and
temperature. We can measure temperatures in Celsius
compare the temperatures of two cities, but it doesn't mean anything to say one
city is twice as hot as the other. Therefore, interval scale values can be
compared using addition/subtraction, but not multiplication/division. meaningfully
The ratio scale,
then, are those values that can be meaningfully
compared with ratios (using
multiplication and division). Examples of the ratio scale include prices, sizes, and
counts.
Drawing conclusions
After we have collected the data for our analysis, cleaned it up, and performed some
thorough EDA, it is time to draw conclusions. This is where we summarize our
findings from EDA and decide the next steps:
151
Introduction to Data Analysis Chupter 1
If we decide to model the data, this falls under machine learning and statistics. While
not technically data analysis, it is usually the next step, and we will cover it in
Chapter 9, Getting Started with Machine Learning in Python, and Chapter 10, Making
Better Predictions Optimizing Models. In addition, we will see how this entire process
will work in practice in chapter 11, Machine Learning Anomaly Detection. As a
reference,in the Machine learning workflow section in the appendix, there is a workflow
cdiagram depicting the full process from data analysis to machine learning. chapt er 7,
Financial Analysis - Bitcoin and the Stock Market, and chapt er 8, Rule-Based Anomaly
Detection, will focus on drawing conclusions from data analysis, rather than building
models.
Statistical foundations
When we want to make observations about the data we are analyzing, we are often, if
not always, turning to statistics in some fashion. The data we have is referred to as the
sample, which was observed from (and is a subset of) the population. Two broad
categories of statistics are descriptive and inferential statistics. With descriptive
statistics, as the name implies, we are looking to describe the sample. Inferential
statistics involves using the sample statistics to infer, or deduce, something about ihe
population, such as the underlying distribution.
Often, the goal of an analysis is to create a story for the data; unfortunately, it is very
easy to misuse statistics. It's the subject of a famous quote:
There are three kinds oflies: lies, damned lies, and statistics."
Benjamin Disraeli
This is especially true of inferential statistics, which are used in many scientific
studies and papers to show significance of their findings. This is a more advanced
topic, and, since this isn't a statistics book, we will only briefly touch upon some of
the tools and principles behind inferential statistics, which can be pursuedfurther
We will focus on descriptive statistics to help explain the data we are analyzing
I 161
Introduction to Data Analysis Chapter 1
Sampling
There's an important thing to remember before we attempt any analysis: our sample
must be a random sample that is representative of the population. This means that
the data must be sampled without bias (for example, if we are asking people if they
like a certain sports team, we can't only ask fans of the team) and that we should have
ideally) members of all distinct groups from the population in our sample (in the
sports team example, we can't just ask men).
There are many methods of sampling. You can read about them,
along with their strengths and weaknesses, here: https://www.
khanacademy.org/math/stat istics-probability/designing-
studies/sampl ing-methods-stats/a/sampling-met hods- review.
[ 171
ntroduction to Data Analysis Chupter 1
Descriptive statistics
We will begin our discussion of descriptive statistics with univariate statistics;
univariate simply means that these statistics are calculated from one (uni) variable.
Everything in this section can be extended to the whole dataset, but the statistics will
be calculated per variable we are recording (meaning that if we had 100 observations
of speed and distance pairs, we could calculate the averages across the dataset, which
would give us the average speed and the average distance
stalistics).
Descriptive statistics are used to describe and/or summarize the data we are working
with. We can start our summarization of the data with a measure of central
tendency,
which describes where most of the data is centered around, and of
or dispersion, which indicates how far apart values are.
a measure spread
Mean
Perhaps the most common statistic for summarizing data is the average, or mean. The
population mean is denoted by the Greek symbol mu (u), and the sample mean is
written as T (pronounced X-bar). The sample mean is calculated by summing all the
values and dividing by the count of values; lor example, the mean of [0, 1, 1, 2,
9] is 2.6 ( ( 0 + 1 + 1 + 2 + 9)/5):
[ 18
ntroduction to Data Analysis Chapter 1
One important thing to note about the mean is that il is very sensitive to outliers
(values created by a different generative process than our distribution). We we
dealing with only five values; nevertheless, the 9 is much larger than the other
numbers and pulled the mean higher than all but the 9.
Median
In cases where we suspect outliers to be present in our data, we may want to use
the median as our measure of central tendency. Unlike the mean, the median is
robust to outliers. Think of income in the US; the top 1% is much higher than the rest
of the population, so this will skew the mean to be higher and distort the percepiion
of the average person's income.
The median represents the 50" percentile of our data; this means that 50% of the
values are greater than the median and 50% are less than the median. It is calculated
by taking the middle value from an ordered list of values; in cases where we have an
even number of values, we take the average of the middle two values. If we take tihe
numbers [0, 1, 1, 2, 9] again, our median is 1.
The i" percentile is the value at which i% of the observations are less
than that value, so the 99" percentile is the value in X, where 99% of
TIP the x's are less than it.
miadolla Vahu m a S
Mehan
oni
orlu als tom S
ObsuVi by
Value,
2.
ol
0bgecawtnt,
huser uf, 4 aun n
Dho poAon
Introduction to Data Analysis Chupter 1
Understanding the concept of the mode comes in handy when describing continuous
distributions; however, most of the time when we're describing our data, we will use
either the mean or the median of central
as our measure
tendency.
Measures of spread
Knowing where the center of the distribution is only gets us partially to being able to
Summarize the distribution of our data-we need to know how values fall around the
center and how far apart they are. Measures of spread tell us how the data is
dispersed; this will indicate how thin (low dispersion) or wide (very spread out) our
distribution is. As with measures of central tendency, we have several ways to
describe the spread of a distribution, and which one we choose will depend on the
situation and the data.
Range
The rangeisthe distance between the smallest value (minimum) and the largest
value (maximum):
The units of the range will be the same units as our data. Therefore, unless two
distributions of data are in the same units and measuring the same thing, we can't
compare theirrange_ and say one is moredispersed than the other
Variance
Just from the definition of the range, we can see why that wouldn't always be the best
way to measure the spread of our data. It gives us upper and lower
bounds
we have in the data, however, if we have any outliers in our data, the willwhat
range on be
rendered useless.
201
Introduction to Data Analysis Chapter 1
Another problem with the range is that it doesn't tell us how the data is dispersed
around its center; it really only tells us how dispersed the entire dataset is. Enter
the variance, which describes how far apart observations are spread out from their
average value (the mean). The population variance is denoted as sigma-squared (r),
and the sample variance is written as (s).
The variance is calculated as the average squared distance from the mean. The
distances must be squared so that dist ces below the mean don't cancel out those
above the mean. If we want the sample variance to be an unbiased estimator of the
population variance, we divide by n - 1 instead of n to account for using the sample
mean instead of the population mean; this is called Bessel's correction (nttps://en.
wikipedia.org/wiki/Bessel827s_correct ion). Most statistical tools will give us the
for the entire
sample variance by default, since it is very rare that we would have data
population:
s n-
Standard deviation
The variance gives us a statistic with squared units. This means that if we started with
data on gross domestic product (GDP) in dollars ($), then our variance would be in
dollars squared ($). This isn't really useful when we're trying to see how this
describes the data; we can use the magnitude (size) itself to see how spread out
something is (large values large spread), but beyond that, we need a measure of
spread with units that are the same as our data.
For this purpose, we use the standard deviation, which is simply the square root of
the variance. By performing this operation, we get a statistic in units that we can
make sense of again ($ for our GDP example):
= n-1
y
The population standard deviation is represented as o, and the
sample standard deviation is denoted as s.
21
Chapter 1
Introduction to Data Analysis
2
00
-10 -5 10
Coefficient of variation
we were looking to get to
units
When we moved from variance to standard deviation,
the level of dispersion of one
that made sense; however, if we then want to compare
dataset to another, we would need to have the same units once again. One way
which is the ratio of the
around this is to calculate the coefficient of variation (CV),
the standard deviation is relative
standard deviation to the mean. It tells us how big
to the mean:
CV
Interquartile range
So far, other than the range, we have discussed mean-based measures of dispersion;
now, we will look at how we can describe the spread with the median as our measure
of central tendency. As mentioned earlier, the median is the 50" percentile or the
2 quartile (Q). Percentiles and quartiles are both quantiles-values that divide data
into equal groups each containing the same percentage of the total data; percentiles
give this in 100 parts, while quartiles give it in four (25%, 50%, 75%, and 100.)
221
Chgtt
in
our data, and we
know how much of the data goes
Since quantiles neatly divide up the spread of our
candidate for helping us quantify
each section, they are a perfect which is the
is the h¥AiAtHKNGKIAA (IQR),
data. One c o m m o n m e a s u r e for this
distance between the 3 and 1" quartiles:
1QR= QQ1
how much
of data around the median iA quantifies
the spread
The 1QR gives us
distribution. It can also be
useful to
we have in the
middle 50% of our
dispersion REKEBAkAA GOiAMDANAO
determine outliers, which we
will cover in Chapter 8,
Summarizing data
we can use to summarize
of descriptive statistics that
We have seen many examples at the 5-number summary
center and dispersion; in practice, looking
our data by its
first steps before diving into somne
or visualizing
the distribution prove to be helpful
5-number summary, as its n a m e indicates,
aforementioned metrics. The
of the other
statistics that summarize our data:
provides five descriptive
Statistic Percentilee
Quartile
minimum
N/A 25
2.
median 50
N/A 75t
Q 100"
maximum
Q
231
Introduction to Data Analysis Chupter1
sense of
Looking at the 5-number summary is a quick and efficient way of gettinga
an idea of the distribution of
the data and can move on
our data. At a glance, we have
to visualizing it.
Box plot
outlers
100 0, +15 10R
median
1QR
-50
-100
241
Introduction to Data Analysis Chupter 1
Scaling data
In order to compare variables from different distributions, we would have to scale the
We take each data
data, which we could do with the range by using min-max scaling.
divide by the
point, subtract the minimum of the dataset, then range.
This normalizes our data (scales it to the range [0, 1]):
min(X)
tscaled
range(X)
a Z-score of -0.5.
between variables
Quantifying relationships
with univariate statistics and were only able
dealing
previous sections,
we were
In the multivariate statistics,
to say something
about the variable we were looking at. With
look into
we can look to quantify
relationships between variables. This allows us to
one variable changes with respect to another) and
correlations (how
things such as
291
Introduction to Data Analysis Chapter 1
EIX] is new notation for us. It is read as the expected value of X or the
i
expectation of X, and it is calculated by summing all the possible
values of X multiplied by their probability-it's the long-run
average of X.
The magnitude of the covariance isn't easy to interpret, but its sign tells us if the
variables are positively or negatively correlated. However, we would also like to
us to
quantify how strong the relationship is between the variables, which brings
correlation. Correlation tells us how variables change together both in direction (samne
or opposite) and in of the relationship). To find the
magnitude (strength
correlation, we calculate the Pearson correlation coefficient, symbolized by p (the
standard
Greek letter rho), by dividing the covariance by the product of the
deviations of the variables:
couX, Y)
Px,Y
Sx SY
011 052
-5
301
Chupter 1
Introduction to Data Analysis
correlation
to remember is that, while we may find a
One very important thing that Y X. There could be
doesn't that X causes Y or causes
between X and Y, it
mean
event that
X intermediary
causes some
Z that actually c a u s e s both; perhaps often don't
Keep in mind that
some we
coincidence.
c a u s e s Y, or perhaps
it is actually just a
to report causation:
have enough information
exponential, logarithmic,
data is actually quadratic,
linear function.
correlations, but it's pretty
data with strong positive
depict The o n e o n the left
Both of the following plots these a r e not linear.
at the scatter plots, that
obvious, when looking
while the one on the right is exponential:
is logarithmic,
0 69
P 0 80
20900
15000
10000
S000
statistics
Pitfalls of summary
can summary statistics. There
correlation coefficients be misleading-so
Not only can
w e must be when only using
dataset illustrating how careful
is a very interesting coefficients to describe our data. It also shows us
summary
statistics and correlation
that plotting is not optional.
31
Introduction to Data Analysis
-
Chapter 1
Anscombe's quartet is a collection of four different
summary statistics and correlation coefficients, but datasets that have identical
are not similar: when plotted, it is obvious
they
Anscombe's Quartet
inear
i non-linear
** 12
10
10
P082
y 0 5x + 30
y-6530
90fo, 3 2
9 ,2
= 7 5 } , =19
10 12 14 16 18 8 17 18
Y-05x +30 05 30
75} o , ! 9
4,7 5
1 2 14 16 16 10 12 4
321
Introduction to Data Analysis
Chapter 1
Summary statistics are
helpful when we're getting to know the
very
data, but be wary of
relying exclusively
statistics can mislead; be sure to on them. Remember,
also plot the data before drawing
any conclusions or
about Anscombe's
proceeding with the analysis. You can read more
quartet here:
Anscombe127s_quartet. https://en.wikipedia.org/wiki/
20
15
30
temperature in "C
We can observe an upward trend in the scatter plot: more ice creams are sold at
higher temperatures. In order to help out the ice cream shop, though, we need to find
a way to make predictions from this data. We can use a technique called regression to
model the relationship between temperature and ice cream sales with an equation.
Using this equation, we will be able to predict ice cream sales at a given temperature.
331
IâtRONiAdhminANAGMAKD Chapter 1
In chapt er 9, GAiAS ttiNDÄKMÁMGAD ÉÐ, we will go over
regression in depth, so this discussion will be a high-level overview. There are many
ypes of regression that will yield a different type of equation, such as linear and
OgIstic. Our first step will be to identify the dependent variable, which is the
quantity we want to
predict (ice creamn sales),
and the variables we will use to
t, which are
called predict
varlables, our ice cream sales äegAnÄAGHWADMD
While we can have many independent
use example only has one: temperature. Therefore, we will
simple linear regression to model the
relationship as a line:
Using regression to predict ice cream sales
regression iine
extrapolated regresSion line
y 150x+27 96
30
15
20
30
temperature in C
The regression line in the previous scatter plot yields the following equation for the
relationship:
ice cream sales 1.5
=
x
temperature 27.9
Today the
temperature is 35°C, so we plug that in for temperature in the equation. The
result predicts that the ice cream shop will sell 24.54 ice creams. This
along the red line in the previous plot. Note that the ice cream prediction is
fractions of an ice cream. shop can't actually sell
- [ 34]
Introduction to Data Analysis Chapter 1
Before leaving the model in the hands of the ice cream shop, it's important to discuss
the difference between the dotted and solid portions of the regression line that we
obtained. When we make predictions using the solid portion of the line, we are
using interpolation, meaning that we will be predicting ice cream sales for
temperatures the regression was created on. On the other hand, if we try to predict
how many ice creams will be sold at 45°C, it is called extrapolation (dotted portion of
the line), since we didn't have any temperatures this high when we ran the regression.
trends don't continue indefinitely. It
Extrapolation can be very dangerous as many
that instead of
people decide not to leave their houses. This
means
may be so hot that
selling the predicted 39.54 ice creams, they would sell zero.
We can also predict categories. Imagine that the ice cream shop
wants to know which flavor of ice cream will sell the most on a
[ 351
Introduction to Data
Analysis Chapter 1
The moving
average puts equal weight on each time period in the past involved in
the calculation. In
practice, this isn't always a realistic expectation of our data.
Sometimes, all past values are important, but
data points. For these cases, we can use they vary in their influence on future
exponential smoothing,
put more weight on more recent values and less weight
which allows us to
on values further away from
what we are
predicting.
Note that we aren't limited to
predicting numbers; in fact, depending on the data,
predictions could be categorical in nature-things such as determining what color our
the
next observation will be or if an email is
spam or not. We will cover more on
regression, time series analysis, and other methods of prediction using machine
learning in later chapters.
Inferential statistics
As mentioned earlier, inferential statistics deals with inferring or deducing things
from the sample data we have in order to make statements about the population as a
whole. When we're looking to state our conclusions, we have to be mindful of
whether we conducted an observational study or an experiment. An observational
study is where the independent variable is not under the control of the researchers,
and so we are observing those taking part in our study (think about studies on
smokingwe can't force people to smoke). The fact that we can't control the
independent variable means that we cannot conclude causation.
[ 37]
ntroducti0n to Data Analysis
Chapter 1
Inferential statistics gives us tools to translate our understanding of the sample data
to a statement about the population. Remember that the sample statistics we
discussed earlier are estimators for the population parameters. Our estimators need
confidence intervals, which provide a point estimate and a margin of error around it.
This is the range that the true population parameter will be in at a certain confidence
level. At the 95% confidence level, 95% of the confidence intervals that are calculated
from random
samples of the population contain the true population parameter.
Frequently, 95% is chosen for the confidence level and other purposes in statistics,
although 90% and 99% are also common; the higher the confidence level, the wider
the interval.
Hypothesis tests allow us to test whether the true population parameter is less than,
greater than, or not equal to some value at a certain significance level (called alpha).
The process of performing a hypothesis test involves stating our initial assumption or
null hypothesis: for example, the true population mean is 0. We pick a level of statistical
significance, usually 5%, which is the probability of rejecting the null hypothesis
when it is true. Then, we calculate the critical value for the test statistic, which will
depend on the amount of data we have and the type of statistic (such as the mean of
one population or the proportion of votes for a candidate) we are testing. The critical
value is compared to the test statistic from our data, and we decide to either reject or
fail to reject the null hypothesis. Hypothesis tests are closely related to confidence
intervals. The significance level is equivalent to 1 minus the confidence level. This
means that a result is statistically significant if the null hypothesis value is not in the
confidence interval.
[ 381