0% found this document useful (0 votes)
5 views

Ch-1 Introduction To Data Analysis

Uploaded by

deepakgurnani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Ch-1 Introduction To Data Analysis

Uploaded by

deepakgurnani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Ch EA tiVa

achnsuus with
phionl preatatns.
ntroduction to Data Analysis
Chapter 1

Fundamentals of data analysis


Data analysis is a highly iterative process involving collection, preparation
(Wrangling), exploratory data analysis (EDA), and drawing conclusions. During an
analysis, we will frequently revisit each of these steps. The following diagram depicts
a generalized workflow:

Data

Data Analysis

Collect data EDA+


Data Wrangling Get more data? Dr aw conlus ons

Yes

Communicate
results

In practice, this process is


heavily skewed towards the data preparation side. Surveys
have found that, although data scientists
enjoy the data preparation side of their job
the least, it makes up 80% of their work
(https: //www.forbes.com/sites/g1ipress/
2016/03/23/data-preparation-most-t ime-consuming-least-enjoyable data-
science-task-survey-says/#419ce7b36f63). This data preparation step is
where pandas really shines.

[111
TproCey
Con-b ineng nate
More eepbl anol essier
ls
776 Rau d omat,
Introduction to Data
Analysis Chapter 1

Data collection
Data collection is the natural first step for any data analysis-we can't analyze data
we don't have. In reality, our analysis can begin even before we have the data: when
we decide what we want to investigate or analyze, we have to think of what kind of
data we can collect that will be useful for our analysis. While data can come from
anywhere, we will explore the following sources throughout this book:

Web scraping to extract data from a website's HTML (often with


Python
packages such as selenium, requests, scrapy, and beaut i fulsoup)
Application Programming Interfaces (APIs) for web services from which
we can collect data with the
requests package
Databases (data can be extracted with SQL or another database-querying
language)
.Internet resources that provide data for download, such
websites or Yahoo! Finance
as
government
.Logfiles
Chapter 2, Working with Pandas DataFrames, will give us the skills
we need to work with the aforementioned data sources.
Chapter
12, The Road Ahead, provides countless for
sources.
resources
finding data
We are surrounded by data, so the
possibilities are limitless. It is important, For
to make sure that we are collecting data that will help us draw conclusions.
however,
example, if we are trying to determine if hot chocolate sales are higher when the
temperature is lower, we should collect data on the amount of hot chocolate sold and
the temperatures each day. While it might be
interesting to see how far people
traveled to get the hot chocolate, it's not relevant to our
analysis.
Don't worry too much about finding the
perfect data before
beginning an analysis. Odds are, there will always be something we
want to add/remove from the initial dataset,
TIP
reformat, merge with
other data, or change in some way. This is where data
comes into play. wrangling

- [12]

w
Introduction to Data Analysis Chapter 1

Data wrangling
Data wrangling is the process of preparing the data and getting it into a format that
can be used for analysis. The unfortunate reality of data is that it is often dirty,
that it requires cleaning (preparation) before it can be used. The
meaning
following are some issues we may encounter with our data:

Human errors: Data is recorded


(oreven collected) incorrectly, such
putting 100 instead of 1000, or typos. In addition, there may be multiple
as

versions of the same entry recorded, such as New York City, NYC, and
nye
Computer error: Perhaps we weren't recording entries for a while (missing
data)
Unexpected values: Maybe whoever was recording the data decided to use
? for a missing value in a numeric column, so now all the entries in the
column will be treated as text instead of numeric values

Incomplete information: Think of survey with optional questions; not


a
everyone will answer them, so we have missing data, but not due to
computer or human error
Resolution: The data may have been collected per second, while we need
hourly data for our analysis
Relevance of the fields: Often, data is collected or generated as a
product
of some process rather than explicitly for our analysis. In order to get it to a
usable state, we will have to clean it up
Format of the data: The data may be recorded in a format that isn't
conducive to analysis, which will require that we reshape it
Misconfigurations in data-recording process: Data coming from sources
such as misconfigured trackers and/or webhooks may be
missing fields oor
passing them in the wrong order

Most of these data quality issues can be remedied, but some cannot, such as when the
data is collected daily and we need it on an hourly resolution. It is our responsibility
to carefully examine our data and to handle any issues, so that our analysis doesn't
get distorted. We will cover this process in depth in Chapter 3, Data Wrangling with
Pandas, and Chapter 4, Aggregating Pandas DataFrames.

13]
Chapter 1
Introduction to Data Analysis

Exploratory data analysis


statistics to geta better
During EDA, we use visualizations and summary out visual
understanding of the data. Since the human brain excels at picking
In fact, some characteristics of
patterns, data visualization is essential to any analysis.
the data can only be observed in a plot. Depending on our data, we may create plots
to see how a variable of interest has evolved over time, compare how many
distributions of continuous
observations belong to each category, find outliers, look at
Data with Pandas and
and discrete variables, and much more. In Chapter 5, Visualizing
we will
Matplotlib, and chapter 6, Plotting with Seaborn and Customization Techniques,
learn how to create these plots for both EDA and presentation.

can often
Data visualizations are very powerful; unfortunately, they
be misleading. One common issue stems from the scale of the y-axis.
Most plotting tools will zoom in by default to show the patterm
what the
up-close. It would be difficult for software to know it is our
appropriate axis limits are for every possible plot; therefore,
before our results. You can
jobto properly adjustthe axes presenting
read about some more ways plots can mislead here: ht tps://
venngage.com/blog/misleading-graphs/.

In the workflow diagram we saw earlier, EDA and data wrangling shared a box. This
is because they are closely tied:

Data needs to be prepped before EDA.


Visualizations that are created during EDA may indicate the need for
additional data cleaning.
Data wrangling uses summary statistics to look for potential data issues,
while EDA uses them to understand the data. Improper cleaning will
distort the findings when we're conducting EDA. In addition, data
wrangling skills will be required to get summary statistics across subsets of
the data.

When calculating summary statistics, we must keep the type of data we collected in
mind. Data can be quantitative (measurable quantities) or categorical (descriptions,
groupings, or categories). Within these classes of data, we have further subdivisions
that let us know what types of operations we can pertorm on them.

[141
Introduction to Data Analysis Chapter 1
For example,
categorical data can be nominal, where we assign a numeric value to
each level of the
category, such as on 1/off 0, but we can't say that one is
=

greater than the other because that distinction is meaningless. The fact that on is
greater than off has no meaning because we arbitrarily chose those numbers to
represent the states on and off. Note that in this case, we can represent the data with
a Boolean
(True/False value): is_on. Categorical data can also be ordinal, meaning
that we can
rank the levels (for instance, we can have lowmedium high).
With quantitative data, we can be on an
interval scale or a ratio scale. The interval
Scale includes things such as
and
temperature. We can measure temperatures in Celsius
compare the temperatures of two cities, but it doesn't mean anything to say one
city is twice as hot as the other. Therefore, interval scale values can be
compared using addition/subtraction, but not multiplication/division. meaningfully
The ratio scale,
then, are those values that can be meaningfully
compared with ratios (using
multiplication and division). Examples of the ratio scale include prices, sizes, and
counts.

Drawing conclusions
After we have collected the data for our analysis, cleaned it up, and performed some
thorough EDA, it is time to draw conclusions. This is where we summarize our
findings from EDA and decide the next steps:

Did we notice any patterns or relationships when


visualizing the data?
.Does it look like we can make accurate predictions from our data? Doesit
make sense to move to modeling the data?
Do we need to collect new data points?
How is the data distributed?
.Does the data help us answer the questions we have or give insight into the
problem we are investigating?
Do we need to collect new or additional data?

151
Introduction to Data Analysis Chupter 1

If we decide to model the data, this falls under machine learning and statistics. While
not technically data analysis, it is usually the next step, and we will cover it in
Chapter 9, Getting Started with Machine Learning in Python, and Chapter 10, Making
Better Predictions Optimizing Models. In addition, we will see how this entire process
will work in practice in chapter 11, Machine Learning Anomaly Detection. As a
reference,in the Machine learning workflow section in the appendix, there is a workflow
cdiagram depicting the full process from data analysis to machine learning. chapt er 7,
Financial Analysis - Bitcoin and the Stock Market, and chapt er 8, Rule-Based Anomaly
Detection, will focus on drawing conclusions from data analysis, rather than building
models.

Statistical foundations
When we want to make observations about the data we are analyzing, we are often, if
not always, turning to statistics in some fashion. The data we have is referred to as the
sample, which was observed from (and is a subset of) the population. Two broad
categories of statistics are descriptive and inferential statistics. With descriptive
statistics, as the name implies, we are looking to describe the sample. Inferential
statistics involves using the sample statistics to infer, or deduce, something about ihe
population, such as the underlying distribution.

The sample statistics are used as estimators of the population


parameters, meaning that we have to quantify their bias and
variance. There are a multitude of methods for this; some will make
on the of the distribution (parametric) and others
assumptions shape
won't (non-parametric). This is all well beyond the scope of this
book, but it is good to be aware of.

Often, the goal of an analysis is to create a story for the data; unfortunately, it is very
easy to misuse statistics. It's the subject of a famous quote:

There are three kinds oflies: lies, damned lies, and statistics."
Benjamin Disraeli
This is especially true of inferential statistics, which are used in many scientific
studies and papers to show significance of their findings. This is a more advanced
topic, and, since this isn't a statistics book, we will only briefly touch upon some of
the tools and principles behind inferential statistics, which can be pursuedfurther
We will focus on descriptive statistics to help explain the data we are analyzing

I 161
Introduction to Data Analysis Chapter 1

The next few sections will be a review of statistics; those with


statistical knowledge can skip to the Setting up a oirtual environment
section.

Sampling
There's an important thing to remember before we attempt any analysis: our sample
must be a random sample that is representative of the population. This means that
the data must be sampled without bias (for example, if we are asking people if they
like a certain sports team, we can't only ask fans of the team) and that we should have
ideally) members of all distinct groups from the population in our sample (in the
sports team example, we can't just ask men).

There are many methods of sampling. You can read about them,
along with their strengths and weaknesses, here: https://www.
khanacademy.org/math/stat istics-probability/designing-
studies/sampl ing-methods-stats/a/sampling-met hods- review.

When we discuss machine learning in chapter 9, Getting Started with Machine


Learning in Python, we will need to sample our data, which will be a sample to begin
with. This is called resampling. Depending on the data, we will have to pick a
different method of sampling. Often, our best bet is a simple random sample: we use
a random number generator to pick rows at random. When we have distinct groups
in the data, we want our sample to be a stratified random sample, which will
preserve the proportion of the groups in the data. In some cases, we don't have
enough data for the aforementioned sampling strategies, so we may turn to random
sampling with replacement (bootstrapping); this is a bootstrap sample. Note that our
underlying sample needs to have been a random sample or we risk increasing the
bias of the estimator (we could pick certain rows more often because they are in the
data more often if it was a convenience sample, while in the true population these
rows aren't as prevalent). We will see an example of this in Chapter 8, Rule-Based
Anomaly Detection.

A thorough discussion of the theory behind


bootstrapping and its
consequences is well beyond the scope of this book, but watch this
video for a primer:
https://www.youtube.com/wat ch ?v=
gcPIyeqymOU.

[ 171
ntroduction to Data Analysis Chupter 1

Descriptive statistics
We will begin our discussion of descriptive statistics with univariate statistics;
univariate simply means that these statistics are calculated from one (uni) variable.
Everything in this section can be extended to the whole dataset, but the statistics will
be calculated per variable we are recording (meaning that if we had 100 observations
of speed and distance pairs, we could calculate the averages across the dataset, which
would give us the average speed and the average distance
stalistics).
Descriptive statistics are used to describe and/or summarize the data we are working
with. We can start our summarization of the data with a measure of central
tendency,
which describes where most of the data is centered around, and of
or dispersion, which indicates how far apart values are.
a measure spread

Measures of central tendency


Measures of central tendency describe the center of our distribution of data. There are
three common statistics that are used as measures of center: mean, median, and
mode. Each has its own strengths, depending on the data we are working with.

Mean
Perhaps the most common statistic for summarizing data is the average, or mean. The
population mean is denoted by the Greek symbol mu (u), and the sample mean is
written as T (pronounced X-bar). The sample mean is calculated by summing all the
values and dividing by the count of values; lor example, the mean of [0, 1, 1, 2,
9] is 2.6 ( ( 0 + 1 + 1 + 2 + 9)/5):

We use , to represent the i"observation of the variable X. Note how


the variable as a whole is represented with a capital letter, while the
specific observation is lowercase. 2 (Greek capital letter sigma) is used
TIP to represent a summation, which, in the
equation for the mean,
from 1 to n, which is the number of observations.
goes

[ 18
ntroduction to Data Analysis Chapter 1

One important thing to note about the mean is that il is very sensitive to outliers
(values created by a different generative process than our distribution). We we
dealing with only five values; nevertheless, the 9 is much larger than the other
numbers and pulled the mean higher than all but the 9.

Median
In cases where we suspect outliers to be present in our data, we may want to use
the median as our measure of central tendency. Unlike the mean, the median is
robust to outliers. Think of income in the US; the top 1% is much higher than the rest
of the population, so this will skew the mean to be higher and distort the percepiion
of the average person's income.

The median represents the 50" percentile of our data; this means that 50% of the
values are greater than the median and 50% are less than the median. It is calculated
by taking the middle value from an ordered list of values; in cases where we have an
even number of values, we take the average of the middle two values. If we take tihe
numbers [0, 1, 1, 2, 9] again, our median is 1.

The i" percentile is the value at which i% of the observations are less
than that value, so the 99" percentile is the value in X, where 99% of
TIP the x's are less than it.

Mode alue t apeay nayt fresuitly u 9


The mode is the most common valuein the data (if we have [0, 1, 1, 2, 9), then dak SatE.
1 is the mode). In practice, this isn't as useful as it would seem, but we will often hear
things like the distribution is bimodal or multimodal (as opposed to unimodal) in cases
where the distribution has two or more most popular values. This doesn't necessarily
mean that each of them occurred the same amount of times, but, rather, they are more
common than the other values by a significant amount. As shown in the following
plots, a unimodal distribution has only one mode (at 0), a bimodal distribution has
two (at -2 and 3), and a multimodal distribution has many (at -2, 0.4, and 3):

miadolla Vahu m a S
Mehan

oni
orlu als tom S

ObsuVi by
Value,
2.
ol

0bgecawtnt,
huser uf, 4 aun n
Dho poAon
Introduction to Data Analysis Chupter 1

unimodal bmodal uitinodal

Understanding the concept of the mode comes in handy when describing continuous
distributions; however, most of the time when we're describing our data, we will use
either the mean or the median of central
as our measure
tendency.

Measures of spread
Knowing where the center of the distribution is only gets us partially to being able to
Summarize the distribution of our data-we need to know how values fall around the
center and how far apart they are. Measures of spread tell us how the data is
dispersed; this will indicate how thin (low dispersion) or wide (very spread out) our
distribution is. As with measures of central tendency, we have several ways to
describe the spread of a distribution, and which one we choose will depend on the
situation and the data.

Range
The rangeisthe distance between the smallest value (minimum) and the largest
value (maximum):

range = maz(X) - min(X)

The units of the range will be the same units as our data. Therefore, unless two
distributions of data are in the same units and measuring the same thing, we can't
compare theirrange_ and say one is moredispersed than the other

Variance
Just from the definition of the range, we can see why that wouldn't always be the best
way to measure the spread of our data. It gives us upper and lower
bounds
we have in the data, however, if we have any outliers in our data, the willwhat
range on be
rendered useless.

201
Introduction to Data Analysis Chapter 1

Another problem with the range is that it doesn't tell us how the data is dispersed
around its center; it really only tells us how dispersed the entire dataset is. Enter
the variance, which describes how far apart observations are spread out from their
average value (the mean). The population variance is denoted as sigma-squared (r),
and the sample variance is written as (s).
The variance is calculated as the average squared distance from the mean. The
distances must be squared so that dist ces below the mean don't cancel out those
above the mean. If we want the sample variance to be an unbiased estimator of the
population variance, we divide by n - 1 instead of n to account for using the sample
mean instead of the population mean; this is called Bessel's correction (nttps://en.
wikipedia.org/wiki/Bessel827s_correct ion). Most statistical tools will give us the
for the entire
sample variance by default, since it is very rare that we would have data
population:

s n-

Standard deviation
The variance gives us a statistic with squared units. This means that if we started with
data on gross domestic product (GDP) in dollars ($), then our variance would be in
dollars squared ($). This isn't really useful when we're trying to see how this
describes the data; we can use the magnitude (size) itself to see how spread out
something is (large values large spread), but beyond that, we need a measure of
spread with units that are the same as our data.
For this purpose, we use the standard deviation, which is simply the square root of
the variance. By performing this operation, we get a statistic in units that we can
make sense of again ($ for our GDP example):

= n-1
y
The population standard deviation is represented as o, and the
sample standard deviation is denoted as s.

21
Chapter 1
Introduction to Data Analysis

mean data points are on


We can usethe standard deviation to see how far from the
deviation means that values are close to the mean; large
average. Small standard tiedto
more widely. This can be
standard deviation means that values are dispersed
smaller the slandard deviation, the
how we would imagine the distribution curve: the
skinnier the peak of the curve; the larger the standard
deviation, the fatter the peak of
standard deviation of 0. to
the curve. The following plot is a comparison of a
2:

Different Population Standard Deviations

2
00
-10 -5 10

Coefficient of variation
we were looking to get to
units
When we moved from variance to standard deviation,
the level of dispersion of one
that made sense; however, if we then want to compare
dataset to another, we would need to have the same units once again. One way
which is the ratio of the
around this is to calculate the coefficient of variation (CV),
the standard deviation is relative
standard deviation to the mean. It tells us how big
to the mean:

CV

Interquartile range
So far, other than the range, we have discussed mean-based measures of dispersion;
now, we will look at how we can describe the spread with the median as our measure
of central tendency. As mentioned earlier, the median is the 50" percentile or the
2 quartile (Q). Percentiles and quartiles are both quantiles-values that divide data
into equal groups each containing the same percentage of the total data; percentiles
give this in 100 parts, while quartiles give it in four (25%, 50%, 75%, and 100.)

221
Chgtt
in
our data, and we
know how much of the data goes
Since quantiles neatly divide up the spread of our
candidate for helping us quantify
each section, they are a perfect which is the
is the h¥AiAtHKNGKIAA (IQR),
data. One c o m m o n m e a s u r e for this
distance between the 3 and 1" quartiles:

1QR= QQ1
how much
of data around the median iA quantifies
the spread
The 1QR gives us
distribution. It can also be
useful to
we have in the
middle 50% of our
dispersion REKEBAkAA GOiAMDANAO
determine outliers, which we
will cover in Chapter 8,

Quartile coefficient of dispersion the mean as our


of
measure

like had the coefficient of variation when using


Just we when using the
central tendency, we have
the quartile coefficient of dispersion
also unitless, so it can
be used to
center. This statistic is
median as o u r m e a s u r e of the QR)
the semi-quartile range (half
datasets. It is calculated by dividing
compare
between the first and third quartiles):
by the midhinge (midpoint
s-Q1
QCD= Q4Q Q3 +Q1

Summarizing data
we can use to summarize
of descriptive statistics that
We have seen many examples at the 5-number summary
center and dispersion; in practice, looking
our data by its
first steps before diving into somne
or visualizing
the distribution prove to be helpful
5-number summary, as its n a m e indicates,
aforementioned metrics. The
of the other
statistics that summarize our data:
provides five descriptive
Statistic Percentilee
Quartile
minimum
N/A 25
2.
median 50
N/A 75t
Q 100"
maximum
Q

231
Introduction to Data Analysis Chupter1

sense of
Looking at the 5-number summary is a quick and efficient way of gettinga
an idea of the distribution of
the data and can move on
our data. At a glance, we have
to visualizing it.

representation of the 5-number


plot (or box and whisker plot) is the visual the
The box
box. The top of the box is Q
thick line in
summary. The median is denoted by a
the box
extend from both sides of
and the bottom of the box is Q. Lines (whiskers)
Based on the convention o u r
boundaries toward the minimum and maximum.
extend to a certain statistic; any values
plotting tool uses, though, they may only For this book, the lower
as outliers (using points).
beyond these statistics are marked * bound will be
whiskers will be Q,-1.5 1QR and the upper
bound of the
Q+1.5* 1QR, which is called the Tukey box plot:

Box plot

outlers
100 0, +15 10R

median
1QR

-50

-100

-150 - 0 -15 '1OR


Cutlier 0

241
Introduction to Data Analysis Chupter 1

Scaling data
In order to compare variables from different distributions, we would have to scale the
We take each data
data, which we could do with the range by using min-max scaling.
divide by the
point, subtract the minimum of the dataset, then range.
This normalizes our data (scales it to the range [0, 1]):

min(X)
tscaled
range(X)

also use the mean and standard


This isn't the way to scale data; we can
only
from each observation and then
deviation. In this case, we would subtract the mean
data:
divide the standard deviation to standardize the
by

what is known as a Z-score. We are


left with a normalized distribution
This gives us
Z-score tells us
deviation (and variance) of 1. The
with a m e a n of 0 and a standard the mean has a Z-
from the mean each observation is;
how many standard deviations the mean will have
observation of 0.5 standard deviations below
score of 0 while an

a Z-score of -0.5.

and the one w e end up


There are, of course, additional ways to scale our data,
on o u r data. By keeping
the m e a s u r e s of central tendency
choosing will be dependent the scaling of
of dispersion in mind, you will be able to identify how
and measures

done in any other methods you come across


data is being

between variables
Quantifying relationships
with univariate statistics and were only able
dealing
previous sections,
we were
In the multivariate statistics,
to say something
about the variable we were looking at. With
look into
we can look to quantify
relationships between variables. This allows us to
one variable changes with respect to another) and
correlations (how
things such as

attempt to make predictions for future behavior.


The covariance is a statistic for quantifying the relationship between variables by

showing their joint variance:


cou(X, Y) = E|(X - E\X]|)(Y - EY|))

291
Introduction to Data Analysis Chapter 1

EIX] is new notation for us. It is read as the expected value of X or the

i
expectation of X, and it is calculated by summing all the possible
values of X multiplied by their probability-it's the long-run
average of X.

The magnitude of the covariance isn't easy to interpret, but its sign tells us if the
variables are positively or negatively correlated. However, we would also like to
us to
quantify how strong the relationship is between the variables, which brings
correlation. Correlation tells us how variables change together both in direction (samne
or opposite) and in of the relationship). To find the
magnitude (strength
correlation, we calculate the Pearson correlation coefficient, symbolized by p (the
standard
Greek letter rho), by dividing the covariance by the product of the
deviations of the variables:

couX, Y)
Px,Y
Sx SY

covariance and results in statistic bounded between -1 and 1,


a
This normalizes the
direction of the correlation (sign) and the strength
making it easy to describe both the to be perfect positive (linear) correlations,
of it (magnitude). Correlations of 1 a r e said correlated. If
correlations. Values n e a r 0 aren't
while those of -1 are perfect negative
then the variables are said to be
absolute value,
correlation coefficients are near 1 in
correlated.
those closer to 0.5 are said to be weakly
strongly correlated;
In the leftmost c o r n e r (p = 0.11), we
Let's look at some examples using scatter plots.
to be random noise
correlation between the variables: they appear
see that there is no
-0.52 has weak negative correlation:
we can
with p
with no pattern. The next plot
=

with the x variable increasing, while


see that the
variables appear to m o v e together
still a bit of randomness. In the third plot
from
but there is
the y variable decreases,
correlation: x and y a r e increasing
the left
(p 0.87), there is a strong positive
=

with p -0.99 has near perfect negative correlation: as


together. The rightmost plot
=

decreases. We can also see how the points form a line:


x increases, y

011 052

-5

301
Chupter 1
Introduction to Data Analysis

correlation
to remember is that, while we may find a
One very important thing that Y X. There could be
doesn't that X causes Y or causes

between X and Y, it
mean
event that
X intermediary
causes some
Z that actually c a u s e s both; perhaps often don't
Keep in mind that
some we
coincidence.
c a u s e s Y, or perhaps
it is actually just a
to report causation:
have enough information

"Correlation does not imply causation."


two
and direction of the relationship between
To quickly eyeball the strength will often use scatter plots rather
even seems to be one), we
variables (and s e e if there couple of reasons
coefficient. This is for a
exact correlation
than calculating the
to arrive at
visualizations, but it's m o r e work
.It's easier to find patterns in
numbers and tables.
the same
conclusion by looking at may not belinearly
variables seem related, but they
that the if o u r
.We might see
will make it easy to see
visual representation
related. Looking at a or some
other norn-

exponential, logarithmic,
data is actually quadratic,
linear function.
correlations, but it's pretty
data with strong positive
depict The o n e o n the left
Both of the following plots these a r e not linear.
at the scatter plots, that
obvious, when looking
while the one on the right is exponential:
is logarithmic,
0 69
P 0 80
20900

15000

10000

S000

statistics
Pitfalls of summary
can summary statistics. There
correlation coefficients be misleading-so
Not only can
w e must be when only using
dataset illustrating how careful
is a very interesting coefficients to describe our data. It also shows us
summary
statistics and correlation
that plotting is not optional.

31
Introduction to Data Analysis
-

Chapter 1
Anscombe's quartet is a collection of four different
summary statistics and correlation coefficients, but datasets that have identical
are not similar: when plotted, it is obvious
they

Anscombe's Quartet
inear
i non-linear

** 12

10
10

P082
y 0 5x + 30
y-6530
90fo, 3 2
9 ,2
= 7 5 } , =19

10 12 14 16 18 8 17 18

lnear with outlier V vertical with outler

Y-05x +30 05 30

75} o , ! 9
4,7 5
1 2 14 16 16 10 12 4

321
Introduction to Data Analysis
Chapter 1
Summary statistics are
helpful when we're getting to know the
very
data, but be wary of
relying exclusively
statistics can mislead; be sure to on them. Remember,
also plot the data before drawing
any conclusions or
about Anscombe's
proceeding with the analysis. You can read more
quartet here:
Anscombe127s_quartet. https://en.wikipedia.org/wiki/

Prediction and forecasting


Say our favorite ice cream shop has asked us to help predict how many ice creams
they can expect to sell on a given day. They are convinced that the
outside has a temperature
strong influence on their sales, so they collected data on the number of
ice creams sold at a
given temperature. We agree to help them, and the first thing we
do is make a scatter
plot of the data they us: gave

ICe CrearII Sales at d qiven temperature

20

15

30
temperature in "C

We can observe an upward trend in the scatter plot: more ice creams are sold at
higher temperatures. In order to help out the ice cream shop, though, we need to find
a way to make predictions from this data. We can use a technique called regression to
model the relationship between temperature and ice cream sales with an equation.
Using this equation, we will be able to predict ice cream sales at a given temperature.

331
IâtRONiAdhminANAGMAKD Chapter 1
In chapt er 9, GAiAS ttiNDÄKMÁMGAD ÉÐ, we will go over
regression in depth, so this discussion will be a high-level overview. There are many
ypes of regression that will yield a different type of equation, such as linear and
OgIstic. Our first step will be to identify the dependent variable, which is the
quantity we want to
predict (ice creamn sales),
and the variables we will use to
t, which are
called predict
varlables, our ice cream sales äegAnÄAGHWADMD
While we can have many independent
use example only has one: temperature. Therefore, we will
simple linear regression to model the
relationship as a line:
Using regression to predict ice cream sales
regression iine
extrapolated regresSion line
y 150x+27 96
30

15

20
30
temperature in C

The regression line in the previous scatter plot yields the following equation for the
relationship:
ice cream sales 1.5
=
x
temperature 27.9

Today the
temperature is 35°C, so we plug that in for temperature in the equation. The
result predicts that the ice cream shop will sell 24.54 ice creams. This
along the red line in the previous plot. Note that the ice cream prediction is
fractions of an ice cream. shop can't actually sell

Remember that correlation does not imply causation. People may


buyice cream when it is warmer, but warmer
temperatures don't
cause people to buy ice cream.

- [ 34]
Introduction to Data Analysis Chapter 1

Before leaving the model in the hands of the ice cream shop, it's important to discuss
the difference between the dotted and solid portions of the regression line that we
obtained. When we make predictions using the solid portion of the line, we are

using interpolation, meaning that we will be predicting ice cream sales for
temperatures the regression was created on. On the other hand, if we try to predict
how many ice creams will be sold at 45°C, it is called extrapolation (dotted portion of
the line), since we didn't have any temperatures this high when we ran the regression.
trends don't continue indefinitely. It
Extrapolation can be very dangerous as many
that instead of
people decide not to leave their houses. This
means
may be so hot that
selling the predicted 39.54 ice creams, they would sell zero.
We can also predict categories. Imagine that the ice cream shop
wants to know which flavor of ice cream will sell the most on a

This type of prediction will be introduced in Chapter


given day.
9, Getting Started with Machine Learning in Python.

different: we often look


When working with time series, o u r terminology is a little
values. Forecasting is a type of prediction
for
to forecast future values based on past
however, we will often use a
time series. Before we try to model the time series,
series decomposition to split the time series into components,
process called time fashion and may be used as
which can be combined in an additive or multiplicative
parts of a model.
behavior of the time series in the long term
The trend component describes the
effects. Using the trend, we can make
without accounting for the seasonal or cyclical Earth is
series in the long run, such as the population of
broad statements about the time series
a time
Facebook stock is stagnating. Seasonality of
increasing or the value of calendar-related movements of a time series. For
explains the systematic and York City is high in
ice cream trucks on the streets of New
example, the number of in the winter; this pattern repeats every year,
the summer and drops to nothing the cyclical
actual amount each s u m m e r is the same. Lastly,
regardless of whether the with the time series;
for anything else unexplained or irregular
component accounts the number of ice cream trucks
as a hurricane driving
this could be something such is difficult
term because it isn't safe to be outside. This component
down in the short
nature.
with a forecast due to its unexpected
to anticipate
the time series into trend, seasonality, and noise
We can usePython to decompose in the noise (random, unpredictable
or residuals.
The cyclical component is captured
the trend and seasonality from the time series, what we are ieft
data); after we remove
with is the residual:

[ 351
Introduction to Data
Analysis Chapter 1
The moving
average puts equal weight on each time period in the past involved in
the calculation. In
practice, this isn't always a realistic expectation of our data.
Sometimes, all past values are important, but
data points. For these cases, we can use they vary in their influence on future
exponential smoothing,
put more weight on more recent values and less weight
which allows us to
on values further away from
what we are
predicting.
Note that we aren't limited to
predicting numbers; in fact, depending on the data,
predictions could be categorical in nature-things such as determining what color our
the
next observation will be or if an email is
spam or not. We will cover more on
regression, time series analysis, and other methods of prediction using machine
learning in later chapters.

Inferential statistics
As mentioned earlier, inferential statistics deals with inferring or deducing things
from the sample data we have in order to make statements about the population as a
whole. When we're looking to state our conclusions, we have to be mindful of
whether we conducted an observational study or an experiment. An observational
study is where the independent variable is not under the control of the researchers,
and so we are observing those taking part in our study (think about studies on
smokingwe can't force people to smoke). The fact that we can't control the
independent variable means that we cannot conclude causation.

An experiment is where we are able to directly influence the independent variable


and randomly assign subjects to the control and test groups, like A/B tests (lor
anything from website redesigns to ad copy). Note that the control group doesn't
receive treatment; they can be given a placebo (depending on what the study is). The
ideal setup for this will be double-blind, where the researchers administering the
treatment don't know which is the placebo and also don't know which subject belongs
to which group.

We can often find reference to Bayesian inference and frequentist


inference. These are based on two different ways of approaching
probability. Frequerntist statistics focuses on the frequency of the
event, while Bayesian statistics uses a degree of belief when
determining the probability of án event. We will see an example of
this in Chapter 11,Machine Learning Anomaly Detection. You can
read more about how these methods differ here:
https://www
probabilisticworld.com/ frequent ist-bayesian-approaches-
inferential-statistics/,

[ 37]
ntroducti0n to Data Analysis
Chapter 1

Inferential statistics gives us tools to translate our understanding of the sample data
to a statement about the population. Remember that the sample statistics we
discussed earlier are estimators for the population parameters. Our estimators need
confidence intervals, which provide a point estimate and a margin of error around it.
This is the range that the true population parameter will be in at a certain confidence
level. At the 95% confidence level, 95% of the confidence intervals that are calculated
from random
samples of the population contain the true population parameter.
Frequently, 95% is chosen for the confidence level and other purposes in statistics,
although 90% and 99% are also common; the higher the confidence level, the wider
the interval.

Hypothesis tests allow us to test whether the true population parameter is less than,
greater than, or not equal to some value at a certain significance level (called alpha).
The process of performing a hypothesis test involves stating our initial assumption or
null hypothesis: for example, the true population mean is 0. We pick a level of statistical
significance, usually 5%, which is the probability of rejecting the null hypothesis
when it is true. Then, we calculate the critical value for the test statistic, which will
depend on the amount of data we have and the type of statistic (such as the mean of
one population or the proportion of votes for a candidate) we are testing. The critical
value is compared to the test statistic from our data, and we decide to either reject or
fail to reject the null hypothesis. Hypothesis tests are closely related to confidence
intervals. The significance level is equivalent to 1 minus the confidence level. This
means that a result is statistically significant if the null hypothesis value is not in the
confidence interval.

There are many things we have to be aware of when picking the


method to calculate a confidence interval or the proper test statistic
for a hypothesis test. This is beyond the scope of this book, but
check out the link in the Further reading section at the end of this
chapter for more information. Also be sure to look at some of the
mishaps with p-values, such as p-hacking here: https://en.
wikipedia.org/wiki/Misunderstandings_of_p-values.

[ 381

You might also like