100% found this document useful (1 vote)
97 views36 pages

Notation

Uploaded by

Solomon Asfaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
97 views36 pages

Notation

Uploaded by

Solomon Asfaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Notation

Notation is a common language used to communicate mathematical ideas. Think of


notation as a universal language used by academic and industry professionals to
convey mathematical ideas. In the next videos, you might see things that seem
confusing. Use the quizzes to assist with your understanding of the concepts.
You likely already know some notation. Plus, minus, multiply, division, and equal
signs all have mathematical symbols that you are likely familiar with. Each of these
symbols replaces an idea for how numbers interact with one another. In the coming
concepts, you will be introduced to some additional ideas related to notation. Though
you will not need to use notation to complete the project, it does have the following
properties:
Understanding how to correctly use notation makes you seem really
smart. Knowing how to read and write in notation is like learning a new language. A
language that is used to convey ideas associated with mathematics.
It allows you to read documentation, and implement an idea to your own
problem. Notation is used to convey how problems are solved all the time. One really
popular mathematical algorithm that is used to solve some of the world's most
difficult problems is known as Gradient Boosting. The way that it solves problems is
explained here(opens in a new tab). If you really want to understand how this
algorithm works, you need to be able to read and understand notation.
It makes ideas that are hard to say in words easier to convey. Sometimes we just
don't have the right words to say. For those situations, I prefer to use notation to
convey the message. Similar to the way an emoji or meme might convey a feeling
better than words, the notation can convey an idea better than words. Usually, those
ideas are related to mathematics, but I am not here to stifle your creativity.
Example to Introduce Notation

There is a lot going on in this video - here is a recap of the big ideas.
Rows and Columns

If you aren't familiar with spreadsheets, this will be covered in detail in future lessons.
Spreadsheets are a common way to hold data. They are composed of rows and
columns. Rows run horizontally, while columns run vertically. Each column in a
spreadsheet commonly holds a specific variable, while each row is commonly called
an instance or individual.
The example used in the video is shown below.
Time Spent On
Date Day of Week Buy (Y)
Site (X)

June 15 Thursday 5 No

June 15 Thursday 10 Yes

June 16 Friday 20 Yes

This is a row:
Date Day of Week Time Spent On Site (X) Buy (Y)

June 15 Thursday 5 No

This is a column:
Time Spent On Site (X)

10

20

Before Collecting Data

Before collecting data, we usually start with a question, or multiple questions,


that we would like to answer. The purpose of data is to help us in answering these
questions.
Random Variables

A random variable is a placeholder for the possible values of some process (mostly...
the term 'some process' is a bit ambiguous). As was stated before, notation is useful in
that it helps us take complex ideas and simplify (often to a single letter or single
symbol). We see random variables represented by capital letters (X, Y, or Z are
common ways to represent a random variable).
We might have the random variable X, which is a holder for the possible values of the
amount of time someone spends on our site. Or the random variable Y, which is a
holder for the possible values of whether or not an individual purchases a product.
X is 'a holder' of the values that could possibly occur for the amount of time spent on
our website. Any number from 0 to infinity really.

Example Dataset
An example of the data we might have collected in the previous
video is shown here:

Day of Time Spent On Buy (


Date
Week Site (X) Y)

June
Thursday 5 No
15

June
Thursday 10 Yes
15

June
Friday 20 Yes
16

Question 1 of 2
What type of variable is the random variable X in the video in the
previous concept?

Categorical - Ordinal

Categorical - Nominal

Quantitative - Continuous

Quantitative - Discrete
Submit
Question 2 of 2
What type of variable is the random variable Y in the video in the
previous concept?

Categorical - Ordinal

Categorical - Nominal

Quantitative - Continuous
Quantitative - Discrete
Submit

Capital vs. Lower Case Letters


Random variables are represented by capital letters. Once we
observe an outcome of these random variables, we notate it as a
lower case of the same letter.

Example 1
For example, the amount of time someone spends on our
site is a random variable (we are not sure what the outcome will
be for any particular visitor), and we would notate this with X. Then
when the first person visits the website, if they spend 5 minutes, we
have now observed this outcome of our random variable. We would
notate any outcome as a lowercase letter with a subscript
associated with the order that we observed the outcome.

If 5 individuals visit our website, the first spend 10 minutes, the


second spends 20 minutes, the third spend 45 mins, the fourth
spends 12 minutes, and the fifth spends 8 minutes; we can notate
this problem in the following way:

X is the amount of time an individual spends on the website.

x1x1 = 10, x2x2 = 20 x3x3 = 45 x4x4 = 12


x5x5 = 8.
The capital X is associated with this idea of a random variable,
while the observations of the random variable take on
lowercase x values.

Example 2
Taking this one step further, we could ask:

What is the probability someone spends more than 20


minutes in our website?

In notation, we would write:


P(X > 20)?

Here P stands for probability, while the parentheses encompass


the statement for which we would like to find the probability.
Since X represents the amount of time spent on the website, this
notation represents the probability the amount of time on the
website is greater than 20.

We could find this in the above example by noticing that only one of
the 5 observations exceeds 20. So, we would say there is a 1 (the
45) in 5 or 20% chance that an individual spends more than 20
minutes on our website (based on this dataset).

Example 3
If we asked: What is the probability of an individual spending
20 or more minutes on our website? We could notate this as:

P(X ≥≥ 20)?
We could then find this by noticing there are two out of the five
individuals that spent 20 or more minutes on the website. So this
probability is 2 out of 5 or 40%.

Notation for Calculating the Mean


We know that the mean is calculated as the sum of all our values
divided by the number of values in our dataset.

In our current notation, adding all of our values together can be


extremely tedious. If we want to add 3 values of some random
variable together, we would use the notation:

x1+x2+x3x1+x2+x3
If we want to add 6 values together, we would use the notation:

x1+x2+x3+x4+x5+x6x1+x2+x3+x4+x5+x6
To extend this to add one hundred, one thousand, or one million
values would be ridiculous! How can we make this easier to
communicate?!
Aggregations
An aggregation is a way to turn multiple numbers into fewer
numbers (commonly one number).

Summation is a common aggregation. The notation used to sum


our values is a greek symbol called sigma ΣΣ.

Example 1
Imagine we are looking at the amount of time individuals spend on
our website. We collect data from nine individuals:

x1x1 = 10, x2x2 = 20 x3x3 = 45 x4x4 = 12 x5x5 = 8 x6x6 =


12, x7x7 = 3 x8x8 = 68 x9x9 = 5

If we want to sum the first three values together in our previous


notation, we write:

x1+x2+x3x1+x2+x3
In our new notation, we can write:

∑i=13xii=1∑3xi.

Notice, our notation starts at the first observation ( i=1i=1) and


ends at 3 (the number at the top of our summation).

So all of the following are equal to one another:

∑i=13xii=1∑3xi = x1+x2+x3x1+x2+x3 = 10 + 20 + 45 = 75
Example 2
Now, imagine we want to sum the last three values together.

x7+x8+x9x7+x8+x9
In our new notation, we can write:

∑i=79xii=7∑9xi.
Notice, our notation starts at the seventh observation ( i=7i=7) and
ends at 9 (the number at the top of our summation).

Other Aggregations
The ΣΣ sign is used for aggregating using summation, but we might
choose to aggregate in other ways. Summing is one of the most
common ways to need to aggregate. However, we might need to
aggregate in alternative ways. If we wanted to multiply all of our
values together we would use a product sign ΠΠ** **, capital Greek
letter pi. The way we aggregate continuous values is with something
known as integration (a common technique in calculus), which uses
the following symbol ∫∫ which is just a long s. We will not be using
integrals or products for quizzes in this class, but you may see them
in the future!

Final Steps for Calculating the Mean


To finalize our calculation of the mean, we introduce n as the total
number of values in our dataset. We can use this notation both at
the top of our summation, as well as for the value that we divide by
when calculating the mean.

1n∑i=1nxin1i=1∑nxi

Instead of writing out all of the above, we commonly write xˉxˉ to


represent the mean of a dataset. Although similar to the first video,
we could use any variable. Therefore, we might also write yˉyˉ, or
any other letter.

We also could index using any other letter, not just ii. We could just
as easily use jj, kk, or mm to index each of our data values. The
quizzes on the next concept will help reinforce this idea.

Notice
At second 0:12, this should
say ∑i=15xi=x1+x2+x3+x4+x5 i=1∑5xi=x1+x2+x3+x4
+x5. The xixi is missing here in front of the summation.
Notation Recap
Notation is an essential tool for communicating mathematical ideas.
We have introduced the fundamentals of notation in this lesson that
will allow you to read, write, and communicate with others using
your new skills!

Notation and Random Variables


As a quick recap, capital letters signify random variables. When
we look at individual instances of a particular random variable,
we identify these as lowercase letters with subscripts attach
themselves to each specific observation.

For example, we might have X be the amount of time an individual


spends on our website. Our first visitor arrives and spends 10
minutes on our website, and we would say x1x1 is 10 minutes.

We might imagine the random variables as columns in our dataset,


while a particular value would be notated with the lower case
letters.

Notation English Example


Time spent on
X A random variable
website
x1x1 First observed value of the random variable X 15 mins
∑i=1nxii=1∑n Sum values beginning at the first observation and ending at
5 + 2 + ... + 3
xi the last
1n∑i=1nxin1 Sum values beginning at the first observation and ending at
(5 + 2 + 3)/3
i=1∑nxi the last and divide by the number of observations (the mean)
xˉxˉ Exactly the same as the above - the mean of our data. (5 + 2 + 3)/3

Notation for the Mean


We took our notation even further by introducing the notation for
summation ∑∑. Using this we were able to calculate the mean as:
1n∑i=1nxin1i=1∑nxi

In the next section, you will see this notation used to assist in your
understanding of calculating various measures of spread. Notation
can take time to fully grasp. Understanding notation not only helps
in conveying mathematical ideas but also in writing computer
programs - if you decide you want to learn that too! Soon you will
analyze data using spreadsheets. When that happens, many of
these operations will be hidden by the functions you will be using.
But until we get to spreadsheets, it is important to understand how
mathematical ideas are commonly communicated. This isn't easy,
but you can do it!

Lesson Overview
In this lesson, we will continue to cover more topics related to
analyzing quantitative variables and you will learn to use measures
of spread. Measures of spread are used to provide us an idea of
how spread-out our data are from one another.

In this lesson you will:

 Evaluate measures of spread

 Range

 Interquartile Range (IQR)

 Standard Deviation

 Variance

 Analyze outliers

 Evaluate descriptive and inferential statistics

Throughout this lesson, you will learn how to calculate these, as well
as why we would use one measure of spread over another.

Histograms
Histograms are super useful for understanding the different aspects
of data and they are the most common visual used for quantitative
data. In the upcoming concepts, you will see histograms used all the
time to help you understand the four aspects we outlined earlier
regarding a quantitative variable:

 center

 spread

 shape

 outliers

How are Histograms constructed?


First, we need to bin our data. Each bin represents a range of
values in a dataset. The number of values that fall in the range of
each bin determines the height of each histogram bar. As shown in
the video above, changing the range of our bins can result in slightly
different visuals. However, there is no right or wrong answer in
choosing how to bin, and in most cases, the software you use will
choose the appropriate bins for you.
he two histograms below illustrate the number of dogs Josh saw on
weekdays versus weekends. The measures of center for both
histograms (mean, median, mode) are basically the same and
centered about the highest bin for both histograms, 13.
Visually, the difference between the histograms is the range or
spread of dogs Josh sees during each time period. In the upcoming
lessons, we will discuss the most common ways to measure the
spread of our data.

Calculating the 5 Number Summary


The five-number summary consist of 5 values:

1. Minimum: The smallest number in the dataset.

2. Q1Q1: The value such that 25% of the data fall below.

3. Q2Q2: The value such that 50% of the data fall below.

4. Q3Q3: The value such that 75% of the data fall below.
5. Maximum: The largest value in the dataset.

In the above video, we saw that calculating each of these values


was essentially just finding the median of a bunch of different
datasets. Because we are essentially calculating a bunch of
medians, the calculation depends on whether we have an odd or
even number of values.
Range
The range is then calculated as the difference between
the maximum and the minimum.

IQR
The interquartile range is calculated as the difference
between Q3Q3 and Q1Q1.

In the upcoming sections, you will practice this with Katie and on
your own.
Looking back at the histograms Josh created for the number of dogs
he recorded seeing on weekdays and weekends, we can use the
histograms to mark the values of the 5 number summary and create
a box plot.

 Box plots are useful for quickly comparing the spread of two
data sets across some key metrics, like quartiles, maximum,
and minimum.

How do we create the box plot?

1. The beginning of the line to the left of the box and the end of
the line to the right of the box represent the minimum and
maximum values in a dataset.

2. The visual distance between these markings is an indication of


the range of the values.

3. The box itself represents the IQR. The box begins at the Q1
value, ends at the Q3 value, and Q2, or the median, is
represented by a line within the box.
From both the histograms and box plots, we can see that the
number of dogs seen on weekends varies much more than on
weekdays.

However, instead of depending on a visual of the 5 number


summary to compare our data, in the next lesson, we will learn
about using a single value to compare the two distribution spreads
- standard deviation.

Standard Deviation and Variance


Show Video Transcript
Video Transcript
0:00
The most common way that professionals measure the spread of a
data
0:04
set with a single value is with the standard deviation or variance.
0:09
Here, we will focus on the standard deviation,
0:13
but we will actually learn how to calculate the variance in the
process.
0:17
If you have never heard of these measures before,
0:20
this calculation will probably look pretty complex.
0:23
When all's said and done with this calculation,
0:26
the standard deviation will tell us on average
0:30
how far every point is from the mean of the points.
0:35
As a quick mental picture,
0:37
imagine we wanted to know how far employees were located from
their place of work.
0:42
One person might be 15 miles,
0:45
another 35, another only one mile,
0:49
and another might be remote and is 103 miles.
0:52
We could aggregate all of these distances together to show that
0:56
the average distance employees are located from their work is 18
miles.
1:02
But now, we want to know how the distance to work varies from one
employee to the next.
1:09
We could use the five number summary as a description.
1:12
But if we wanted just one number to talk about the spread,
1:16
we'd probably choose the standard deviation.
1:19
The standard deviation is on average how much each observation
varies from the mean.
1:26
For this example this is,
1:28
how much on average the distance each person is from
1:32
work differs from the average distance all of them are from work.
1:37
So, this one is three miles farther from work than the average
1:41
while this individual is four miles closer to work than the average.
1:45
The standard deviation is how far on
1:48
average are individuals located from this mean distance.
1:54
So, it is like the average of all of these distances.
1:58
We will take a closer look at this but hopefully this gives you
2:01
a strong conceptual understanding of what we'll be calculating in
the next sections.
Standard Deviation and Variance
The standard deviation is one of the most common measures for
talking about the spread of data. It is defined as the average
distance of each observation from the mean.

In the above video, we saw this as how far individuals were from the
average distance from work (the example distances shown are
examples from the full data set, the mean of just those 4 numbers is
38.5. The mean of 18 shown later in the video is the mean of the full
data set which is not shown in the video). In the next video, you will
see exactly how this is calculated.

Standard Deviation Calculation

Show Video Transcript


Video Transcript
0:00
In the last video,
0:01
we got an idea of what the standard deviation is measuring.
0:05
In this video, we will look at
0:07
the math that actually occurs when calculating this measure.
0:11
We will work with data to calculate this measure as well as associate
notation with it.
0:18
It is worth noting that after this lesson,
0:21
you probably won't calculate this measure by hand ever again,
0:25
because you'll learn software to do it for you.
0:27
The calculating it yourself will give you intuition behind what it's
actually doing.
0:34
And this intuition is necessary to become good at
0:37
understanding data and choosing the right analysis for your
situation.
0:42
Imagine we have a data set with four values,
0:45
10, 14, 10, and 6.
0:49
The first thing we need to do to calculate the standard deviation is
to find the mean.
0:55
In notation, we have this as X bar.
0:58
For our values, the sum is 40,
1:01
and we have four numbers.
1:04
So the mean is 40 over 4 or 10.
1:09
Then we want to look at the distance of each observation from this
mean.
1:14
Two of these observations are exactly equal to the mean.
1:18
So the distance here is zero.
1:21
One is 4 larger the 14,
1:24
while the other is 4 smaller the 6.
1:27
In notation, each of these is XI minus X bar.
1:34
Then, if we were to average these distances,
1:37
the positive would cancel with the negative value.
1:41
And the value of zero isn't a great measure of the spread here.
1:45
Zero would suggest that all the values are the same or that there's
no spread.
1:52
So instead, we need to make all of these values positive.
1:56
The way we do this when calculating the standard deviation is by
squaring them all.
2:02
If we do that here,
2:03
our negative and positive 4 values will become 16s.
2:09
Now, we could average these to find the average
2:13
squared distance of each observation from the mean.
2:18
This is called the variance.
2:20
Finding the average, just as we did before,
2:24
means adding all of these values and dividing by how many there
are.
2:29
In our case, we had 0, 16,
2:33
0, 16 and we divide by 4 because we have four observations.
2:38
However, this is an average of
2:41
squared values which we only did to get positive values in the first
place.
2:48
So to get our standard deviation,
2:49
we take the square root of this ending value.
2:53
Here, our standard deviation is 2.83.
2:57
So this is on average how far each point in our data set is from the
mean,
3:03
which is the definition of the standard deviation.
Note: at 2:00 the 4 in (14-10) = 4 = 16 should be squared. So it
2

should be (14-10) = 4 = 16
2 2

Example: Calculating the Standard Deviation


The dataset for the example is 10,14,10,610,14,10,6
1. First, calculate the mean:

x‾=(∑i=14xi)n=404=10x=n(i=1∑4xi)=440=10
2. Next, calculate the distance of each observation from the
mean and square the value:

(xi−x‾)2=(xi−x)2=
(10−10)2=02=0(10−10)2=02=0
(14−10)2=42=16(14−10)2=42=16
(10−10)2=02=0(10−10)2=02=0
(6−10)2=−42=16(6−10)2=−42=16
2. Then calculate the variance, the average squared difference
of each observation from the mean:

1n∑i=1n(xi−x‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi
−x)2=41(0+16+0+16)=432=8
4. Finally, calculate the standard deviation, the square root of
the variance:

1n∑i=1n(xi−x‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83
The standard deviation is, on average, how far each point in our
dataset is from the mean.

Other Measures of Spread


5 Number Summary
In the previous sections, we have seen how to calculate the values
associated with the five-number summary (min, Q1Q1, Q2Q2
, Q3Q3, max), as well as the measures of spread associated with
these values (range and IQR).

For datasets that are not symmetric, the five-number summary


and a corresponding box plot are a great way to get started with
understanding the spread of your data. Although I still prefer a
histogram in most cases, box plots can be easier to compare
two or more groups. You will see this in the quizzes towards the
end of this lesson.

Variance and Standard Deviation


Two additional measures of spread that are used all the time are
the variance and standard deviation. At first glance, the variance
and standard deviation can seem overwhelming. If you do not
understand the expressions below, don't panic! In this section, I just
want to give you an overview of what the next sections will cover.
We will walk through each of these parts thoroughly in the next few
sections, but the big picture goal is to generally understand the
following:

1. How the mean, variance, and standard deviation are


calculated.
2. Why the measures of variance and standard deviation make
sense to capture the spread of our data.

3. Fields, where you might see these values used.

4. Why we might use the standard deviation or variance as


opposed to the values associated with the 5 number summary
for a particular dataset.

Calculation
We calculate the variance in the following way:

1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
The variance is the average squared difference of each
observation from the mean.

To calculate the variance of a set of 10 values in a spreadsheet


application, with our 10 data points in column A, we would create a
new column B by typing in something like =A1-
AVERAGE(A$1:A$10) and copying this down for all 10 rows. This
would find us the difference between each data point and the mean
average of all the data. Then we create a new column C having the
square of these differences, using the formula =B1^2 in cell C1,
and copying that down for all rows. Then in the cell below this new
column, cell C11, type in =SUM(C1:C10). This adds up all these
values in column C. Finally in cell C12, we divide this sum by the
number of data points we have, in this case, ten: =C11/10. This cell
C12 now contains the variance for our 10 data points.

More detailed guidance on using spreadsheets like this may be


included in a future lesson in your program.

The standard deviation is the square root of the variance. Therefore,


the formula for the standard deviation is the following:

1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of
our same set of 10 data values, we would use another cell like C13
to take the square root of our variance measure, by typing
in =sqrt(C12).

The standard deviation is a measurement that has the same units as


our original data, while the units of the variance are the square of
the units in our original data. For example, if the units in our original
data were dollars, then units of the standard deviation would also be
dollars, while the units of the variance would be dollars squared.

Again, this section is designed as background knowledge for


the following sections. If it doesn't make sense on this first pass,
do not worry. You will be guided in future sections in performing
these calculations, and building your intuition, as you work through
an example using the salary data. Then we will provide context
about why these calculations are important, and where you might
see them!
Standard deviation is a common metric used to compare the
spread of two datasets. The benefits of using a single metric instead
of the 5 number summary are:

 It simplifies the amount of information needed to give a


measure of spread

 It is useful for inferential statistics

Important Final Points


1. The variance is used to compare the spread of two different
groups. A set of data with higher variance is more spread out
than a dataset with lower variance. Be careful though, there
might just be an outlier (or outliers) that is increasing the
variance when most of the data are actually very close.
2. When comparing the spread between two datasets, the units of
each must be the same.
3. When data are related to money or the economy, higher
variance (or standard deviation) is associated with higher risk.
4. The standard deviation is used more often in practice than the
variance because it shares the units of the original dataset.
Use in the World
The standard deviation is associated with risk in finance, assists in
determining the significance of drugs in medical studies, and
measures the error of our results for predicting anything from the
amount of rainfall we can expect tomorrow to your predicted
commute time tomorrow.

These applications are beyond the scope of this lesson as they


pertain to specific fields, but know that understanding the spread of
a particular set of data is extremely important to many areas. In this
lesson, you mastered the calculation of the most common measures
of spread.
Variable Types
We have covered a lot up to this point! We started with identifying
data types as either categorical or quantitative. We then learned we could
identify quantitative variables as either continuous or discrete. We also
found we could identify categorical variables as
either ordinal or nominal.

Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category.
For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say
32 of the 100 dogs I saw were labs (count).

However, the 4 aspects associated with describing quantitative


variables are not used to describe categorical variables.

Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:

1. Measures of Center

2. Measures of Spread

3. Shape of the Distribution

4. Outliers

We looked at calculating measures of Center


1. Means

2. Medians

3. Modes

We also looked at calculating measures of Spread

1. Range

2. Interquartile Range

3. Standard Deviation

4. Variance

Calculating Variance
We saw that we could calculate the variance as:

1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
You will also see:

1n−1∑i=1n(xi−xˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered
thus far, but you can find an explanation here(opens in a new
tab).

You can commonly find answers to your questions with a


quick Google search(opens in a new tab). Now is a great time to
get started with this practice! This answer should make more sense
at the completion of this lesson.

Standard Deviation vs. Variance


The standard deviation is the square root of the variance. In
practice, you usually use the standard deviation rather than the
variance. The reason for this is because the standard deviation
shares the same units with our original data, while the variance has
squared units.
What Next?
In the next sections, we will be looking at the last two aspects of
quantitative variables: shape and outliers. What we know about
measures of center and measures of spread will assist in your
understanding of these final two aspects.
Supporting Materials

 Calculating Variance

 Histograms
 We learned how to build a histogram in this video, as this is
the most popular visual for quantitative data.
 Shape
 From a histogram, we can quickly identify the shape of our
data, which helps influence all of the measures we learned in
the previous concepts. We learned that the distribution of our
data is frequently associated with one of the three shapes:
 1. Right-skewed
 2. Left-skewed
 3. Symmetric (frequently normally distributed)
 Summary
Mean vs.
Shape Real-World Applications
Median
Symmetric Mean equals
Height, Weight, Errors, Precipitation
(Normal) Median
Mean greater Amount of drug remaining in a bloodstream, Time between
Right-skewed
than Median phone calls at a call center, Time until light bulb dies
Mean less than Grades as a percentage in many universities, Age of death,
Left-skewed
Median Asset price changes
 The mode of a distribution is essentially the tallest bar in a
histogram. There may be multiple modes depending on the
number of peaks in our histogram.

The Shape For Data In The World


Show Video Transcript
Video Transcript
0:00
If you're working with data,
0:02
you can always build a Quick Plot to see the shape.
0:07
Just to apply some context,
0:09
some examples of approximately Bell-Shaped data include heights
and weights,
0:15
standardized test scores, precipitation amounts,
0:19
the mean of a distribution,
0:20
or errors in manufacturing processes.
0:23
Common data that follow Left Skewed Distributions include GPAs,
0:27
the age of death,
0:29
and asset price changes.
0:32
Common data that follow approximately
0:34
Right Skewed Distributions include the amount of drug left in your
bloodstream over time,
0:38
the distribution of wealth,
0:40
and human athletic abilities.
0:43
There are links below in
0:44
the instructor notes in case you want to learn more about each of
these cases.
0:48
Though these three, Right Skewed,
0:51
Left Skewed and Symmetric,
0:53
are the most common distributions,
0:55
data in the real world can be messy and it might not follow any of
these distributions.
1:00
We will talk about this more in the next section.
When working with data, building a quick plot lets you quickly see
the shape of your data.

Distribution
Types of Data
Shape

Bell Shaped Heights, Weight, Scores

Left Skewed GPA, Age of Death, Price

Right Skewed Distribution of Wealth, Athletic Abilities

References
These are the references used to pull the applications of each
shape.

 Quora(opens in a new tab)


 University of Texas(opens in a new tab)
 Stack Exchange(opens in a new tab)
Supporting Materials

 Quora(opens in a new tab)

 Stack Exchange

Common Techniques
When outliers are present we should consider the following points.

1. Noting they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understanding why they exist, and the impact on questions we


are trying to answer about our data.

4. Reporting the 5 number summary values is often a better


indication than measures like the mean and standard deviation
when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.

Outliers Advice
Below are my guidelines for working with any column (random
variable) in your dataset.

1. Plot your data to identify if you have outliers.

2. Handle outliers accordingly via the previous methods.

3. If no outliers and your data follow a normal distribution - use the


mean and standard deviation to describe your dataset, and report
that the data are normally distributed.
Side note
If you aren't sure if your data are normally distributed, there are
plots called normal quantile plots(opens in a new tab) and
statistical methods like the Kolmogorov-Smirnov test(opens in a
new tab) that are aimed to help you understand whether or not
your data are normally distributed. Implementing this test is beyond
the scope of this class, but can be used as a fun fact.

4. If you have skewed data or outliers, use the five-number


summary to summarize your data and report the outliers.
Supporting Materials

 Kolmogorov-Smirnov test(opens in a new tab)

 Normal quantile plots

Recap

Variable Types
We have covered a lot up to this point! We started with identifying
data types as either categorical or quantitative. We then learned we could
identify quantitative variables as either continuous or discrete. We also
found we could identify categorical variables as
either ordinal or nominal.

Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category.
For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say
32 of the 100 dogs I saw were labs (count).

However, the 4 aspects associated with describing quantitative


variables are not used to describe categorical variables.
Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:

1. Measures of Center

2. Measures of Spread

3. Shape of the Distribution

4. Outliers

Measures of Center
We looked at calculating measures of Center

1. Means

2. Medians

3. Modes

Measures of Spread
We also looked at calculating measures of Spread

1. Range

2. Interquartile Range

3. Standard Deviation

4. Variance

Shape
We learned that the distribution of our data is frequently associated
with one of the three shapes:
1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)

Depending on the shape associated with our dataset, certain


measures of center or spread may be better for summarizing our
dataset.

When we have data that follows a normal distribution, we can


completely understand our dataset using the mean and standard
deviation .

However, if our dataset is skewed, the 5 number summary (and


measures of center associated with it) might be better to summarize
our dataset.

Outliers
We learned that outliers have a larger influence on measures like
the mean than on measures like the median. We learned that we
should work with outliers on a situation by situation basis. Common
techniques include:

1. At least note they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understand why they exist, and the impact on questions we are


trying to answer about our data.

4. Reporting the 5 number summary values is often a better


indication than measures like the mean and standard deviation
when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.


Histograms and Box Plots
We also looked at histograms and box plots to visualize our
quantitative data. Identifying outliers and the shape associated with
the distribution of our data are easier when using a visual as
opposed to using summary statistics.

What Next?
Up to this point, we have only looked at Descriptive Statistics,
because we are describing our collected data. In the final sections of
this lesson, we will be looking at the difference
between Descriptive Statistics and Inferential Statistics.

Recap

Variable Types
We have covered a lot up to this point! We started with identifying
data types as either categorical or quantitative. We then learned we could
identify quantitative variables as either continuous or discrete. We also
found we could identify categorical variables as
either ordinal or nominal.

Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category.
For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say
32 of the 100 dogs I saw were labs (count).

However, the 4 aspects associated with describing quantitative


variables are not used to describe categorical variables.

Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:
1. Measures of Center

2. Measures of Spread

3. Shape of the Distribution

4. Outliers

Measures of Center
We looked at calculating measures of Center

1. Means

2. Medians

3. Modes

Measures of Spread
We also looked at calculating measures of Spread

1. Range

2. Interquartile Range

3. Standard Deviation

4. Variance

Shape
We learned that the distribution of our data is frequently associated
with one of the three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)


Depending on the shape associated with our dataset, certain
measures of center or spread may be better for summarizing our
dataset.

When we have data that follows a normal distribution, we can


completely understand our dataset using the mean and standard
deviation .

However, if our dataset is skewed, the 5 number summary (and


measures of center associated with it) might be better to summarize
our dataset.

Outliers
We learned that outliers have a larger influence on measures like
the mean than on measures like the median. We learned that we
should work with outliers on a situation by situation basis. Common
techniques include:

1. At least note they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understand why they exist, and the impact on questions we are


trying to answer about our data.

4. Reporting the 5 number summary values is often a better


indication than measures like the mean and standard deviation
when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.

Histograms and Box Plots


We also looked at histograms and box plots to visualize our
quantitative data. Identifying outliers and the shape associated with
the distribution of our data are easier when using a visual as
opposed to using summary statistics.
What Next?
Up to this point, we have only looked at Descriptive Statistics,
because we are describing our collected data. In the final sections of
this lesson, we will be looking at the difference
between Descriptive Statistics and Inferential Statistics.

Descriptive vs. Inferential Statistics


In this section, we learned about how Inferential Statistics differs
from Descriptive Statistics.

Descriptive Statistics
is about describing our collected data using the
Descriptive statistics
measures discussed throughout this lesson: measures of center,
measures of spread, the shape of our distribution, and outliers. We
can also use plots of our data to gain a better understanding.

Inferential Statistics
is about using our collected data to draw
Inferential Statistics
conclusions to a larger population. Performing inferential
statistics well requires that we take a sample that accurately
represents our population of interest.

A common way to collect data is via a survey. However, surveys


may be extremely biased depending on the types of questions that
are asked, and the way the questions are asked. This is a topic you
should think about when tackling the first project.

We looked at specific examples that allowed us to identify the

1. Population - our entire group of interest.

2. Parameter - numeric summary about a population

3. Sample - a subset of the population

4. Statistic - numeric summary about a sample


Looking Ahead
Though we will not be diving deep into inferential statistics within
this course, you are now aware of the difference between these two
branches of statistics. If you have ever conducted a hypothesis test
or built a confidence interval, you have performed inferential
statistics. The way we perform inferential statistics is changing as
technology evolves. Many career paths involving Machine
Learning and Artificial Intelligence are aimed at using collected
data to draw conclusions about entire populations at an individual
level. It is an exciting time to be a part of this space, and you are
now well on your way to joining the other practitioners!

You might also like