Notation
Notation
There is a lot going on in this video - here is a recap of the big ideas.
Rows and Columns
If you aren't familiar with spreadsheets, this will be covered in detail in future lessons.
Spreadsheets are a common way to hold data. They are composed of rows and
columns. Rows run horizontally, while columns run vertically. Each column in a
spreadsheet commonly holds a specific variable, while each row is commonly called
an instance or individual.
The example used in the video is shown below.
Time Spent On
Date Day of Week Buy (Y)
Site (X)
June 15 Thursday 5 No
This is a row:
Date Day of Week Time Spent On Site (X) Buy (Y)
June 15 Thursday 5 No
This is a column:
Time Spent On Site (X)
10
20
A random variable is a placeholder for the possible values of some process (mostly...
the term 'some process' is a bit ambiguous). As was stated before, notation is useful in
that it helps us take complex ideas and simplify (often to a single letter or single
symbol). We see random variables represented by capital letters (X, Y, or Z are
common ways to represent a random variable).
We might have the random variable X, which is a holder for the possible values of the
amount of time someone spends on our site. Or the random variable Y, which is a
holder for the possible values of whether or not an individual purchases a product.
X is 'a holder' of the values that could possibly occur for the amount of time spent on
our website. Any number from 0 to infinity really.
Example Dataset
An example of the data we might have collected in the previous
video is shown here:
June
Thursday 5 No
15
June
Thursday 10 Yes
15
June
Friday 20 Yes
16
Question 1 of 2
What type of variable is the random variable X in the video in the
previous concept?
Categorical - Ordinal
Categorical - Nominal
Quantitative - Continuous
Quantitative - Discrete
Submit
Question 2 of 2
What type of variable is the random variable Y in the video in the
previous concept?
Categorical - Ordinal
Categorical - Nominal
Quantitative - Continuous
Quantitative - Discrete
Submit
Example 1
For example, the amount of time someone spends on our
site is a random variable (we are not sure what the outcome will
be for any particular visitor), and we would notate this with X. Then
when the first person visits the website, if they spend 5 minutes, we
have now observed this outcome of our random variable. We would
notate any outcome as a lowercase letter with a subscript
associated with the order that we observed the outcome.
Example 2
Taking this one step further, we could ask:
We could find this in the above example by noticing that only one of
the 5 observations exceeds 20. So, we would say there is a 1 (the
45) in 5 or 20% chance that an individual spends more than 20
minutes on our website (based on this dataset).
Example 3
If we asked: What is the probability of an individual spending
20 or more minutes on our website? We could notate this as:
P(X ≥≥ 20)?
We could then find this by noticing there are two out of the five
individuals that spent 20 or more minutes on the website. So this
probability is 2 out of 5 or 40%.
x1+x2+x3x1+x2+x3
If we want to add 6 values together, we would use the notation:
x1+x2+x3+x4+x5+x6x1+x2+x3+x4+x5+x6
To extend this to add one hundred, one thousand, or one million
values would be ridiculous! How can we make this easier to
communicate?!
Aggregations
An aggregation is a way to turn multiple numbers into fewer
numbers (commonly one number).
Example 1
Imagine we are looking at the amount of time individuals spend on
our website. We collect data from nine individuals:
x1+x2+x3x1+x2+x3
In our new notation, we can write:
∑i=13xii=1∑3xi.
∑i=13xii=1∑3xi = x1+x2+x3x1+x2+x3 = 10 + 20 + 45 = 75
Example 2
Now, imagine we want to sum the last three values together.
x7+x8+x9x7+x8+x9
In our new notation, we can write:
∑i=79xii=7∑9xi.
Notice, our notation starts at the seventh observation ( i=7i=7) and
ends at 9 (the number at the top of our summation).
Other Aggregations
The ΣΣ sign is used for aggregating using summation, but we might
choose to aggregate in other ways. Summing is one of the most
common ways to need to aggregate. However, we might need to
aggregate in alternative ways. If we wanted to multiply all of our
values together we would use a product sign ΠΠ** **, capital Greek
letter pi. The way we aggregate continuous values is with something
known as integration (a common technique in calculus), which uses
the following symbol ∫∫ which is just a long s. We will not be using
integrals or products for quizzes in this class, but you may see them
in the future!
1n∑i=1nxin1i=1∑nxi
We also could index using any other letter, not just ii. We could just
as easily use jj, kk, or mm to index each of our data values. The
quizzes on the next concept will help reinforce this idea.
Notice
At second 0:12, this should
say ∑i=15xi=x1+x2+x3+x4+x5 i=1∑5xi=x1+x2+x3+x4
+x5. The xixi is missing here in front of the summation.
Notation Recap
Notation is an essential tool for communicating mathematical ideas.
We have introduced the fundamentals of notation in this lesson that
will allow you to read, write, and communicate with others using
your new skills!
In the next section, you will see this notation used to assist in your
understanding of calculating various measures of spread. Notation
can take time to fully grasp. Understanding notation not only helps
in conveying mathematical ideas but also in writing computer
programs - if you decide you want to learn that too! Soon you will
analyze data using spreadsheets. When that happens, many of
these operations will be hidden by the functions you will be using.
But until we get to spreadsheets, it is important to understand how
mathematical ideas are commonly communicated. This isn't easy,
but you can do it!
Lesson Overview
In this lesson, we will continue to cover more topics related to
analyzing quantitative variables and you will learn to use measures
of spread. Measures of spread are used to provide us an idea of
how spread-out our data are from one another.
Range
Standard Deviation
Variance
Analyze outliers
Throughout this lesson, you will learn how to calculate these, as well
as why we would use one measure of spread over another.
Histograms
Histograms are super useful for understanding the different aspects
of data and they are the most common visual used for quantitative
data. In the upcoming concepts, you will see histograms used all the
time to help you understand the four aspects we outlined earlier
regarding a quantitative variable:
center
spread
shape
outliers
2. Q1Q1: The value such that 25% of the data fall below.
3. Q2Q2: The value such that 50% of the data fall below.
4. Q3Q3: The value such that 75% of the data fall below.
5. Maximum: The largest value in the dataset.
IQR
The interquartile range is calculated as the difference
between Q3Q3 and Q1Q1.
In the upcoming sections, you will practice this with Katie and on
your own.
Looking back at the histograms Josh created for the number of dogs
he recorded seeing on weekdays and weekends, we can use the
histograms to mark the values of the 5 number summary and create
a box plot.
Box plots are useful for quickly comparing the spread of two
data sets across some key metrics, like quartiles, maximum,
and minimum.
1. The beginning of the line to the left of the box and the end of
the line to the right of the box represent the minimum and
maximum values in a dataset.
3. The box itself represents the IQR. The box begins at the Q1
value, ends at the Q3 value, and Q2, or the median, is
represented by a line within the box.
From both the histograms and box plots, we can see that the
number of dogs seen on weekends varies much more than on
weekdays.
In the above video, we saw this as how far individuals were from the
average distance from work (the example distances shown are
examples from the full data set, the mean of just those 4 numbers is
38.5. The mean of 18 shown later in the video is the mean of the full
data set which is not shown in the video). In the next video, you will
see exactly how this is calculated.
should be (14-10) = 4 = 16
2 2
x‾=(∑i=14xi)n=404=10x=n(i=1∑4xi)=440=10
2. Next, calculate the distance of each observation from the
mean and square the value:
(xi−x‾)2=(xi−x)2=
(10−10)2=02=0(10−10)2=02=0
(14−10)2=42=16(14−10)2=42=16
(10−10)2=02=0(10−10)2=02=0
(6−10)2=−42=16(6−10)2=−42=16
2. Then calculate the variance, the average squared difference
of each observation from the mean:
1n∑i=1n(xi−x‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi
−x)2=41(0+16+0+16)=432=8
4. Finally, calculate the standard deviation, the square root of
the variance:
1n∑i=1n(xi−x‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83
The standard deviation is, on average, how far each point in our
dataset is from the mean.
Calculation
We calculate the variance in the following way:
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
The variance is the average squared difference of each
observation from the mean.
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of
our same set of 10 data values, we would use another cell like C13
to take the square root of our variance measure, by typing
in =sqrt(C12).
Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category.
For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say
32 of the 100 dogs I saw were labs (count).
Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:
1. Measures of Center
2. Measures of Spread
4. Outliers
2. Medians
3. Modes
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Calculating Variance
We saw that we could calculate the variance as:
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
You will also see:
1n−1∑i=1n(xi−xˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered
thus far, but you can find an explanation here(opens in a new
tab).
Calculating Variance
Histograms
We learned how to build a histogram in this video, as this is
the most popular visual for quantitative data.
Shape
From a histogram, we can quickly identify the shape of our
data, which helps influence all of the measures we learned in
the previous concepts. We learned that the distribution of our
data is frequently associated with one of the three shapes:
1. Right-skewed
2. Left-skewed
3. Symmetric (frequently normally distributed)
Summary
Mean vs.
Shape Real-World Applications
Median
Symmetric Mean equals
Height, Weight, Errors, Precipitation
(Normal) Median
Mean greater Amount of drug remaining in a bloodstream, Time between
Right-skewed
than Median phone calls at a call center, Time until light bulb dies
Mean less than Grades as a percentage in many universities, Age of death,
Left-skewed
Median Asset price changes
The mode of a distribution is essentially the tallest bar in a
histogram. There may be multiple modes depending on the
number of peaks in our histogram.
Distribution
Types of Data
Shape
References
These are the references used to pull the applications of each
shape.
Stack Exchange
Common Techniques
When outliers are present we should consider the following points.
Outliers Advice
Below are my guidelines for working with any column (random
variable) in your dataset.
Recap
Variable Types
We have covered a lot up to this point! We started with identifying
data types as either categorical or quantitative. We then learned we could
identify quantitative variables as either continuous or discrete. We also
found we could identify categorical variables as
either ordinal or nominal.
Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category.
For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say
32 of the 100 dogs I saw were labs (count).
1. Measures of Center
2. Measures of Spread
4. Outliers
Measures of Center
We looked at calculating measures of Center
1. Means
2. Medians
3. Modes
Measures of Spread
We also looked at calculating measures of Spread
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Shape
We learned that the distribution of our data is frequently associated
with one of the three shapes:
1. Right-skewed
2. Left-skewed
Outliers
We learned that outliers have a larger influence on measures like
the mean than on measures like the median. We learned that we
should work with outliers on a situation by situation basis. Common
techniques include:
What Next?
Up to this point, we have only looked at Descriptive Statistics,
because we are describing our collected data. In the final sections of
this lesson, we will be looking at the difference
between Descriptive Statistics and Inferential Statistics.
Recap
Variable Types
We have covered a lot up to this point! We started with identifying
data types as either categorical or quantitative. We then learned we could
identify quantitative variables as either continuous or discrete. We also
found we could identify categorical variables as
either ordinal or nominal.
Categorical Variables
When analyzing categorical variables, we commonly just look at the
count or percent of a group that falls into each level of a category.
For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say
32 of the 100 dogs I saw were labs (count).
Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:
1. Measures of Center
2. Measures of Spread
4. Outliers
Measures of Center
We looked at calculating measures of Center
1. Means
2. Medians
3. Modes
Measures of Spread
We also looked at calculating measures of Spread
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Shape
We learned that the distribution of our data is frequently associated
with one of the three shapes:
1. Right-skewed
2. Left-skewed
Outliers
We learned that outliers have a larger influence on measures like
the mean than on measures like the median. We learned that we
should work with outliers on a situation by situation basis. Common
techniques include:
Descriptive Statistics
is about describing our collected data using the
Descriptive statistics
measures discussed throughout this lesson: measures of center,
measures of spread, the shape of our distribution, and outliers. We
can also use plots of our data to gain a better understanding.
Inferential Statistics
is about using our collected data to draw
Inferential Statistics
conclusions to a larger population. Performing inferential
statistics well requires that we take a sample that accurately
represents our population of interest.