Week2 Modified
Week2 Modified
Total 10 1
Histograms & Barplots
● Histograms extend frequency tables 1 step further by showing
them visually.
● They convey the same story but in different ways.
● In some ways, they are better because if we had many bins, it
would be very hard to comprehend and see in a table format.
● Visually, we can spot trends and patterns a lot better than in
tables with numbers and various sections.
● Barplots are the same as histograms, except we do not need to
bin numbers together and simply represent each category as its
own bin.
Histograms: Tables vs. Graphs
versus
Histograms Example
● Converting our frequency table for Heights into a chart, we get
the following:
● We can see that the lower end and upper end of the box plot show values that are “typical”.
● There is 1 outlier identified here which is the height of 230 cm.
Scatterplots
● Often times looking at just one variable at a time isn’t
meaningful or what we are trying to get insights for.
● Scatterplots are a way of comparing pairs of values across your
entire data set simultaneously. This way you can draw the
relationship for two (or sometimes more) variables.
● When looking at 2 variables, the x-axis would represent one
variable and the y-axis would represent another.
○ Each point would represent one observation’s
characteristic.
Scatterplots Example
● FINALLY! Let’s look at another variable besides Height…
Weight!
● Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
Weights ={115, 110, 182, 210, 104, 100, 109, 121, 124, 131}
● In terms of data, the data for this might look something like
the following:
…
Scatterplots Example cont.
Sailor Jupiter
Sailor Moon
Sailor Venus
Scatterplots Example cont.
Say we had another variable that was categorical, like the strength of each person being
strong or weak
Scatterplots Example cont.
Weak
Strong
We can add that into our scatterplot to enhance our insights. If they give more information,
it is worth showing!
Scatterplots Example cont.
Poisson Distribution: µ = 3, σ = 9
Discrete
Normal Distributions
● The normal distribution (Gaussian distribution) is one of the
most commonly sampling distribution.
● The following are key facts about the normal distribution:
○ The mean, median and mode are the exact same.
○ The distribution is symmetric around the mean, µ.
○ Exactly 50% of the data lie on the left and right side of the
mean.
○ 68% of the data lies within one standard deviation of the
mean, and 95% lies within two standard deviations.
● A big misconception is that most data in the world behaves
“normally” (is normally distributed). Actually, their statistics
follow a normal distribution.
○ Most data follows a long-tail distribution (data that is
skewed, typically to the right).
Normal Distributions cont.
= ,
Normal Distributions: Standard Normal
● The standard normal distribution is when the mean is equal
to 0, and the standard deviation is equal to 1.
● You can standardize any sets of values by the following
equation:
Z = (x - µ)/σ
= -
= -
● P( X < x ) = 0.7
Normal Distributions: Example 3 cont.
● Here the tricky thing is we know what the probability is, we
just don’t know what value satisfies it within the parameters
provided.
● We can use software to find the inverse of this very easily!
● P( X < x ) = 0.7, x must be 552.44 ~ 553.
● Since Tom scored greater than 553, he will be admitted! Yay!
Confidence Intervals
● Placing our trust in 1 value for an analysis is very risky.
● Say we wanted to forecast budgets and stated that we expect
the average savings to be $10,000 next month. When the next
month financial results occur, we find out we actually only
saved $8,000 - but the business trusted us so much that they
allocated the $2,000 not saved to some other product. Now we
are in trouble!
● To avoid this issue of saying “I don’t know” or “maybe we expect
somewhere around $10,000”, we instead give a range of what we
can expect!
● So, a confidence interval is an estimate of how likely are our
estimates to be within a range.