Statistical Analysis: Dr. Shahid Iqbal Fall 2021

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 65

Statistical Analysis

Lecture 2
Dr. Shahid Iqbal
Fall 2021
Topics
1. Central Dogma of Statistics
2. Statistical Data Distributions
 Binomial Distribution
 Normal Distribution
 Poisson Distribution
3. Populations and Samples
4. Data Science Process
5. Exploratory Data Analysis
6. Correlation Analysis
 Pearson Correlation Coefficient
 Spearman Rank Correlation Coefficient
2
Statistics and Data Science

“A data scientist is someone who knows more statistics


than a computer scientist and more computer science
than a statistician.”
by
Josh Blumenstock (Uni. of Washington)
The Central Dogma of Statistics

Process of statistical reasoning.


The Central Dogma ………
 Suppose, we have population of possible things that we can
observe.
 A relatively small subset of them are actually sampled,
ideally at random, meaning that we can observe properties
of the sampled items.
 Probability theory describes “what properties our sample
should have”, given the properties of the underlying
population.
 But statistical inference try to “deduce what the full
population is like given analysis of the sample”.
Statistical Data Distributions
 Every observed random variable defines a particular
frequency/probability distribution.

 That reflects how often each particular value arises.

 The unique properties of variables like height, weight, and


IQ are captured by their distributions.

 But the shapes of these distributions are themselves not


unique.
Significance of Classical Distributions
These classical distributions have two properties:
1. They describe shapes of frequency distributions that
arise often in practice.
2. They can often be described mathematically using
closed-form expressions with very few parameters.
Once abstracted from specific data observations, they
become probability distributions
But, your observed data does not necessarily come from a
particular distribution because the shape looks similar.
Statistical Data Distributions
Some distributions occur often in practice/theory:

1. Binomial Distribution
2. Normal Distribution
3. Poisson Distribution
Binomial Distributions (BD)
 It can be thought of as simply the probability of a
SUCCESS or FAILURE outcome in an experiment or
survey that is repeated multiple times.

 The binomial is a type of distribution that has two


possible outcomes (the prefix “bi” means two, or twice).

 For example, a coin toss has only two possible


outcomes: head or tail and taking a test could have two
possible outcomes: pass or fail.
Criteria
BD must meet the following three criteria:
1. The number of observations or trials is fixed
 We can only figure out the probability of something
happening if you do it a certain number of times.
 if we toss a coin once, your probability of getting a tail
is 50%. If you toss a coin 20 times, your probability of
getting a tail is very close to 100%.
2. Each observation or trial is independent.
 In other words, none of your trials have an effect on the
probability of the next/other trials.
Criteria
3. The probability of getting a success (tails/heads)
must remain the same from one trial to another.

Real Life Examples:


 If a new drug is introduced to cure a disease, it either
cures the disease (it’s successful) or it doesn’t cure the
disease (it’s a failure).
 If you purchase a lottery ticket, you’re either going to win
money, or you aren’t.
 Basically, anything you can think of, that can only be a
success or a failure can be represented by a binomial
distribution.
Binomial Distributions
 The binomial distribution is closely related to the
Bernouilli distribution.

 A Bernouilli distribution is a set of Bernouilli trials. Each


Bernouilli trial has one possible outcome, “success or
failure”

 In each trial, the probability of success, P(S) = p, is the


same. The probability of failure is just 1 minus the
probability of success: P(F) = 1-p.
Binomial Distributions
The binomial distribution formula is:

Where
x = Total number of “successes” (pass or fail etc.)
p = Probability of a success in an individual trial
n = number of trials
Example 1
A fair coin is tossed 10 times. What is the probability of
getting exactly 6 heads?

Solution:

The number of trials (n) is 10


The odds of success (“tossing a head”) is 0.5
(So 1- p = 0.5)
x=6
P(X = 6) = 10C6 * 0.5^6 * 0.5^4
= 210 * 0.015625 * 0.0625
= 0.205078125
Example 2
80% of people who purchase pet insurance are women.
If 9 pet insurance owners are randomly selected, find
the probability that exactly 6 are women.

Solution:
Step 1: n = number of randomly selected items = 9.
Step 2: x = number you are asked to find the probability

for, is 6.
Step 3:
n! / (n – X)! * X!
9! / ((9 – 6)! * 6!) = 84
Example 2……..
 Step 4: Find p and q. We are given p = 80%, or 0.8. So
the probability of failure is 1 – 0.8 = .2 = (20%).

 Step 5: pX
= 0.86
= 0.262144
 Step 6: q(n – X)
= 0.2(9-6)
= 0.23
= 0.008
 Step 7:
84  × 0.262144 × .008 = 0.176.
Properties of Binomial Distributions

Binomial probability mass function and normal probability distribution function


approximation for n = 6 and p = 0.5
Normal Distribution
 A normal distribution (bell curve), is a distribution that
occurs naturally in many situations.

 For example: bell curve is seen in tests like SAT and


GRE. Bulk of students will score average (C), whereas
smaller number of students will score B or D. Also, very
smaller percentage of students will score F or A.

 This creates a distribution that resembles a bell.

 The bell curve is symmetrical. Half of the data will fall to


the left of the mean; half will fall to the right.
Normal Distribution
Many groups follow this type of pattern. That’s why it’s
widely used in businesses and statistics:

 Heights of people.
 Measurement errors.
 Blood pressure.
 Points on a test.
 IQ scores.
 Salaries.
Normal Distribution
 The empirical rule tells, what percentage of your data
falls within a certain number of standard deviation from
the mean.

 68% of the data falls within one standard deviation of the


mean.

 95% of the data falls within two standard deviations of the


mean.

 99.7% of the data falls within three standard deviations of


the mean.
Normal Distribution
Standard deviation
1. It controls the spread of the distribution. A smaller
standard deviation indicates that the data is tightly
clustered around the mean; the normal distribution will
be taller.

2. A larger standard deviation indicates that the data is


spread out around the mean; the normal distribution will
be flatter and wider.
Properties of a normal distribution
 The mean, mode, and median are all equal.

 The curve is symmetric at the center (i.e. around the


mean, μ).

 Exactly half of the values are to the left of center and


exactly half the values are to the right.

 The total area under the curve is 1.


Standard Normal Model
 A standard normal model has normal distribution with
mean of 0 and standard deviation of 1.
Standard Normal Model
 In the standard normal model, about 5 percent of your
data would fall into the “tails” (dark orange color in the
image) and 95 percent will be in between.

 Not all bell-shaped distributions are normal but it is


generally a reasonable start.
Example 1
A group of students with normally distributed salaries, earn
an average of $6,800 with a standard deviation of $2,500.
What proportion of students earn between $6,500 and
$7,300?

Solution:
 Step 1: μ = 6800
 Step 2: σ = 2500.
 Step 3: a = 6500 // lower value
 Step 4: b = 7300 //upper value.
 Step 5: P(a < X < b) = P(X < b) – P (X < a)
Example 1
Solution:
  
 Step 6: Apply formula
 Step 7: P(X < b) = P (X < 7300) =
= 0.2, after matching in z-table
= 0.57926
 Step 8: P(X < a) = P (X < 6500) =
= -0.12, after matching in z-table
= 0.45224
Example 1
Solution:
 Step 8: P(a < X < b) = P(X < b) – P (X < a)
= 0.57926 – 0.45224
= 0.013

13% of students earn between $6500 to $7300


Example 1
Poisson Distribution
 It is a statistical distribution that shows how many times
an event is likely to occur within a specified period of
time.

 It is used for independent events that occur at a constant


rate within a given interval of time.

 The poisson distribution is a discrete function, meaning


that the event can only be measured as occurring or not
as occurring.
Example 1
The average number of major storms in your city is 2
per year. What is the probability that exactly 3 storms
will hit your city next year?

Answer:

 Step 1: μ = 2 (average number of storms per year,


historically)

 Step 2: x = 3 (the number of storms we think might hit


next year)

e = 2.71828 (e is Euler’s number, a constant)


Example 1

Step 3: Plug the values into the Poisson distribution


formula:

 P(x; μ) = (e-μ).(μx) / x!

 = (2.71828 – 2).(23) / 3!

 = (0.13534).(8) / 6

 = 0.180
Practical Uses of the Poisson…
 A textbook store sell an average of 200 books every
Saturday night. Using this data, you can predict the
probability that more books will be sold (perhaps 300 or
400) on the following Saturday nights.
 Another example is the number of diners in a certain
restaurant every day. If the average number of diners for
seven days is 500, you can predict the probability of a
certain day having more dinners.
 Because of this application, Poisson distributions are
used by businessmen to make forecasts about the
number of customers or sales on certain days or seasons
of the year.
Populations and Samples
 The word population immediately makes us think of the
entire world’s population of 7 billion people.

When we take a sample:


 We take a subset of the units of size n in order to
examine the observations to draw conclusions and make
inferences about the population.

 There are different ways of sampling and we should be


aware of particular sampling mechanism because it can
introduce biases into the data, and distort it.
Populations and Samples
 If the subset is not a representative version of the
population. Then, any conclusions you draw will simply
be wrong and distorted.

In the BigCorp email example:


 We have a list of all the employees and select 1/10th of
those people at random and take all the email they ever
sent (called sample).
Alternatively, we could sample 1/10th of all email sent
each day at random.
Both these methods are reasonable, and yield the same
sample size.
Populations and Samples
But if we took them and counted
 “how many email messages each person sent”

and used that to estimate the underlying distribution of


emails sent by all individuals at BigCorp, then

 We might get entirely different answers.


Populations and Samples of big data
 In the age of Big Data, we can record all users’ actions all
the time.

 “How much data we need depends on what our goal is”.

 For analysis or inference purposes, we don’t need to


store all the data all the time (sampling is a solution).

 Alternatively, for serving purposes if we have to render


the correct information for a user, then we need to have
all the information for that particular user.
Populations and Samples of big data
For example: Bias
“Even if we have access to all of Facebook’s or Google’s or
Twitter’s data corpus, any inferences we make from that
data should not be extended to draw conclusions about
humans beyond those sets of users, or even those users
for any particular day”
Populations and Samples of big data
 For example: Kate Crawford, in her Strata talk, “Hidden
Biases of Big Data,”
 If we analyzed tweets immediately before and after
“Hurricane Sandy”, then we find that most people were
shopping pre-Sandy and partying post-Sandy.
 However, most of those tweets came from New York,
because they’re heavier Twitter users than the coastal
New Jerseyans,
 Second, the coastal New Jerseyans were worrying about
their house falling down and didn’t have time to tweet.
Populations and Samples of big data
 You would think that Hurricane Sandy wasn’t all that bad
if you used tweet data to understand it.

 The only conclusion we can draw is that

“Hurricane Sandy was like for the subset of Twitter users


(who themselves are not representative), whose situation
was not so bad that they didn’t have time to tweet”

Let’s rethink what the population and the sample are in


various contexts.
Populations and Samples of big data
Now we have new kinds of data:
1. Traditional: numerical, categorical, or binary
2. Text: emails, tweets, New York Times articles
3. Records: user-level data, timestamped event data,
json formatted log files
4. Geo-based location data.
5. Network data.
6. Sensor data
7. Images
These new kinds of data require us to think more carefully
about what sampling means in these contexts.
Data Science Process/Cycle

42
Data Science Process/Cycle
 Real World: In real world, lots of people busy at various
activities such as using Google+, competing in the
Olympics; spammers sending spam etc.

 We’ll start with raw data—logs, Olympics records, Enron


employee emails, or recorded genetic material. We want
to process these to make it clean for analysis.

 So we build and use pipelines of data munging: joining,


scraping and wrangling. To do this we use tools such as
Python, shell scripts, R, or SQL, or all of the above.
Data Science Process/Cycle
 EDA: once we have clean dataset, we shall do some
kind of EDA.

 While doing EDA, we may realize that it isn’t actually


clean because of duplicates, missing values, absurd
outliers, and data that wasn’t actually logged or
incorrectly logged.

 If that’s the case, we may have to go back to collect more


data, or spend more time cleaning the dataset.
Data Science Process/Cycle
 Design Model: Next, we design our model to use some
ML algorithms like k-nearest neighbor (k-NN), linear
regression, Naive Bayes etc.

 The model we choose depends on the type of problem


we’re trying to solve e.g. a classification problem, a
prediction problem, or a basic description problem.

 Then we can interpret, visualize, report, or


communicate our results. This could take the form of
reporting the results up to our boss or coworkers etc.
Data Science Process/Cycle
 Our goal may be to build or prototype a “data product”;
e.g., a spam classifier, or a search ranking algorithm, or a
recommendation system.

 Now here, what makes data science distinct from


statistics is that this data product then gets incorporated
back into the real world

 Then users interact with that product, and that generates


more data, which creates a feedback loop.
Data Science Process/Cycle
 It is very different from predicting the weather, where your
model doesn’t influence the outcome at all. e.g., you’re
not going to cause it to rain.

 But if you instead build a recommendation system that


generates evidence that “lots of people love this book,”
then you will know that you caused that feedback loop.

 Take this loop into account in any analysis you do by


adjusting for any biases your model caused.

 Your models are not just predicting the future, but


causing it!
Data Scientist’s Role in This Process

48
Exploratory Data Analysis (EDA)
 “EDA” is an attitude, a state of flexibility, a willingness to
look for those things that we believe are not there, as well
as those we believe to be there.

 EDA is the first step toward building a model.

 It’s traditionally presented as a bunch of histograms and


stem-and-leaf plots.

 But EDA is a critical part of the data science process, and


also represents a philosophy or way of doing statistics.

 In EDA, there is no hypothesis and there is no model.


Exploratory Data Analysis
 The “exploratory” aspect means that your understanding
of the problem you are solving, is changing as you go.

Basic tools of EDA are plots, graphs and summary


statistics. Generally, It’s a method of systematically going
through the data:
1. Plotting distributions of all variables (using box plots),
plotting time series of data, transforming variables,

2. Looking all pairwise relationships between variables,


and generating summary statistics for all of them. E.g.
computing their mean, minimum, maximum, the upper
and lower quartiles, and identifying outliers.
Exploratory Data Analysis
EDA is also a mindset, which is about our relationship
with the data. We want to understand the data—gain
intuition, understand the shape of it.

Try to connect our understanding of the process that


generated the data to the data itself.

EDA happens between us and the data and isn’t about


proving anything to anyone else yet.
Philosophy of EDA
Reasons For EDA:
 To gain intuition about the data; to make comparisons
between distributions; for sanity checking (make sure the
data is on the scale you expect, in the format you thought
it should be).

 To find out where data is missing or if there are outliers;


and to summarize the data.

 In the context of data generated from logs, EDA also


helps with debugging the logging process.
Philosophy of EDA
Reasons For EDA:
 E.g. “patterns” you find in the data could actually be
something wrong in the logging process that needs to be
fixed.

 If you never go to the trouble of debugging, you’ll


continue to think your patterns are real.

 In the end, EDA helps you make sure the product is


performing as intended.
Correlation Analysis
 Given two variables x and y, represented by a sample of n
points of the form (xi, yi), for 1 ≤ i ≤ n.

 We say that x and y are correlated when the value of x has


some predictive power on the value of y.

 The correlation coefficient r(X , Y) is a statistic that


measures the degree to which Y is a function of X.

 The value of correlation coefficient ranges from -1 to 1

 Where 1 means fully correlated and 0 implies no relation, or


independent variables.
54
Correlation Analysis
 Negative correlations imply that the variables are anti-
correlated, meaning that when X goes up, Y goes down.

 Note that negative correlations are just as good for


predictive purposes as positive ones.

 Correlations around 0 are useless for forecasting.

Example: Does financial status affect health? The


observed correlation between household income and the
prevalence of coronary artery disease is r = - 0.717.

 So wealthier you are, the lower your risk of having a


heart attack (strong negative correlation).
55
The Pearson Correlation Coefficient

56
Example
Subject Age x Glucose Level y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81

The Correlation Coefficient is 0.5409


57
Spearman Rank Correlation Coefficient
 The Pearson correlation coefficient defines the degree to
which a linear predictor of the form y = mx + b can fit the
observed data.

 This generally does a good job to measure the degree of


the relationship between linearly related variables.

OR

 It measures the strength of the linear relationship


between normally distributed variables.

58
Spearman Rank Correlation Coefficient
 First, test the type of distribution of your data…whether
they follow a normal distribution or not?

 Pearson is a parametric test whereas Spearman is a


non-parametric test, that assesses how well the
relationship between two variables can be described
using a monotonic function.

 When the variables are not normally distributed or the


relationship between the variables is not linear, it may be
more appropriate to use the Spearman rank correlation.

59
Spearman Rank Correlation Coefficient

60
Example
The scores for nine students in physics and math are:

Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28


Mathematics: 30, 33, 45, 23, 8, 49, 12, 4, 31

Compute the students’ ranks in both subjects and then


compute the Spearman rank correlation.

61
Example

62
Example

63
Example
Step 4: Sum (add up) all of your d-squared values.
4 + 4 + 1 + 0 + 1 + 1 + 1 + 0 + 0 = 12.

Step 5: Insert values into the formula.

= 1 – (6*12)/(9(81-1))
= 1 – 72/720
= 1-0.1
= 0.9

64
Any Question

65

You might also like