0% found this document useful (0 votes)
4 views53 pages

Tian Statistics Lesson 5 and 6 Probability Distributions

The document covers the fundamentals of data analytics and statistics, focusing on distributions, statistical tests, and the identification of outliers through graphical techniques. It explains key concepts such as independent and dependent variables, probability distributions, and the characteristics of discrete and continuous random variables. Additionally, it discusses the expected value and variance for random variables, particularly in the context of binomial distributions.

Uploaded by

vineetpjoshi.71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views53 pages

Tian Statistics Lesson 5 and 6 Probability Distributions

The document covers the fundamentals of data analytics and statistics, focusing on distributions, statistical tests, and the identification of outliers through graphical techniques. It explains key concepts such as independent and dependent variables, probability distributions, and the characteristics of discrete and continuous random variables. Additionally, it discusses the expected value and variance for random variables, particularly in the context of binomial distributions.

Uploaded by

vineetpjoshi.71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Fundamentals of Data Analytics

and Statistics
Topic 1: Introducing Distributions and Statistical Tests -
Concepts of distribution, frequency, quintiles, outliers (cont.)

• An outlier is an observation that lies an abnormal distance from

other values in a random sample from a population.

• In a sense, this definition leaves it up to the analyst to decide

what will be considered abnormal.

• There are two graphical techniques for identifying outliers –

scatter plots and box plots.


Page 2 of 54
data test;
input v1 v2;
datalines;
1.1 2.3
2.4 2.2
3.2 1.8
4.4 1.9
5 4
6.1 2
;run;

proc sgplot data=test;


scatter x=v1 y=v2; run;
/* Creating Scatter Plots */

Data wings;
Input name $13. type $ Length Wingspan @@;
Datalines;
Robin S 28 41 Bald Eagle R 102 244 Barn Owl R 50 110
Osprey R 66 180 Cardinal S 23 31 Goldfinch S 11 19
Golden Eagle R 100 234 Crow S 53 100 Magpie S 60 60
Elf Owl R 15 27 Condor R 140 300 Rt Robin S 24 70
;
Run;

Proc format;
Value $birdtype
'S'='Songbirds'
'R'='Raptors';
Run;
Proc sgplot data=wings;
Scatter x=wingspan y=length;
Title 'Comparison of Wingspan vs, Length';
Run;

Proc sgplot data=wings;


Scatter x=wingspan y=length/group=type;
Format type $birdtype.;
Title 'Comparison of Wingspan vs. Length';
Run;

Proc sgplot data=wings;


Scatter x=wingspan y=length/group=type datalabel=name;
Format type $birdtype.;
Title 'Comparison of Wingspan vs, Length';
Run;

Proc sgplot data=wings;


Scatter x=wingspan y=length/group=type datalabel;
Format type $birdtype.;
Title 'Comparison of Wingspan vs, Length';
Run;
Topic 1: Introducing Distributions and Statistical Tests -
Concepts of distribution, frequency, quintiles, outliers (cont.)

The scatter plot here reveals:

(1) A basic linear relationship between X and Y for most of the data, and

(2) A single outlier (at X=375). Outlier detection is important for effective modeling.
Page 6 of 54
Independent and dependent variables
In general, the independent variable is obtained by the
researcher and its effects on the dependent variable are
measured.

For example, an experimenter might compare the


effectiveness of four types of antidepressants. In this case,
the variable is “type of antidepressant.” When a variable is
manipulated by an experimenter, it is called an
independent variable.

The experiment seeks to determine the effect of the


independent variable on relief from depression. In this
case, relief from depression is called dependent variable.
Topic 1: Introducing Distributions and Statistical Tests -
Concepts of distribution, frequency, quintiles, outliers (cont.)

Definition of Probability Distribution

 A probability distribution is a table or an equation that links each

outcome of a statistical experiment with its probability of occurrence.

 Usually, X represents the random variable X. P(X) represents the probability

of X. P(X=x) refers to the probability that the random variable X is equal to a

particular value.

Page 8 of 54
Topic 1: Introducing Distributions and Statistical Tests -
Concepts of distribution, frequency, quintiles, outliers (cont.)

Example: When you flip a coin two times, there are 4 possible outcomes: HH, HT,

TH and TT. What is the probability distribution for the number of Tails?

Solution: Let the variable X represent the number of Tails.

Then random variable X has values 0, 1, 2

Its probability distribution will be shown in the following table.

# of Tails (X) Probability (P(X))


0 0.25
1 0.5
2 0.25
Page 9 of 54
P(x)

.5

.25

x
0 1 2

.25 if x=0 or 2
P(x)=
.5 if x=1
Topic 1: Introducing Distributions and Statistical Tests -
Concepts of distribution, frequency, quintiles, outliers (cont.)

A cumulative probability refers to the probability that the value of a random

variable falls within a specified range.

Q : What is the probability that the coin flips would be in one or fewer tails?

P(X ≤ 1) = P(X=0) + P(X=1) = 0.25 + 0.5 = 0.75

A cumulative probability distribution can be represented by the following table:

# of Tails (X) Probability (P(X)) Cumulative Probability (P(X≤1))


0 0.25 0.25
1 0.5 0.75
2 0.25 1

Page 11 of 54
Topic 2: Different kinds of distribution, shape and attributes

There are two main types of random variables: discrete and continuous.

Discrete: possible values of finite or countable infinite numbers

Continuous: possible values of interval or uncountable numbers

Page 12 of 54
Topic 2: Different kinds of distribution, shape and attributes

Correspondingly, most probability distributions can be classified:

Discrete probability distributions

Continuous probability distributions

Page 13 of 54
Topic 2: Different kinds of distribution, shape and attributes

Probability Distributions

Discrete Probability Distributions Continuous Probability Distributions

Binomial Probability Distributions

Multinomial Probability Distributions Normal Probability Distributions

Poisson Probability Distributions Student's t Distributions

Negative Probability Distributions Chi-square Distributions

Hypergeometric Probability Distributions F Distributions

Page 14 of 54
Topic 3: Different kinds of distribution, shape and attributes

For probablity distribution, an important aspect of the distribution of a data set is its

shape. There are some common distribution shapes:


 bell-shaped
 Triangular
 Uniform
 Reverse J-shaped
 J-shaped
 Right skewed
 Left skewed
 Bimodal
 Multimodal
Topic 3: Different kinds of distribution, shape and attributes
Topic 3: Different kinds of distribution, shape and attributes

In considering the shape of a distribution, it is helpful to observe the number of

peaks (highest points).

 Unimodal - A distribution has one peak.

 Bimodal – A distribution has two peaks.

 Multimodal – A distribution has three or more peaks.

Technically, a distribution is bimodal or multimodal only if the peaks are the same

height. However, in practice, distributions with pronounced but not necessarily

equal height peaks are often referred to as bimodal or multimodal.


Topic 3: Different kinds of distribution, shape and attributes

Symmetry – the distribution can be divided into two pieces that are mirror images of

one another. E.g. bell-shaped (normal), triangular, rectangular, inverted u-shaped,

etc. Bimodal and multimodal distributions sometimes are symmetry, but that are

not always true.


 For the discrete probability distributions, only the most popular one –

Binomial probability distribution is introduced in the following

sections.

 For the continuous probability distributions, 2 distributions (uniform

normal) are introduced in the following sections.


Discrete probability distribution:
A table, formula or graph that lists all possible values a
discrete random variable can assume, together with
their associate probabilities.

Requirements of discrete probaiblity distribution:

If a random variable X can take values xi, then the following must be
true:
1. 0<=p(xi)<=1 for all xi
2. Sum(p(xi))=1 for all xi

In practice, probabilities assigned to values of a random variable are often estimated


from relative frequencies from random samples.
e.g. coin-tossing 10 times or 20 times

Frequency distribution table to estimate the population probability distribution


Practice:
1. Consider a random variable X with the following probability distribution

x -4 0 1 2
P(x) .2 .3 .4 .1

Find the following probabilities


A P(X>0)
B P(X>=0)
C P(0<=x<=1)
D P(X=-4)
E P(X=-2)
F P(X<2)

2. Determine which of the following are not valid probability distributions, and why

A: x 1 2 3 4
p(x) .2 .2 .3 .4

B: x 0 2 4 5
p(x) -.1 .2 .3 .4

C: x -2 -1 1 2
p(x) .1 .1 .1 .7
Topic 2: Different kinds of distribution, shape and attributes

Mean of a discrete random variable

 The mean of a discrete random variable x (expected value) is defined by

μx = ∑x * P(x)

 Suppose a random variable x is defined to be the value of randomly selected

member from a finite population. Then the mean of the random variable

equals the population mean; that is, μx = μ.

 In a large number of independent observations of x (sample), the average

value of those observations will be approximately equal to μx, The larger the

number of observations, the closer the average tends to be to μx.


Page 22 of 54
Expected value and variance

Given a discrete random variable X with values xi that occur with


probabilities p(xi), the expected value of X is

For all xi

The mean of a population of N values of xi

U=sum(xi)/N=sum(xi)*(1/N)
Topic 2: Different kinds of distribution, shape and attributes

Example: A fair six-sided die is tossed. You win $2 if the result is a “1”, you win

$1 if the result is a “6”, but otherwise you lose $1.

The probability Distribution for X = Amount Won or Lost

X +$2 +$1 -$1


Probability 1/6 1/6 4/6

E(X) = $2(1/6) + $1(1/6) + (-$1)(4/6) = -$0.17

The interpretation is that if you play many times, the average outcome is losing 17

cents per play. Thus, over time you should expect to lose money.

Page 24 of 54
Laws of expected value

If X and Y are random variables and c is any constant then

1 E(c)=c
2 E(cX)=cE(x)
3 E(X+Y)=E(X)+E(Y)
E(X-Y)=E(X)-E(Y)
4 E(XY)=E(X)E(Y), if X and Y are independent
Topic 2: Different kinds of distribution, shape and attributes

Variance for a discrete random variable

Defined σ 2=Var(X)=E((X-u)^2)

Standard Deviation σ is the positive square root of the variance of X

Page 26 of 54
Topic 2: Different kinds of distribution, shape and attributes

Example: Going back to the previous example for expectation involving the

dice game, we would calculate the standard deviation for this discrete

distribution by first calculating the variance:

= [22(1/6) + 12(1/6) + (-1)2(4/6)] – (-1/6)2 = 1.472

The standard deviation = sqrt(1.472) = 1.213

Page 27 of 54
Laws of variance

If X and Y are random variables and c is a constant


1 V(c)=0
2 V(cX)=c^2V(X)
3 V(X+c)=V(X)
4 V(X+Y)=V(X)+V(Y) and V(X-Y)=V(X)+V(Y), if X and Y are
independent
Topic 2: Different kinds of distribution, shape and attributes

Binomial random variable: A specific type of discrete random variable

that counts how often a particular event occurs in a fixed number of tries or

trials. All of the following conditions must be met:

 There are a fixed number of trials (a fixed sample size)

 On each trial, the event of interest either occurs or does not

 The probability of occurrence (or not) is the same on each trial

 Trials are independent of one another

Page 29 of 54
Topic 2: Different kinds of distribution, shape and attributes

Binomial Distribution – is a probability distribution for independent events

for which there are only two possible outcomes such as a coin flip. If one

of the two outcomes is defined as a success, then the probability of

exactly x successes out of N trials (events) is given by:

N = number of trials

= is the probability of success on one trial

X = number of successes
Page 30 of 54
Example:

H T

H T H T

H T HT HT HT

X: the number of H---0, 1, 2, 3 n=3 p=1/2

P(0)=(3!/0!*3!)*p^0(1-p)^3=(1-P)^3
P(1)=(3!/1!*2!)*p*(1-p)^2=3*p*(1-P)^2
Topic 2: Different kinds of distribution, shape and attributes

Example: Cross-fertilizing a red and a white flower produces offspring of red

flowers 25% of the time. Now we cross-fertilize five pairs of red and white flowers

and produce five offspring. Find the probability that there will be no red flowered

plants in the five offspring.

Solution: X = # of red flowered plants in the five offspring.

The number of red flowered plants has a binomial distribution with n = 5, p = .25

P(X=0)=5!/0!(5−0)!0.25^0(1−.25)^5=1×.755^5=.237

There is a 23.7% chance that none of the five plants will be red flowered.

Page 32 of 54
Topic 2: Different kinds of distribution, shape and attributes

Expected Value and Variance for Binomial Random Variable:

The expected value is: μ = np

The variance is: σ2 = np(1−p)

n = number of trials

p = probability event of interest occurs on any one trial

Page 33 of 54
Topic 2: Different kinds of distribution, shape and attributes

Example: A roulette wheel has 38 slots, 18 are red, 18 are black, and 2 are green.

You play five games and always bet on red. How many times you expected to win?

Let X is the num of red in the five times

Solution: you play five games and always bet on red.

n = 5 and p = red slots / total slots =18 / 38

μ = np = 5(18 / 38)=2.3684

σ = √ np(1−p) = √ 5(18 / 38)(1−18 / 38) = 1.1165

Out of 5 games, you can expect to win 2.3684 (with a standard deviation of 1.1165).

Page 34 of 54
Topic 2: Different kinds of distribution, shape and attributes

If a random variable is a continuous variable, its probability distribution is

called a continuous probability distribution. A continuous probability

distribution differs from a discrete probability distribution in several ways:

 The probability that a continuous random variable will assume a particular

value is 0, e.g. P(X = x) = 0 for any x.

 A continuous probability distribution can’t be expressed in tabular form.

 An equation or formula is used to describe a continuous probability

distribution.
Page 35 of 54
When dealing with continuous random variable X, we attempt to find a

function f(x), called a probability density function (PDF) which we have

P(a<X<b)= f(x)dx

A probability density function f(x) must satisfy two conditions:

1 f(x) is nonnegative

2 The total area of under the curve representing f(x) equals 1.

3 A continuous random variable also has expected value and a variance

E(x)= xf(x)dx V(x)= (x-u)^2f(x)dx


Topic 2: Different kinds of distribution, shape and attributes

Example: A survey finds the following probability distribution for the age of

a rented car.

What is the probability that a rented car is between 0 and 4 years old?

Page 37 of 54
Topic 2: Different kinds of distribution, shape and attributes

Solution: The histogram of this distribution is shown on the left of the figure below.

The curve on the right is the graph of some function f, which we call a probability

density function. The domain of f is [0, +∞) . The probability that a rented car is

between 0 and 4 years old is

P(0 ≤ X ≤ 4) = = 0.1 + 0.26 + 0.28 + 0.2 = 0.84

Page 38 of 54
Practice:

Consider a random variable X with a probability density


function

f(x)=-.5x+1 0<=x<=2

a. Verify that f(x) is a probability density function


b. Find P(X>=1)
c. Find P(X<=.5)
d. Find P(X=1.5)

(-.5*(x^2/2)+x)
Uniform distribution:

A random variable X defined over an interval a<=x<=b is


uniformly distributed if its density function is given by

f(x)=1/(b-a) a<=x<=b

1/(b-a)

A=1
a b

E(x)=(1/2)(a+b) V(X)=(b-a)^2/12
Example:

Suppose X is uniformly distributed between 100 and 180.

A. define the density function


B. what is the probability 120<=x<=150

Solution f(x)=1/(180-100)=1/80 100<=x<=180

P(120<=x<=150)=(150-120)(1/80)=.375
Topic 2: Different kinds of distribution, shape and attributes

The most commonly encountered type of continuous random variable is a

normal random variable, and it’s distribution called Normal Distribution.

A random variable X with mean μ and standard deviation σ is normally

distributed if its probability density function is given by

where x from negative infinite to positive infinite.

It is sometimes referred to as a ‘bell-shaped’ distribution.


Page 42 of 54
Topic 2: Different kinds of distribution, shape and attributes

The Central Limit Theorem

 For a relatively large sample size, the random variable is

approximately normally distributed, regardless of the population’s

distribution. The approximation becomes better and better with

increasing sample size.

 If the sample size n is 30 or more, then probabilities for are

approximately equal to areas under the normal curve with parameters

μ and .
Page 43 of 54
Topic 2: Different kinds of distribution, shape and attributes

The normal distribution has the following properties:

 Mean = median = mode

 symmetry about the center

 50% of values less than the mean and 50% greater than the mean

Page 44 of 54
Topic 2: Different kinds of distribution, shape and attributes

Standard Score (Z) – is usually used for normal distribution. The

transformation from a raw score X to a Z score can be done using the

following formula:

Transforming a variable in this way is called ‘standardizing’ the variable.

It should be kept in mind that if X is not normally distributed then the

transformed variable will not be normally distributed either.

Page 45 of 54
Topic 2: Different kinds of distribution, shape and attributes

Example:

Vehicle speeds at a highway location have a normal distribution with a mean

of 65 mph and a standard deviation of 5 mph. What is the probability that a

randomly selected car is going 73 mph or less?

Page 46 of 54
Topic 2: Different kinds of distribution, shape and attributes
Solution: The graph on the previous page shows that more than half of the curve
is shaded in; P(x<73)=P((x-65)/5<(73-65)/5)=P((x-65)/5<1.6)

Z=(x-65)/5 N(0,1)

P(z<1.6)=? Looking in the z table, we obtained P(z<1.6)=.9452

Then P(x<73)=.9452

Next, we can use the z table to determine the proportion of the curve that is less
than a z score of 1.6 by looking up 1.6. The cumulative probability for x = 1.6 is
0.9452. There is a 94.52% chance of randomly selecting a vehicle that is going 73
mph or less.

Page 47 of 54
Topic 2: Different kinds of distribution, shape and attributes

Page 48 of 54
Practice:
Determine the following probabilities
A p(Z>=1.47)
B P(0<=Z<=1.85)
C P(1.65<=X<=2.36)

Where Z ~ N(0,1) and X~N(1,2)


Exercise 1:

Suppose a die is tossed 5 times. What is the probability of getting

exactly 2 fours?

Page 50 of 54
Exercise 2:

The data set of N=90 ordered observations as shown below is examined for outliers:

Page 51 of 54
Exercise 3:

IQ scores are normally distributed with a mean of 100 and a standard

deviation of 15. What IQ score separates the bottom 30% from the

top 70%? This is also known as the 30th percentile.

Page 52 of 54
Bibliography:

David M. Lane Introduction to Statistics Online Edition Rice University

Page 185 – 300.

Online course notes for Intro Probability Theory 2016 The

Pennsylvania State University

Page 53 of 55

You might also like