0% found this document useful (0 votes)
43 views42 pages

Probability Distributions in R

The document discusses probability distributions in R. It covers the normal, binomial, and Poisson distributions. It defines key terms like random variables, probability functions, expected value, and variance. It provides examples of how to generate and visualize random values from these distributions using functions in R like dnorm(), pnorm(), qnorm(), and rnorm(). The document is an introduction to common probability distributions and how to work with them in the R programming language.

Uploaded by

Naski Kuafni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views42 pages

Probability Distributions in R

The document discusses probability distributions in R. It covers the normal, binomial, and Poisson distributions. It defines key terms like random variables, probability functions, expected value, and variance. It provides examples of how to generate and visualize random values from these distributions using functions in R like dnorm(), pnorm(), qnorm(), and rnorm(). The document is an introduction to common probability distributions and how to work with them in the R programming language.

Uploaded by

Naski Kuafni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Probability Distributions in R

Lecture 9
Lecture Outline
• R - Normal Distribution
• R - Binomial Distribution
• R - Poisson Regression
Random Variable

• A random variable x takes on a defined set of values


with different probabilities.
• For example, if you roll a die, the outcome is random (not fixed)
and there are 6 possible outcomes, each of which occur with
probability one-sixth.
• For example, if you poll people about their voting preferences,
the percentage of the sample that responds “Yes on Proposition
100” is a also a random variable (the percentage will be slightly
differently every time you poll).

• Roughly, probability is how frequently we expect


different outcomes to occur if we repeat the
experiment over and over (“frequentist” view)
Random variables can be discrete or
continuous

• Discrete random variables have a countable


number of outcomes
– Examples: Dead/alive, treatment/placebo, dice,
counts, etc.
• Continuous random variables have an
infinite continuum of possible values.
– Examples: blood pressure, weight, the speed of
a car, the real numbers from 1 to 6.
Probability functions

• A probability function maps the possible values of


x against their respective probabilities of
occurrence, p(x)
• p(x) is a number from 0 to 1.0.
• The area under a probability function is always 1.
Discrete example: roll of a die

p(x)

1/6

x
1 2 3 4 5 6

 P(x)  1
all x
Probability mass function (pmf)
x p(x)
1 p(x=1)=1/6

2 p(x=2)=1/6

3 p(x=3)=1/6

4 p(x=4)=1/6

5 p(x=5)=1/6

6 p(x=6)=1/6
1.0
Cumulative distribution function
(CDF)

1.0 P(x)
5/6
2/3
1/2
1/3
1/6
1 2 3 4 5 6 x
Cumulative distribution function
x P(x≤A)
1 P(x≤1)=1/6

2 P(x≤2)=2/6

3 P(x≤3)=3/6

4 P(x≤4)=4/6

5 P(x≤5)=5/6

6 P(x≤6)=6/6
Examples
1. What’s the probability that you roll a 3 or less?
P(x≤3)=1/2

2. What’s the probability that you roll a 5 or higher?


P(x≥5) = 1 – P(x≤4) = 1-2/3 = 1/3
R - Normal Distribution
• In a random collection of data from independent sources, it is generally observed that the
distribution of data is normal. Which means, on plotting a graph with the value of the variable in
the horizontal axis and the count of the values in the vertical axis we get a bell shape curve. The
center of the curve represents the mean of the data set. In the graph, fifty percent of values lie to
the left of the mean and the other fifty percent lie to the right of the graph. This is referred as
normal distribution in statistics.
• R has four in built functions to generate normal distribution. They are described below.
dnorm(x, mean, sd)
pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)
• Following is the description of the parameters used in above functions −
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations(sample size).
mean is the mean value of the sample data. It's default value is zero.
sd is the standard deviation. It's default value is 1.
Important discrete distributions
• Binomial
– Yes/no outcomes (dead/alive, treated/untreated,
smoker/non-smoker, sick/well, etc.)
• Poisson
– Counts (e.g., how many cases of disease in a given
area)
Expected Value and Variance
• All probability distributions are characterized
by an expected value and a variance (standard
deviation squared).
For example, bell-curve (normal) distribution:

One standard
deviation from the
Mean ()
mean ()
Expected value, or mean

• If we understand the underlying probability function of a


certain phenomenon, then we can make informed decisions
based on how we expect x to behave on-average over the long-
run…(so called “frequentist” theory of probability).

• Expected value is just the weighted average or mean (µ) of


random variable x. Imagine placing the masses p(x) at the
points X on a beam; the balance point of the beam is the
expected value of x.
Example: expected value

• Recall the following probability distribution of


ship arrivals:

x 10 11 12 13 14
P(x) .4 .2 .2 .1 .1

 x p( x)  10(.4)  11(.2)  12(.2)  13(.1)  14(.1)  11.3


i 1
i
Expected value, formally

Discrete case:

E( X )   x p(x )
all x
i i

Continuous case:

E( X )  
all x
xi p(xi )dx
Empirical Mean is a special case of Expected
Value…

Sample mean, for a sample of n subjects: =


n

x i n
1
X i 1
n
 
i 1
xi ( )
n

The probability (frequency) of each person in the


sample is 1/n.
Variance/standard deviation
“The average (expected) squared distance (or
deviation) from the mean”

  Var ( x)  E[( x   ) ] 
2 2
 (x   )
all x
i
2
p(xi )

**We square because squaring has better properties than absolute


value. Take square root to get back linear average distance from the
mean (=”standard deviation”).
Variance, formally

Discrete case:

Var ( X )   2
 (x   )
all x
i
2
p(xi )

Continuous case:

Var ( X )     ( xi   ) p ( xi )dx
2 2


Similarity to empirical variance

The variance of a sample: s2 =

 ( xi  x ) 2 N
1
i 1
n 1
 
i 1
( xi  x ) ( 2
n 1
)

Division by n-1 reflects the fact that we have lost a “degree


of freedom” (piece of information) because we had to
estimate the sample mean before we could estimate the
sample variance.
Symbol of Variance
• Var(X) = 2
– these symbols are used interchangeably
dnorm()
• This function gives height of the probability distribution at each point for a
given mean and standard deviation.
# Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)
# Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)
# Give the chart file a name.
png(file = "dnorm.png")
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it
produces the following result
pnorm()
• This function gives the probability of a normally distributed random number to be
less that the value of a given number. It is also called "Cumulative Distribution
Function".

# Create a sequence of numbers between -10 and 10 incrementing by 0.2.


x <- seq(-10,10,by = .2)
# Choose the mean as 2.5 and standard deviation as 2.
y <- pnorm(x, mean = 2.5, sd = 2)
# Give the chart file a name.
png(file = "pnorm.png")
# Plot the graph.
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it
produces the following result
qnorm()
• This function takes the probability value and gives a number whose cumulative value
matches the probability value.

# Create a sequence of probability values incrementing by 0.02.


x <- seq(0, 1, by = 0.02)
# Choose the mean as 2 and standard deviation as 3.
y <- qnorm(x, mean = 2, sd = 1)
# Give the chart file a name.
png(file = "qnorm.png")
# Plot the graph.
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it
produces the following result
rnorm()
• This function is used to generate random numbers whose distribution is normal. It
takes the sample size as input and generates that many random numbers. We draw a
histogram to show the distribution of the generated numbers.

# Create a sample of 50 numbers which are normally distributed.


y <- rnorm(50)
# Give the chart file a name.
png(file = "rnorm.png")
# Plot the histogram for this sample.
hist(y, main = "Normal DIstribution")
# Save the file.
dev.off()
When we execute the above code, it
produces the following result
R - Binomial Distribution
• The binomial distribution model deals with finding the probability of success of an event
which has only two possible outcomes in a series of experiments. For example, tossing of
a coin always gives a head or a tail. The probability of finding exactly 3 heads in tossing a
coin repeatedly for 10 times is estimated during the binomial distribution.
• R has four in-built functions to generate binomial distribution. They are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
• Following is the description of the parameters used −
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.
dbinom()
• This function gives the probability density distribution at each point.

# Create a sample of 50 numbers which are incremented by 1.


x <- seq(0,50,by = 1)
# Create the binomial distribution.
y <- dbinom(x,50,0.5)
# Give the chart file a name.
png(file = "dbinom.png")
# Plot the graph for this sample.
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it
produces the following result
pbinom()
• This function gives the cumulative probability of an event. It is a single
value representing the probability.

# Probability of getting 26 or less heads from a 51 tosses of a coin.

x <- pbinom(26,51,0.5)
print(x)
• When we execute the above code, it produces
the following result:
• [1] 0.610116
qbinom()
• This function takes the probability value and gives a number whose
cumulative value matches the probability value.

# How many heads will have a probability of 0.25 will come out when a coin

# is tossed 51 times.
x <- qbinom(0.25,51,1/2)
print(x)
• When we execute the above code, it produces
the following result −
• [1] 23
rbinom()
• This function generates required number of random values of given
probability from a given sample.
# Find 8 random values from a sample of 150 with probability of 0.4.

x <- rbinom(8,150,.4)
print(x)
• When we execute the above code, it produces
the following result −
• [1] 58 61 59 66 55 60 61 67
R - Poisson Regression
• Poisson Regression involves regression models in which the response variable is in the form of counts
and not fractional numbers. For example, the count of number of births or number of wins in a football
match series. Also the values of the response variables follow a Poisson distribution.
• The general mathematical equation for Poisson regression is −
log(y) = a + b1x1 + b2x2 + bnxn.....
• Following is the description of the parameters used −
y is the response variable.
a and b are the numeric coefficients.
x is the predictor variable.
• The function used to create the Poisson regression model is the glm() function.
• Syntax
• The basic syntax for glm() function in Poisson regression is −
glm(formula,data,family)
• Following is the description of the parameters used in above functions −
formula is the symbol presenting the relationship between the variables.
data is the data set giving the values of these variables.
family is R object to specify the details of the model. It's value is 'Poisson' for Logistic Regression.
Example
• We have the in-built data set "warpbreaks" which
describes the effect of wool type (A or B) and tension
(low, medium or high) on the number of warp breaks
per loom. Let's consider "breaks" as the response
variable which is a count of number of breaks. The
wool "type" and "tension" are taken as predictor
variables.
• Input Data
• input <- warpbreaks
• print(head(input))
When we execute the above code, it
produces the following result
Create Regression Model
output <-glm(formula = breaks ~ wool+tension, data = warpbreaks,
family = poisson)
print(summary(output))
When we execute the above code, it
produces the following result
• In the summary we look for the p-value in the
last column to be less than 0.05 to consider an
impact of the predictor variable on the
response variable. As seen the wooltype B
having tension type M and H have impact on
the count of breaks.

You might also like