22amh32 - Data Analytics and Data Science Unit I & Probability Distributions and Fitting A Model 1. Probability Distributions and Fitting A Model
22amh32 - Data Analytics and Data Science Unit I & Probability Distributions and Fitting A Model 1. Probability Distributions and Fitting A Model
DATA SCIENCE
Figure 1: Frequency_distribution_example_egg_weight
She can get a rough idea of the probability of different egg sizes directly from this frequency
distribution. For example, she can see that there’s a high probability of an egg being around
1.9 oz., and there’s a low probability of an egg being bigger than 2.1 oz.
Suppose the farmer wants more precise probability estimates. One option is to improve her
estimates by weighing many more eggs.
A better option is to recognize that egg size appears to follow a common probability
distribution called a normal distribution. The farmer can make an idealized version of the egg
weight distribution by assuming the weights are normally distributed:
Figure 2: Normal_distribution_example_egg_weight
Since normal distributions are well understood by statisticians, the farmer can calculate
precise probability estimates, even with a relatively small sample size.
Variables that follow a probability distribution are called random variables. There’s special
notation you can use to say that a random variable follows a specific distribution:
Random variables are usually denoted by X.
The ~ (tilde) symbol means “follows the distribution.”
The distribution is denoted by a capital letter (usually the first letter of the
distribution’s name), followed by brackets that contain the distribution’s parameters.
For example, the following notation means “the random variable X follows a normal
distribution with a mean of µ and a variance of σ2.”
“Greetings, human!” .6
“Hi!” .1
“Howdy!” .1
Where:
Poisson Describes count data. It gives the probability of The number of text
an event happening k number of times within a messages received per
given interval of time or space. day
Where:
is the probability density of egg weight
is the mean egg weight in the population ( oz., in this case)
is the standard deviation of egg weight in the population ( oz., in this
case)
The probability of an egg being exactly 2 oz. is zero. Although an egg can weigh very close
to 2 oz., it is extremely improbable that it will weigh exactly 2 oz. Even if a regular scale
measured an egg’s weight as being 2 oz., an infinitely precise scale would find a tiny
difference between the egg’s weight and 2 oz.
The probability that an egg is within a certain weight interval, such as 1.98 and 2.04 oz., is
greater than zero and can be represented in the graph of the probability density function as a
shaded region:
Normal Describes data with values that become less probable SAT scores
distribution the farther they are from the mean, with a bell-shaped
probability density function.
Continuous Describes data for which equal-sized intervals have The amount of time
uniform equal probability. cars wait at a red light
Log-normal Describes right-skewed data. It’s the probability The average body
distribution of a random variable whose logarithm is weight of different
normally distributed. mammal species
Distribution Description Example
Exponential Describes data that has higher probabilities for small Time between
values than large values. It’s the probability earthquakes
distribution of time between independent events.
DISCUSSION QUESTIONS:
1.What are the advantages and limitations of different methods for fitting probability
distributions to empirical data?
2. How can understanding the characteristics and parameters of probability distributions
enhance the accuracy of predictive modeling?
3. In what ways do the properties of specific probability distributions influence their
suitability for modeling different types of data?