0% found this document useful (0 votes)
17 views49 pages

Normal Distribution

The document provides an overview of normal distribution, its significance in probability and statistics, and its applications in data science and machine learning. It explains the characteristics of normal distribution, including the bell-shaped curve and the Central Limit Theorem, which states that the sum of many random variables tends to be normally distributed. Additionally, it discusses methods to check and transform data to achieve normality, as well as potential problems associated with assuming normality in various contexts.

Uploaded by

varmakdc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views49 pages

Normal Distribution

The document provides an overview of normal distribution, its significance in probability and statistics, and its applications in data science and machine learning. It explains the characteristics of normal distribution, including the bell-shaped curve and the Central Limit Theorem, which states that the sum of many random variables tends to be normally distributed. Additionally, it discusses methods to check and transform data to achieve normality, as well as potential problems associated with assuming normality in various contexts.

Uploaded by

varmakdc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Normal

distribution

Dhanya N.M.
Taxonomy of Probability Distributions

Discrete probability distributions


–Binomial distribution
–Multinomial distribution
–Poisson distribution
–Hypergeometric distribution

Continuous probability distributions


–Normal distribution
–Standard normal distribution
–Gamma distribution
–Exponential distribution
–Chi square distribution
–Lognormal distribution
–Weibull distribution
Normal Distribution

– What is so special about normal probability distribution?


– Why so many data science and machine learning articles revolve around normal
probability distribution?
Agenda

– What probability distribution is?


– What normal distribution means?
– Which variables exhibit normal distribution?
– How to check distribution of your data set in Python?
– How to make a variable normally distributed in Python?
– Problems with normality
A Little Background First

– Firstly, the most important point to note is that the normal distribution is also
known as the Gaussian distribution.
– It is named after the genius of Carl Friedrich Gauss.
– Lastly, an important point to note is that the simple predictive models
are usually the most used models due to the fact that they can be explained
and are well-understood.
– Now to add to this point; normal distribution is simple and hence its simplicity
makes it extremely popular.
What Does Probability
Distribution Mean?
Let me explain by building the appropriate building blocks first.
– Consider the predictive models we might be interested in building in our data science
projects.
– If we want to predict a variable accurately then the first task we need to perform is to
understand the underlying behavior of our target variable.
– What we need to do first is to determine the possible outcomes of the target variable and
if the underlying outcomes are discrete (distinct values) or continuous (infinite values).
– For the sake of simplicity, if we are estimating the behaviour of a dice then the first step is
to know that it can take any value from 1 to 6 (discrete).
– Then the next step would be to start assigning probabilities to the events (values).
Consequently, if a value cannot occur then it is assigned a probability of 0%.
The higher the probability, the
more likely it is for the event to
occur.
– As an instance, we can start repeating an experiment for a large
number of times and start noting the values we retrieve for the
variable.
– Now what we can do is to group the values into categories/buckets.
– And for each bucket, we can start recording the number of times
the variable had the value of the bucket.
– For example, we can throw a dice 10000 times and as there are 6
possible values that a dice can take, we can create 6 buckets.
– And start recording the number of occurrences for each value.
Probability Distribution

– We can plot the chart and it will form a curve.


– This curve is known as probability distribution curve and the likelihood of the target
variable getting a value is the probability distribution of the variable.
– Once we understand how the values are distributed then we can start estimating the
probabilities of the events, even by the means of using formulas (known as probability
distribution functions).
– As a result, we can start understanding its behaviour better.
– The probability distribution is dependent on the moments of the sample such as mean,
standard deviation, skewness and kurtosis.
– If you add all of the probabilities then it will sum up to 100%.
– There are a large number of probability distributions and the most widely used
probability distribution is known as “normal distribution”.
Let’s Now Move Onto
Normal Probability
Distribution
– If you plot the probability distribution and it forms a bell shaped curve and the
mean, mode and median of the sample are equal then the variable has normal
distribution.
– This is an example normal distribution bell shaped curve:
It is important to understand and
estimate the probability distribution of
your target variable.
Following variables are close to normally distributed variables:
– Height of a population
– Blood pressure of adult human
– Position of a particle that experiences diffusion
– Measurement errors
– Residuals in regression
– Shoe size of a population
– Amount of time it takes for employees to reach home
– A large number of educational measures
– Additionally, there are a large number of variables around us which are normal with a x%
confidence; x < 100.
What Is Normal
Distribution?
– A normal distribution is a distribution that is solely dependent on
two parameters of the data set: its mean and the standard
deviation of the sample.
– Mean — This is the average value of all the points in the sample.
– Standard Deviation — This indicates how much the data set
deviates from the mean of the sample.
This characteristic of the distribution makes it
extremely simple for statisticians and hence any
variable that exhibits normal distribution is feasible to
be forecasted with higher accuracy.
Normal Distribution Is Simply … The
Normal Behaviour That We Are Just
So Familiar With
– Now, what’s phenomenal to note is that once you find the
probability distributions of most of the variables in nature then
they all approximately follow normal distribution.
– Normal distribution is simple to explain. The reasons are:
– The mean, mode and median of the distribution are equal.
– We only need to use the mean and standard deviation to explain the entire
distribution.
But how are so many variables
approximately normally distributed?
What is the logic behind it?
– The idea revolves around the theorem that when you repeat an experiment a
large number of times on a large number of random variables then the sum of
their distributions will be very close to normality.
– As height of a person is a random variable and is based on other random
variables such as the amount of nutrition a person consumes, the environment
they live in, their genetics and so on, the sum of the distributions of these
variables end up being very close to normal.
– This is known as the Central Limit Theorem.
This brings us to the core
of the article:
– We understood from the section above that the normal distribution is the sum
of many random distributions.
– If we plot the normal distribution density function, it’s curve has following
characteristics:
Characteristics

– The bell-shaped curve above has 100 mean and 1 standard deviation
– Mean is the center of the curve. This is the highest point of the curve as most of
the points are at the mean.
– There are equal number of points on each side of the curve. The center of the
curve has the most number of points.
– The total area under the curve is the total probability of all of the values that
the variable can take.
– The total curve area is therefore 100%
Characteristics
– Approximately 68.2% of all of the points are within the range -1 to 1 standard
deviation.
– About 95.5% of all of the points are within the range -2 to 2 standard
deviations.
– About 99.7% of all of the points are within the range -3 to 3 standard
deviations.
– This allows us to easily estimate how volatile a variable is and given a
confidence level, what its likely value is going to be.
– As an instance, in the gray bell shaped curve above, there is a 68.2% chance
that the value of the variable will be within 101–99.
– Imagine the confidence you can now have when making future decisions with
that information!!!
Normal Probability
Distribution Function
– The probability density function of normal distribution is:
– The probability density function is essentially the probability of continuous random variable
taking a value.


Normal distribution is a bell-shaped curve where
mean=mode=median.
– If you plot the probability distribution curve using its computed probability density
function then the area under the curve for a given range gives the probability of the
target variable being in that range.
– This probability distribution curve is based on a probability distribution function which
itself is computed on a number of parameters such as mean, or standard deviation of
the variable.
– We could use this probability distribution function to find the relative chance of a
random variable taking a value within a range. As an instance, we could record the daily
returns of a stock, group them into appropriate buckets and then find the probability of
the stock making 20–40% gain in the future.
– The larger the standard deviation, the more the volatility in the sample.
How Do I Find Feature
Distribution In Python?
– The simplest method I follow is to load all of the features in the
data frame and then write this script:
– Use the Python Pandas libarary:
– DataFrame.hist(bins=10)
– #Make a histogram of the DataFrame.

– It shows us the probability distributions of all of the variables.


What Does It Mean For A Variable
To Have Normal Distribution?

– Now what’s even more fascinating is that once you add a large number of random
variables with differing distributions together, your new variable will end up having a
normal distribution. This is essentially known as the Central Limit Theorem.
– The variables that exhibit normal distribution always exhibit normal distribution. As
an instance, if A and B are two variables with normal distributions then:
– A x B is normally distributed
– A + B is normally distributed
– As a result, it is extremely simple to forecast a variable and find the probability of it
within a range of values because of the well-known probability distribution function.
What If The Sample
Distribution Is Not Normal?

– You can convert a distribution of a feature into normal distribution.


– I have used a number of techniques to make a feature normally distributed:
1. Linear Transformation

– Once we gather sample for a variable, we can compute the Z-score via linearly
transforming the sample using the formula above:
– Calculate the mean
– Calculate the standard deviation
– For each value x, compute Z using:
2. Using Boxcox
Transformation

– You can use SciPy package of Python to transform data to normal distribution:
– scipy.stats.boxcox(x, lmbda=None, alpha=None)
3. Using Yeo-Johnson
Transformation
– Additionally, power transformer yeo-johnson can be used. Python’s sci-kit learn
provides the appropriate function:
– sklearn.preprocessing.PowerTransformer(method=’yeo-johnson’,
standardize=True, copy=True)
Min-Max Normalization

– Min-max normalization, (usually called feature scaling) performs a linear


transformation on the original data.
– This technique gets all the scaled data in the range [0,1].
Problems With Normality

– As the normal distribution is simple and is well-understood, it is also over used


in the predictive projects.
– Assuming normality has its own flaws.
– As an instance, we cannot assume that the stock price follows normal
distribution as the price cannot be negative.
– Therefore the stock price potentially follows log of normal distribution to ensure
it is never below zero.
– We know that the returns can be negative, therefore the returns can follow
normal distribution.
Problems With Normality

– It is not wise to assume that the variable follows a normal


distribution without any analysis.
– A variable can follow Poisson, Student-t or Binomial distribution as an instance
and falsely assuming that a variable follows normal distribution can lead to
inaccurate results.
Problem

– The population distribution of SAT scores is normal with a mean of μ = 500 and
a standard deviation of 100. Given this information about the population and
the known proportions for a normal distribution, we can determine the
probabilities associated with specific samples. For example, what is the
probability of randomly selecting an individual from this population who has an
SAT score greater than 700?
Find probabilities
Answers

– 0.1587
– 0.9335
– 0.3085
Find probability
Answers
Problem

– It is known that IQ scores form a normal distribution with μ = 100 and σ =15.
Given this information, what is the probability of randomly selecting an
individual with an IQ score less than 120?

– 1. Transform the X values into z-scores.


– 2. Use the unit normal table to look up the proportions corresponding to the z-
score values.
– The highway department conducted a study measuring driving speeds on a
local section of interstate highway. They found an average speed of μ = 58 miles
per hour with a standard deviation of σ = 10. The distribution was
approximately normal.

– Given this information, what proportion of the cars are traveling between 55
and 65 miles per hour? Using probability notation, we can express the problem
as p(55 < X < 65) ?

You might also like