Normal Distribution
Normal Distribution
distribution
Dhanya N.M.
Taxonomy of Probability Distributions
– Firstly, the most important point to note is that the normal distribution is also
known as the Gaussian distribution.
– It is named after the genius of Carl Friedrich Gauss.
– Lastly, an important point to note is that the simple predictive models
are usually the most used models due to the fact that they can be explained
and are well-understood.
– Now to add to this point; normal distribution is simple and hence its simplicity
makes it extremely popular.
What Does Probability
Distribution Mean?
Let me explain by building the appropriate building blocks first.
– Consider the predictive models we might be interested in building in our data science
projects.
– If we want to predict a variable accurately then the first task we need to perform is to
understand the underlying behavior of our target variable.
– What we need to do first is to determine the possible outcomes of the target variable and
if the underlying outcomes are discrete (distinct values) or continuous (infinite values).
– For the sake of simplicity, if we are estimating the behaviour of a dice then the first step is
to know that it can take any value from 1 to 6 (discrete).
– Then the next step would be to start assigning probabilities to the events (values).
Consequently, if a value cannot occur then it is assigned a probability of 0%.
The higher the probability, the
more likely it is for the event to
occur.
– As an instance, we can start repeating an experiment for a large
number of times and start noting the values we retrieve for the
variable.
– Now what we can do is to group the values into categories/buckets.
– And for each bucket, we can start recording the number of times
the variable had the value of the bucket.
– For example, we can throw a dice 10000 times and as there are 6
possible values that a dice can take, we can create 6 buckets.
– And start recording the number of occurrences for each value.
Probability Distribution
– The bell-shaped curve above has 100 mean and 1 standard deviation
– Mean is the center of the curve. This is the highest point of the curve as most of
the points are at the mean.
– There are equal number of points on each side of the curve. The center of the
curve has the most number of points.
– The total area under the curve is the total probability of all of the values that
the variable can take.
– The total curve area is therefore 100%
Characteristics
– Approximately 68.2% of all of the points are within the range -1 to 1 standard
deviation.
– About 95.5% of all of the points are within the range -2 to 2 standard
deviations.
– About 99.7% of all of the points are within the range -3 to 3 standard
deviations.
– This allows us to easily estimate how volatile a variable is and given a
confidence level, what its likely value is going to be.
– As an instance, in the gray bell shaped curve above, there is a 68.2% chance
that the value of the variable will be within 101–99.
– Imagine the confidence you can now have when making future decisions with
that information!!!
Normal Probability
Distribution Function
– The probability density function of normal distribution is:
– The probability density function is essentially the probability of continuous random variable
taking a value.
–
Normal distribution is a bell-shaped curve where
mean=mode=median.
– If you plot the probability distribution curve using its computed probability density
function then the area under the curve for a given range gives the probability of the
target variable being in that range.
– This probability distribution curve is based on a probability distribution function which
itself is computed on a number of parameters such as mean, or standard deviation of
the variable.
– We could use this probability distribution function to find the relative chance of a
random variable taking a value within a range. As an instance, we could record the daily
returns of a stock, group them into appropriate buckets and then find the probability of
the stock making 20–40% gain in the future.
– The larger the standard deviation, the more the volatility in the sample.
How Do I Find Feature
Distribution In Python?
– The simplest method I follow is to load all of the features in the
data frame and then write this script:
– Use the Python Pandas libarary:
– DataFrame.hist(bins=10)
– #Make a histogram of the DataFrame.
– Now what’s even more fascinating is that once you add a large number of random
variables with differing distributions together, your new variable will end up having a
normal distribution. This is essentially known as the Central Limit Theorem.
– The variables that exhibit normal distribution always exhibit normal distribution. As
an instance, if A and B are two variables with normal distributions then:
– A x B is normally distributed
– A + B is normally distributed
– As a result, it is extremely simple to forecast a variable and find the probability of it
within a range of values because of the well-known probability distribution function.
What If The Sample
Distribution Is Not Normal?
– Once we gather sample for a variable, we can compute the Z-score via linearly
transforming the sample using the formula above:
– Calculate the mean
– Calculate the standard deviation
– For each value x, compute Z using:
2. Using Boxcox
Transformation
– You can use SciPy package of Python to transform data to normal distribution:
– scipy.stats.boxcox(x, lmbda=None, alpha=None)
3. Using Yeo-Johnson
Transformation
– Additionally, power transformer yeo-johnson can be used. Python’s sci-kit learn
provides the appropriate function:
– sklearn.preprocessing.PowerTransformer(method=’yeo-johnson’,
standardize=True, copy=True)
Min-Max Normalization
– The population distribution of SAT scores is normal with a mean of μ = 500 and
a standard deviation of 100. Given this information about the population and
the known proportions for a normal distribution, we can determine the
probabilities associated with specific samples. For example, what is the
probability of randomly selecting an individual from this population who has an
SAT score greater than 700?
Find probabilities
Answers
– 0.1587
– 0.9335
– 0.3085
Find probability
Answers
Problem
– It is known that IQ scores form a normal distribution with μ = 100 and σ =15.
Given this information, what is the probability of randomly selecting an
individual with an IQ score less than 120?
– Given this information, what proportion of the cars are traveling between 55
and 65 miles per hour? Using probability notation, we can express the problem
as p(55 < X < 65) ?