0% found this document useful (0 votes)
2 views

Probability

Uploaded by

ai23mtech11008
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Probability

Uploaded by

ai23mtech11008
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Probability

Sample vs population:

Why n-1 in denominator?


But when we use true mean instead of sample mean in the formula we dont need bias correction.
● Geometric and exponential distributions are memoryless.
● Chi square RV is the summation of different std. Normal RVs.
● Markov inequality gives the upper bound of the tail probability.

Markov property: A stochastic process has the Markov property if the future state of the process
depends only on the present state, and not on the sequence of events that preceded it.

It is memoryless.
● Choose t-distribution when population size is n<30 and variation of the population is not known. It
is a general form of the z distribution. Tails are longer in the T distribution, so choose when probability is
more distributed on tails.

● Chi square distribution is used to check goodness of fit, categorical data and/or sample size is
small. It is a combination of squares of different std. Normal distribution.

● Choose Normal distribution when distribution is symmetric and data is distributed near the mean.

● When two RVs X, Y are orthogonal then, E (XY) = 0.


● When two rvs are independent then they are also uncorrelated. But converse not true.

Skewness:
It is a measure how much a pdf deviates from being symmetric.
https://en.wikipedia.org/wiki/Skewness#:~:text=In%20probability%20theory%20and%20statistics,zero%2
C%20negative%2C%20or%20undefined.

In skewness, mode lies where the peak forms. Mean shifts most towards the skewness, and median lies
in between them. Median is the best measure when there is skewness (presence of outlier).
Positive skewed: Income distribution, Housing prices, Sales data of a store.
Negative skewed: Age at retirement, Score of an exam.
Zero skewed: Height data, IQ scores.

Kurtosis: It tells us about the tailedness of a pdf. Means how sharp the tails are of a pdf. It's a fourth
order mean. Normal distribution has 0 kurt. If tails are sharp, then kurt is positive else negative.

Law of large numbers:


It tells that as the number of samples increases, sample mean tends to true mean.
1. Weak LLN: Convergence in Probability

2. Strong LLN: Almost sure convergence

Central Limit theorem


The sampling distribution of the mean will always be normally distributed, as long as the sample size is
large enough. Regardless of whether the population has a normal, Poisson, binomial, or any other
distribution, the sampling distribution of the mean will be normal when n is large (n>30). n is the number
of RVs chosen in sampling.

Or Z = (x_bar - mu)/sigma/sqrt(n), Z~N(0,1).

Kernel Density estimation:


It is a non-parametric method to estimate the pdf of a rv based on kernels as weights. Idea is to use a
kernel function( normal pdf for example) and then add the values of it to estimate the actual pdf. A
smoothing function known as bandwidth(variance of normal distribution) controls the smoothness of the
pdf.
https://en.wikipedia.org/wiki/Kernel_density_estimation
Percentiles and quantiles:
A percentile is a measure that indicates the value below which a given percentage of the data falls. For
example, the 25th percentile is the value below which 25% of the data points are found.

A quantile is a general term for values that divide a dataset into equal-sized intervals. The data is divided
into q equal parts, and the k-th quantile is the value below which k/q of the data falls.
Special Types of Quantiles:

● Quartiles: Divide the data into 4 equal parts (25%, 50%, 75%, 100%).
● Deciles: Divide the data into 10 equal parts (10%, 20%, ..., 100%).
● Percentiles: Divide the data into 100 equal parts (1%, 2%, ..., 100%).

Q-Q(quantile-quantile) plot:
It is a graphical technique used to check whether given data points (or rv X) follow a specific(normal or
any other) distribution. Or two rvs X, Y follow same distribution.
Chebyshev’s Inequality:
If we know the given data follows normal dist, then we can approximate that x% of the data lies within mu
+- sigma using the 68-95-99.7 rule. But what if we don’t know the distribution. If i know mean and std.
Deviation, we can leverage this equality to answer few questions, like what percentage of individuals have
a salary in the range of [20k, 60k].
It provides an upper bound of the variance of a random variable given it’s mean.
Log-Normal distribution:
X~lognormal(mu, sigma) then Y = log(X) ~ N(mu, sigma).

https://en.wikipedia.org/wiki/Log-normal_distribution check occurrences and application section in the


wikipedia.
[51]
● The length of comments posted in Internet discussion forums follows a log-normal distribution.
[53]
● The length of chess games tends to follow a log-normal distribution.
● In economics, there is evidence that the income of 97%–99% of the population is distributed
[69] [70]
log-normally. (The distribution of higher-income individuals follows a Pareto distribution).
You can take log(X) and check whether log(X) follows normal(use QQ plot). If so, we can apply all the
mathematics. Generally, log-normal distribution is more common in human behavior and internet
companies.

Power-law distribution:
Power law functions have very long tail. And it follows 80-20 rule, means, 80% of the region falls in
starting 20% range. Pareto distribution follows power law distribution.
Power law functions are basically when relative change is dependent on powers and not on initial values.

https://en.wikipedia.org/wiki/Power_law
https://en.wikipedia.org/wiki/Pareto_distribution
To check whether a distribution follows pareto or power law distribution you can use log log plot. If is gives
a decreasing straight line, then it follows.

Xm is the scale parameter. This denotes that below xm pareto dist is not defined.
Kepler’s law follows power law or pareto’s distribution.
Wealth distribution: A small percentage of population holds most of the wealth.
Book sales: A small percentage of books give majority income.

Power transform (Box-cox transform).


Transform any rv X -> normal distribution.
https://builtin.com/data-science/box-cox-transformation-target-variable
Where lambda is the optimal value of log likelihood of the transformed data yi.
For lambda=0, xi is the log normal distribution.
For lambda=1, xi is already normally distributed, only shift happens.
Limitations:
1. it is sensitive to outliers.
2. It is applicable to only positive data.

So, what’s the recipe?


You have given data, you try to check from which distribution it came using QQ plot. Otherwise you can
also transform to Gaussian distribution using Box-cox plot. Then make inferences like what % of
population has age>60 like that.

How to measure how two RVs are related?


Covariance, pearson correlation coeff, spearman rank corr-coeff.

Covariance: It give qualitative measure of the tendency of linear relationship of two jointly distributed rvs.
It does not quantify.
If greater values of one variable mainly correspond with greater values of the other variable, and the same
holds for lesser values (that is, the variables tend to show similar behavior), the covariance is positive.
Units of measurements can change the result. This can be fixed using correlation coeff.

Pearson correlation coeff:


https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
Both covariance and pearson assumes relation to be linear. But what if they not. Also, it does not consider
slope of the relation.

Spearman rank correlation coeff:


Better to use when there is non-linear relationship.It is pearson’s correlation coeff of the ranks of the
dataset. Idea is to assign ranks for the data points and find pearson correlation of ranks rx and ry.
It is also robust to outliers.

https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
“Correlation does not employ causation.” Correlation does not imply causation means that the
co-occurrence of two events does not require that one causes the other.
For example, see below example.

Nobel prize winner per capita correlates with chocolate consumption per capita. But this does not mean
that chocolate consumption causes winning nobel prize.
Confidence Intervals (C.I)
It indicates that there is a x% chance that the true mean(or parameter) of a given rv lies between two set
of values.
Let x_bar represent the sample mean and mu represent the population mean. Now, if we repeat the
sampling multiple times, each time, we get a different value of sample mean, x_bar. In 95% of the
sampling experiments, mu will be between the endpoints of the C.I calculated using x_bar, but in 5% of
the cases, it will not be. 95% C.I does NOT mean that mu lies in the interval with a probability of 95%.

CI is a better measure than point estimate because point estimates seem to vary according to sampling.

How to calculate CI:

CI for mean of a rv:


Use CLT to get CI for sample mean distribution of a random variable.
For mean, If we know std deviation then we can use CLT to get CI for true mean. If we dont know then we
can use student’s t distribution.
If we want to get CI for some other param, like sigma or median, then use can use bootstrap CI.

Bootstrap method to get CI

1. Let’s say we want to find out the CI for median.


2. Let’s say we have the input x of size n.
3. Idea of this method is to resample m (<=n) from x and find out median of these samples
m1.
4. Perform step 3 several times say k = 1000.
5. Now you have 1000 medians of the resampled values. Sort this and get the percentile
values. Like if you want 95% CI, then find lower percentile 2.5% and upper percentile
97.5%.
Hypothesis testing:
It is a statistical method that is used to determine whether the assumption about population parameter
true or not given the experimental data.
Steps:
Null hypothesis is the default assumption about data i.e. statements like “no difference” or “no effects”.
Both null and alternate hypothesis are mutually exclusive. Proof by contradiction that one statement is
correct while other is wrong.

How to read it: For the above case, it goes like: if the null hypothesis is true(there is no difference in the
pop mean of the distributions), then there is 90% chance that x = mu1-mu2 is 10cm.

If p-value is high, accept null hyp. Else choose alternate hypothesis.


What is p-value:
It is the probability of observation given my assumption(null hypothesis) is true.
p(obs | Ho) = p-value.
So choose a null hypothesis such that you can get the pdf of the data.

Significance level:
It is the threshold for rejecting null hypothesis.
If p <= alpha, reject H0

Example of coin toss:


Permutation test:
K-S test:
It is used to check whether given samples came from a given reference pdf or not (one-sample test).
Or, two samples came from the same pdf (two sample k-s test).

One sample test: "How likely is it that we would see a collection of samples like this if they were drawn
from that probability distribution?"
Two sample test: “How likely is it that we would see two sets of samples like this if they were drawn from
the same (but unknown) probability distribution?"
Null Hypothesis for two sample: Two samples came from a population with the same pdf.

Dn,m is the maximum difference between both the cdfs.

You might also like