Probability
Probability
Sample vs population:
Markov property: A stochastic process has the Markov property if the future state of the process
depends only on the present state, and not on the sequence of events that preceded it.
It is memoryless.
● Choose t-distribution when population size is n<30 and variation of the population is not known. It
is a general form of the z distribution. Tails are longer in the T distribution, so choose when probability is
more distributed on tails.
● Chi square distribution is used to check goodness of fit, categorical data and/or sample size is
small. It is a combination of squares of different std. Normal distribution.
● Choose Normal distribution when distribution is symmetric and data is distributed near the mean.
Skewness:
It is a measure how much a pdf deviates from being symmetric.
https://en.wikipedia.org/wiki/Skewness#:~:text=In%20probability%20theory%20and%20statistics,zero%2
C%20negative%2C%20or%20undefined.
In skewness, mode lies where the peak forms. Mean shifts most towards the skewness, and median lies
in between them. Median is the best measure when there is skewness (presence of outlier).
Positive skewed: Income distribution, Housing prices, Sales data of a store.
Negative skewed: Age at retirement, Score of an exam.
Zero skewed: Height data, IQ scores.
Kurtosis: It tells us about the tailedness of a pdf. Means how sharp the tails are of a pdf. It's a fourth
order mean. Normal distribution has 0 kurt. If tails are sharp, then kurt is positive else negative.
A quantile is a general term for values that divide a dataset into equal-sized intervals. The data is divided
into q equal parts, and the k-th quantile is the value below which k/q of the data falls.
Special Types of Quantiles:
● Quartiles: Divide the data into 4 equal parts (25%, 50%, 75%, 100%).
● Deciles: Divide the data into 10 equal parts (10%, 20%, ..., 100%).
● Percentiles: Divide the data into 100 equal parts (1%, 2%, ..., 100%).
Q-Q(quantile-quantile) plot:
It is a graphical technique used to check whether given data points (or rv X) follow a specific(normal or
any other) distribution. Or two rvs X, Y follow same distribution.
Chebyshev’s Inequality:
If we know the given data follows normal dist, then we can approximate that x% of the data lies within mu
+- sigma using the 68-95-99.7 rule. But what if we don’t know the distribution. If i know mean and std.
Deviation, we can leverage this equality to answer few questions, like what percentage of individuals have
a salary in the range of [20k, 60k].
It provides an upper bound of the variance of a random variable given it’s mean.
Log-Normal distribution:
X~lognormal(mu, sigma) then Y = log(X) ~ N(mu, sigma).
Power-law distribution:
Power law functions have very long tail. And it follows 80-20 rule, means, 80% of the region falls in
starting 20% range. Pareto distribution follows power law distribution.
Power law functions are basically when relative change is dependent on powers and not on initial values.
https://en.wikipedia.org/wiki/Power_law
https://en.wikipedia.org/wiki/Pareto_distribution
To check whether a distribution follows pareto or power law distribution you can use log log plot. If is gives
a decreasing straight line, then it follows.
Xm is the scale parameter. This denotes that below xm pareto dist is not defined.
Kepler’s law follows power law or pareto’s distribution.
Wealth distribution: A small percentage of population holds most of the wealth.
Book sales: A small percentage of books give majority income.
Covariance: It give qualitative measure of the tendency of linear relationship of two jointly distributed rvs.
It does not quantify.
If greater values of one variable mainly correspond with greater values of the other variable, and the same
holds for lesser values (that is, the variables tend to show similar behavior), the covariance is positive.
Units of measurements can change the result. This can be fixed using correlation coeff.
https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
“Correlation does not employ causation.” Correlation does not imply causation means that the
co-occurrence of two events does not require that one causes the other.
For example, see below example.
Nobel prize winner per capita correlates with chocolate consumption per capita. But this does not mean
that chocolate consumption causes winning nobel prize.
Confidence Intervals (C.I)
It indicates that there is a x% chance that the true mean(or parameter) of a given rv lies between two set
of values.
Let x_bar represent the sample mean and mu represent the population mean. Now, if we repeat the
sampling multiple times, each time, we get a different value of sample mean, x_bar. In 95% of the
sampling experiments, mu will be between the endpoints of the C.I calculated using x_bar, but in 5% of
the cases, it will not be. 95% C.I does NOT mean that mu lies in the interval with a probability of 95%.
CI is a better measure than point estimate because point estimates seem to vary according to sampling.
How to read it: For the above case, it goes like: if the null hypothesis is true(there is no difference in the
pop mean of the distributions), then there is 90% chance that x = mu1-mu2 is 10cm.
Significance level:
It is the threshold for rejecting null hypothesis.
If p <= alpha, reject H0
One sample test: "How likely is it that we would see a collection of samples like this if they were drawn
from that probability distribution?"
Two sample test: “How likely is it that we would see two sets of samples like this if they were drawn from
the same (but unknown) probability distribution?"
Null Hypothesis for two sample: Two samples came from a population with the same pdf.