Stats Intro
Stats Intro
com/krishnaik06/The-Grand-Complete-Data-Science-Materials/tree/main
https://docs.google.com/document/d/1JXswnn4qSZhHQO7RaWLYWTj7kRyL9Y7PjrCNbGLIEvA/edit?
tab=t.0
https://dou.ua/forums/topic/44769/
AB Testing Project:
1. https://www.youtube.com/watch?
v=FTpmwX94_Yo&t=10247s&pp=ygUvY29tcGxldGUgYWIgdGVzdGluZyB0dXRvcmlhbGFiIHRlc
3RpbmcgdHV0b3JpYWw%3D
2. https://www.youtube.com/watch?
v=qhYsZWrTiuM&list=PLHS1p0ot3SVjQg0q1eEPrmOmPUY_AT1vB
3.
STATISTICS
Types of Sampling
Simple Random Sampling – every member of Population has equal chance of being selected
Stratified Sampling – Sampling from non overlapping groups
Systematic Sampling – Survey for every nth person. Like people infront of Malls, offices, etc
Type of Variable
Central Tendency
Dispersion
Variance shows the spread of my data
STEPS:
Declare a lower fence (below which all numbers are outliers) and a higher fence(above which
all numbers are outliers)
Find Q1, Q3, IQR, LF and HF by formula
So Lower Fence(LF)=-3
Higher Fence(HF) = 13
27 is outlier. So below max value is 9 not 27, bcoz 27 is outlier
Boxplot
Standardisation vs Normalisation
NORMAL DISTRIBUTION
CENTRAL LIMIT THEOREM
Suppose our population/original data is normal/log-normal distribution/ any distribution
Now we take maybe 100 different samples where n>=30 and for each such sample set we get
a mean for each such sample set.
Eg ; For 1st sample set the mean is x1
For 2nd sample set, mean is x2
Similarly for 100th set mean is x100
In order to apply the central limit theorem, there are four conditions that must be met:
1. Randomization: The data must be sampled randomly such that every member in a
population has an equal probability of being selected to be in the sample.
3. The 10% Condition: When the sample is drawn without replacement, the sample size should
be no larger than 10% of the population and n>30
No matter what the shape of the original data is, if you take enough random samples and average
them, the distribution of those averages will start to look like a normal distribution.
Now if all the mean values above(x1,x2,….,x100) and are plotted, a bell curve is going to from
which follows normal distribution
The central limit theorem states that if you take sufficiently large samples (n>=30) from a
population, the samples’ means will be normally distributed, even if the population isn’t normally
distributed.
Example: A population follows a Poisson distribution (left image). If we take 10,000 samples from
the population, each with a sample size of 50, the sample means follow a normal distribution, as
predicted by the central limit theorem (right image).
Skewness
In probability, a real-valued function, defined over the sample space of a random experiment, is
called a random variable. That is, the values of the random variable correspond to the outcomes of
the random experiment
A probability function is a rule or formula that assigns a probability (a value between 0 and 1) to
each possible outcome of a random event or variable.
🔹 Two Main Types:
🔁 Summary:
Thing Description
Random Variable (X) Describes the possible outcomes numerically
Probability Function Assigns a probability to each possible value of X
o The CDF accumulates this density to give the total probability up to a certain point.
Steps
Statistical Tests Table
Test Type Conditions Test Name H0 H1
μ ≠ μ₀ (two-tailed) or
Mean 1 sample, σ known Z-test μ = μ₀
μ > μ₀ / μ < μ₀ (one-tailed)
μ ≠ μ₀ (two-tailed) or
1 sample, σ unknown T-test μ = μ₀
μ > μ₀ / μ < μ₀ (one-tailed)
2 samples, independent, σ
Independent T-
unknown μ₁ ≠ μ₂ (two-tailed) or
test (pooled μ₁ = μ₂
(Like drug test on 2 diff sets of μ₁ > μ₂ / μ₁ < μ₂ (one-tailed)
variance)
people)
μ₁ = μ₂ = μ₃ ... =
>2 independent samples ANOVA (F-test) At least one mean differs
μₙ
Median =
Median Non-parametric, 1 sample Sign Test Median ≠ Median₀
Median₀
Observed
Goodness of fit, non
Chi- distribution = Observed distribution ≠
parametric & nominal Chi-Square Test
Square Expected Expected distribution
variables
distribution
Here's a chart with examples for each statistical test, showing when and where they are
commonly applied:
Chi-square test - A statistical method is used to find the difference or correlation between the observed
and expected categorical variables in the dataset.
Example: A food delivery company wants to find the relationship between gender, location, and food
choices of people.
The Chi-Square test is actually used for two main purposes, and they both revolve around comparing observed vs.
expected distributions — but in slightly different contexts:
🔹 1. Chi-Square Test of Independence
(Also called association test)
👉 Purpose:
🧠 Example:
🧪 You’re testing:
If the distributions across age groups differ significantly, you conclude the variables are associated.
To test if a single categorical variable follows a specific distribution (often a uniform or expected one).
🧠 Example:
🧪 You’re testing:
If the difference is too large, the test statistic (Chi-Square) becomes large, and you reject the null hypothesis.
✅ Quick Summary:
For categorical variables, it is not meaningful to talk about whether the distribution is
"normal" or not because:
o Many tests (like t-tests and ANOVA) assume that the variances across groups are
approximately equal.
o Useful when comparing categories like gender, treatment types, or experimental groups.
3. Robust to Non-Normality:
o Levene's test is less sensitive to departures from normality compared to other tests (like
Bartlett’s test).
A Confidence Interval gives a range of values that is likely to contain the true population parameter (like
mean or proportion), based on your sample data.
Confidence Levels - It’s the probability (or certainty level) that your confidence interval actually contains the
true population parameter.
95% confidence level = If you repeated the experiment 100 times, 95 of those times the parameter
will liw within confidence interval
Relation to CI:
The higher the confidence level, the wider the confidence interval.
P-Value :- A p-value, or probability value, is a number describing the likelihood of obtaining the observed
data under the null hypothesis of a statistical test
A small p-value (like less than 0.05) means the results are unlikely due to chance, so you might reject
the null hypothesis.
A large p-value means the results could easily happen by chance, so you don't reject the null
hypothesis.
p-value "How likely is my result if there's no effect?" "Only 2% chance it happened randomly."
Z-test Example
Similarly from below we can calculate the actual values of the lower and upper bounds of the
Confidence level
Exampe of T test
This value of +2.045 and -2.045 is found in the t-test table specifically in the 2-tailed
table. The value 2.045 is called critical value
Confusion matrix - A confusion matrix is a table used to evaluate the performance of
a classification model. It compares the predicted labels with the actual labels and
provides insights into how well the model is performing. The matrix is especially
useful for binary and multi-class classification problems.
Explanation of Terms
True Positive (TP) – The model correctly predicted the positive class.
False Positive (FP) (Type I Error) – The model incorrectly predicted positive
when it was actually negative.
False Negative (FN) (Type II Error) – The model incorrectly predicted negative
when it was actually positive.
True Negative (TN) – The model correctly predicted the negative class
Q) When to reduce FN and FP?
Domain specific
Suppose for diseases, reducing FN should be more imp. Eg: Cancer. If it is FN,
that means patient has cancer but predicted as No Cancer
Point Estimate & Margin of Error
Z = critical value
Interpreting Margin of Error
If a survey reports a 60% approval rating with a ±3% margin of error, the actual approval
could be anywhere between 57% and 63%.
A larger sample size reduces the margin of error, making estimates more precise.
A higher confidence level increases the margin of error, as it requires a wider range to ensure
accuracy.
3) Degrees of Freedom
4)
Chi Square Test
1. 1st table is the original pop data(percentages provided on pop scale) and 2nd table is
the date gathered after sampling
2. So we are trying to find out if there is a change between these 2data? Did the
population changed in 10 years?
3.
4.
CoVariance
🎯 Intuitive Explanation
It does not assume linearity but instead measures if the relationship is monotonic (consistently
increasing or decreasing).
Works with ranked data, meaning it converts data into ranks before calculating correlation.
Ideal for ordinal data or data that doesn't meet the assumptions of Pearson correlation (like normal
distribution).
Rank is like this
OR
Probability Basics
Basic Probability Notes for Data Analysis Roles
1. Basics of Probability
Definitions:
2. Types of Probability
Classical (Theoretical) Probability: Based on predefined rules (e.g., rolling a fair die).
Empirical (Experimental) Probability: Based on observations/data.
Subjective Probability: Based on intuition or expert knowledge.
3. Probability Rules
4. Conditional Probability