0% found this document useful (0 votes)
2 views45 pages

Stats Intro

The document provides a comprehensive overview of statistical concepts including sampling methods, types of variables, measures of central tendency and dispersion, and hypothesis testing. It explains the Central Limit Theorem, various statistical tests, and the importance of confidence intervals and tests for normality. Additionally, it covers the Chi-Square test for independence and goodness-of-fit, as well as the Levene's test for homogeneity of variances.

Uploaded by

funtooshda1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views45 pages

Stats Intro

The document provides a comprehensive overview of statistical concepts including sampling methods, types of variables, measures of central tendency and dispersion, and hypothesis testing. It explains the Central Limit Theorem, various statistical tests, and the importance of confidence intervals and tests for normality. Additionally, it covers the Chi-Square test for independence and goodness-of-fit, as well as the Levene's test for homogeneity of variances.

Uploaded by

funtooshda1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

https://github.

com/krishnaik06/The-Grand-Complete-Data-Science-Materials/tree/main

Krish Naik - https://www.youtube.com/watch?v=6Z8SdN52GuU&list=PLZoTAELRMXVMhVyr3Ri9IQ-


t5QPBtxzJO&index=44

https://docs.google.com/document/d/1JXswnn4qSZhHQO7RaWLYWTj7kRyL9Y7PjrCNbGLIEvA/edit?
tab=t.0

https://dou.ua/forums/topic/44769/

AB Testing Project:

1. https://www.youtube.com/watch?
v=FTpmwX94_Yo&t=10247s&pp=ygUvY29tcGxldGUgYWIgdGVzdGluZyB0dXRvcmlhbGFiIHRlc
3RpbmcgdHV0b3JpYWw%3D
2. https://www.youtube.com/watch?
v=qhYsZWrTiuM&list=PLHS1p0ot3SVjQg0q1eEPrmOmPUY_AT1vB
3.

STATISTICS

Types of Sampling

 Simple Random Sampling – every member of Population has equal chance of being selected
 Stratified Sampling – Sampling from non overlapping groups
 Systematic Sampling – Survey for every nth person. Like people infront of Malls, offices, etc
Type of Variable

Central Tendency

Dispersion
Variance shows the spread of my data

SD shows how far from mean is the data available


Percentile

Now 25 Percentile = Q1=(25/100) x (n+1) this value is the position of Q1

75 percentile = Q3 = (75/100) x (n+1) this value is the position of Q3

Inter Quartile Range = Q3-Q1  used to find outlier via Boxplot


Removing Outlier

STEPS:

 Declare a lower fence (below which all numbers are outliers) and a higher fence(above which
all numbers are outliers)
 Find Q1, Q3, IQR, LF and HF by formula

 So Lower Fence(LF)=-3
 Higher Fence(HF) = 13
 27 is outlier. So below max value is 9 not 27, bcoz 27 is outlier
 Boxplot

SECOND METHOD TO REMOVE OUTLIER : Z-score


Applications of Z-score:
1. Standardisation
2. Finding outlier
3. Compare scores between various ditributions.
Eg: India cricket score in 2015 and 2019 WC where we know Avg, SD
and final score for both tournaments. We need to find which final score
is better?

Standardisation vs Normalisation
NORMAL DISTRIBUTION
CENTRAL LIMIT THEOREM
 Suppose our population/original data is normal/log-normal distribution/ any distribution
 Now we take maybe 100 different samples where n>=30 and for each such sample set we get
a mean for each such sample set.
 Eg ; For 1st sample set the mean is x1
 For 2nd sample set, mean is x2
 Similarly for 100th set mean is x100

In order to apply the central limit theorem, there are four conditions that must be met:

1. Randomization: The data must be sampled randomly such that every member in a
population has an equal probability of being selected to be in the sample.

2. Independence: The sample values must be independent of each other.

3. The 10% Condition: When the sample is drawn without replacement, the sample size should
be no larger than 10% of the population and n>30

4. Large Sample Condition: The sample size needs to be sufficiently large.

No matter what the shape of the original data is, if you take enough random samples and average
them, the distribution of those averages will start to look like a normal distribution.

 Now if all the mean values above(x1,x2,….,x100) and are plotted, a bell curve is going to from
which follows normal distribution

The central limit theorem states that if you take sufficiently large samples (n>=30) from a
population, the samples’ means will be normally distributed, even if the population isn’t normally
distributed.

Example: A population follows a Poisson distribution (left image). If we take 10,000 samples from
the population, each with a sample size of 50, the sample means follow a normal distribution, as
predicted by the central limit theorem (right image).
Skewness

Different Probability functions


What is a Random Variable?

In probability, a real-valued function, defined over the sample space of a random experiment, is
called a random variable. That is, the values of the random variable correspond to the outcomes of
the random experiment

A probability function is a rule or formula that assigns a probability (a value between 0 and 1) to
each possible outcome of a random event or variable.
🔹 Two Main Types:

1. Probability Mass Function (PMF)


→ For discrete random variables (things you can count: dice rolls, coin flips)

2. Probability Density Function (PDF)


→ For continuous random variables (things you measure: height, weight, time)

🔁 Summary:
Thing Description
Random Variable (X) Describes the possible outcomes numerically
Probability Function Assigns a probability to each possible value of X

Relation between PDF & CDF

Summary of Relationships in Action:

 Continuous Example (Normal Distribution):

o The PDF describes the probability density at any given point.

o The CDF accumulates this density to give the total probability up to a certain point.

o The CDF is the integral of the PDF.

 Discrete Example (Dice Roll):

o The PMF gives the probability for each outcome.


o The CDF gives the cumulative probability for all outcomes less than or equal to a
specific value.

o The CDF is the cumulative sum of the PMF.


Hypothesis Testing – take sample and infer something about the population

Steps
Statistical Tests Table
Test Type Conditions Test Name H0 H1

μ ≠ μ₀ (two-tailed) or
Mean 1 sample, σ known Z-test μ = μ₀
μ > μ₀ / μ < μ₀ (one-tailed)

μ ≠ μ₀ (two-tailed) or
1 sample, σ unknown T-test μ = μ₀
μ > μ₀ / μ < μ₀ (one-tailed)

2 samples, independent, σ
Independent T-
unknown μ₁ ≠ μ₂ (two-tailed) or
test (pooled μ₁ = μ₂
(Like drug test on 2 diff sets of μ₁ > μ₂ / μ₁ < μ₂ (one-tailed)
variance)
people)

2 samples, paired (Eg : Effect


μd = 0 μd ≠ 0 (two-tailed) or
of diet on weight on a set of Paired T-test
people before and after diet) (d=dependent) μd > 0 / μd < 0 (one-tailed)

μ₁ = μ₂ = μ₃ ... =
>2 independent samples ANOVA (F-test) At least one mean differs
μₙ

Median =
Median Non-parametric, 1 sample Sign Test Median ≠ Median₀
Median₀

Non-parametric, two Mann-Whitney U Distributions are


Rank One distribution is shifted
independent samples Test (MWU Test) equal

Non-parametric, paired(or Wilcoxon Signed- Medians are


Medians are different
dependent) samples Rank Test equal

Kruskal-Wallis Test Distributions are At least one distribution is


Non-parametric, >2 groups
(KW Test) equal different

1 categorical variable, 1 p ≠ p₀ (two-tailed) or p > p₀ / p


Proportion Z-test p = p₀
sample < p₀ (one-tailed)

1 categorical variable,2 p₁ ≠ p₂ (two-tailed) or p₁ > p₂ /


Z-test p₁ = p₂
independent sample p₁ < p₂ (one-tailed)

Observed
Goodness of fit, non
Chi- distribution = Observed distribution ≠
parametric & nominal Chi-Square Test
Square Expected Expected distribution
variables
distribution

Test of independence (nominal Variables are


Chi-Square Test Variables are dependent
variables) independent

Here's a chart with examples for each statistical test, showing when and where they are
commonly applied:

Test Type Conditions Conditions Test Name Example Use Case


Mean 1 sample, σ known 1 sample, σ known Z-test Testing if the average height of
(Parametric) students is 170 cm when population
Test Type Conditions Conditions Test Name Example Use Case
standard deviation is known.
Checking if the average weight of a
1 sample, σ
Mean test 1 sample, σ unknown T-test product is 500g when population
unknown
standard deviation is unknown.
2 samples, independent, σ
2 samples,
unknown Independent T-test Comparing the average test scores of
Mean test independent, σ
(Like drug test on 2 diff sets of (Pooled Variance) two different classrooms.
unknown
people)
2 samples, paired (Eg : Effect of Measuring the effect of a diet on
Mean test diet on weight on a set of 2 samples, paired Paired T-test weight before and after for the same
people before and after diet) group of people.
Comparing the average sales across 3
Mean test >=3 independent samples >=3 samples ANOVA (F-test)
different regions.
Median (Non- 1 sample, non- Testing if the median house price is
Non-parametric, 1 sample Sign Test
parametric) parametric $300,000.
Rank (Non-
Non-parametric, two
parametric) 2 samples, Mann-Whitney U Comparing customer satisfaction
independent samples, ordinal
independent Test rankings of two different stores.
data
Median Test
Non-parametric, paired(or Measuring the effect of therapy on
Wilcoxon Signed-
Median Test dependent) samples, 2 samples, paired stress levels before and after
Rank Test
ordinal data treatment.
Non-parametric, >2 groups, Comparing the effectiveness of 3
Median Test > 2 groups Kruskal-Wallis Test
ordinal data different fertilizers on plant growth.
1 categorical variable, 1 You want to test if the proportion of
Proportion sample(only 2 options available 1 category Z-test people who prefer tea is different
for variable) between two cities (City 1 and City 2)
1 categorical variable,2
Comparing the proportion of men vs
independent sample (only 2 2 categories Z-test
women who prefer online shopping.
options available for variable)
Chi-Square
(Non-
parametric) You want to test if there is an
Goodness of fit, no assumption
association between gender (Male,
for parametric & nominal Goodness of fit Chi-Square Test
To check for Female) and preferred beverage (Tea,
variables i.e has no order/rank
association Coffee, Water)
among
variables
Test of independence (nominal Test of Checking if gender is related to
Chi-Square Test
variables i.e has no order/rank) independence political preference.

Chi-square test - A statistical method is used to find the difference or correlation between the observed
and expected categorical variables in the dataset.
Example: A food delivery company wants to find the relationship between gender, location, and food
choices of people.

In Chi-Square, this difference is tested like this:


 Null Hypothesis (H₀): The observed distribution is the same as expected (no relationship, fair dice,
etc.).
 Alternative Hypothesis (H₁): The distribution is different (not fair, variables are related, etc.).
If your test shows a significant difference, you reject H₀ and say:
“The distribution is different” = “What we see doesn’t match what we expected under no effect.”

The Chi-Square test is actually used for two main purposes, and they both revolve around comparing observed vs.
expected distributions — but in slightly different contexts:
🔹 1. Chi-Square Test of Independence
(Also called association test)

👉 Purpose:

To test if two categorical variables are associated or independent.

🧠 Example:

 Is ice cream preference related to age group?


 Is gender associated with product purchase?

🧪 You’re testing:

"Is the distribution of preferences different across groups?"

If the distributions across age groups differ significantly, you conclude the variables are associated.

🔹 2. Chi-Square Goodness-of-Fit Test


👉 Purpose:

To test if a single categorical variable follows a specific distribution (often a uniform or expected one).

🧠 Example:

 Do people prefer all brands equally?


 Does a die produce all numbers with equal frequency?

🧪 You’re testing:

"Is the observed distribution of a variable different from what we expected?"

Test Checks What? Obs vs Exp Based On…


Goodness-of-Fit Is the data evenly/randomly distributed? Assumed probabilities
Test of Independence Are 2 categorical variables related? Frequencies from 2-way table

🔁 Common Ground: Observed vs. Expected


In both tests, you're comparing:

 Observed frequencies → from actual data


 Expected frequencies → under the null hypothesis

If the difference is too large, the test statistic (Chi-Square) becomes large, and you reject the null hypothesis.
✅ Quick Summary:

 Parametric Tests: Assume normal distribution; used for means.


 Non-Parametric Tests: No assumption of normality; used for medians/ranks.
 Chi-Square Tests: Used for categorical data (counts/frequencies).

For categorical variables, it is not meaningful to talk about whether the distribution is
"normal" or not because:

 Normal distribution applies to continuous variables (e.g., height, weight, income)


where values are measured on a continuous scale.
 Categorical variables (e.g., gender, device type, yes/no responses) represent
categories or labels, not numerical values.
 Since categorical data is based on counts or proportions, concepts like mean and
standard deviation don’t apply in the same way.

Note : Wilk_Shapiro test is used to determine whether a


distribution is normal or not
The Levene's test is used to check if the variances of two or more groups are equal, also known as
homogeneity of variances. It's a crucial assumption for many parametric tests, such as t-tests and ANOVA.

✅ When to Use Levene's Test:

1. Before Running Parametric Tests:

o Many tests (like t-tests and ANOVA) assume that the variances across groups are
approximately equal.

o Levene's test helps validate this assumption.

2. For Comparing Two or More Groups:

o Works for 2 or more independent groups.

o Useful when comparing categories like gender, treatment types, or experimental groups.

3. Robust to Non-Normality:

o Levene's test is less sensitive to departures from normality compared to other tests (like
Bartlett’s test).

o Ideal when data is non-normal or has outliers.

A Confidence Interval gives a range of values that is likely to contain the true population parameter (like
mean or proportion), based on your sample data.

Confidence Levels - It’s the probability (or certainty level) that your confidence interval actually contains the
true population parameter.

 Common confidence levels: 90%, 95%, 99%

 95% confidence level = If you repeated the experiment 100 times, 95 of those times the parameter
will liw within confidence interval

Relation to CI:

The higher the confidence level, the wider the confidence interval.

P-Value :- A p-value, or probability value, is a number describing the likelihood of obtaining the observed
data under the null hypothesis of a statistical test

 A small p-value (like less than 0.05) means the results are unlikely due to chance, so you might reject
the null hypothesis.

 A large p-value means the results could easily happen by chance, so you don't reject the null
hypothesis.

✅ Practical Interpretation in Interviews:


Concept What It Answers Example

"The true mean is likely between 180


Confidence Interval "Where could the true value lie?"
and 220."

"We are 95% confident that range


is correct." i.e. if I do the experiment 100
Confidence Level "How confident are you in that range?"
times, 95 times it will lie inside
confidence level

p-value "How likely is my result if there's no effect?" "Only 2% chance it happened randomly."

"If p < 0.05, I’ll consider the result


Significance level (α) "What's my cut-off for surprise?"
significant."
When to use Z test and t-test

Z-test Example
Similarly from below we can calculate the actual values of the lower and upper bounds of the
Confidence level

Exampe of T test

 Analyses the difference of mean


Degree of Freedom = n-1 = 30-1 = 29

This value of +2.045 and -2.045 is found in the t-test table specifically in the 2-tailed
table. The value 2.045 is called critical value
Confusion matrix - A confusion matrix is a table used to evaluate the performance of
a classification model. It compares the predicted labels with the actual labels and
provides insights into how well the model is performing. The matrix is especially
useful for binary and multi-class classification problems.

Explanation of Terms
 True Positive (TP) – The model correctly predicted the positive class.
 False Positive (FP) (Type I Error) – The model incorrectly predicted positive
when it was actually negative.
 False Negative (FN) (Type II Error) – The model incorrectly predicted negative
when it was actually positive.
 True Negative (TN) – The model correctly predicted the negative class
Q) When to reduce FN and FP?
 Domain specific
 Suppose for diseases, reducing FN should be more imp. Eg: Cancer. If it is FN,
that means patient has cancer but predicted as No Cancer

Point Estimate & Margin of Error
Z = critical value
Interpreting Margin of Error
 If a survey reports a 60% approval rating with a ±3% margin of error, the actual approval
could be anywhere between 57% and 63%.

 A larger sample size reduces the margin of error, making estimates more precise.

 A higher confidence level increases the margin of error, as it requires a wider range to ensure
accuracy.

Example to calculate the confidence interval values


This shows that hypothesis testing (Z-test) and confidence intervals are closely related:
 In a Z-test, we check whether Z-test exceeds a critical value to determine statistical
significance.
 In a confidence interval, we check whether a hypothesized mean μ lies inside or
outside the interval.
Key Insight:
 The confidence interval checks if the population mean is within a certain range.
Here sample mean is provided
 The Z-test checks if the sample mean is significantly different from the population
mean(population mean is provided). Also if the sample mean lies within this CI or
not (imagine the 95% bell curve)
ANOVA Test  F-distribution (Right Skewed Distribution)

1. Concept example of Factors & Levels

2. Assumptions to be considered before ANNOVA


3. Types of Annova
STEPS

Example- One Way Annova


1) Question

2) Null & Alternate Hypothesis

3) Degrees of Freedom

4)
Chi Square Test

1. 1st table is the original pop data(percentages provided on pop scale) and 2nd table is
the date gathered after sampling
2. So we are trying to find out if there is a change between these 2data? Did the
population changed in 10 years?
3.

Here calculating the expected numbers based on the population percentages


provided

4.

Note : n=no of categories(not sample size=500) hence n=3


5.

From Chi Square table, critical value is 5.99


6.

CoVariance

Covariance is a measure of how two variables change together.

🎯 Intuitive Explanation

 If both variables increase together, covariance is positive.


 If one variable increases while the other decreases, covariance is negative.
 If there's no consistent pattern, covariance is close to zero.

For Sample, (n-1) in denominator


Covariance and correlation are both measures of how two variables change together, but they
differ in terms of scale, interpretation, and standardization.

 Covariance tells you the direction of the relationship.


 Correlation tells you the direction and strength of the relationship, and it's easier to interpret! 🚀
This correlation formula is also called Pearson Correlation Co-efficient. Main thing is it
can represent the properties linearly

Spearman Rank Correlation


This can also capture non-linear properties. See below figure.
The Spearman Rank Correlation (or Spearman's rho, denoted as ρ\rho) is a non-parametric measure of
the strength and direction of the monotonic relationship between two variables.

🎯 Key Points about Spearman Rank Correlation

 It does not assume linearity but instead measures if the relationship is monotonic (consistently
increasing or decreasing).

 Works with ranked data, meaning it converts data into ranks before calculating correlation.

 Insensitive to outliers because it’s based on ranks, not raw values.

 Ideal for ordinal data or data that doesn't meet the assumptions of Pearson correlation (like normal
distribution).
Rank is like this 

OR

Probability Basics
Basic Probability Notes for Data Analysis Roles

1. Basics of Probability
Definitions:

2. Types of Probability
 Classical (Theoretical) Probability: Based on predefined rules (e.g., rolling a fair die).
 Empirical (Experimental) Probability: Based on observations/data.
 Subjective Probability: Based on intuition or expert knowledge.

3. Probability Rules
4. Conditional Probability

5. Important Probability Distributions for Data Analysis


1. Bernoulli Distribution: Binary outcomes (0 or 1), like coin flips.
2. Binomial Distribution: Number of successes in n independent Bernoulli trials.
3. Poisson Distribution: Counts of events happening in a fixed time interval.
4. Uniform Distribution: All outcomes are equally likely.
5. Normal Distribution (Gaussian): Bell-shaped curve, common in real-world data.
6. Exponential Distribution: Time until an event occurs (e.g., time between arrivals in a queue).

7. Expected Value & Variance


7. Applications in Data Analysis
 A/B Testing: Probability helps determine if a change in a product leads to statistically significant improvements.
 Predictive Modeling: Many machine learning models (e.g., Naive Bayes) are based on probability.
 Anomaly Detection: Outliers can be identified using probability distributions.
 Bayesian Statistics: Used for updating beliefs with new data.

You might also like