Statistics Guide
Statistics Guide
2. Data Types
🔹 Quantitative (Numerical):
Measurable quantities (height, weight).
❖ Discrete:
Countable (e.g., number of customers).
❖ Continuous:
Measurable with infinite possibilities (e.g., temperature).
🔹 Qualitative (Categorical):
Descriptive labels (gender, color).
❖ Ordinal:
Ordered categories (e.g., education level).
❖ Nominal:
Unordered categories (e.g., colors).
3. Descriptive Statistics
Example:
Suppose we have test scores: 50,60,70,80,90
Mean = (50+60+70+80+90)/5=70
Key Points:
Sensitive to outliers (Extreme values can shift the mean).
Used for normally distributed data.
Helps in summarizing data.
Example:
Odd data: [10, 20, 30, 40, 50] → Median = 30 (middle value).
Even data: [10, 20, 30, 40] → Median = (20+30)/2=25
Key Points:
Not affected by outliers.
Best for skewed data distributions.
Example:
Dataset: [1, 2, 3, 3, 4, 5] → Mode = 3
If all numbers appear equally, there's no mode.
Key Points:
Used in categorical data (e.g., most popular product color).
Can be bimodal (two modes) or multimodal (more than two modes).
2. Measures of Dispersion (Variability in Data)
These measures show how spread out the data points are.
🔹 Range
Range=Max value−Min value
Example:
Data [5, 10, 15, 20],
Range =20−5=15
Key Point:
Ignores data distribution and only considers the extremes.
🔹 Variance (σ^2)
Variance tells how much the data deviates from the mean.
Formula:
Example:
Data [2, 4, 6, 8]
Mean = 5
σ2 = ((2-5)^2 + (4-5)^2 + (6-5)^2 + (8-5)^2)/ 4
= (9+1+1+9)/4 = 5
Key Point:
Higher variance = More spread out data.
Formula:
➢ Mean vs. Median (Effect of Outliers) – Shows how outliers pull the mean but not the
median.
➢ Standard Deviation & Variance – Displays a normal distribution with standard
deviation lines.
3. Shape of Distribution
🔹 Kurtosis
Kurtosis measures the "tailedness" of a distribution. It tells us whether the distribution has
more or fewer extreme values than a normal distribution.
Types of Kurtosis:
Where:
➢ A higher kurtosis (>3) indicates extreme values.
➢ A lower kurtosis (<3) suggests fewer extreme values.
🔹 Skewness
Skewness measures the asymmetry of a probability distribution. It indicates whether data
points are more spread out on one side of the mean than the other.
Types of Skewness
Example:
A normal distribution.
Example:
Income distribution, where a few people earn extremely high salaries.
Example:
Exam scores where most students score high, but a few score very low.
Where:
4. Percentiles and Quartiles
🔹 Percentile:
A measure indicating the value below which a given percentage of observations fall.
Example:
The 90th percentile is the value below which 90% of the data lie.
🔹 Quartiles:
Special percentiles that divide the data into four equal parts (Q1, Q2 [median], Q3).
5. Outlier Detection
🔹Z-score Method:
➢ Measures how many standard deviations a data point is from the mean.
➢ Assumes the data follows a normal distribution.
➢ Most data (99.7%) falls within ±3 standard deviations. Outliers lie in the far tails.
Formula
Where:
Outlier Thresholds:
• Lower Bound:
▪ Q1−1.5×IQR
• Upper Bound:
▪ Q3+1.5×IQR
4. Probability Basics
Probability measures how likely an event is to occur.
🔹Probability of an Event:
If an event A has possible outcomes, its probability is:
📌 Example:
Rolling a fair die, P(rolling 4)=1/6.
🔹Complementary Events:
➢ Two events are complementary if one event's occurrence means the other cannot
occur.
➢ Together, they cover the entire sample space.
📌 Example:
Probability of not rolling a 6 in a fair die roll:
📌 Example:
A factory has 3 machines:
• Machine 1: Produces 50% of items, defect rate = 2%.
• Machine 2: Produces 30% of items, defect rate = 3%.
• Machine 3: Produces 20% of items, defect rate = 4%.
📌 Example:
Tossing two coins: Probability of both heads = 0.5×0.5=0.25
❖ Dependent Events
One event influences the probability of the other.
📌 Example:
Drawing 2 cards from a deck without replacement:
Probability both are aces:
📌 Example:
Probability of rolling a 2 or a 3 on a die:
P(2)=1/6, P(3)=1/6
Since these are mutually exclusive
🔹Multiplication Rule (For Intersection of Events)
For two independent events A and B:
🔹Conditional Probability
Conditional probability is the probability of event A occurring, given that event B has
already occurred.
📌 Formula:
📌 Example:
A deck of 52 cards has 13 hearts.
If you draw one heart (Event B) and then another (Event A), the probability of
the second heart is:
🔹Bayes’ Theorem
Bayes' Theorem is a fundamental concept in probability and statistics, especially in
fields like machine learning, data science, and AI. It describes the relationship
between conditional probabilities, helping us update our beliefs based on new
evidence.
📌 The Formula:
Where:
• The test has a 5% false positive rate if the person doesn't have the
disease:
P(Positive∣NoDisease)=0.05
📌 Question:
If you test positive, what is the probability that you actually have the
disease?
📌 Final Answer:
If you test positive, there's only about a 16.1% chance that you actually have the
disease, despite the test being 95% accurate. This is a classic example of how base
rates can significantly influence outcomes in Bayesian inference.
6. Sampling Methods
🔹 1. Simple Random Sampling (SRS)
➢ Every member of the population has an equal chance of being selected.
➢ Use random number generators or lottery systems.
📌 Example:
Randomly selecting 100 employees from a company of 1,000.
🔹 2. Systematic Sampling
➢ Select every k-th member from a list.
➢ Calculate sampling interval:
📌 Example:
Selecting every 5th customer entering a store.
🔹 3. Stratified Sampling
➢ Divide the population into subgroups (strata) based on a characteristic, then
sample each subgroup.
➢ Use proportional or equal allocation.
📌 Example:
Surveying students by selecting equal numbers from each grade.
🔹 4. Cluster Sampling
➢ Divide the population into clusters (naturally occurring groups) and randomly
select entire clusters.
➢ One-stage (all members in clusters) or two-stage (randomly sample within
clusters).
📌 Example:
Choosing 3 classrooms at random and surveying all students in them.
📌 Example:
Surveying customers at a single mall. (Prone to bias)
🔹 6. Snowball Sampling (Non-Probability)
➢ Existing participants recruit new participants.
📌 Example:
Surveying a hidden population like people with a rare medical condition.
7. Probability Distributions
A probability distribution describes how values in a dataset are distributed.
Choosing the right distribution depends on the nature of the data and the problem you're
solving.
Mathematical Representation:
X∼N(μ,σ2), where μ is the mean and σ2 is the variance.
Examples:
✓ Human Heights & Weights: Most people’s height is around an average, with fewer
extremely short or tall individuals.
✓ IQ Scores: Most people have an IQ near the mean, with fewer people having very
high or very low IQs.
🔹 Binomial Distribution
A discrete distribution used for scenarios with two possible outcomes (success/failure).
Mathematical Representation:
X∼Bin(n,p) where n is the number of trials, and p is the probability of success.
Examples:
✓ Flipping a Coin: If you flip a fair coin 10 times, how many heads will you get?
✓ Medical Tests: If a test detects a disease with 95% accuracy, how many correct
diagnoses will 10 tests yield?
✓ Customer Purchases: If 30% of website visitors make a purchase, how many
purchases from 50 visitors?
✓
Why We Use It?
✓ When outcomes are binary (yes/no, pass/fail).
✓ Helps in probability estimation for repeated independent events.
🔹 Poisson Distribution
A discrete distribution used to model the number of events happening in a fixed time/space
interval.
Mathematical Representation:
X∼Pois(λ), where λ is the expected number of occurrences in an interval.
Example:
✓ Number of emails received per hour.
✓ Number of Customer Arrivals: How many customers arrive at a store per hour?
✓ Call Center Inquiries: How many calls a call center receives in a minute?
✓ Typos in a Book: How many spelling errors appear per 100 pages?
🔹 Uniform Distribution
A distribution where all values in a range have equal probability.
Mathematical Representation:
X∼U(a,b) , where all values between a and b are equally likely.
Examples:
✓ Rolling a Fair Die: Each face (1-6) has an equal chance of appearing.
✓ Lottery Numbers: Each number has the same probability of being drawn.
✓ Random Password Generation: If each character has an equal probability of being
selected.
🔹 Bernoulli Distribution
Bernoulli Distribution is a discrete probability distribution for a random variable that has
exactly two possible outcomes: success (1) and failure (0).
Key Characteristics
• Parameter: p= Probability of success (1)
• Probability of failure (0) = 1−p
• Discrete: Only two possible outcomes.
• Example: Coin toss (Heads = 1, Tails = 0)
where:
• x = 0 or 1
• p = Probability of success
Example:
✓ Coin Toss: Head or Tail
✓ Email Classification: Spam or Not Spam
✓ Medical Test: Positive or Negative
🔹 Exponential Distribution
Used to model the time until an event occurs.
Mathematical Representation:
X∼Exp(λ) , where λ is the event rate.
Real-Life Examples:
✓ Time Between Calls: How long until the next call arrives at a support center?
✓ Lifespan of Devices: The time until a light bulb or machine component fails.
✓ Arrival of Buses: Time between consecutive buses at a station.
Types of Correlation
Example:
Study Hours & Exam Scores, the more hours a student studies, the higher their exam
score.
Example:
Temperature & Hot Coffee Sales, As temperature increases, hot coffee sales
decrease.
❖ No correlation (r ≈ 0):
No relationship.
Example:
Shoe Size & Intelligence, bigger feet don’t mean higher IQ
🔹 Regression:
Predicts one variable based on another.
Types of Regression
❖ Linear Regression
Predicting a numeric value based on one variable.
Example:
Predicting House Prices – square footage of a house to predict house price
❖ Multiple Regression
Predicting a numeric value based on multiple variables
Example:
Predicting Car Prices – depends on Car Age, Mileage,Engine size.
❖ Logistic Regression
Predicting a category/class
Example:
Spam Email Detection – Number of capital letters , Presence of certain words ,
Number of links to detect Spam (Yes/No)
🔹 Causation:
Causation means that one event directly influences another—changing one variable
causes a change in another. Unlike correlation, causation implies a cause-and-effect
relationship.It requires controlled experiments or statistical techniques (like
regression) to confirm.
Example:
Smoking causes lung cancer. (direct cause-effect)
9. Hypothesis Testing
Used to test if an assumption about a dataset is true or false.
❖ p-value:
The probability of obtaining the observed results (or more extreme) assuming H0 is
true.
❖ Type II Error:
Failing to reject a false H0 (false negative).
🔹 Z-Test
You want to compare a sample mean to a known population mean, and the population
standard deviation is known.
Example:
A company claims their battery lasts 20 hours on average. You test 50 batteries and want to
check if the claim is valid.
TYPES
Xˉ : Sample mean
𝜇 : Population mean
σ : Population standard deviation
𝑛 : Sample size
Example:
To test whether the average height of students in your class differs from the national
average height
Example:
To compare the average height of students in two different classes (Class A and Class
B).
🔹 T-Test
You compare two sample means when the population standard deviation is unknown.
Types:
Example:
You own a bakery and claim that the average weight of a cookie is 30 grams. You test
this hypothesis by randomly selecting a sample of cookies and calculating the
average weight of cookies
❖ Two-sample T-test (Independent): Compares two independent groups.
Example:
You want to compare the effectiveness of two different diets. You measure the
weight loss of people on Diet A and Diet B.
Example:
Medical Drug Effectiveness: Compare blood pressure before and after taking
medicine (Paired T-test).
🔹 Chi-Square Test
You check if categorical variables are related.
Example:
✓ Marketing Research: Do gender and favorite product category have a relationship?
✓ COVID-19 Analysis: Does wearing a mask reduce infection rates?
Types:
❖ Chi-Square Goodness-of-Fit Test:
Used to check if an observed distribution follows an expected theoretical
distribution.
Example :
Does the distribution of colors in an M&M packet match the expected company
proportions?
Example:
Does gender (Male/Female) affect product preference (A/B/C)?
If the F-value is large, it suggests that at least one group mean is different.
Example:
✓ Comparing Exam Scores: Do students from different schools perform differently?
✓ Effect of Diet on Weight Loss: Comparing weight loss among different diets.
ANOVA Assumptions
✓ Independence – Observations are independent within and across groups.
✓ Normality – Data within each group should be approximately normally distributed.
✓ Homogeneity of Variance – Variances across groups should be roughly equal (use
Levene’s test to check).
Types:
❖ One-Way ANOVA:
Compares means across one categorical independent variable with multiple groups.
Example:
Do three different diets lead to different weight loss results?
❖ Two-Way ANOVA:
Examines the effect of two categorical independent variables on one continuous
dependent variable.
Example:
Do diet type and exercise level affect weight loss?
Example:
Measuring students' test scores before, during, and after a training program.
🔹 Mann-Whitney U Test
Used to compare the medians of two independent (unpaired) groups. It determines whether
one group tends to have higher values than the other.
Example:
Imagine you are analyzing the customer ratings (1-10) of two food delivery services: Service
A and Service B.
If ratings were normally distributed, you’d use an independent t-test.
Since they are skewed, Mann-Whitney U test is appropriate.
Steps
• Rank all the observations from both groups combined.
• Calculate the sum of ranks for each group.
• Compute the U-statistic and compare it with a critical value or p-value.
➢ If p < 0.05, we reject the null hypothesis and conclude that the ratings differ
significantly between Service A and B.
➢ If p ≥ 0.05, there’s no strong evidence of a difference.
Steps
• Compute the difference between paired values.
• Rank the absolute differences, ignoring signs.
• Sum the ranks of positive and negative differences separately.
• Compare the Wilcoxon test statistic to determine significance..
➢ If p < 0.05, we reject the null hypothesis and conclude that the diet significantly
affected weight.
➢ If p ≥ 0.05, we conclude no significant change in weight due to the diet.
🔹 Kruskal-Wallis Test
Used when comparing three or more independent groups to check if at least one differs
significantly in median values. Equivalent to one-way ANOVA, but for non-parametric data.
Example:
A teacher wants to compare exam scores of students from three different schools to see if
performance differs.
Steps
• Rank all observations from all groups together.
• Calculate the rank sum for each group.
• Compute the H-statistic and compare it with a chi-square distribution.
Key Components:
✓ Point Estimate: Single value estimate of a population parameter (e.g., sample mean).
✓ Margin of Error (MOE): The amount of error expected due to sampling variability.
✓ Confidence Level (CL): Probability that the CI contains the true population parameter
(e.g., 90%, 95%, or 99%).
Where:
Common Z-values:
• 90% CI → 1.645
• 95% CI → 1.96
• 99% CI → 2.576
When the population standard deviation is unknown, we use the t-distribution instead of
the Z-distribution to construct a confidence interval for the population mean (μ).
Mathematical formula
Example:
Suppose you have a sample of 36 students with a sample mean score of 78 and a sample
standard deviation of 10. To construct a 95% confidence interval for the true mean score
(using a t-distribution because the population SD is unknown):
Given,
Step 1: Calculate the t-critical value
Since we use the t-distribution when the population standard deviation is unknown, we
need:
Final Answer:
The 95% confidence interval for the true mean score is:
(74.62, 81.38)
Example:
Rolling dice multiple times and averaging the values → Results in a normal distribution!
Why is it useful?
✓ Allows us to use normal distribution even if original data is skewed, normal, uniform
or any other shape.
✓ A sample size of 30 or more usually considered sufficient for CLT to hold.
✓ CLT helps in estimating population parameters using sample statistics.
✓ It enables the use of confidence intervals and hypothesis testing.
We then measure a key metric (e.g., conversion rate, click-through rate, sales, engagement)
and determine if there is a statistically significant difference between the two groups.
A Real-Life Example
We show Version A to 50% of visitors and Version B to the other 50%, then measure the
purchase completion rate over one month.
Version A (Blue button): 5,000 visitors → 250 purchases (5% conversion rate).
Version B (Green button): 5,000 visitors → 300 purchases (6% conversion rate).
Question:
Is the 1% improvement due to the new button real or just random chance?
To answer this, we perform hypothesis testing (usually a two-sample Z-test or T-test).
➔ 4. Collect Data:
Measure the key metric for each group (e.g., conversions).
➔ 6. Make a Decision:
▪ If p < 0.05, reject H0→ The new version performs significantly better.
▪ If p > 0.05, fail to reject H0→ The change is not significant.