Statistics
Statistics
1. Descriptive Statistics
o Measures of Central Tendency (Mean, Median, Mode)
o Measures of Dispersion (Range, Variance, Standard Deviation, IQR)
o Data Visualization (Histogram, Boxplot, Scatterplot)
o Skewness and Kurtosis
2. Probability
o Basics of Probability
o Types of Events (Independent, Dependent, Mutually Exclusive)
o Conditional Probability
o Bayes' Theorem
3. Probability Distributions
o Discrete Distributions (Binomial, Poisson)
o Continuous Distributions (Normal, Uniform, Exponential)
o Properties of Normal Distribution (Z-Scores, Empirical Rule)
4. Sampling and Sampling Techniques
o Types of Sampling (Random, Stratified, Cluster, Systematic)
o Importance of Sample Size
o Sampling Error
5. Inferential Statistics
o Confidence Intervals
o Margin of Error
o Central Limit Theorem (CLT)
o Bootstrapping
6. Hypothesis Testing
o Null and Alternative Hypotheses
o Type I and Type II Errors
o Steps in Hypothesis Testing
o One-Tailed vs. Two-Tailed Tests
o Test Statistics (Z-Test, T-Test, Chi-Square Test, ANOVA)
7. Correlation and Covariance
o Definition and Differences
o Pearson’s Correlation Coefficient
o Spearman’s Rank Correlation
o Covariance Formula and Interpretation
8. Regression Analysis
o Linear Regression (Simple and Multiple)
o Assumptions of Linear Regression
o Coefficients (Intercept and Slope)
o R-Squared and Adjusted R-Squared
o Multicollinearity and Variance Inflation Factor (VIF)
o Logistic Regression
9. Analysis of Variance (ANOVA)
o One-Way ANOVA
o Assumptions of ANOVA
o F-Test
o Post Hoc Tests (Tukey’s Test)
10. Chi-Square Tests
o Chi-Square Goodness-of-Fit Test
Connect & follow for more such content and information on Data Analysis/Science
o Chi-Square Test for Independence
11. Data Transformation and Standardization
o Log Transformation
o Min-Max Scaling
o Z-Score Standardization
12. Outliers and Missing Data
o Detection of Outliers (IQR, Z-Scores)
o Handling Outliers (Winsorizing, Capping, Removing)
o Missing Data Imputation (Mean, Median, Mode, Regression Imputation)
13. Measures of Association
o Odds Ratio
o Relative Risk
o Contingency Tables
Connect & follow for more such content and information on Data Analysis/Science
Descriptive Statistics
1. Basics of Data
• Data Types:
o Quantitative (Numerical):
▪ Discrete: Countable values (e.g., number of students in a class)
▪ Continuous: Infinite possible values within a range (e.g., height,
weight)
o Qualitative (Categorical):
▪ Nominal: Categories without a specific order (e.g., gender, colors)
▪ Ordinal: Categories with a specific order (e.g., rankings, satisfaction
levels)
• Levels of Measurement:
o Nominal: Labels with no inherent order (e.g., male/female)
o Ordinal: Rank-ordered data with unequal intervals (e.g., class grades: A, B,
C)
o Interval: Equal intervals, but no true zero (e.g., temperature in Celsius)
o Ratio: Equal intervals and a true zero (e.g., weight, income)
1. Mean (Average):
o Formula: Mean=ΣXn\text{Mean} = \frac{\Sigma X}{n}Mean=nΣX
o Sensitive to outliers.
o Example: The average of 2,3,52, 3, 52,3,5 is (2+3+5)/3=3.33(2+3+5)/3 =
3.33(2+3+5)/3=3.33.
2. Median:
o The middle value when data is sorted.
o If nnn is odd: Middle value.
o If nnn is even: Average of two middle values.
o Not affected by outliers.
3. Mode:
o The most frequently occurring value.
o Can have multiple modes (bimodal, multimodal).
o Example: In 1,2,2,31, 2, 2, 31,2,2,3, the mode is 222.
1. Range:
Connect & follow for more such content and information on Data Analysis/Science
o Formula: Range=Max−Min\text{Range} = \text{Max} -
\text{Min}Range=Max−Min
o Example: For 2,4,6,82, 4, 6, 82,4,6,8, Range = 8−2=68 - 2 = 68−2=6.
2. Variance:
o Measures the average squared deviation from the mean.
o Formula:
▪ Population: σ2=Σ(X−μ)2N\sigma^2 = \frac{\Sigma (X -
\mu)^2}{N}σ2=NΣ(X−μ)2
▪ Sample: s2=Σ(X−Xˉ)2n−1s^2 = \frac{\Sigma (X - \bar{X})^2}{n-
1}s2=n−1Σ(X−Xˉ)2
o Larger variance = greater spread.
3. Standard Deviation:
o Square root of variance.
o Represents data spread in the same units as the data.
o Formula: σ=σ2\sigma = \sqrt{\sigma^2}σ=σ2.
4. Interquartile Range (IQR):
o Measures the middle 50% of the data.
o Formula: IQR=Q3−Q1\text{IQR} = Q3 - Q1IQR=Q3−Q1
▪ Q1Q1Q1: 25th percentile
▪ Q3Q3Q3: 75th percentile
o Helps detect outliers.
1. Skewness:
o Describes asymmetry in data distribution.
o Positive skew: Tail on the right (e.g., income data).
o Negative skew: Tail on the left.
2. Kurtosis:
o Measures the "tailedness" of the distribution.
o Types:
▪ Mesokurtic: Normal distribution
▪ Leptokurtic: Heavy tails
▪ Platykurtic: Light tails
5. Data Visualization
1. Histograms:
o Show frequency distribution.
o Useful for understanding data shape (normal, skewed).
2. Bar Charts:
o Represent categorical data.
3. Box Plots:
o Show data spread, median, and outliers.
4. Pie Charts:
Connect & follow for more such content and information on Data Analysis/Science
o Represent proportions of a whole.
5. Scatter Plots:
o Display relationships between two variables.
Connect & follow for more such content and information on Data Analysis/Science
Probability
1. Basics of Probability
3. Types of Probability
1. Classical Probability:
o Based on equally likely outcomes.
o Example: Probability of rolling a 4 on a die → P(4)=16P(4) =
\frac{1}{6}P(4)=61.
2. Empirical Probability:
o Based on observed data.
o Example: Probability of rain based on historical weather data.
3. Subjective Probability:
o Based on intuition or experience.
o Example: The probability of a startup succeeding.
4. Rules of Probability
1. Addition Rule:
Connect & follow for more such content and information on Data Analysis/Science
o For mutually exclusive events: P(A∪B)=P(A)+P(B)P(A \cup B) = P(A) +
P(B)P(A∪B)=P(A)+P(B)
o For non-mutually exclusive events: P(A∪B)=P(A)+P(B)−P(A∩B)P(A \cup B)
= P(A) + P(B) - P(A \cap B)P(A∪B)=P(A)+P(B)−P(A∩B)
2. Multiplication Rule:
o For independent events: P(A∩B)=P(A)×P(B)P(A \cap B) = P(A) \times
P(B)P(A∩B)=P(A)×P(B)
3. Complement Rule:
o Probability of an event not occurring: P(Ec)=1−P(E)P(E^c) = 1 -
P(E)P(Ec)=1−P(E)
5. Conditional Probability
• The probability of event AAA occurring given that BBB has occurred:
P(A∣B)=P(A∩B)P(B)(if P(B)>0)P(A|B) = \frac{P(A \cap B)}{P(B)} \quad \text{(if \(
P(B) > 0 \))}P(A∣B)=P(B)P(A∩B)(if P(B)>0)
6. Bayes’ Theorem
• A formula to calculate conditional probabilities: P(A∣B)=P(B∣A)⋅P(A)P(B)P(A|B) =
\frac{P(B|A) \cdot P(A)}{P(B)}P(A∣B)=P(B)P(B∣A)⋅P(A)
o Applications: Spam detection, medical testing, machine learning.
7. Probability Distributions
8. Examples
1. Example 1: A coin is flipped twice. What is the probability of getting at least one
head?
Connect & follow for more such content and information on Data Analysis/Science
oSample space: S={HH,HT,TH,TT}S = \{HH, HT, TH,
TT\}S={HH,HT,TH,TT}
o Event AAA: At least one head → A={HH,HT,TH}A = \{HH, HT,
TH\}A={HH,HT,TH}
o Probability: P(A)=34P(A) = \frac{3}{4}P(A)=43
2. Example 2: A bag contains 4 red balls and 6 blue balls. What is the probability of
drawing a red ball?
o Total balls = 10, Red balls = 4.
o Probability: P(Red)=410=0.4P(\text{Red}) = \frac{4}{10} = 0.4P(Red)=104
=0.4.
Connect & follow for more such content and information on Data Analysis/Science
Sampling and Sampling Techniques
Sampling is the process of selecting a subset (sample) from a larger population to infer
conclusions about the entire population. Sampling is essential for practical and efficient data
analysis when studying large populations.
1. Types of Sampling
Sampling methods are broadly divided into probability sampling (where every member of
the population has a known chance of being selected) and non-probability sampling (where
selection is not random). Below are the key types:
A. Random Sampling
• Definition: Every member of the population has an equal chance of being selected.
• Method: Selection is done using random number generators or lottery methods.
• Example: Drawing names out of a hat to select participants for a survey.
• Advantages:
o Reduces selection bias.
o High likelihood of representing the population.
• Disadvantages:
o May be impractical for very large populations.
B. Stratified Sampling
C. Cluster Sampling
• Definition: The population is divided into clusters, and entire clusters are randomly
selected.
• Example: Selecting a few schools randomly and surveying all students in those
schools.
• Advantages:
Connect & follow for more such content and information on Data Analysis/Science
o Cost-efficient for large populations.
o Useful when population data is geographically spread out.
• Disadvantages:
o Less accurate if clusters are not homogeneous.
o May increase sampling error.
D. Systematic Sampling
• Definition: Every nth member of the population is selected after randomly selecting
the starting point.
• Example: Selecting every 10th person in a list of 1,000 names.
• Advantages:
o Simple and quick to implement.
o Ensures uniform coverage of the population.
• Disadvantages:
o Patterns in the population may bias the sample (e.g., cyclical data).
The sample size plays a crucial role in ensuring the reliability and validity of results.
• Key Considerations:
1. Population Size: A smaller sample may suffice for small populations, but
larger populations require a larger sample.
2. Margin of Error: A smaller margin of error requires a larger sample.
3. Confidence Level: A higher confidence level (e.g., 95% vs. 99%) requires a
larger sample.
• Impact of Sample Size:
o Small Sample: May lead to underrepresentation and unreliable results.
o Large Sample: Reduces sampling error but can be costly and time-
consuming.
3. Sampling Error
• Definition: Sampling error is the difference between the sample statistic (e.g., sample
mean) and the true population parameter (e.g., population mean).
It arises due to the fact that a sample is only a subset of the population.
• Causes:
o Variability in the population.
o Sample size being too small.
o Non-representative sampling techniques.
• Reduction Methods:
o Use random sampling to minimize bias.
o Increase the sample size.
o Use stratified sampling to ensure all groups are proportionally represented.
Connect & follow for more such content and information on Data Analysis/Science
Inferential Statistics.
• Population: The entire group of individuals or items under study (e.g., all voters in a
country).
• Sample: A subset of the population used for analysis (e.g., 1,000 voters surveyed).
3. Sampling Distribution
• The probability distribution of a statistic (e.g., sample mean) when drawn from a
population repeatedly.
• Central Limit Theorem (CLT):
o For large sample sizes (n>30n > 30n>30), the sampling distribution of the
sample mean approximates a normal distribution, regardless of the
population's distribution.
4. Estimation
1. Point Estimation:
o Provides a single value as an estimate of a parameter (e.g., sample mean
xˉ\bar{x}xˉ for population mean μ\muμ).
2. Interval Estimation:
o Provides a range of values (confidence interval) likely to contain the
population parameter.
o Example: Confidence Interval (CI)=xˉ±z⋅sn\text{Confidence Interval (CI)} =
\bar{x} \pm z \cdot \frac{s}{\sqrt{n}}Confidence Interval (CI)=xˉ±z⋅ns
o zzz: Z-score for confidence level (e.g., 1.96 for 95% CI).
Connect & follow for more such content and information on Data Analysis/Science
o sss: Sample standard deviation.
o nnn: Sample size.
5. Hypothesis Testing
1. Z-Test:
o Used when population variance is known and sample size is large (n>30n >
30n>30).
o Example: Testing the mean of a large dataset.
2. T-Test:
o Used when population variance is unknown or sample size is small (n<30n <
30n<30).
o Types:
▪ One-sample t-test.
▪ Independent (two-sample) t-test.
▪ Paired t-test.
3. Chi-Square Test:
o Used to test the association between categorical variables or goodness-of-fit.
o Example: Testing independence between gender and product preference.
4. ANOVA (Analysis of Variance):
o Used to compare means of more than two groups.
o Example: Comparing exam scores across three teaching methods.
Connect & follow for more such content and information on Data Analysis/Science
8. Examples
Connect & follow for more such content and information on Data Analysis/Science
Hypothesis Testing.
Key Concepts
1. Hypotheses
1. One-Tailed Test:
o Tests if the sample mean is significantly greater or smaller than the population
mean.
o Example: Testing if a new drug increases recovery rate.
2. Two-Tailed Test:
o Tests if the sample mean is significantly different (either higher or lower) from
the population mean.
o Example: Testing if a new teaching method affects scores (positively or
negatively).
Connect & follow for more such content and information on Data Analysis/Science
o The p-value tells us the probability of observing the data if H0H_0H0 is true.
6. Make a Decision:
o If p≤αp \leq \alphap≤α: Reject H0H_0H0 (there is enough evidence to support
H1H_1H1).
o If p>αp > \alphap>α: Fail to reject H0H_0H0 (no sufficient evidence to
support H1H_1H1).
4. Types of Errors
6. Interpreting Results
7. Practical Applications
Connect & follow for more such content and information on Data Analysis/Science
Regression Analysis
Connect & follow for more such content and information on Data Analysis/Science
o The value of the dependent variable when all predictors are zero.
3. Residuals:
o The difference between observed and predicted values:
Residual=Yobserved−Ypredicted\text{Residual} = Y_{\text{observed}} -
Y_{\text{predicted}}Residual=Yobserved−Ypredicted
4. R-Squared (R2R^2R2):
o Measures the proportion of variance in the dependent variable explained by
the independent variables.
o Value ranges from 0 to 1. Higher values indicate a better fit.
5. Adjusted R-Squared:
o Adjusted for the number of predictors. Used when comparing models with
different numbers of predictors.
6. P-Value:
o Tests whether the coefficients are statistically significant.
p<0.05p < 0.05p<0.05: The predictor significantly impacts the dependent
variable.
Common Applications
1. Business:
o Forecasting sales, revenue, or market trends.
2. Healthcare:
o Predicting patient outcomes based on clinical variables.
3. Economics:
o Studying relationships between economic indicators (e.g., inflation vs. GDP
growth).
4. Education:
Connect & follow for more such content and information on Data Analysis/Science
o Evaluating the impact of study hours on exam scores.
Practical Example
Scenario:
You want to predict monthly sales (YYY) based on advertising spend (XXX).
Steps:
Advanced Topics
1. Multicollinearity:
o Occurs when independent variables are highly correlated.
o Solution: Remove redundant predictors or use Ridge/Lasso regression.
2. Interaction Effects:
o Examines if the effect of one predictor depends on the value of another.
o Example: The impact of advertising may differ based on the season.
3. Model Diagnostics:
o Residual plots to check assumptions.
o Use AIC/BIC for model comparison.
Connect & follow for more such content and information on Data Analysis/Science
Correlation and Covariance
Correlation and covariance measure the relationship between two variables, but they differ in
how they express that relationship. Let’s dive into each aspect.
Correlation Analysis.
Correlation Analysis Overview
Connect & follow for more such content and information on Data Analysis/Science
o Negative Correlation: One variable increases while the other decreases (e.g.,
speed and time taken to travel a fixed distance).
o No Correlation: No apparent relationship between variables (e.g., shoe size
and IQ).
4. Strength of Correlation:
o 0.0≤∣r∣<0.30.0 \leq |r| < 0.30.0≤∣r∣<0.3: Weak correlation.
o 0.3≤∣r∣<0.70.3 \leq |r| < 0.70.3≤∣r∣<0.7: Moderate correlation.
o 0.7≤∣r∣≤1.00.7 \leq |r| \leq 1.00.7≤∣r∣≤1.0: Strong correlation.
Types of Correlation
1. Pearson Correlation:
o Measures linear relationships between two continuous variables.
o Assumes normal distribution and no significant outliers.
2. Spearman Rank Correlation:
o Measures monotonic relationships (increasing or decreasing, not necessarily
linear).
o Uses ranks of data instead of actual values.
3. Kendall’s Tau:
o Measures ordinal association between two variables.
o Used when datasets have tied ranks or smaller sample sizes.
4. Partial Correlation:
o Measures the relationship between two variables while controlling for the
effect of one or more additional variables.
• Correlation:
o Shows a relationship between two variables but does not imply causation.
o Example: Ice cream sales and drowning incidents may be correlated due to a
third factor (hot weather).
• Causation:
o One variable directly affects another.
o Requires experimental evidence to confirm.
Connect & follow for more such content and information on Data Analysis/Science
Applications of Correlation
1. Business:
o Analyze the relationship between marketing spend and sales.
2. Healthcare:
o Study the relationship between physical activity and heart disease risk.
3. Education:
o Investigate the correlation between study hours and exam performance.
4. Finance:
o Examine the relationship between stock prices and market indices.
1. Data Collection:
o Gather data on the two variables of interest.
2. Visualize the Data:
o Use scatterplots to identify patterns or trends.
3. Calculate the Correlation Coefficient:
o Use statistical software or formulas.
4. Interpret Results:
o Determine the strength and direction of the relationship.
5. Validate Assumptions:
o Ensure data meets the assumptions of the chosen correlation method.
Scenario:
• You have data on 10 students' study hours and their corresponding exam scores.
• Hypothesis: Students who study more tend to score higher.
Steps:
1. Visualize: Plot study hours vs. exam scores to check for a linear relationship.
2. Calculate rrr:
o Use Pearson correlation formula.
o Let’s assume r=0.85r = 0.85r=0.85.
3. Interpretation:
o r=0.85r = 0.85r=0.85: Strong positive correlation. Students who study more
tend to score higher.
Correlation Matrix
Connect & follow for more such content and information on Data Analysis/Science
Variable Study Hours Exam Scores Attendance
Study Hours 1 0.85 0.6
Exam Scores 0.85 1 0.7
Attendance 0.6 0.7 1
Practical Notes
Connect & follow for more such content and information on Data Analysis/Science
Analysis of Variance (ANOVA).
Types of ANOVA
1. One-Way ANOVA:
o Used when there is one independent variable (factor) with multiple levels.
o Example: Comparing test scores of students across three teaching methods.
2. Two-Way ANOVA:
o Used when there are two independent variables (factors).
o Example: Studying the effect of teaching method (factor 1) and study
environment (factor 2) on student performance.
3. Repeated Measures ANOVA:
o Used when the same subjects are measured multiple times under different
conditions.
o Example: Testing the effect of a drug on the same patients at different time
intervals.
4. MANOVA (Multivariate Analysis of Variance):
o Extends ANOVA to multiple dependent variables.
o Example: Studying the effect of a training program on both performance and
motivation.
1. Normality:
o The dependent variable is normally distributed within each group.
2. Homogeneity of Variance:
o Variances of the groups being compared are approximately equal.
3. Independence:
o Observations in each group are independent of each other.
Terms in ANOVA
1. Factors:
Connect & follow for more such content and information on Data Analysis/Science
o Independent variables being studied.
2. Levels:
o Categories or groups within a factor.
3. F-Ratio:
o The test statistic in ANOVA.
o
Scenario:
A researcher wants to test whether three diets (A, B, and C) lead to different average weight
losses.
1. Hypotheses:
o H0H_0H0: μA=μB=μC\mu_A = \mu_B = \mu_CμA=μB=μC.
Connect & follow for more such content and information on Data Analysis/Science
o HaH_aHa: At least one mean is different.
2. Data:
o Diet A: [5, 6, 7]
o Diet B: [8, 9, 10]
o Diet C: [3, 4, 5]
3. Calculate Sum of Squares:
o SSBSS_BSSB: Variability between group means.
o SSWSS_WSSW: Variability within each group.
4. Degrees of Freedom:
o dfB=k−1=3−1=2df_B = k - 1 = 3 - 1 = 2dfB=k−1=3−1=2.
o dfW=n−k=9−3=6df_W = n - k = 9 - 3 = 6dfW=n−k=9−3=6.
5. Calculate F-Ratio:
o Use F=MSBMSWF = \frac{MS_B}{MS_W}F=MSWMSB, where
MS=SSdfMS = \frac{SS}{df}MS=dfSS.
6. Compare P-Value and α\alphaα:
o If p≤0.05p \leq 0.05p≤0.05: Reject H0H_0H0.
Scenario:
A researcher studies the effect of gender (male, female) and exercise type (yoga, cardio) on
stress levels.
1. Factors:
o Factor 1: Gender (2 levels).
o Factor 2: Exercise Type (2 levels).
2. Hypotheses:
o Main Effects:
▪ H0H_0H0: No difference in stress levels by gender.
▪ H0H_0H0: No difference in stress levels by exercise type.
o Interaction Effect:
▪ H0H_0H0: No interaction between gender and exercise type.
3. Steps:
o Compute main effects and interaction effect.
o Compare F-ratio for each.
Post-Hoc Tests
If ANOVA shows significant results, post-hoc tests determine which groups differ:
1. Tukey's HSD:
o Compares all possible pairs of group means.
2. Bonferroni Correction:
o Adjusts α\alphaα to reduce Type I error.
3. Dunnett’s Test:
o Compares all groups to a control group.
Connect & follow for more such content and information on Data Analysis/Science
Applications of ANOVA
1. Business:
o Testing the effectiveness of different advertising campaigns.
2. Education:
o Comparing average scores across multiple teaching methods.
3. Healthcare:
o Evaluating the effectiveness of various treatments.
Connect & follow for more such content and information on Data Analysis/Science
Chi-Square Test.
Where:
Steps:
Connect & follow for more such content and information on Data Analysis/Science
1. State the Hypotheses:
o Null Hypothesis (H0H_0H0): The two variables are independent.
o Alternative Hypothesis (HaH_aHa): The two variables are not independent.
2. Set the Significance Level (α\alphaα):
o Commonly α=0.05\alpha = 0.05α=0.05.
3. Create a Contingency Table:
o Summarize the observed frequencies of the variables in a matrix format.
4. Calculate Expected Frequencies:
o Use the formula: E=Row Total×Column TotalGrand TotalE = \frac{\text{Row
Total} \times \text{Column Total}}{\text{Grand
Total}}E=Grand TotalRow Total×Column Total
5. Compute the Chi-Square Statistic:
o Plug observed (OOO) and expected (EEE) frequencies into the formula.
6. Determine Degrees of Freedom:
o df=(r−1)(c−1)df = (r - 1)(c - 1)df=(r−1)(c−1), where rrr is the number of rows
and ccc is the number of columns.
7. Compare χ2\chi^2χ2 Value to Critical Value:
o Use a Chi-Square distribution table or statistical software.
8. Make a Decision:
o If χ2≥\chi^2 \geqχ2≥ critical value or p≤αp \leq \alphap≤α: Reject H0H_0H0.
o Otherwise, fail to reject H0H_0H0.
Observed Data:
1. Expected Frequencies:
o For Youth & Product A:
E=Row Total×Column TotalGrand Total=60×90240=22.5E = \frac{\text{Row
Total} \times \text{Column Total}}{\text{Grand Total}} = \frac{60 \times
90}{240} = 22.5E=Grand TotalRow Total×Column Total=24060×90=22.5
2. Chi-Square Statistic:
o Calculate χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2.
3. Degrees of Freedom:
o df=(3−1)(3−1)=4df = (3 - 1)(3 - 1) = 4df=(3−1)(3−1)=4.
4. Decision:
Connect & follow for more such content and information on Data Analysis/Science
o Compare χ2\chi^2χ2 to critical value at df=4df = 4df=4 and α=0.05\alpha =
0.05α=0.05.
Steps:
Example: Goodness-of-Fit
Scenario: A die is rolled 60 times, and the observed frequencies of each side are:
Hypotheses:
• H0H_0H0: The die is fair (E=606=10E = \frac{60}{6} = 10E=660=10 for each side).
• HaH_aHa: The die is not fair.
1. Expected Frequencies:
o E=10E = 10E=10 for all sides.
2. Chi-Square Statistic:
o Calculate χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2.
3. Degrees of Freedom:
o df=6−1=5df = 6 - 1 = 5df=6−1=5.
4. Decision:
o Compare χ2\chi^2χ2 to critical value at df=5df = 5df=5 and α=0.05\alpha =
0.05α=0.05.
1. Business:
Connect & follow for more such content and information on Data Analysis/Science
o Analyzing customer preferences across regions.
2. Healthcare:
o Studying the association between a disease and a risk factor.
3. Education:
o Determining if student performance differs by teaching method.
Connect & follow for more such content and information on Data Analysis/Science
Data Transformation and Standardization
Log Transformation
Definition:
Log transformation is the process of applying the logarithm function to transform a dataset. It
is used to reduce skewness and normalize data.
Formula:
For a value xx:
Min-Max Scaling
Definition:
Min-Max scaling transforms data to a fixed range, typically between 0 and 1.
Formula:
For a value xx:
Example:
Original data: [10,20,30][10, 20, 30]
Scaled data: [0,0.5,1][0, 0.5, 1]
Advantages:
Connect & follow for more such content and information on Data Analysis/Science
• Preserves the relationships between data points.
• Easy to implement.
Z-Score Standardization
Definition:
Z-Score standardization transforms data to have a mean of 0 and a standard deviation of 1. It
is used when the dataset contains outliers.
Formula:
For a value xx:
Example:
Original data: [10,20,30][10, 20, 30]
Mean: 2020, Standard deviation: 1010
Z-Scores: [−1,0,1][-1, 0, 1]
Key Points:
Connect & follow for more such content and information on Data Analysis/Science
Outliers and Missing Data
Detection of Outliers
• Steps:
1. Calculate Q1 (25th percentile) and Q3 (75th percentile).
2. Compute IQR: IQR=Q3−Q1\text{IQR} = Q3 - Q1.
3. Define outlier boundaries:
▪ Lower boundary: Q1−1.5×IQRQ1 - 1.5 \times \text{IQR}
▪ Upper boundary: Q3+1.5×IQRQ3 + 1.5 \times \text{IQR}
4. Values outside these boundaries are considered outliers.
Example:
Data: [1,2,3,4,5,6,100][1, 2, 3, 4, 5, 6, 100]
Q1 = 2.5, Q3 = 5.5, IQR = 3
Lower boundary: 2.5−1.5(3)=−22.5 - 1.5(3) = -2
Upper boundary: 5.5+1.5(3)=105.5 + 1.5(3) = 10
Outlier: 100100
2. Using Z-Scores:
Z-Scores help detect outliers by measuring how far a data point is from the mean in terms of
standard deviations.
Example:
Data: [10,20,30,1000][10, 20, 30, 1000], Mean = 265, Std Dev = 476.74
Z-Score for 1000: 1000−265476.74≈1.54\frac{1000 - 265}{476.74} \approx 1.54 (not an
outlier).
Handling Outliers
1. Winsorizing:
• Replacing extreme values with a less extreme value (e.g., replacing them with the
nearest boundary).
2. Capping:
• Setting a maximum and minimum threshold. Values beyond these thresholds are
capped.
Connect & follow for more such content and information on Data Analysis/Science
3. Removing:
• Replace missing values with the mean, median, or mode of the column.
• Works well for small datasets with low missingness.
2. Regression Imputation:
3. Advanced Techniques:
Connect & follow for more such content and information on Data Analysis/Science
Measures of Association
Measures of association describe the relationship or dependency between two variables.
These are crucial in determining how one variable changes concerning another. Below are
key topics under this category:
• Definition: Odds ratio measures the strength of association between two binary
variables.
It's often used in case-control studies to determine how strongly an exposure is
associated with an outcome.
• Formula:
Where:
• Example:
In a study of 100 people:
o 40 people drink coffee, and 30 of them report improved focus.
o 60 people don’t drink coffee, and 20 of them report improved focus.
• Example:
Using the same data as above:
Connect & follow for more such content and information on Data Analysis/Science
o Probability of improved focus for coffee drinkers: 3040=0.75\frac{30}{40} =
0.75
o Probability of improved focus for non-coffee drinkers:
2060=0.33\frac{20}{60} = 0.33
Connect & follow for more such content and information on Data Analysis/Science