0% found this document useful (0 votes)
19 views36 pages

Statistics

The document outlines key topics in statistics, including descriptive statistics, probability, sampling techniques, inferential statistics, and hypothesis testing. It covers essential concepts such as measures of central tendency, probability distributions, and data visualization methods. Additionally, it discusses various sampling methods and their importance in data analysis.

Uploaded by

shub99gaikwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views36 pages

Statistics

The document outlines key topics in statistics, including descriptive statistics, probability, sampling techniques, inferential statistics, and hypothesis testing. It covers essential concepts such as measures of central tendency, probability distributions, and data visualization methods. Additionally, it discusses various sampling methods and their importance in data analysis.

Uploaded by

shub99gaikwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Index of Topics Covered in Statistics

1. Descriptive Statistics
o Measures of Central Tendency (Mean, Median, Mode)
o Measures of Dispersion (Range, Variance, Standard Deviation, IQR)
o Data Visualization (Histogram, Boxplot, Scatterplot)
o Skewness and Kurtosis
2. Probability
o Basics of Probability
o Types of Events (Independent, Dependent, Mutually Exclusive)
o Conditional Probability
o Bayes' Theorem
3. Probability Distributions
o Discrete Distributions (Binomial, Poisson)
o Continuous Distributions (Normal, Uniform, Exponential)
o Properties of Normal Distribution (Z-Scores, Empirical Rule)
4. Sampling and Sampling Techniques
o Types of Sampling (Random, Stratified, Cluster, Systematic)
o Importance of Sample Size
o Sampling Error
5. Inferential Statistics
o Confidence Intervals
o Margin of Error
o Central Limit Theorem (CLT)
o Bootstrapping
6. Hypothesis Testing
o Null and Alternative Hypotheses
o Type I and Type II Errors
o Steps in Hypothesis Testing
o One-Tailed vs. Two-Tailed Tests
o Test Statistics (Z-Test, T-Test, Chi-Square Test, ANOVA)
7. Correlation and Covariance
o Definition and Differences
o Pearson’s Correlation Coefficient
o Spearman’s Rank Correlation
o Covariance Formula and Interpretation
8. Regression Analysis
o Linear Regression (Simple and Multiple)
o Assumptions of Linear Regression
o Coefficients (Intercept and Slope)
o R-Squared and Adjusted R-Squared
o Multicollinearity and Variance Inflation Factor (VIF)
o Logistic Regression
9. Analysis of Variance (ANOVA)
o One-Way ANOVA
o Assumptions of ANOVA
o F-Test
o Post Hoc Tests (Tukey’s Test)
10. Chi-Square Tests
o Chi-Square Goodness-of-Fit Test

Connect & follow for more such content and information on Data Analysis/Science
o Chi-Square Test for Independence
11. Data Transformation and Standardization
o Log Transformation
o Min-Max Scaling
o Z-Score Standardization
12. Outliers and Missing Data
o Detection of Outliers (IQR, Z-Scores)
o Handling Outliers (Winsorizing, Capping, Removing)
o Missing Data Imputation (Mean, Median, Mode, Regression Imputation)
13. Measures of Association
o Odds Ratio
o Relative Risk
o Contingency Tables

Connect & follow for more such content and information on Data Analysis/Science
Descriptive Statistics

1. Basics of Data

• Data Types:
o Quantitative (Numerical):
▪ Discrete: Countable values (e.g., number of students in a class)
▪ Continuous: Infinite possible values within a range (e.g., height,
weight)
o Qualitative (Categorical):
▪ Nominal: Categories without a specific order (e.g., gender, colors)
▪ Ordinal: Categories with a specific order (e.g., rankings, satisfaction
levels)
• Levels of Measurement:
o Nominal: Labels with no inherent order (e.g., male/female)
o Ordinal: Rank-ordered data with unequal intervals (e.g., class grades: A, B,
C)
o Interval: Equal intervals, but no true zero (e.g., temperature in Celsius)
o Ratio: Equal intervals and a true zero (e.g., weight, income)

2. Measures of Central Tendency

These measures represent the center of the data.

1. Mean (Average):
o Formula: Mean=ΣXn\text{Mean} = \frac{\Sigma X}{n}Mean=nΣX
o Sensitive to outliers.
o Example: The average of 2,3,52, 3, 52,3,5 is (2+3+5)/3=3.33(2+3+5)/3 =
3.33(2+3+5)/3=3.33.
2. Median:
o The middle value when data is sorted.
o If nnn is odd: Middle value.
o If nnn is even: Average of two middle values.
o Not affected by outliers.
3. Mode:
o The most frequently occurring value.
o Can have multiple modes (bimodal, multimodal).
o Example: In 1,2,2,31, 2, 2, 31,2,2,3, the mode is 222.

3. Measures of Dispersion (Variability)

These describe how spread out the data is.

1. Range:

Connect & follow for more such content and information on Data Analysis/Science
o Formula: Range=Max−Min\text{Range} = \text{Max} -
\text{Min}Range=Max−Min
o Example: For 2,4,6,82, 4, 6, 82,4,6,8, Range = 8−2=68 - 2 = 68−2=6.
2. Variance:
o Measures the average squared deviation from the mean.
o Formula:
▪ Population: σ2=Σ(X−μ)2N\sigma^2 = \frac{\Sigma (X -
\mu)^2}{N}σ2=NΣ(X−μ)2
▪ Sample: s2=Σ(X−Xˉ)2n−1s^2 = \frac{\Sigma (X - \bar{X})^2}{n-
1}s2=n−1Σ(X−Xˉ)2
o Larger variance = greater spread.
3. Standard Deviation:
o Square root of variance.
o Represents data spread in the same units as the data.
o Formula: σ=σ2\sigma = \sqrt{\sigma^2}σ=σ2.
4. Interquartile Range (IQR):
o Measures the middle 50% of the data.
o Formula: IQR=Q3−Q1\text{IQR} = Q3 - Q1IQR=Q3−Q1
▪ Q1Q1Q1: 25th percentile
▪ Q3Q3Q3: 75th percentile
o Helps detect outliers.

4. Skewness and Kurtosis

1. Skewness:
o Describes asymmetry in data distribution.
o Positive skew: Tail on the right (e.g., income data).
o Negative skew: Tail on the left.
2. Kurtosis:
o Measures the "tailedness" of the distribution.
o Types:
▪ Mesokurtic: Normal distribution
▪ Leptokurtic: Heavy tails
▪ Platykurtic: Light tails

5. Data Visualization

Helps to visually understand and describe data:

1. Histograms:
o Show frequency distribution.
o Useful for understanding data shape (normal, skewed).
2. Bar Charts:
o Represent categorical data.
3. Box Plots:
o Show data spread, median, and outliers.
4. Pie Charts:

Connect & follow for more such content and information on Data Analysis/Science
o Represent proportions of a whole.
5. Scatter Plots:
o Display relationships between two variables.

Summary of Descriptive Statistics Workflow:

1. Understand your data: Types and levels of measurement.


2. Calculate central tendency: Mean, median, mode.
3. Assess variability: Range, variance, standard deviation, IQR.
4. Check distribution: Skewness, kurtosis.
5. Visualize: Use appropriate charts and plots.

Connect & follow for more such content and information on Data Analysis/Science
Probability

1. Basics of Probability

• Definition: Probability measures the likelihood of an event occurring. It ranges from


0 (impossible event) to 1 (certain event).
o Formula: P(E)=Number of favorable outcomesTotal number of outcomesP(E)
= \frac{\text{Number of favorable outcomes}}{\text{Total number of
outcomes}}P(E)=Total number of outcomesNumber of favorable outcomes

2. Key Probability Terms

1. Experiment: A process with well-defined outcomes (e.g., rolling a die, flipping a


coin).
2. Sample Space (S): The set of all possible outcomes.
o Example: Rolling a die → S={1,2,3,4,5,6}S = \{1, 2, 3, 4, 5,
6\}S={1,2,3,4,5,6}
3. Event (E): A subset of the sample space (e.g., rolling an even number → E={2,4,6}E
= \{2, 4, 6\}E={2,4,6}).
4. Complement of an Event (EcE^cEc): All outcomes not in EEE (e.g.,
Ec={1,3,5}E^c = \{1, 3, 5\}Ec={1,3,5}).
5. Mutually Exclusive Events: Two events that cannot occur simultaneously.
6. Independent Events: The occurrence of one event does not affect the probability of
the other.

3. Types of Probability

1. Classical Probability:
o Based on equally likely outcomes.
o Example: Probability of rolling a 4 on a die → P(4)=16P(4) =
\frac{1}{6}P(4)=61.
2. Empirical Probability:
o Based on observed data.
o Example: Probability of rain based on historical weather data.
3. Subjective Probability:
o Based on intuition or experience.
o Example: The probability of a startup succeeding.

4. Rules of Probability

1. Addition Rule:

Connect & follow for more such content and information on Data Analysis/Science
o For mutually exclusive events: P(A∪B)=P(A)+P(B)P(A \cup B) = P(A) +
P(B)P(A∪B)=P(A)+P(B)
o For non-mutually exclusive events: P(A∪B)=P(A)+P(B)−P(A∩B)P(A \cup B)
= P(A) + P(B) - P(A \cap B)P(A∪B)=P(A)+P(B)−P(A∩B)
2. Multiplication Rule:
o For independent events: P(A∩B)=P(A)×P(B)P(A \cap B) = P(A) \times
P(B)P(A∩B)=P(A)×P(B)
3. Complement Rule:
o Probability of an event not occurring: P(Ec)=1−P(E)P(E^c) = 1 -
P(E)P(Ec)=1−P(E)

5. Conditional Probability

• The probability of event AAA occurring given that BBB has occurred:
P(A∣B)=P(A∩B)P(B)(if P(B)>0)P(A|B) = \frac{P(A \cap B)}{P(B)} \quad \text{(if \(
P(B) > 0 \))}P(A∣B)=P(B)P(A∩B)(if P(B)>0)

6. Bayes’ Theorem
• A formula to calculate conditional probabilities: P(A∣B)=P(B∣A)⋅P(A)P(B)P(A|B) =
\frac{P(B|A) \cdot P(A)}{P(B)}P(A∣B)=P(B)P(B∣A)⋅P(A)
o Applications: Spam detection, medical testing, machine learning.

7. Probability Distributions

1. Discrete Probability Distributions:


o Binomial Distribution:
▪ Number of successes in nnn independent trials.
▪ Formula: P(X=k)=(nk)⋅pk⋅(1−p)n−kP(X = k) = \binom{n}{k} \cdot
p^k \cdot (1-p)^{n-k}P(X=k)=(kn)⋅pk⋅(1−p)n−k
o Poisson Distribution:
▪ Number of events in a fixed interval (time/space).
2. Continuous Probability Distributions:
o Normal Distribution:
▪ Bell-shaped curve.
▪ Properties: Symmetrical, mean = median = mode.
o Exponential Distribution:
▪ Time between events in a Poisson process.

8. Examples

1. Example 1: A coin is flipped twice. What is the probability of getting at least one
head?

Connect & follow for more such content and information on Data Analysis/Science
oSample space: S={HH,HT,TH,TT}S = \{HH, HT, TH,
TT\}S={HH,HT,TH,TT}
o Event AAA: At least one head → A={HH,HT,TH}A = \{HH, HT,
TH\}A={HH,HT,TH}
o Probability: P(A)=34P(A) = \frac{3}{4}P(A)=43
2. Example 2: A bag contains 4 red balls and 6 blue balls. What is the probability of
drawing a red ball?
o Total balls = 10, Red balls = 4.
o Probability: P(Red)=410=0.4P(\text{Red}) = \frac{4}{10} = 0.4P(Red)=104
=0.4.

9. Applications of Probability in Data Analysis

• Estimating uncertainties in predictions.


• Creating probabilistic models (e.g., weather forecasting).
• Forming the basis for inferential statistics (e.g., hypothesis testing).

Connect & follow for more such content and information on Data Analysis/Science
Sampling and Sampling Techniques
Sampling is the process of selecting a subset (sample) from a larger population to infer
conclusions about the entire population. Sampling is essential for practical and efficient data
analysis when studying large populations.

1. Types of Sampling

Sampling methods are broadly divided into probability sampling (where every member of
the population has a known chance of being selected) and non-probability sampling (where
selection is not random). Below are the key types:

A. Random Sampling

• Definition: Every member of the population has an equal chance of being selected.
• Method: Selection is done using random number generators or lottery methods.
• Example: Drawing names out of a hat to select participants for a survey.
• Advantages:
o Reduces selection bias.
o High likelihood of representing the population.
• Disadvantages:
o May be impractical for very large populations.

B. Stratified Sampling

• Definition: The population is divided into strata (groups) based on a characteristic,


and samples are drawn from each stratum proportionally.
• Example: Sampling students by gender (male/female) or income brackets
(low/middle/high).
• Advantages:
o Ensures representation of all subgroups.
o Increases precision of results.
• Disadvantages:
o Requires detailed knowledge of the population.
o Complex and time-consuming.

C. Cluster Sampling

• Definition: The population is divided into clusters, and entire clusters are randomly
selected.
• Example: Selecting a few schools randomly and surveying all students in those
schools.
• Advantages:

Connect & follow for more such content and information on Data Analysis/Science
o Cost-efficient for large populations.
o Useful when population data is geographically spread out.
• Disadvantages:
o Less accurate if clusters are not homogeneous.
o May increase sampling error.

D. Systematic Sampling

• Definition: Every nth member of the population is selected after randomly selecting
the starting point.
• Example: Selecting every 10th person in a list of 1,000 names.
• Advantages:
o Simple and quick to implement.
o Ensures uniform coverage of the population.
• Disadvantages:
o Patterns in the population may bias the sample (e.g., cyclical data).

2. Importance of Sample Size

The sample size plays a crucial role in ensuring the reliability and validity of results.

• Key Considerations:
1. Population Size: A smaller sample may suffice for small populations, but
larger populations require a larger sample.
2. Margin of Error: A smaller margin of error requires a larger sample.
3. Confidence Level: A higher confidence level (e.g., 95% vs. 99%) requires a
larger sample.
• Impact of Sample Size:
o Small Sample: May lead to underrepresentation and unreliable results.
o Large Sample: Reduces sampling error but can be costly and time-
consuming.

3. Sampling Error

• Definition: Sampling error is the difference between the sample statistic (e.g., sample
mean) and the true population parameter (e.g., population mean).
It arises due to the fact that a sample is only a subset of the population.
• Causes:
o Variability in the population.
o Sample size being too small.
o Non-representative sampling techniques.
• Reduction Methods:
o Use random sampling to minimize bias.
o Increase the sample size.
o Use stratified sampling to ensure all groups are proportionally represented.

Connect & follow for more such content and information on Data Analysis/Science
Inferential Statistics.

Inferential Statistics Overview

• Definition: Inferential statistics involves using sample data to make generalizations or


predictions about a larger population.
• It helps to estimate population parameters and test hypotheses.

Key Concepts in Inferential Statistics

1. Population and Sample

• Population: The entire group of individuals or items under study (e.g., all voters in a
country).
• Sample: A subset of the population used for analysis (e.g., 1,000 voters surveyed).

2. Parameters and Statistics

• Parameter: A numerical value summarizing a population (e.g., population mean


μ\muμ).
• Statistic: A numerical value summarizing a sample (e.g., sample mean xˉ\bar{x}xˉ).

3. Sampling Distribution

• The probability distribution of a statistic (e.g., sample mean) when drawn from a
population repeatedly.
• Central Limit Theorem (CLT):
o For large sample sizes (n>30n > 30n>30), the sampling distribution of the
sample mean approximates a normal distribution, regardless of the
population's distribution.

4. Estimation

Estimation involves determining population parameters based on sample data.

1. Point Estimation:
o Provides a single value as an estimate of a parameter (e.g., sample mean
xˉ\bar{x}xˉ for population mean μ\muμ).
2. Interval Estimation:
o Provides a range of values (confidence interval) likely to contain the
population parameter.
o Example: Confidence Interval (CI)=xˉ±z⋅sn\text{Confidence Interval (CI)} =
\bar{x} \pm z \cdot \frac{s}{\sqrt{n}}Confidence Interval (CI)=xˉ±z⋅ns
o zzz: Z-score for confidence level (e.g., 1.96 for 95% CI).

Connect & follow for more such content and information on Data Analysis/Science
o sss: Sample standard deviation.
o nnn: Sample size.

5. Hypothesis Testing

• Hypothesis testing is a formal procedure to test assumptions about population


parameters.

1. Steps in Hypothesis Testing:


o Define null hypothesis (H0H_0H0) and alternative hypothesis (HaH_aHa).
o Choose a significance level (α\alphaα, commonly 0.05 or 0.01).
o Select a test statistic (e.g., t-test, z-test).
o Compute the test statistic and compare it with critical value or p-value.
o Make a decision: Reject or fail to reject H0H_0H0.
2. Types of Errors:
o Type I Error: Rejecting H0H_0H0 when it is true (false positive).
o Type II Error: Failing to reject H0H_0H0 when it is false (false negative).

6. Confidence Intervals vs. Hypothesis Testing

• Confidence intervals estimate the range of a parameter.


• Hypothesis testing evaluates whether a parameter meets a specified condition.

7. Common Tests in Inferential Statistics

1. Z-Test:
o Used when population variance is known and sample size is large (n>30n >
30n>30).
o Example: Testing the mean of a large dataset.
2. T-Test:
o Used when population variance is unknown or sample size is small (n<30n <
30n<30).
o Types:
▪ One-sample t-test.
▪ Independent (two-sample) t-test.
▪ Paired t-test.
3. Chi-Square Test:
o Used to test the association between categorical variables or goodness-of-fit.
o Example: Testing independence between gender and product preference.
4. ANOVA (Analysis of Variance):
o Used to compare means of more than two groups.
o Example: Comparing exam scores across three teaching methods.

Connect & follow for more such content and information on Data Analysis/Science
8. Examples

1. Example 1: A company claims its average delivery time is 30 minutes. A sample of


50 deliveries has a mean of 32 minutes with a standard deviation of 4 minutes. At
α=0.05\alpha = 0.05α=0.05, can we conclude the claim is false?
o Use a one-sample t-test.
2. Example 2: Does a new drug significantly reduce blood pressure compared to a
placebo?
o Use an independent two-sample t-test.

9. Applications of Inferential Statistics

• Predicting market trends from surveys.


• Testing the effectiveness of new treatments.
• Quality control in manufacturing processes.

Connect & follow for more such content and information on Data Analysis/Science
Hypothesis Testing.

Hypothesis Testing Overview

• Definition: Hypothesis testing is a statistical method used to make inferences or


decisions about a population based on sample data.
• It helps determine whether there is enough evidence to reject a null hypothesis.

Key Concepts

1. Hypotheses

• Null Hypothesis (H0H_0H0): A statement of no effect or no difference.


Example: "There is no difference in the mean scores of two groups."
• Alternative Hypothesis (H1H_1H1): A statement that contradicts the null
hypothesis.
Example: "There is a significant difference in the mean scores of two groups."

2. Types of Hypothesis Tests

1. One-Tailed Test:
o Tests if the sample mean is significantly greater or smaller than the population
mean.
o Example: Testing if a new drug increases recovery rate.
2. Two-Tailed Test:
o Tests if the sample mean is significantly different (either higher or lower) from
the population mean.
o Example: Testing if a new teaching method affects scores (positively or
negatively).

3. Steps in Hypothesis Testing

1. State the Hypotheses: Define H0H_0H0 and H1H_1H1.


Example: H0H_0H0: The average salary is $50,000. H1H_1H1: The average salary is
not $50,000.
2. Set the Significance Level (α\alphaα): Common values are 0.05 or 0.01.
3. Choose the Appropriate Test:
o t tt-test (small sample size or unknown population standard deviation).
o z zz-test (large sample size or known population standard deviation).
o Chi-square test (categorical data).
o ANOVA (comparing means of 3 or more groups).
4. Compute the Test Statistic: Use formulas or statistical software to calculate.
5. Determine the p-Value:

Connect & follow for more such content and information on Data Analysis/Science
o The p-value tells us the probability of observing the data if H0H_0H0 is true.
6. Make a Decision:
o If p≤αp \leq \alphap≤α: Reject H0H_0H0 (there is enough evidence to support
H1H_1H1).
o If p>αp > \alphap>α: Fail to reject H0H_0H0 (no sufficient evidence to
support H1H_1H1).

4. Types of Errors

1. Type I Error (α\alphaα): Rejecting a true null hypothesis.


Example: Convicting an innocent person.
2. Type II Error (β\betaβ): Failing to reject a false null hypothesis.
Example: Letting a guilty person go free.

5. Examples of Hypothesis Testing

1. Two-Sample ttt-Test: Comparing means of two groups.


Example: Do male and female students perform differently in exams?
2. Paired ttt-Test: Comparing means of the same group before and after a treatment.
Example: Measuring the effect of a diet plan on weight loss.
3. Chi-Square Test: Testing independence or goodness of fit for categorical data.
Example: Are gender and preference for a product related?
4. ANOVA: Comparing means across three or more groups.
Example: Does the average income differ across three cities?

6. Interpreting Results

• Critical Value Approach:


o Compare the test statistic with critical values from a statistical table.
o If the test statistic exceeds the critical value, reject H0H_0H0.
• P-Value Approach:
o If the p-value is less than α\alphaα, reject H0H_0H0.
o Example: If p=0.03p = 0.03p=0.03 and α=0.05\alpha = 0.05α=0.05, reject
H0H_0H0.

7. Practical Applications

1. Business: Testing the effectiveness of marketing campaigns.


2. Healthcare: Evaluating the impact of new drugs.
3. Education: Assessing teaching methods.
4. Manufacturing: Testing the quality of products.

Connect & follow for more such content and information on Data Analysis/Science
Regression Analysis

Regression Analysis Overview

• Definition: Regression analysis is a statistical technique used to understand the


relationship between one dependent variable (target) and one or more independent
variables (predictors).
• It helps predict outcomes and identify trends or patterns.

Types of Regression Analysis

1. Simple Linear Regression:


o Examines the relationship between one dependent variable and one
independent variable.
o Example: Predicting sales based on advertising expenditure.
o Equation: Y=b0+b1X+ϵY = b_0 + b_1X + \epsilonY=b0+b1X+ϵ where:
YYY = Dependent variable
XXX = Independent variable
b0b_0b0 = Intercept
b1b_1b1 = Slope (rate of change)
ϵ\epsilonϵ = Error term
2. Multiple Linear Regression:
o Examines the relationship between one dependent variable and two or more
independent variables.
o Example: Predicting house prices based on size, location, and number of
bedrooms.
o Equation: Y=b0+b1X1+b2X2+⋯+bnXn+ϵY = b_0 + b_1X_1 + b_2X_2 +
\dots + b_nX_n + \epsilonY=b0+b1X1+b2X2+⋯+bnXn+ϵ
3. Logistic Regression:
o Used when the dependent variable is categorical (e.g., yes/no, 0/1).
o Example: Predicting whether a customer will buy a product (yes/no).
4. Polynomial Regression:
o Extends linear regression by fitting a curve to the data.
o Example: Predicting growth rates in non-linear relationships.
5. Ridge and Lasso Regression (Regularization Techniques):
o Used to prevent overfitting in models with many predictors by penalizing
large coefficients.

Key Terms in Regression

1. Coefficient (b1,b2,…b_1, b_2, \dotsb1,b2,…):


o Measures the strength and direction of the relationship between each predictor
and the target variable.
2. Intercept (b0b_0b0):

Connect & follow for more such content and information on Data Analysis/Science
o The value of the dependent variable when all predictors are zero.
3. Residuals:
o The difference between observed and predicted values:
Residual=Yobserved−Ypredicted\text{Residual} = Y_{\text{observed}} -
Y_{\text{predicted}}Residual=Yobserved−Ypredicted
4. R-Squared (R2R^2R2):
o Measures the proportion of variance in the dependent variable explained by
the independent variables.
o Value ranges from 0 to 1. Higher values indicate a better fit.
5. Adjusted R-Squared:
o Adjusted for the number of predictors. Used when comparing models with
different numbers of predictors.
6. P-Value:
o Tests whether the coefficients are statistically significant.
p<0.05p < 0.05p<0.05: The predictor significantly impacts the dependent
variable.

Steps in Regression Analysis

1. Define the Problem:


o Identify the dependent and independent variables.
2. Visualize the Data:
o Use scatterplots to examine relationships between variables.
3. Check Assumptions:
o Linearity: Relationship between independent and dependent variables should
be linear.
o Independence: Observations should be independent.
o Homoscedasticity: Variance of residuals should be constant across all levels
of the independent variable.
o Normality: Residuals should be approximately normally distributed.
4. Fit the Model:
o Use statistical software (e.g., Excel, Python, R) to estimate coefficients.
5. Evaluate the Model:
o Look at R2R^2R2, p-values, and residual plots.
6. Make Predictions:
o Use the regression equation to predict new outcomes.

Common Applications

1. Business:
o Forecasting sales, revenue, or market trends.
2. Healthcare:
o Predicting patient outcomes based on clinical variables.
3. Economics:
o Studying relationships between economic indicators (e.g., inflation vs. GDP
growth).
4. Education:

Connect & follow for more such content and information on Data Analysis/Science
o Evaluating the impact of study hours on exam scores.

Practical Example

Scenario:

You want to predict monthly sales (YYY) based on advertising spend (XXX).

Steps:

1. Data Collection: Gather data on monthly sales and advertising expenditure.


2. Visualize: Plot sales vs. advertising spend to check for a linear trend.
3. Fit Model:

Y=b0+b1X+ϵY = b_0 + b_1X + \epsilonY=b0+b1X+ϵ

o Software will calculate b0b_0b0 (intercept) and b1b_1b1 (slope).


4. Interpret Results:
o If b1=50b_1 = 50b1=50: For every $1 spent on advertising, sales increase by
$50.
o Check R2R^2R2 to see how well the model explains sales variance.

Advanced Topics

1. Multicollinearity:
o Occurs when independent variables are highly correlated.
o Solution: Remove redundant predictors or use Ridge/Lasso regression.
2. Interaction Effects:
o Examines if the effect of one predictor depends on the value of another.
o Example: The impact of advertising may differ based on the season.
3. Model Diagnostics:
o Residual plots to check assumptions.
o Use AIC/BIC for model comparison.

Connect & follow for more such content and information on Data Analysis/Science
Correlation and Covariance
Correlation and covariance measure the relationship between two variables, but they differ in
how they express that relationship. Let’s dive into each aspect.

1. Definition and Differences

Aspect Correlation Covariance


Definition Measures the strength and direction of the Measures how two variables
linear relationship between two variables. vary together.
Range Always between -1 and +1. Can take any value, positive
or negative.
Unit Unitless (standardized). Expressed in units of the
variables.
Interpretation Indicates strength and direction of the Indicates direction of the
relationship. relationship, but not
strength.

Correlation Analysis.
Correlation Analysis Overview

• Definition: Correlation analysis measures the strength and direction of a linear


relationship between two variables.
• Purpose:
o Understand how two variables are related.
o Identify patterns and trends in data.

Key Features of Correlation

1. Correlation Coefficient (rrr):


o A numerical value between −1-1−1 and +1+1+1 that represents the strength
and direction of the relationship.
o Formula for Pearson Correlation Coefficient:
r=∑(Xi−Xˉ)(Yi−Yˉ)∑(Xi−Xˉ)2∑(Yi−Yˉ)2r = \frac{\sum (X_i - \bar{X})(Y_i
- \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}}r=∑(Xi
−Xˉ)2∑(Yi−Yˉ)2∑(Xi−Xˉ)(Yi−Yˉ)
2. Range of rrr:
o r=+1r = +1r=+1: Perfect positive correlation.
o r=−1r = -1r=−1: Perfect negative correlation.
o r=0r = 0r=0: No correlation.
3. Direction of Correlation:
o Positive Correlation: Both variables increase together (e.g., height and
weight).

Connect & follow for more such content and information on Data Analysis/Science
o Negative Correlation: One variable increases while the other decreases (e.g.,
speed and time taken to travel a fixed distance).
o No Correlation: No apparent relationship between variables (e.g., shoe size
and IQ).
4. Strength of Correlation:
o 0.0≤∣r∣<0.30.0 \leq |r| < 0.30.0≤∣r∣<0.3: Weak correlation.
o 0.3≤∣r∣<0.70.3 \leq |r| < 0.70.3≤∣r∣<0.7: Moderate correlation.
o 0.7≤∣r∣≤1.00.7 \leq |r| \leq 1.00.7≤∣r∣≤1.0: Strong correlation.

Types of Correlation

1. Pearson Correlation:
o Measures linear relationships between two continuous variables.
o Assumes normal distribution and no significant outliers.
2. Spearman Rank Correlation:
o Measures monotonic relationships (increasing or decreasing, not necessarily
linear).
o Uses ranks of data instead of actual values.
3. Kendall’s Tau:
o Measures ordinal association between two variables.
o Used when datasets have tied ranks or smaller sample sizes.
4. Partial Correlation:
o Measures the relationship between two variables while controlling for the
effect of one or more additional variables.

Assumptions of Correlation Analysis

1. Quantitative Data: Variables should be continuous or ordinal for most correlation


methods.
2. Linear Relationship: Assumes a straight-line relationship for Pearson correlation.
3. Homogeneity: Data should have similar variance.
4. Normal Distribution: For Pearson correlation, the data should be approximately
normally distributed.

Correlation vs. Causation

• Correlation:
o Shows a relationship between two variables but does not imply causation.
o Example: Ice cream sales and drowning incidents may be correlated due to a
third factor (hot weather).
• Causation:
o One variable directly affects another.
o Requires experimental evidence to confirm.

Connect & follow for more such content and information on Data Analysis/Science
Applications of Correlation

1. Business:
o Analyze the relationship between marketing spend and sales.
2. Healthcare:
o Study the relationship between physical activity and heart disease risk.
3. Education:
o Investigate the correlation between study hours and exam performance.
4. Finance:
o Examine the relationship between stock prices and market indices.

Steps to Perform Correlation Analysis

1. Data Collection:
o Gather data on the two variables of interest.
2. Visualize the Data:
o Use scatterplots to identify patterns or trends.
3. Calculate the Correlation Coefficient:
o Use statistical software or formulas.
4. Interpret Results:
o Determine the strength and direction of the relationship.
5. Validate Assumptions:
o Ensure data meets the assumptions of the chosen correlation method.

Example: Analyzing Study Hours and Exam Scores

Scenario:

• You have data on 10 students' study hours and their corresponding exam scores.
• Hypothesis: Students who study more tend to score higher.

Steps:

1. Visualize: Plot study hours vs. exam scores to check for a linear relationship.
2. Calculate rrr:
o Use Pearson correlation formula.
o Let’s assume r=0.85r = 0.85r=0.85.
3. Interpretation:
o r=0.85r = 0.85r=0.85: Strong positive correlation. Students who study more
tend to score higher.

Correlation Matrix

• Definition: A table showing correlation coefficients between multiple variables.


• Example:

Connect & follow for more such content and information on Data Analysis/Science
Variable Study Hours Exam Scores Attendance
Study Hours 1 0.85 0.6
Exam Scores 0.85 1 0.7
Attendance 0.6 0.7 1

• Useful for identifying relationships in large datasets.

Practical Notes

1. Correlation does not detect non-linear relationships (e.g., U-shaped patterns).


2. Outliers can distort correlation coefficients.
3. Always complement correlation analysis with additional tests (e.g., regression,
experiments).

Connect & follow for more such content and information on Data Analysis/Science
Analysis of Variance (ANOVA).

Analysis of Variance (ANOVA) Overview

• Definition: ANOVA is a statistical technique used to compare the means of three or


more groups to determine if at least one group mean is significantly different from the
others.
• Purpose:
o Test differences among group means.
o Analyze the impact of one or more factors on a dependent variable.

Types of ANOVA

1. One-Way ANOVA:
o Used when there is one independent variable (factor) with multiple levels.
o Example: Comparing test scores of students across three teaching methods.
2. Two-Way ANOVA:
o Used when there are two independent variables (factors).
o Example: Studying the effect of teaching method (factor 1) and study
environment (factor 2) on student performance.
3. Repeated Measures ANOVA:
o Used when the same subjects are measured multiple times under different
conditions.
o Example: Testing the effect of a drug on the same patients at different time
intervals.
4. MANOVA (Multivariate Analysis of Variance):
o Extends ANOVA to multiple dependent variables.
o Example: Studying the effect of a training program on both performance and
motivation.

Key Assumptions of ANOVA

1. Normality:
o The dependent variable is normally distributed within each group.
2. Homogeneity of Variance:
o Variances of the groups being compared are approximately equal.
3. Independence:
o Observations in each group are independent of each other.

Terms in ANOVA

1. Factors:

Connect & follow for more such content and information on Data Analysis/Science
o Independent variables being studied.
2. Levels:
o Categories or groups within a factor.
3. F-Ratio:
o The test statistic in ANOVA.
o

F = \frac{\text{Variance Between Groups}}{\text{Variance Within Groups}} ]

4. Sum of Squares (SS):


o Between Groups (SSB): Variability due to differences between group means.
o Within Groups (SSW): Variability due to differences within each group.
5. Degrees of Freedom (df):
o Between Groups: k−1k - 1k−1, where kkk is the number of groups.
o Within Groups: n−kn - kn−k, where nnn is the total number of observations.
6. Mean Square (MS):
o Variance estimates:
▪ MSB=SSBdfBMS_B = \frac{SS_B}{df_B}MSB=dfBSSB
▪ MSW=SSWdfWMS_W = \frac{SS_W}{df_W}MSW=dfWSSW
7. P-Value:
o Probability of observing the F-ratio under the null hypothesis.

Steps in Conducting ANOVA

1. State the Hypotheses:


o Null Hypothesis (H0H_0H0): All group means are equal.
o Alternative Hypothesis (HaH_aHa): At least one group mean is different.
2. Set the Significance Level (α\alphaα):
o Commonly α=0.05\alpha = 0.05α=0.05.
3. Calculate the F-Ratio:
o Compare the variance between groups to the variance within groups.
4. Determine the P-Value:
o Use statistical software or F-distribution tables.
5. Make a Decision:
o If p≤αp \leq \alphap≤α: Reject H0H_0H0 (evidence suggests group means are
different).
o If p>αp > \alphap>α: Fail to reject H0H_0H0.

One-Way ANOVA Example

Scenario:

A researcher wants to test whether three diets (A, B, and C) lead to different average weight
losses.

1. Hypotheses:
o H0H_0H0: μA=μB=μC\mu_A = \mu_B = \mu_CμA=μB=μC.

Connect & follow for more such content and information on Data Analysis/Science
o HaH_aHa: At least one mean is different.
2. Data:
o Diet A: [5, 6, 7]
o Diet B: [8, 9, 10]
o Diet C: [3, 4, 5]
3. Calculate Sum of Squares:
o SSBSS_BSSB: Variability between group means.
o SSWSS_WSSW: Variability within each group.
4. Degrees of Freedom:
o dfB=k−1=3−1=2df_B = k - 1 = 3 - 1 = 2dfB=k−1=3−1=2.
o dfW=n−k=9−3=6df_W = n - k = 9 - 3 = 6dfW=n−k=9−3=6.
5. Calculate F-Ratio:
o Use F=MSBMSWF = \frac{MS_B}{MS_W}F=MSWMSB, where
MS=SSdfMS = \frac{SS}{df}MS=dfSS.
6. Compare P-Value and α\alphaα:
o If p≤0.05p \leq 0.05p≤0.05: Reject H0H_0H0.

Two-Way ANOVA Example

Scenario:

A researcher studies the effect of gender (male, female) and exercise type (yoga, cardio) on
stress levels.

1. Factors:
o Factor 1: Gender (2 levels).
o Factor 2: Exercise Type (2 levels).
2. Hypotheses:
o Main Effects:
▪ H0H_0H0: No difference in stress levels by gender.
▪ H0H_0H0: No difference in stress levels by exercise type.
o Interaction Effect:
▪ H0H_0H0: No interaction between gender and exercise type.
3. Steps:
o Compute main effects and interaction effect.
o Compare F-ratio for each.

Post-Hoc Tests

If ANOVA shows significant results, post-hoc tests determine which groups differ:

1. Tukey's HSD:
o Compares all possible pairs of group means.
2. Bonferroni Correction:
o Adjusts α\alphaα to reduce Type I error.
3. Dunnett’s Test:
o Compares all groups to a control group.

Connect & follow for more such content and information on Data Analysis/Science
Applications of ANOVA

1. Business:
o Testing the effectiveness of different advertising campaigns.
2. Education:
o Comparing average scores across multiple teaching methods.
3. Healthcare:
o Evaluating the effectiveness of various treatments.

Connect & follow for more such content and information on Data Analysis/Science
Chi-Square Test.

Chi-Square Test Overview

The Chi-Square Test is a non-parametric statistical test used to determine if there is a


significant association between categorical variables or if an observed distribution matches an
expected one.

Types of Chi-Square Tests

1. Chi-Square Test of Independence:


o Tests whether two categorical variables are independent of each other.
o Example: Checking if gender and product preference are related.
2. Chi-Square Goodness-of-Fit Test:
o Tests whether the observed distribution of a single categorical variable
matches an expected distribution.
o Example: Verifying if a die is fair by comparing observed and expected
frequencies of each side.

Key Assumptions of the Chi-Square Test

1. The data are categorical.


2. Observations are independent.
3. The sample size is large enough:
o Expected frequency in each cell should be at least 5.

Chi-Square Test Formula

The test statistic (χ2\chi^2χ2) is calculated as:

χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2

Where:

• OOO: Observed frequency.


• EEE: Expected frequency.

Chi-Square Test of Independence

Steps:

Connect & follow for more such content and information on Data Analysis/Science
1. State the Hypotheses:
o Null Hypothesis (H0H_0H0): The two variables are independent.
o Alternative Hypothesis (HaH_aHa): The two variables are not independent.
2. Set the Significance Level (α\alphaα):
o Commonly α=0.05\alpha = 0.05α=0.05.
3. Create a Contingency Table:
o Summarize the observed frequencies of the variables in a matrix format.
4. Calculate Expected Frequencies:
o Use the formula: E=Row Total×Column TotalGrand TotalE = \frac{\text{Row
Total} \times \text{Column Total}}{\text{Grand
Total}}E=Grand TotalRow Total×Column Total
5. Compute the Chi-Square Statistic:
o Plug observed (OOO) and expected (EEE) frequencies into the formula.
6. Determine Degrees of Freedom:
o df=(r−1)(c−1)df = (r - 1)(c - 1)df=(r−1)(c−1), where rrr is the number of rows
and ccc is the number of columns.
7. Compare χ2\chi^2χ2 Value to Critical Value:
o Use a Chi-Square distribution table or statistical software.
8. Make a Decision:
o If χ2≥\chi^2 \geqχ2≥ critical value or p≤αp \leq \alphap≤α: Reject H0H_0H0.
o Otherwise, fail to reject H0H_0H0.

Example: Chi-Square Test of Independence

Scenario: A company wants to determine if product preference (Product A, B, C) is related


to customer age group (Youth, Adult, Senior).

Observed Data:

Product A Product B Product C Row Total


Youth 30 20 10 60
Adult 40 30 20 90
Senior 20 30 40 90
Column Total 90 80 70 240

1. Expected Frequencies:
o For Youth & Product A:
E=Row Total×Column TotalGrand Total=60×90240=22.5E = \frac{\text{Row
Total} \times \text{Column Total}}{\text{Grand Total}} = \frac{60 \times
90}{240} = 22.5E=Grand TotalRow Total×Column Total=24060×90=22.5

Similarly, calculate EEE for all cells.

2. Chi-Square Statistic:
o Calculate χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2.
3. Degrees of Freedom:
o df=(3−1)(3−1)=4df = (3 - 1)(3 - 1) = 4df=(3−1)(3−1)=4.
4. Decision:

Connect & follow for more such content and information on Data Analysis/Science
o Compare χ2\chi^2χ2 to critical value at df=4df = 4df=4 and α=0.05\alpha =
0.05α=0.05.

Chi-Square Goodness-of-Fit Test

Steps:

1. State the Hypotheses:


o Null Hypothesis (H0H_0H0): The observed frequencies match the expected
distribution.
o Alternative Hypothesis (HaH_aHa): The observed frequencies do not match
the expected distribution.
2. Set the Significance Level (α\alphaα).
3. Calculate Expected Frequencies:
o Use the total sample size and proportions.
4. Compute the Chi-Square Statistic:
o Use χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2.
5. Determine Degrees of Freedom:
o df=k−1df = k - 1df=k−1, where kkk is the number of categories.
6. Compare χ2\chi^2χ2 to the Critical Value.

Example: Goodness-of-Fit

Scenario: A die is rolled 60 times, and the observed frequencies of each side are:

• 1: 8, 2: 10, 3: 12, 4: 8, 5: 14, 6: 8.

Hypotheses:

• H0H_0H0: The die is fair (E=606=10E = \frac{60}{6} = 10E=660=10 for each side).
• HaH_aHa: The die is not fair.

1. Expected Frequencies:
o E=10E = 10E=10 for all sides.
2. Chi-Square Statistic:
o Calculate χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2.
3. Degrees of Freedom:
o df=6−1=5df = 6 - 1 = 5df=6−1=5.
4. Decision:
o Compare χ2\chi^2χ2 to critical value at df=5df = 5df=5 and α=0.05\alpha =
0.05α=0.05.

Applications of the Chi-Square Test

1. Business:

Connect & follow for more such content and information on Data Analysis/Science
o Analyzing customer preferences across regions.
2. Healthcare:
o Studying the association between a disease and a risk factor.
3. Education:
o Determining if student performance differs by teaching method.

Connect & follow for more such content and information on Data Analysis/Science
Data Transformation and Standardization
Log Transformation

Definition:
Log transformation is the process of applying the logarithm function to transform a dataset. It
is used to reduce skewness and normalize data.

Why Use It?

• Reduces the impact of outliers.


• Handles data with an exponential growth pattern.
• Helps stabilize the variance in heteroscedastic data.

When to Use It?

• When the data has positive skewness.


• When values range across several orders of magnitude.

Formula:
For a value xx:

• y=log⁡(x)y = \log(x) (commonly base 10 or natural log)

Example: Original data: [10,100,1000][10, 100, 1000]


Log-transformed data (base 10): [1,2,3][1, 2, 3]

Min-Max Scaling

Definition:
Min-Max scaling transforms data to a fixed range, typically between 0 and 1.

Why Use It?

• To standardize features for machine learning models.


• To ensure all features contribute equally.

Formula:
For a value xx:

• x′=x−min(x)max(x)−min(x)x' = \frac{x - \text{min}(x)}{\text{max}(x) -


\text{min}(x)}

Example:
Original data: [10,20,30][10, 20, 30]
Scaled data: [0,0.5,1][0, 0.5, 1]

Advantages:

Connect & follow for more such content and information on Data Analysis/Science
• Preserves the relationships between data points.
• Easy to implement.

Z-Score Standardization

Definition:
Z-Score standardization transforms data to have a mean of 0 and a standard deviation of 1. It
is used when the dataset contains outliers.

Why Use It?

• To standardize data with different units or scales.


• For algorithms sensitive to feature scaling (e.g., SVM, PCA).

Formula:
For a value xx:

• z=x−μσz = \frac{x - \mu}{\sigma}


where μ\mu is the mean and σ\sigma is the standard deviation.

Example:
Original data: [10,20,30][10, 20, 30]
Mean: 2020, Standard deviation: 1010
Z-Scores: [−1,0,1][-1, 0, 1]

Key Points:

• Z-Scores help compare data from different distributions.


• Often used in statistical tests and machine learning.

Connect & follow for more such content and information on Data Analysis/Science
Outliers and Missing Data
Detection of Outliers

1. Using Interquartile Range (IQR):


Outliers are values that fall significantly below or above most of the data. IQR is a common
method to detect them.

• Steps:
1. Calculate Q1 (25th percentile) and Q3 (75th percentile).
2. Compute IQR: IQR=Q3−Q1\text{IQR} = Q3 - Q1.
3. Define outlier boundaries:
▪ Lower boundary: Q1−1.5×IQRQ1 - 1.5 \times \text{IQR}
▪ Upper boundary: Q3+1.5×IQRQ3 + 1.5 \times \text{IQR}
4. Values outside these boundaries are considered outliers.

Example:
Data: [1,2,3,4,5,6,100][1, 2, 3, 4, 5, 6, 100]
Q1 = 2.5, Q3 = 5.5, IQR = 3
Lower boundary: 2.5−1.5(3)=−22.5 - 1.5(3) = -2
Upper boundary: 5.5+1.5(3)=105.5 + 1.5(3) = 10
Outlier: 100100

2. Using Z-Scores:
Z-Scores help detect outliers by measuring how far a data point is from the mean in terms of
standard deviations.

• Rule: Any value with ∣z∣>3|z| > 3 is considered an outlier.


• Formula: z=x−μσz = \frac{x - \mu}{\sigma}

Example:
Data: [10,20,30,1000][10, 20, 30, 1000], Mean = 265, Std Dev = 476.74
Z-Score for 1000: 1000−265476.74≈1.54\frac{1000 - 265}{476.74} \approx 1.54 (not an
outlier).

Handling Outliers

1. Winsorizing:

• Replacing extreme values with a less extreme value (e.g., replacing them with the
nearest boundary).

2. Capping:

• Setting a maximum and minimum threshold. Values beyond these thresholds are
capped.

Connect & follow for more such content and information on Data Analysis/Science
3. Removing:

• Completely removing outliers if they are errors or irrelevant to the analysis.

Missing Data Imputation

1. Mean, Median, Mode Imputation:

• Replace missing values with the mean, median, or mode of the column.
• Works well for small datasets with low missingness.

2. Regression Imputation:

• Predict missing values using regression models.


• Example: If a column YY has missing values, it can be predicted using other columns
X1,X2,…X_1, X_2, \dots.

3. Advanced Techniques:

• K-Nearest Neighbors (KNN): Uses similar rows to fill in missing values.


• Multiple Imputation: Repeats imputation several times and combines the results for
robustness.

Connect & follow for more such content and information on Data Analysis/Science
Measures of Association
Measures of association describe the relationship or dependency between two variables.
These are crucial in determining how one variable changes concerning another. Below are
key topics under this category:

1. Odds Ratio (OR)

• Definition: Odds ratio measures the strength of association between two binary
variables.
It's often used in case-control studies to determine how strongly an exposure is
associated with an outcome.
• Formula:

Odds Ratio=Odds of Event in Group AOdds of Event in Group B\text{Odds Ratio} =


\frac{\text{Odds of Event in Group A}}{\text{Odds of Event in Group B}}

Where:

Odds=Probability of EventProbability of No Event\text{Odds} =


\frac{\text{Probability of Event}}{\text{Probability of No Event}}

• Example:
In a study of 100 people:
o 40 people drink coffee, and 30 of them report improved focus.
o 60 people don’t drink coffee, and 20 of them report improved focus.

Odds of improved focus for coffee drinkers: 3010=3\frac{30}{10} = 3


Odds of improved focus for non-coffee drinkers: 2040=0.5\frac{20}{40} = 0.5

Odds Ratio: 30.5=6\frac{3}{0.5} = 6


Interpretation: Coffee drinkers are 6 times more likely to report improved focus
compared to non-coffee drinkers.

2. Relative Risk (RR)

• Definition: Relative risk compares the probability of an event occurring in two


groups.
Unlike odds ratio, it directly compares probabilities rather than odds.
• Formula:

Relative Risk=Probability of Event in Group AProbability of Event in Group B\text{


Relative Risk} = \frac{\text{Probability of Event in Group A}}{\text{Probability of
Event in Group B}}

• Example:
Using the same data as above:

Connect & follow for more such content and information on Data Analysis/Science
o Probability of improved focus for coffee drinkers: 3040=0.75\frac{30}{40} =
0.75
o Probability of improved focus for non-coffee drinkers:
2060=0.33\frac{20}{60} = 0.33

Relative Risk: 0.750.33=2.27\frac{0.75}{0.33} = 2.27


Interpretation: Coffee drinkers are 2.27 times more likely to report improved focus
compared to non-coffee drinkers.

3. Contingency Tables (Cross-tabulation)

• Definition: A contingency table shows the frequency distribution of variables. It is


used to analyze relationships between categorical variables.
• Structure:
A 2x2 contingency table example for a study of smokers and lung disease:

Disease Present Disease Absent Total


Smokers 50 30 80
Non-Smokers 20 50 70
Total 70 80 150

• Measures Derived from Contingency Tables:


1. Risk: Disease PresentTotal Group Size\frac{\text{Disease
Present}}{\text{Total Group Size}}
2. Odds Ratio and Relative Risk: Computed as shown above.
• Chi-Square Test: Used to assess whether the observed distribution in the table is
significantly different from what we would expect if there were no relationship
between the variables.

Connect & follow for more such content and information on Data Analysis/Science

You might also like