0% found this document useful (0 votes)
29 views62 pages

Data Science

Uploaded by

yogesh7waran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views62 pages

Data Science

Uploaded by

yogesh7waran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

MODULE IV STATISTICAL CONCEPTS FOR DATA SCIENCE

Role of Statistics in Data Science; Population vs. Sample; Descriptive vs. Inferential
statistics; Probability distributions: Poisson, Normal, Binomial, Uniform; Bayes'
theorem and conditional probability; Descriptive statistics: Measures of central
tendency: Mean, median, mode; Measures of dispersion: Variance, standard deviation;
Inferential statistics: Hypothesis testing: Null and alternative hypotheses, p-values;
Confidence intervals, ANOVA, Chi-square test, T-test; Correlation and Covariance.
Role of statistics in data science:
Statistics plays a fundamental role in data science by providing the tools and
techniques necessary for understanding and extracting insights from data. Here are
some key roles of statistics in data science:
1. Descriptive Statistics: Descriptive statistics summarize and describe features of a
dataset, such as mean, median, mode, standard deviation, and percentiles. These
statistics help in understanding the basic properties of the data.
2. Inferential Statistics: Inferential statistics involve making predictions or inferences
about a population based on a sample of data. Techniques like hypothesis testing,
confidence intervals, and regression analysis are used to draw conclusions about larger
populations from limited data.
3. Probability Theory: Probability theory is essential for understanding uncertainty and
randomness in data. It provides the foundation for statistical modeling and inference,
enabling data scientists to quantify uncertainty and make probabilistic predictions.
4. Statistical Modeling: Statistical modeling involves building mathematical models
that describe the relationships between variables in a dataset. Techniques such as
linear regression, logistic regression, time series analysis, and machine learning
algorithms rely on statistical principles to model complex relationships and make
predictions.
5. Experimental Design: Statistics helps in designing experiments and observational
studies to collect data in a systematic and unbiased manner. Proper experimental
design ensures that the collected data is reliable and can lead to valid conclusions.
6. Data Mining and Pattern Recognition: Statistical methods are used to identify
patterns, trends, and relationships within large datasets. Techniques such as clustering,
classification, and association rule mining help in discovering useful insights from
data.
7. Model Evaluation and Validation: Statistics provides methods for evaluating the
performance of predictive models and assessing their accuracy and reliability.
Techniques like cross-validation, hypothesis testing, and goodness-of-fit tests are used
to validate models and ensure their effectiveness.
8. Bayesian Statistics: Bayesian methods are increasingly used in data science for
modeling complex phenomena and updating beliefs based on new evidence. Bayesian
inference allows data scientists to incorporate prior knowledge and uncertainty into
statistical analysis, leading to more robust and interpretable results.
In essence, statistics provides the theoretical foundation and analytical tools necessary
for extracting meaningful information from data, making data-driven decisions, and
solving real-world problems in various domains. Statistics is extensively applied in
various real-life scenarios within the realm of data science.
1. Business Analytics: Companies use statistics to analyze sales data, customer
demographics, and market trends to make informed decisions about pricing, marketing
strategies, inventory management, and resource allocation.
2. Healthcare Analytics: Healthcare providers employ statistics to analyze patient data,
clinical trials, and medical records to identify patterns, predict disease outbreaks,
assess treatment effectiveness, and improve patient outcomes.
3. Financial Analysis: Financial institutions use statistical models to analyze stock
market data, predict asset prices, assess investment risks, detect fraudulent activities,
and optimize portfolio management strategies.
4. Marketing and Customer Analytics: Marketers use statistics to segment customers
based on demographics, behavior, and preferences, conduct A/B testing for marketing
campaigns, analyze website traffic and user engagement, and personalize marketing
messages for targeted audiences.
5. Predictive Maintenance: Industries such as manufacturing and transportation use
statistical models to predict equipment failures, schedule maintenance activities,
optimize asset performance, and minimize downtime and operational costs.
6. Social Media Analytics: Social media platforms analyze user data, interactions, and
engagement metrics using statistical techniques to personalize content
recommendations, target advertisements, detect trends and sentiment, and improve
user experience.
7. Environmental Science: Environmental scientists use statistics to analyze climate
data, predict natural disasters such as hurricanes and earthquakes, assess
environmental risks, and make policy recommendations for conservation and
sustainability efforts.
8. Supply Chain Management: Logistics companies employ statistics to optimize
supply chain operations, forecast demand for goods and services, minimize inventory
costs, improve transportation routes, and enhance overall efficiency and
responsiveness.
9. Sports Analytics: Sports teams and organizations use statistics to analyze player
performance, assess team strategies, predict game outcomes, optimize player
recruitment and draft selections, and gain a competitive edge in sports competitions.
10. Fraud Detection: Banks, insurance companies, and e-commerce platforms use
statistical models to detect fraudulent transactions, identify suspicious patterns and
anomalies in financial data, and prevent financial losses due to fraudulent activities.
Population vs. Sample:
1. Population:
 The population refers to the entire group of individuals, items, or events that we are
interested in studying and about which we want to draw conclusions.
 It represents the entire target of our research or analysis.
 For example, if we are studying the heights of all adult males in a country, the
population would be all adult males in that country.
2. Sample:
 A sample is a subset of the population that is selected for study or analysis.
 It is chosen to represent the larger population, with the aim of drawing conclusions or
making inferences about the population based on the characteristics observed in the
sample.
 Samples are often used because it is impractical or impossible to study every
individual or item in the entire population.
 The process of selecting a sample from a population should ideally be done in a way
that avoids bias and ensures that the sample is representative of the population.
 For example, in the scenario of studying the heights of all adult males in a country, we
might select a sample of 1000 adult males from different regions and demographics
within the country to represent the entire population.
Key differences between population and sample:
 Size: The population includes all individuals or items of interest, whereas the sample
is a smaller subset of the population.
 Representativeness: The population represents the entire group being studied, while
the sample should ideally be representative of the population to ensure that findings
can be generalized.
 Practicality: It is often impractical or impossible to study the entire population, so
samples are used for practical reasons.
 Inference: Statistical analyses performed on a sample are used to make inferences or
draw conclusions about the population from which the sample was drawn.
In summary, while the population represents the entire group being studied, the
sample is a smaller subset of the population that is selected for analysis. The goal of
sampling is to obtain a representative sample that accurately reflects the
characteristics of the population, allowing for valid inferences to be made about the
population based on the observed sample data.
A population in statistics refers to the entire group of individuals, items, or events that
we are interested in studying and about which we want to draw conclusions. It
encompasses all possible subjects that meet specific criteria and is the complete set
from which a sample is drawn.
Key characteristics of a population include:
1. Comprehensiveness: The population includes all elements that meet the criteria for
inclusion in the study. It represents the entire target of our research or analysis.
2. Defined Parameters: Each individual or item in the population possesses certain
characteristics or parameters that are of interest for the study.
3. Finite or Infinite: A population can be finite, meaning it consists of a fixed number of
elements, or infinite, meaning it continues indefinitely.
4. Homogeneity or Heterogeneity: Populations can vary in terms of the similarity or
diversity of their elements. They can be homogeneous if all elements are similar in
certain characteristics, or heterogeneous if they are diverse.
5. Accessibility: While some populations are easily accessible and well-defined, others
may be difficult to access or define precisely.
6. Stability: The characteristics of a population may change over time due to various
factors such as growth, migration, or other external influences.
Examples of populations include:
 All registered voters in a country.
 Every smartphone user in a particular city.
 All cars manufactured by a specific automobile company in a given year.
 All students enrolled in a university program.
Understanding the population of interest is essential for determining the scope and
objectives of a study, selecting appropriate sampling methods, and making valid
inferences about the entire group based on the observed data.
A sample in statistics refers to a subset of the population that is selected for study or
analysis. It is chosen to represent the larger population, with the aim of drawing
conclusions or making inferences about the population based on the characteristics
observed in the sample.
Key characteristics of a sample include:
1. Representativeness: The sample should ideally be representative of the population
from which it is drawn. This means that the characteristics of the sample should
closely resemble those of the population in terms of relevant attributes.
2. Randomness: Random sampling methods are often used to select a sample from the
population, which helps to minimize bias and ensure that each member of the
population has an equal chance of being included in the sample.
3. Size: The size of the sample refers to the number of individuals or items included in
the sample. Larger samples generally provide more precise estimates of population
parameters, but the appropriate sample size depends on factors such as the variability
of the population and the desired level of confidence.
4. Validity: The validity of the sample refers to the extent to which the conclusions
drawn from the sample accurately reflect the characteristics of the population. Validity
can be compromised if the sample is not representative or if there are biases in the
sampling process.
5. Sampling Methods: Various sampling methods can be used to select a sample from a
population, including simple random sampling, stratified sampling, cluster sampling,
and systematic sampling. The choice of sampling method depends on factors such as
the nature of the population and the research objectives.
Examples of samples include:
 Survey responses from a randomly selected group of individuals in a population.
 Measurements of a random sample of products from a production line to assess
quality.
 Test scores from a sample of students in a school to evaluate academic performance.
 Blood samples collected from a subset of patients in a clinical trial to study the
effectiveness of a new treatment.
Sampling is an essential aspect of statistical analysis, as it allows researchers to study
populations without needing to collect data from every individual or item in the
population. However, it is important to ensure that the sample is representative and
that appropriate sampling methods are used to minimize bias and maximize the
reliability of the study results.
Descriptive and inferential statistics are two branches of statistical analysis that serve
different purposes:
Descriptive vs. Inferential Statistics
1. Descriptive Statistics:
 Descriptive statistics are used to summarize and describe the main features of a
dataset.
 They provide simple summaries about the sample or population under study.
 Common measures of descriptive statistics include measures of central tendency (e.g.,
mean, median, mode) and measures of dispersion or variability (e.g., range, variance,
standard deviation).
 Descriptive statistics help to organize and simplify large amounts of data, making it
easier to understand and interpret.
 These statistics are often used to present the basic characteristics of a dataset, identify
patterns, and provide insights into the data without making inferences beyond the
dataset itself.
 Example: Calculating the average height of students in a class, or the percentage of
customers who purchased a particular product.
2. Inferential Statistics:
 Inferential statistics are used to make inferences or predictions about a population
based on sample data.
 They involve generalizing from a sample to a population and drawing conclusions
about the population parameters.
 Inferential statistics allow researchers to test hypotheses, assess relationships between
variables, and make predictions about future outcomes.
 Techniques such as hypothesis testing, confidence intervals, and regression analysis
are commonly used in inferential statistics.
 These statistics help to determine whether observed differences or relationships in the
sample data are statistically significant and can be generalized to the larger
population.
 Example: Using a sample of voters to estimate the proportion of the population that
supports a particular political candidate, or testing whether a new drug treatment is
effective based on results from a clinical trial.
In summary, descriptive statistics are used to summarize and describe data, whereas
inferential statistics are used to make inferences or predictions about populations
based on sample data. Both branches of statistics play important roles in analyzing and
interpreting data, with descriptive statistics providing insights into the characteristics
of the data and inferential statistics allowing researchers to make broader conclusions
about populations based on sample data.
Probability Distributions:
Probability distributions describe how the values of a random variable are distributed.
Poisson distribution:
The Poisson probability distribution is used to model the number of events
occurring within a fixed interval of time or space when these events happen with a
known constant mean rate and independently of the time since the last event. It is
particularly useful for modeling rare events.

Characteristics of Poisson distribution


1. Mean and Variance: Both the mean and the variance of the Poisson distribution are
equal to λ.
2. Skewness: The Poisson distribution is skewed to the right, especially for smaller
values of λ. As λ increases, the distribution becomes more symmetric and approaches
a normal distribution.
Example Calculation
Suppose a call center receives an average of 5 calls per hour (λ=5). We want to find
the probability that exactly 3 calls are received in an hour.

Applications of Poisson distribution


The Poisson distribution is widely used in various fields. Here are a few examples:
1. Telecommunications: Modeling the number of phone calls received at a call
center.
2. Traffic Flow: Counting the number of cars passing through a toll booth in a
given time period.
3. Biology: Modeling the number of mutations in a given length of DNA
sequence.
4. Retail: Estimating the number of customers arriving at a store within an hour.
5. Insurance: Predicting the number of claims that an insurance company might
receive in a given period.
Properties of Poisson Distribution
1. Additivity: If X and Y are independent Poisson-distributed random variables with
parameters λ1 and λ2, then X + Y is Poisson-distributed with parameter λ1+λ2.
2. No Upper Limit: Theoretically, there is a upper limit to the number of events in the
Poisson distribution, although the probability of a very large number of events
happening in a short interval is very small.
The Poisson distribution is a powerful tool for modeling the occurrence of events over
a fixed interval, particularly when these events are rare and independent. Its mean and
variance are both equal to the rate parameter λ, and it finds applications in many real-
world scenarios where event occurrences are of interest.
Normal distribution:
The Normal distribution, also known as the Gaussian distribution, is a continuous
probability distribution characterized by its symmetrical bell-shaped curve. It is
widely used in statistics due to the central limit theorem, which states that the sum of a
large number of independent and identically distributed random variables will be
approximately normally distributed, regardless of the original distribution.
Characteristics of the Normal Distribution
1. Symmetry: The distribution is symmetric about its mean, μ.
2. Mean, Median, and Mode: For a normal distribution, the mean, median, and mode
are all equal.
3. Bell-shaped Curve: The probability density function (PDF) forms a bell-shaped
curve.
4. Defined by Two Parameters: The normal distribution is completely defined by its
mean (μ) and standard deviation (σ).
5. Asymptotic: The tails of the normal distribution approach, but never touch, the
horizontal axis.
Example Calculation
Suppose we have a dataset of heights of adult males with a mean (μ) of 70 inches and
a standard deviation (σ) of 3 inches. We want to find the probability that a randomly
selected male has a height between 68 and 72 inches.

Applications of Normal Distribution


1. Quality Control: In manufacturing, to ensure products meet specifications.
2. Finance: To model asset returns and risk.
3. Psychometrics: To standardize test scores.
4. Natural Phenomena: To describe biological measurements such as heights, weights,
and blood pressure.
Properties of Normal Distribution
1. 68-95-99.7 Rule: Approximately 68% of the data lies within one standard deviation of
the mean, 95% within two standard deviations, and 99.7% within three standard
deviations.
2. Central Limit Theorem: The distribution of the sample mean will be approximately
normal if the sample size is large enough, regardless of the population distribution.
The normal distribution is fundamental in statistics due to its mathematical
properties and applicability across various fields. Its bell-shaped curve, defined by
mean and standard deviation, makes it an essential tool for probability and statistical
inference.
Uniform Distribution
The uniform distribution is a type of probability distribution in which all outcomes are
equally likely. It comes in two main types: discrete uniform distribution and
continuous uniform distribution.
Applications of Uniform Distribution:
Random Number Generation: Many algorithms for generating random numbers
assume a uniform distribution.
Simulation: Used in simulations where each outcome needs to be equally likely.
Quality Control: Modeling the distribution of a uniformly mixed batch of items.
Gaming: Designing fair games where each outcome has an equal chance of occurring.
Summary:
The uniform distribution is fundamental in probability and statistics due to its
simplicity and wide applicability. In both its discrete and continuous forms, it models
situations where all outcomes are equally likely. Understanding the properties, such as
mean and variance, and how to calculate probabilities is essential for using the
uniform distribution effectively in various fields.
Problems in Uniform Distribution:
Bayes' Theorem
Bayes' Theorem is a fundamental theorem in probability theory that describes how to
update the probabilities of hypotheses when given evidence. It is widely used in
various fields such as statistics, machine learning, medicine, and more for making
decisions and predictions based on incomplete or uncertain information.
The Formula
Bayes' Theorem is mathematically expressed as:
Summary
Bayes' Theorem is a powerful tool for updating probabilities based on new evidence.
It combines prior beliefs with the likelihood of observed data to produce a posterior
probability, which reflects updated beliefs. This theorem is essential in fields requiring
probabilistic reasoning and decision-making under uncertainty.
Conditional Probability
So, the probability that a defective item was produced by Machine 2 is approximately
0.526 (or 52.6%).
Explanation:
In this problem, we used conditional probability and Bayes' Theorem to determine the
probability that a defective item was produced by Machine 2. By breaking down the
problem into identifying events, calculating individual probabilities, and then applying
the theorem, we can find the desired probability. This process illustrates the power of
conditional probability in practical scenarios involving uncertainty and dependence
between events.
Descriptive Statistics:
Measures of Central Tendency: Mean, Median, Mode
Measures of central tendency are statistical tools used to summarize a set of data by
identifying the center point within that dataset. The three main measures of central
tendency are the mean, median, and mode. Here's an example along with calculations
for each measure:
Example Dataset
Consider the following set of numbers representing the ages of a group of people:
23,29,20,32,25,23,27,28,23,30
Mean
The mean (average) is calculated by summing all the numbers in the dataset and then
dividing by the count of numbers.
Median
The median is the middle value of a dataset when it is ordered in ascending or
descending order. If the number of observations is odd, the median is the middle
number. If the number of observations is even, the median is the average of the two
middle numbers.
First, we arrange the dataset in ascending order:
20,23,23,23,25,27,28,29,30,32
Since there are 10 numbers (an even count), the median will be the average of the 5th
and 6th numbers:

Mode
The mode is the number that appears most frequently in the dataset. A dataset can
have one mode, more than one mode, or no mode at all if no number repeats.
In our dataset, the number 23 appears three times, which is more frequent than any
other number.
Mode=23
Summary
For the dataset 23,29,20,32,25,23,27,28,23,30
Mean: 26
Median: 26
Mode: 23
These measures provide different perspectives on the central tendency of the data. The
mean gives the arithmetic average, the median provides the middle value, and the
mode identifies the most frequently occurring value.
Measures of Dispersion: Variance, Standard deviation
Measures of dispersion describe the spread or variability of a dataset. The most
commonly used measures of dispersion are the range, variance, standard deviation,
and interquartile range (IQR). Here's a detailed look at each, using the same dataset as
before:
Example Dataset
Consider the following set of numbers representing the ages of a group of people:
23,29,20,32,25,23,27,28,23,30
Range
The range is the difference between the highest and lowest values in the dataset.
Range=Maximum value−Minimum valueRange
For our dataset:
Maximum value=32
Minimum value=20
Range=32−20=12
First, arrange the dataset in ascending order:
20,23,23,23,25,27,28,29,30,32
Next, find Q1 and Q3. Q1 is the median of the first half of the data, and Q3 is the
median of the second half of the data.
For our dataset:
 Q1 (the median of 20, 23, 23, 23, 25): 23
 Q3 (the median of 27, 28, 29, 30, 32): 29
IQR=Q3−Q1=29−23=6
For the dataset 23,29,20,32,25,23,27,28,23,30
Range: 12
Variance: 14.44
Standard Deviation: 3.80
Interquartile Range (IQR): 6
These measures of dispersion provide insights into the variability and spread of the
data, complementing the measures of central tendency.
Hypothesis Testing: Null and Alternative Hypothesis
Hypothesis testing is a fundamental method used in statistics to make inferences about
a population based on sample data. It involves the following steps:
1. Formulating Hypotheses:
 Null Hypothesis (H0): This is a statement of no effect or no difference, and it serves
as the default or starting assumption. For example, H0: μ=0 (where μ is the population
mean).
 Alternative Hypothesis (Ha or H1): This is a statement that contradicts the null
hypothesis, indicating the presence of an effect or a difference. For example,
Ha: μ ≠ 0.
2. Selecting a Significance Level (α):
 The significance level, often denoted by α, is the probability of rejecting
the null hypothesis when it is true. Common choices are 0.05, 0.01, and
0.10.
3. Choosing a Test Statistic:
 The test statistic is a standardized value that is calculated from sample
data during a hypothesis test. Examples include the t-statistic, z-statistic,
chi-square statistic, and F-statistic. The choice depends on the nature of
the data and the hypotheses.
4. Determining the Sampling Distribution:
 The sampling distribution of the test statistic under the null hypothesis is
determined. This distribution is used to calculate the probability of
observing the test statistic under the null hypothesis.
5. Calculating the Test Statistic:
Using the sample data, compute the test statistic. For instance, if testing the
mean of a population, the test statistic might be calculated as:
6. Making a Decision:
 p-value approach: Calculate the p-value, which is the probability of
observing the test statistic or something more extreme, assuming the
null hypothesis is true. If the p-value is less than or equal to α, reject the
null hypothesis.
 Critical value approach: Compare the test statistic to a critical value
from the sampling distribution. If the test statistic falls into the critical
region (beyond the critical value), reject the null hypothesis.
7. Conclusion:
 Based on the decision, conclude whether there is sufficient evidence to
reject the null hypothesis in favor of the alternative hypothesis. This
conclusion helps in understanding whether the observed data
significantly deviates from what was expected under the null hypothesis.
Hypothesis testing is widely used in many fields, such as medicine,
psychology, economics, and social sciences, to test theories and make decisions based
on data.
Problem 1: Testing a Population Mean
A company claims that the average weight of its product is 50 grams. A quality
control manager wants to test this claim. A sample of 30 products shows an average
weight of 51 grams with a standard deviation of 2 grams. Test the company's claim at
the 0.05 significance level.
1. Formulate Hypotheses:
 H0: μ=50 grams (null hypothesis: the mean weight is 50 grams)
 Ha: μ ≠50 grams (alternative hypothesis: the mean weight is not 50 grams)
2. Select Significance Level:
 α=0.05
3. Choose a Test Statistic:
 Since the sample size is less than 30 and the population standard deviation is
unknown, we use the t-statistic.
4. Determine the Sampling Distribution:
 Under H0, the sampling distribution of the sample mean is approximately normal due
to the Central Limit Theorem.
5. Calculate the Test Statistic:
6. Determine the Critical Value:
 For α=0.05 and a two-tailed test with df=29, the critical value from the t-distribution
table is approximately ±2.045.
7. Make a Decision:
 Since 2.74>2.0452.74>2.045, we reject the null hypothesis.
8. Conclusion:
 There is sufficient evidence at the 0.05 significance level to conclude
that the mean weight of the product is not 50 grams.
In hypothesis testing, the null and alternative hypotheses form the basis of the
test. They are statements about a population parameter (such as a mean, proportion, or
variance) that we aim to test based on sample data.
Null Hypothesis (H0)
The null hypothesis is a statement that there is no effect, no difference, or no
relationship in the population, and it serves as the default assumption. It is
often a statement of equality (e.g., H0:μ=μ0, where μ is the population mean).

Alternative Hypothesis (Ha or H1)


The alternative hypothesis is a statement that contradicts the null hypothesis. It
indicates the presence of an effect, difference, or relationship. It is what we want to
prove.
The alternative hypothesis can take several forms:
 Two-tailed (non-directional): Suggests that the parameter is different from the null
value (e.g., Ha:μ ≠ μ0).
 One-tailed (directional): Suggests that the parameter is either greater than or less
than the null value (e.g., Ha:μ>μ0 or Ha:μ<μ0).
p-values in statistics
The p-value is a crucial concept in hypothesis testing in statistics. It helps to determine
the significance of the results obtained from a sample data set.
Definition of p-value
The p-value (or probability value) is the probability of obtaining test results at least as
extreme as the observed results, under the assumption that the null hypothesis is true.
It quantifies the evidence against the null hypothesis:
 Low p-value (< α): Strong evidence against the null hypothesis, leading to its
rejection.
 High p-value (> α): Weak evidence against the null hypothesis, leading to a failure to
reject it.
Steps Involving p-values in Hypothesis Testing
Confidence Intervals in Statistics
Confidence intervals (CIs) are a range of values used to estimate a population
parameter. The interval has an associated confidence level that quantifies the level of
confidence that the parameter lies within the interval. Common confidence levels
include 90%, 95%, and 99%.
Key Concepts
1. Point Estimate: A single value estimate of a population parameter (e.g., sample
mean).
2. Margin of Error (ME): The range within which the true population parameter is
expected to lie.
3. Confidence Level: The probability that the confidence interval contains the true
population parameter.
4. Interval Estimate: The range of values (point estimate ± margin of error)
representing the confidence interval.
Calculation of Confidence Intervals
Confidence Interval for the Mean
When the population standard deviation (σ) is known:
Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) is a statistical method used to compare the means of


three or more samples to determine if at least one sample mean is significantly
different from the others. It helps in assessing whether the observed differences
among sample means are due to actual differences in the population or merely due to
random variation.

Types of ANOVA

1. One-Way ANOVA: Tests for differences among means of three or more independent
(unrelated) groups.
2. Two-Way ANOVA: Tests for differences among means when there are two
independent variables, allowing for the study of interaction effects between the
variables.
3. Repeated Measures ANOVA: Used when the same subjects are used for each
treatment (i.e., repeated measurements).

Key Concepts

 Null Hypothesis (H0): Assumes that all group means are equal.
 Alternative Hypothesis (Ha): Assumes that at least one group mean is different.
 F-Statistic: The test statistic used in ANOVA, which is the ratio of the variance
between the groups to the variance within the groups.
 P-Value: Used to determine the significance of the results. If the p-value is less than
the significance level (usually 0.05), the null hypothesis is rejected.

One-Way ANOVA

Assumptions

1. Independence: Samples must be independent.


2. Normality: The data in each group should be approximately normally distributed.
3. Homogeneity of Variances: The variances among the groups should be
approximately equal.

Steps in One-Way ANOVA


Chi-Square test
The Chi-Square test is a statistical procedure for determining the difference
between observed and expected data. This test can also be used to determine whether
it correlates to the categorical variables in our data. It helps to find out whether a
difference between two categorical variables is due to chance or a relationship
between them.
Chi-Square Test Definition
A chi-square test is a statistical test that is used to compare observed and expected
results. The goal of this test is to identify whether a disparity between actual and
predicted data is due to chance or to a link between the variables under consideration.
As a result, the chi-square test is an ideal choice for aiding in our understanding and
interpretation of the connection between our two categorical variables.
A chi-square test or comparable nonparametric test is required to test a hypothesis
regarding the distribution of a categorical variable. Categorical variables, which
indicate categories such as animals or countries, can be nominal or ordinal. They
cannot have a normal distribution since they can only have a few particular values.
For example, a meal delivery firm in India wants to investigate the link between
gender, geography, and people's food preferences.
It is used to calculate the difference between two categorical variables, which are:
 As a result of chance or
 Because of the relationship

where
c = Degrees of freedom
O = Observed Value
E = Expected Value
The degrees of freedom in a statistical calculation represent the number of
variables that can vary in a calculation. The degrees of freedom can be calculated
to ensure that chi-square tests are statistically valid. These tests are frequently used
to compare observed data with data that would be expected to be obtained if a
particular hypothesis were true. The Observed values are those we gather
ourselves. The expected values are the frequencies expected based on the null
hypothesis.
A Chi-Square test is fundamentally a data analysis based on the observations of
a random set of variables. It computes how a model equates to actual observed
data. A Chi-Square statistic test is calculated based on the data, which must be raw,
random, drawn from independent variables, drawn from a wide-ranging sample and
mutually exclusive. In simple terms, two sets of statistical data are compared for
instance, the results of tossing a fair coin. Karl Pearson introduced this test in 1900
for categorical data analysis and distribution. This test is also known as „Pearson‟s
Chi-Squared Test‟.
Chi-Squared Tests are most commonly used in hypothesis testing. A hypothesis
is an assumption that any given condition might be true, which can be tested
afterwards. The Chi-Square test estimates the size of inconsistency between the
expected results and the actual results when the size of the sample and the number
of variables in the relationship is mentioned.
These tests use degrees of freedom to determine if a particular null hypothesis
can be rejected based on the total number of observations made in the experiments.
Larger the sample size, more reliable is the result.
There are two main types of Chi-Square tests namely:
Independence
Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable (also known as inferential)
statistical test which examines whether the two sets of variables are likely to be
related with each other or not. This test is used when we have counts of values for
two nominal or categorical variables and is considered as non-parametric test. A
relatively large sample size and independence of observations are the required
criteria for conducting this test.
Example:
In a movie theatre, suppose we made a list of movie genres. Let us consider
this as the first variable. The second variable is whether or not the people who
came to watch those genres of movies have bought snacks at the theatre. Here the
null hypothesis is that the genre of the film and whether people bought snacks or
not are unrelatable. If this is true, the movie genres do not impact snack sales.
Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines
whether a variable is likely to come from a given distribution or not. We must have
a set of data values and the idea of the distribution of this data. We can use this test
when we have value counts for categorical variables. This test demonstrates a way
of deciding if the data values have a “good enough” fit for our idea or if it is a
representative sample data of the entire population.
Example:
Suppose we have bags of balls with five different colours in each bag. The
given condition is that the bag should contain an equal number of balls of each
colour. The idea we would like to test here is that the proportions of the five
colours of balls in each bag must be exact.
Example
Let‟s say we want to know if gender has anything to do with political party
preference. We poll 440 voters in a simple random sample to find out which
political party they prefer. The results of the survey are shown in the table below:
To see if gender is linked to political party preference, perform a Chi-Square
test of independence using the steps below.
Step 1: Define the Hypothesis
H0: There is no link between gender and political party preference.
H1: There is a link between gender and political party preference.
Step 2: Calculate the Expected Values
Now we will calculate the expected frequency.

Similarly, we can calculate the expected value for each of the cells.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table


Now we will calculate the (O - E)2 / E for each cell in the table.
where
O = Observed Value
E = Expected Value
Step 4: Calculate the Test Statistic χ²
χ² is the sum of all the values in the last table
= 0.743 + 2.05 + 2.33 + 3.33 + 0.384 + 1
= 9.837
The degrees of freedom in this case are equal to the table‟s number of rows
minus one multiplied by the table‟s number of columns minus one, or (r-1) (c-1). We
have (3-1)(2-1) = 2.
Finally, we compare our obtained statistic to the critical statistic found in the
chi-square table. As we can see, for an alpha level of 0.05 and two degrees of
freedom, the critical statistic is 5.991, which is less than our obtained statistic of
9.837. Therefore, we can reject our null hypothesis since 9.837 is greater than 5.991.
Chi-Square Test in Statistics
The Chi-Square test is a statistical method used to determine if there is a significant
association between two categorical variables. It compares the observed frequencies in
each category of a contingency table with the expected frequencies that would be
observed if there were no association between the variables.
Types of Chi-Square Tests
1. Chi-Square Test for Independence: Used to determine if there is a significant
association between two categorical variables.
2. Chi-Square Goodness-of-Fit Test: Used to determine if a sample data matches an
expected distribution.
3. Chi-Square Test for Independence
Key Concepts
 Null Hypothesis (H0): Assumes no association between the variables (they are
independent).
 Alternative Hypothesis Ha): Assumes there is an association between the variables
(they are not independent).
 Observed Frequencies (O): The actual count of cases in each category from the data.
 Expected Frequencies (E): The count of cases that would be expected in each
category if the null hypothesis were true.
Calculation Steps
1. Create a Contingency Table: This table shows the frequency distribution of the
variables.
2. Calculate Expected Frequencies: For each cell in the table, the expected frequency
is calculated using:
Eij = (Row Total)(Column Total) / Grand Total=Grand Total
where Eij is the expected frequency for cell (i,j).
3. Calculate the Chi-Square Statistic (χ2):
χ2 = ∑ (Oij−Eij)2 / Eij
where Oij is the observed frequency and Eij is the expected frequency.
4. Determine Degrees of Freedom (df):
df=(Number of Rows−1)(Number of Columns−1)
5. Compare χ2 to the Critical Value: Use the chi-square distribution table to find the
critical value at the desired significance level (usually 0.05). If χ2 is greater than the
critical value, reject the null hypothesis.
Example:
Consider a study to determine if there is an association between gender (male, female)
and preference for a new product (like, dislike).

Problems Involving Chi-Square Tests


Chi-square tests are widely used in statistics to determine if there is a significant
association between categorical variables (Chi-Square Test for Independence) or if a
sample data fits an expected distribution (Chi-Square Goodness-of-Fit Test). Here are
some example problems and solutions involving chi-square tests.
Example 1: Chi-Square Test for Independence
Problem: A researcher wants to know if there is an association between gender and
preference for a new product. The following data was collected from a survey of 100
people:
Assignment Problem:
A researcher wants to investigate whether there is a relationship between gender and
preferred mode of transportation among commuters in a city. They collect data from a
random sample of 200 commuters and record both their gender and preferred mode of
transportation. The data is summarized in the contingency table below:
Car Bus Train Total
Male 40 30 20 90
Female 20 50 40 110

Total 60 80 60 200
Perform a chi-square test to determine whether there is a significant association
between gender and preferred mode of transportation at a significance level of 0.05.
T-Test in Statistics
A t-test is a statistical test used to compare the means of two groups. It helps
determine if the differences between the groups are statistically significant. There are
different types of t-tests depending on the study design and data structure:
1. One-Sample T-Test: Compares the mean of a single group to a known value.
2. Independent Two-Sample T-Test: Compares the means of two independent groups.
3. Paired Sample T-Test: Compares the means of two related groups (e.g.,
measurements before and after a treatment on the same subjects).
Key Concepts
 Null Hypothesis (H0): Assumes no difference between the groups.
 Alternative Hypothesis (Ha): Assumes a difference between the groups.
 Test Statistic (t): A standardized value used to determine the probability of observing
the test results under the null hypothesis.
 Degrees of Freedom (df): Reflects the number of independent values in the data that
can vary.
 p-Value: The probability of obtaining a test statistic as extreme as, or more extreme
than, the observed value under the null hypothesis.
 Significance Level (α): The threshold for rejecting the null hypothesis, commonly set
at 0.05.
Summary
The t-test is a versatile tool for comparing means across different scenarios. By
understanding the types of t-tests and their appropriate applications, we can effectively
analyze data and draw meaningful conclusions.
Correlation in Statistics
Correlation is a statistical measure that describes the strength and direction of a
relationship between two variables. It quantifies how changes in one variable are
associated with changes in another variable.
Key Concepts
1. Correlation Coefficient (r):
 The correlation coefficient, typically denoted as r, ranges from -1 to 1.
 r=1: Perfect positive correlation (as one variable increases, the other increases
proportionally).
 r=−1: Perfect negative correlation (as one variable increases, the other decreases
proportionally).
 r=0: No correlation (no linear relationship between the variables).
2. Types of Correlation:
 Positive Correlation: Both variables increase or decrease together.
 Negative Correlation: One variable increases while the other decreases.
 No Correlation: No consistent pattern of relationship.
3. Correlation vs. Causation:
 Correlation does not imply causation. Two variables may be correlated without one
causing the other.
Pearson Correlation Coefficient
The Pearson correlation coefficient measures the linear relationship between two
continuous variables. It is calculated as:
Covariance in Statistics
Covariance is a measure of the joint variability of two random variables. It indicates
the direction of the linear relationship between the variables. If the greater values of
one variable mainly correspond with the greater values of the other variable and the
lesser values correspond similarly, the covariance is positive. If greater values of one
variable mainly correspond with lesser values of the other, the covariance is negative.
Key Concepts
1. Covariance Calculation: The formula for the covariance between two variables X
and Y is:

1. Interpretation:
 Positive Covariance: Indicates that as one variable increases, the other variable tends
to increase.
 Negative Covariance: Indicates that as one variable increases, the other variable
tends to decrease.
 Zero Covariance: Indicates no linear relationship between the variables.
2. Units of Covariance: The value of covariance is not standardized and is dependent on
the units of the variables. This can make it difficult to compare the strength of
relationships between different pairs of variables.
Examples:
Covariance vs. Correlation
While covariance indicates the direction of the linear relationship between variables, it
is not standardized, making it difficult to interpret the strength of the relationship.
Correlation, on the other hand, standardizes the covariance by dividing it by the
product of the standard deviations of the variables, resulting in a dimensionless value
between -1 and 1. This makes correlation easier to interpret and compare across
different data sets.
Summary
Covariance is a fundamental measure in statistics that helps to understand the
relationship between two variables. It forms the basis for more advanced analyses like
correlation and regression. However, its interpretation is often less straightforward
than correlation due to its dependency on the units of the variables involved.

You might also like