Unit IV
Unit IV
Statistics plays a crucial role in data science, providing the foundation for making informed decisions and
drawing meaningful insights from data. Here are some key aspects of statistics in the context of data
science:
1. Descriptive Statistics:
Mean, Median, and Mode: Measures of central tendency that describe the center of a dataset.
Variance and Standard Deviation: Measures of dispersion that quantify the spread of data
points.
Percentiles and Quartiles: Divides the data into specified percentage intervals.
2. Inferential Statistics:
Hypothesis Testing: Evaluates a hypothesis about a population parameter based on a sample of
data.
Confidence Intervals: Provides a range of values within which the true population parameter is
likely to fall.
Regression Analysis: Examines the relationship between variables and makes predictions.
3. Probability:
Probability Distributions: Describes the likelihood of different outcomes in a random
experiment.
Bayesian Statistics: Incorporates prior knowledge to update probabilities as new data becomes
available.
4. Sampling Techniques:
Random Sampling: Ensures each member of the population has an equal chance of being
included in the sample.
Stratified Sampling: Divides the population into subgroups and then randomly samples from
each subgroup.
Cluster Sampling: Divides the population into clusters and randomly selects entire clusters for
the sample.
5. Statistical Models:
Linear Regression: Models the relationship between a dependent variable and one or more
independent variables.
Logistic Regression: Models the probability of a binary outcome.
Decision Trees, Random Forests, and Support Vector Machines: Machine learning algorithms
based on statistical principles.
1
Box Plots, Histograms, and Scatter Plots: Visual representations of data to identify patterns and
outliers.
Correlation Analysis: Measures the strength and direction of relationships between variables.
7. Experimental Design:
Controlled Experiments: Ensures that changes in the dependent variable are due to the
manipulated independent variable.
Randomized Controlled Trials (RCTs): Randomly assigns subjects to different experimental
conditions to minimize bias.
8. Statistical Software:
Tools like R, Python with libraries like NumPy, SciPy, and pandas, as well as statistical packages
like SPSS or SAS, are commonly used for statistical analysis in data science.
In data science, statistical methods are employed to preprocess and clean data, explore relationships
between variables, build predictive models, and draw reliable conclusions from data. The combination
of statistical techniques and machine learning methods is often used to extract meaningful insights and
patterns from complex datasets.
1. Data:
Raw Data: Unprocessed information collected for analysis.
Dataset: A collection of data, usually tabular, organized for analysis.
2. Variables:
Dependent Variable: The outcome or response variable being predicted or explained.
Independent Variable: The variable that is manipulated or used to predict the dependent
variable.
3. Descriptive Statistics:
Mean: The average value of a set of numbers.
Median: The middle value in a dataset when arranged in ascending or descending order.
Mode: The most frequently occurring value in a dataset.
Range: The difference between the maximum and minimum values in a dataset.
4. Inferential Statistics:
Population: The entire group of individuals or instances about whom the conclusions are drawn.
2
Sample: A subset of the population used to make inferences about the entire population.
5. Probability:
Probability: A measure of the likelihood of a particular event occurring.
Event: A specific outcome or result of an experiment.
6. Distribution:
Normal Distribution (Gaussian Distribution): A symmetrical bell-shaped distribution.
Skewness: A measure of the asymmetry of a distribution.
Kurtosis: A measure of the "tailedness" of a distribution.
7. Statistical Tests:
Hypothesis: A statement about a population parameter that is tested using statistical methods.
Null Hypothesis (H0): The hypothesis that there is no significant difference or effect.
Alternative Hypothesis (H1 or Ha): The hypothesis that there is a significant difference or effect.
8. Regression:
Linear Regression: A statistical method to model the relationship between a dependent variable
and one or more independent variables.
Coefficient: The value that represents the change in the dependent variable for a one-unit
change in the independent variable.
9. Machine Learning:
Supervised Learning: A type of machine learning where the algorithm is trained on a labeled
dataset.
Unsupervised Learning: A type of machine learning where the algorithm is not provided with
labeled output, and it discovers patterns on its own.
11. Resampling:
Bootstrapping: A statistical technique that involves sampling with replacement to estimate the
distribution of a statistic.
Cross-Validation: A technique to assess how well a model will generalize to an independent
dataset.
3
Bias: The error introduced by approximating a real-world problem, which may be complex, by a
simplified model.
Variance: The amount by which a model's prediction can change for a different training dataset.
These terms provide a foundational understanding of the key concepts in data science and statistics. As
we delve deeper into the field, we'll encounter more specialized terminology and concepts.
Population
In the context of statistics and data science, a population refers to the entire group of individuals,
events, or observations about whom or which the researcher or analyst is interested in making
generalizations or drawing conclusions. The population is the larger set from which a sample is drawn
for study and analysis. Here are some key points related to the concept of population:
1. Population Parameters:
Parameter: A numerical value that summarizes a characteristic of the entire population.
Population Mean (μ): The average of all values in the population.
Population Standard Deviation (ς): A measure of the dispersion of values in the population.
2. Characteristics of a Population:
A population can be finite or infinite, depending on the context. For example, the population of
students in a school is finite, while the population of all potential customers for a product might
be considered infinite.
Populations can be homogeneous (similar characteristics) or heterogeneous (diverse
characteristics).
3. Sampling:
Due to practical limitations, it is often impractical or impossible to study an entire population.
Instead, researchers use sampling to study a subset, or sample, of the population.
Random sampling methods are commonly used to ensure that each member of the population
has an equal chance of being included in the sample.
4. Inference:
The goal of statistical inference is to draw conclusions about a population based on the analysis
of a sample.
Inferential statistics, such as confidence intervals and hypothesis testing, are used to make
predictions and inferences about population parameters.
5. Examples of Populations:
4
The population of all registered voters in a country.
The population of all households in a city.
The population of all measurements of a certain variable collected during a specific time period.
Understanding the population is fundamental in statistical analysis because it helps ensure the validity
and reliability of the conclusions drawn from a study or analysis. The choice of an appropriate sampling
method and the careful consideration of the characteristics of the population are critical steps in the
research process.
Sample
In statistics and data science, a sample is a subset of a population that is selected for study and analysis.
The process of selecting a sample from a population is known as sampling. The goal of sampling is to
make inferences or draw conclusions about the entire population based on the characteristics observed
in the sample. Here are some key points related to samples:
1. Random Sampling:
Random Sample: A sample in which every member of the population has an equal chance of
being selected.
Simple Random Sampling: The most straightforward method of random sampling, where each
individual in the population has an equal chance of being chosen.
2. Types of Samples:
Stratified Sampling: Divides the population into subgroups (strata) based on certain
characteristics, and then random samples are taken from each stratum.
Cluster Sampling: Divides the population into clusters, randomly selects some clusters, and then
includes all members from the selected clusters in the sample.
Systematic Sampling: Selects every kth individual from a list after choosing a random starting
point.
3. Sample Size:
Sample Size (n): The number of observations or individuals included in a sample.
The appropriate sample size depends on the research question, the variability within the
population, and the desired level of precision.
4. Sampling Bias:
5
Sampling Bias: Occurs when certain members of the population are more or less likely to be
included in the sample, leading to an unrepresentative sample.
Efforts are made to minimize sampling bias to ensure that the sample accurately reflects the
characteristics of the population.
5. Representativeness:
A representative sample is one that accurately reflects the characteristics of the population
from which it is drawn.
The goal is to ensure that the sample is not biased and can be used to make valid inferences
about the population.
6. Inferential Statistics:
The results obtained from analyzing a sample are used to make inferences or predictions about
the population using inferential statistics.
Techniques such as confidence intervals and hypothesis testing are commonly employed for this
purpose.
7. Examples of Samples:
A survey conducted among a random sample of 500 households to understand consumer
preferences.
A clinical trial studying the effects of a new drug on a randomly selected group of patients.
A social media analysis based on a sample of 1,000 posts to infer trends and sentiments in a
larger online community.
Sampling is a critical step in the research process, and the quality of the sample can greatly impact the
validity of the conclusions drawn from the study. Careful consideration of the sampling method and
efforts to minimize bias contribute to the reliability of statistical analyses and generalizations to the
broader population.
Parameter
In statistics, a parameter is a numerical or characteristic measure that describes a certain aspect of a
population. Parameters are used to quantify and summarize the features or properties of an entire
population. Unlike statistics, which are values calculated from a sample and used to estimate population
parameters, parameters are fixed and specific to the population being studied. Here are some common
types of parameters:
6
2. Population Variance (ς²) and Standard Deviation (ς):
Variance measures the average squared deviation of each value from the population mean.
Standard deviation is the square root of the variance, providing a measure of the spread or
dispersion of the population values.
4. Population Median:
The middle value in a population when the data is arranged in ascending or descending order.
Parameters provide a summary of the entire population's characteristics and are often denoted by
Greek letters (e.g., μ for mean, σ for standard deviation). In practice, it's usually impossible to measure
parameters for an entire population, so researchers use statistical methods to estimate them from
samples. These sample estimates, known as statistics, are then used to make inferences about the
population parameters.
Understanding and estimating parameters are fundamental aspects of statistical analysis and are crucial
for drawing meaningful conclusions and making predictions about populations based on observed
sample data.
Estimator:
An estimator is a function or a statistical method used to calculate an estimate of a population
parameter.
Denoted by a specific symbol, an estimator is a formula or algorithm applied to sample data to
obtain a numerical value that is intended to be close to the true, unknown parameter.
7
Estimate:
An estimate is the specific numerical value calculated by an estimator based on observed
sample data.
It serves as the best guess or approximation for the true value of a population parameter.
Point Estimate:
A single value that is the best guess for the true value of a population parameter.
or e ample, t e sample mean ( ) is a point estimate for t e population mean (μ).
Interval Estimate:
An estimate that provides a range of values within which the true value of a population
parameter is likely to fall.
Confidence intervals are common examples of interval estimates.
Bias:
Bias refers to the systematic error or deviation of an estimator from the true value of the
parameter.
An unbiased estimator, on average, provides estimates that are equal to the true parameter.
Efficiency:
Efficiency measures how well an estimator performs in terms of precision and variability.
An efficient estimator has a smaller variance, providing more precise estimates.
Consistency:
Consistency indicates that as the sample size increases, the estimator converges to the true
value of the parameter.
Consistent estimators are desirable for accurate inferences with larger sample sizes.
Examples of Estimators:
ample ean ( ) n estimator for t e population mean (μ).
Sample Variance (s²): An estimator for the population variance (σ²).
ample roportion (p) n estimator for t e population proportion (π).
Estimators play a crucial role in statistical inference. When using sample data to estimate population
parameters, it's important to consider the properties of the estimator, such as bias, efficiency, and
consistency. Well-designed and unbiased estimators contribute to accurate and reliable statistical
analyses.
Sampling Distribution
The sampling distribution refers to the distribution of a statistic (such as the mean, variance, proportion,
etc.) calculated from multiple samples drawn from the same population. Understanding the properties
of the sampling distribution is crucial in statistical inference, as it allows researchers to make statements
8
about the precision and reliability of estimators. Here are key concepts related to the sampling
distribution:
2. Standard Error:
The standard error of a statistic (e.g., standard error of the mean) is a measure of the variability
of that statistic across different samples.
For the sample mean, the standard error (SE( ̅ )) is calculated as the standard deviation of the
population divided by the square root of the sample size (n).
Understanding the sampling distribution is essential for making statistical inferences and constructing
confidence intervals or conducting hypothesis tests. It helps researchers quantify the variability of
sample statistics and make statements about the precision of estimates. The Central Limit Theorem is
9
particularly valuable in this context, as it allows for the use of normal distribution properties, even when
dealing with non-normally distributed populations.
Standard Error
The standard error is a measure of the variability or precision of a sample statistic, providing an estimate
of how much the sample statistic is expected to vary from the true population parameter. It is
particularly important when making inferences about a population based on a sample. The standard
error is closely related to the standard deviation, but it specifically refers to the variability of a sample
statistic.
The standard error is commonly used in the context of the sample mean ( ̅ ), sample proportion (), or
other sample statistics. Here are some key points:
3. Interpretation:
A smaller standard error indicates less variability in the sample statistic and greater precision.
A larger standard error suggests greater variability and lower precision.
4. Use in Inference:
The standard error is crucial when constructing confidence intervals or conducting hypothesis
tests.
It is used to estimate the likely range within which the true population parameter lies.
10
The Central Limit Theorem is related to the standard error, stating that as the sample size
increases, the sampling distribution of the mean approaches a normal distribution with a mean
equal to the population mean and a standard deviation equal to the standard error.
The standard error is a key concept in statistics, providing information about the precision of sample
statistics and helping researchers make informed inferences about population parameters. It is a critical
component when estimating confidence intervals and conducting hypothesis tests.
1. Unbiasedness:
An estimator is unbiased if, on average, it gives an estimate that is equal to the true population
parameter. In mathematical terms, an estimator ̂ is unbiased for a parameter ( ̂) ,
where E denotes the expected value.
An unbiased estimator does not systematically overestimate or underestimate the true
parameter.
2. Efficiency:
An efficient estimator has a small standard error, providing precise estimates. The standard
error measures the variability of the estimator.
Among unbiased estimators, the one with the smallest variance is considered the most efficient.
3. Consistency:
A consistent estimator approaches the true parameter value as the sample size increases.
Formally, an estimator ̂ is consistent for a parameter θ if lim Var ( ̂)
Consistency ensures that the estimator becomes increasingly accurate with larger sample sizes.
4. Sufficiency:
A sufficient statistic contains all the information in the sample needed to make inferences about
the population parameter. An estimator based on a sufficient statistic is considered more
efficient.
Sufficiency is a concept related to the reduction of data without losing information about the
parameter.
5. Invariance:
An estimator is invariant if its value does not change when the scale or location of the data is
altered. Invariance is a desirable property when dealing with transformations of the data.
For example, the sample mean is invariant under linear transformations of the data.
11
6. Asymptotic Normality:
Asymptotic normality refers to the property that, as the sample size increases, the sampling
distribution of the estimator approaches a normal distribution. This property is often associated
with the Central Limit Theorem.
It allows for the use of normal distribution properties in statistical inference.
7. Robustness:
A robust estimator is not greatly influenced by outliers or deviations from the assumed
distribution. Robustness ensures that the estimator performs well even when data deviates
from ideal conditions.
Robust estimators are particularly useful in the presence of non-normally distributed data.
8. Finite-Sample Efficiency:
Efficiency is not only an asymptotic concept; finite-sample efficiency considers the efficiency of
an estimator for a specific sample size.
An estimator may be efficient in large samples but less so in small samples.
A good estimator balances these properties based on the specific context and goals of the analysis.
Different estimators may be preferred under different circumstances, and the choice often involves
a trade-off between properties like bias and efficiency.
Measures of Centers
Measures of central tendency are statistical measures that describe the center or average of a set of
data values. They provide a single representative value that summarizes the entire dataset. The three
most common measures of central tendency are the mean, median, and mode:
1. Mean:
The mean, also known as the average, is calculated by summing up all the values in a dataset
and dividing by the number of observations.
∑
Formula: Mean=
The mean is sensitive to extreme values and outliers.
2. Median:
The median is the middle value of a dataset when it is arranged in ascending or descending
order. If there is an even number of observations, the median is the average of the two middle
values.
For an odd number of observations: Median=Middle Value
12
The median is less affected by extreme values than the mean and is often used when the data is
skewed.
3. Mode:
The mode is the value that occurs most frequently in a dataset.
A dataset may have no mode, one mode (unimodal), or multiple modes (multimodal).
The mode is suitable for categorical and discrete data, but it can also be used for continuous
data.
Each measure of central tendency has its strengths and is appropriate in different situations:
It's often advisable to use a combination of these measures and consider the characteristics of the data
when interpreting the central tendency. Additionally, measures like the weighted mean or geometric
mean may be used in specific contexts.
Measures of Spread
Measures of spread, also known as measures of dispersion or variability, quantify the extent to which
data points in a dataset deviate from the central tendency (such as the mean, median, or mode). These
measures provide insights into the degree of variability or scatter within the data. Common measures of
spread include:
1. Range:
The range is the simplest measure of spread and is calculated as the difference between the
maximum and minimum values in a dataset.
Range = Maximum Value - Minimum Value
While easy to calculate, the range is sensitive to extreme values and may not provide a robust
measure of variability.
13
2. Interquartile Range (IQR):
The interquartile range is a measure of the spread of the middle 50% of the data. It is less
sensitive to outliers than the range.
IQR = Q3 (upper quartile) - Q1 (lower quartile)
Quartiles divide the dataset into four equal parts, and IQR focuses on the variability within the
central portion of the data.
3. Variance:
Variance measures the average squared deviation of each data point from the mean. It is a
comprehensive measure of the spread.
∑ ( ̅)
Sample Variance =
∑ ( )
Population Variance =
Squaring the deviations emphasizes larger deviations and can be affected by outliers.
4. Standard Deviation:
The standard deviation is the square root of the variance. It provides a measure of spread in the
same units as the original data.
∑ ( ̅)
Sample Standard Deviation = √
∑ ( )
Population Standard Deviation = √
Like variance, the standard deviation is sensitive to outliers.
The choice of a particular measure of spread depends on the characteristics of the data and the specific
goals of the analysis. For example, the interquartile range is robust against outliers, while the standard
deviation provides a commonly used and interpretable measure of variability. Researchers often use a
combination of these measures to gain a comprehensive understanding of data spread.
14
Probability with Examples
Probability is a measure of the likelihood that a particular event will occur. It is expressed as a number
between 0 and 1, where 0 indicates an impossible event, 1 indicates a certain event, and values in
between represent varying degrees of likelihood. Here are some key concepts and examples related to
probability:
1. Probability Basics:
Probability ( ) is calculated as the ratio of the number of favorable outcomes ( ( ))
to the total number of possible outcomes( ( )))
( )
( ) ( )
Probability ranges from 0 (impossible event) to 1 (certain event).
2. Examples of Probability:
Coin Toss:
o Probability of getting heads in a fair coin toss: ( ) -
Dice Roll:
o Probability of rolling a 6 on a fair six-sided die: ( ) -
Deck of Cards:
o Probability of drawing an Ace from a standard deck of 52 cards: ( ) -
Weather Forecast:
o Probability of rain tomorrow based on a weather forecast: ( ) ( )
3. Complementary Probability:
The probability of the complement of an event ( ) denoted as ( ) is equal to 1
minus the probability of the event A: P( ) ( )
4. Joint Probability:
For two independent events , the joint probability ( ( )) is the probability that
both events occur:
P(A B) P(A)×P(B)
5. Conditional Probability:
Conditional probability is the probability of one event given that another event has occurred. It
is denoted as P(A∣B) the probability of event A given that event B has occurred:
( )
( | )
( )
15
P(A∪B) P(A)+P(B)
These examples illustrate fundamental concepts in probability. Probability theory is widely applied in
various fields, including statistics, machine learning, finance, and decision-making. It provides a formal
framework for reasoning about uncertainty and making predictions based on available information.
1. The mean (average), median, and mode are all equal and located at the center of the
distribution.
2. The distribution is symmetric around the mean.
3. The standard deviation controls the spread or width of the distribution.
The probability density function (PDF) of a normal distribution is given by the formula:
( )
( | ) ( )
√
where:
x is the variable,
μ is the mean,
σ is the standard deviation.
Here are a few examples of scenarios where the normal distribution is commonly observed:
1. Height of a Population:
Human height tends to follow a normal distribution. For example, in a population, most
individuals will have an average height around the mean, with fewer individuals being
exceptionally tall or short.
2. IQ Scores:
Intelligence Quotient (IQ) scores are often modeled as a normal distribution with a mean of 100
and a standard deviation of 15. Most people fall within one standard deviation of the mean.
3. Measurement Errors:
16
Measurement errors in scientific experiments often follow a normal distribution. This is
particularly relevant when precise measurements are involved.
4. Exam Scores:
In large-scale exams, scores are often normally distributed. The bulk of the scores cluster around
the mean, with fewer scores at the extremes.
5. Financial Returns:
Daily or monthly financial returns of stocks are often assumed to be normally distributed. This
assumption is used in financial models and risk analysis.
6. Blood Pressure:
Blood pressure in a population can be modeled as a normal distribution. Most people will have
blood pressure close to the mean, with fewer individuals having extremely high or low blood
pressure.
7. Errors in Manufacturing:
Variability in manufacturing processes, such as the length of a manufactured component, may
follow a normal distribution.
It's important to note that while many real-world phenomena exhibit a roughly normal distribution, not
all do. Some distributions may deviate from normality, and understanding the characteristics of the data
is crucial for accurate modeling and analysis. The normal distribution is nonetheless a powerful and
widely used concept in statistics and probability theory.
Assuming we're referring to a binomial distribution, here's an explanation along with examples:
Binomial Distribution:
The binomial distribution describes the number of successes in a fixed number of independent and
identical Bernoulli trials, where each trial has only two possible outcomes: success or failure. The
probability of success is denoted by p, and the probability of failure is
( ) ( ) ( )
17
where:
1. Coin Flips:
Example: Tossing a fair coin (p=0.5) 5 times. The probability of getting exactly 3 heads
(k=3) is given by P(X=3)= ( ) )×( ) ( )
2. Exam Questions:
Example: In a multiple-choice exam with 4 options per question, if a student guesses the
answers for 8 questions (n=8), the probability of getting exactly 5 correct answers (k=5)
is given by P(X=5)= ( ) ( )5 x ( )3 .
3. Defective Items in a Batch:
Example: In a manufacturing process, if 10% of items are defective (p=0.1), the
probability of finding exactly 2 defective items (k=2) in a sample of 8 items (n=8) is given
by P(X=2)= ( )x( )2 x( )6 .
4. Customer Purchases:
Example: In an online store, if the probability of a customer making a purchase is
0.3(p=0.3), the probability of having exactly 4 purchases (k=4) in a sample of 10
customers (n=10) is given by P (X=4) = ( ) x (0.3)4 x (0.7)6.
The binomial distribution is widely applicable in various real-world scenarios where there are a fixed
number of independent trials, each with two possible outcomes.
1. Formulate Hypotheses:
Null Hypothesis (H0): A statement of no effect or no difference.
Alternative Hypothesis (H1 or Ha): A statement suggesting an effect or a difference.
18
Example:
Example:
Choose α=0.05.
Example:
Suppose a sample of 30 individuals has a mean height of 64.2 inches and a standard deviation of
2.5 inches.
4. Determine the Critical Region:
Identify the critical region(s) in the distribution of the test statistic. This is the region where we
would reject the null hypothesis.
Example:
For a two-tailed test at α=0.05, t e critical region would be t e e treme tails of t e distribution.
5. Make a Decision:
Compare the calculated test statistic to the critical value(s). If the test statistic falls in the critical
region, reject the null hypothesis; otherwise, do not reject the null hypothesis.
Example:
If the calculated test statistic falls in the tails beyond the critical values, reject H0 and conclude
that the mean height is not 65 inches.
6. Draw a Conclusion:
Based on the decision, draw a conclusion about the population parameter.
Example:
Conclude whether there is enough evidence to suggest that the population mean height is
different from 65 inches.
19
7. Interpret Results:
Interpret the results in the context of the specific problem and make any necessary
recommendations or conclusions.
Example:
Provide practical implications of the findings in terms of the mean height of the population.
It's important to note that hypothesis testing involves the risk of making a Type I error (rejecting a true
null hypothesis) or a Type II error (failing to reject a false null hypothesis). The choice of the significance
level and the power of the test influence these error rates.
The example provided is a simplified illustration, and the specific test and parameters would depend on
the nature of the data and the research question. Common statistical tests include t-tests, chi-square
tests, ANOVA, and regression analysis.
1. Formulate Hypotheses:
Null Hypothesis (H0): There is no significant association between the two categorical variables.
Alternative Hypothesis (H1 or Ha): There is a significant association between the two categorical
variables.
Example:
Example:
Non-Smoker Smoker
Male 200 150
Female 250 100
20
3. Choose Significance Level (α):
Choose a significance level (α) for t e test (common c oices include 0.05, 0.01).
Example:
Choose α=0.05.
Example:
For the example, the expected frequency for the "Male - Non-Smoker" cell would be
( ) )
, and similarly for other cells.
Example:
Calculate the chi-square test statistic using the observed and expected frequencies.
Example:
For the example, with 2 rows and 2 columns, the degrees of freedom would be ((2−1)×(2−1)=1.
Example:
8. Make a Decision:
If the test statistic is greater than the critical value or the p-value is less than α, reject t e null
hypothesis.
21
Example:
If the calculated chi-square statistic is greater than 3.841, reject the null hypothesis.
9. Draw a Conclusion:
Conclude whether there is enough evidence to suggest a significant association between the
two categorical variables.
Example:
Conclude whether there is a significant association between gender and smoking status based
on the test results.
Chi-square tests are widely used in various fields, including social sciences, medicine, and market
research, to examine associations between categorical variables. It's important to note that the chi-
square test assumes that the data is categorical and the observations are independent.
22