Statistical Concepts Unit-2 DA
Statistical Concepts Unit-2 DA
Unit - 2
Data Exploration:
Data exploration refers to the initial step in data analysis in which data analysts use data
visualization and statistical techniques to describe dataset characterizations, such as size,
quantity, and accuracy, in order to better understand the nature of the data.
Raw data is typically reviewed with a combination of manual workflows and automated
data-exploration techniques to visually explore data sets, look for similarities, patterns
and outliers and to identify the relationships between different variables.
This is also sometimes referred to as exploratory data analysis, which is a statistical
technique employed to analyze raw data sets in search of their broad characteristics.
Exploratory Data Analysis
•Raw data are not very informative. Exploratory Data Analysis (EDA) is how we make sense of the data by
converting them from their raw form to a more informative one.
• Any element can be chosen randomly from • Every element will be chosen on
the population. It deals with choosing the the subjective judgment
sample randomly. (purposefully
• The most critical requirement of probability /intentionally) from the population on the
sampling is that everyone in your population basis of certain past experience &
has a known and equal chance of getting knowledge rather than random selection.
selected. • A sampling process where every
• Ex. When an unbiased coin is thrown single individual elements in the
(randomly), the probability of getting population may not have an
the head is ½. opportunity to be choosen as a
sample.
• Ex. Probability of getting a number i.e
6 when a dice will be thrown. • For example, one person could have
a 10% chance of being selected and
another person could have a 50%
Probability Sampling
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster sampling
Example: Suppose we would like to select 10 students from any class consists of 75
students. Write the roll numbers of each students in separate cheats and put it in a
container and 10 cheats from the container one by one randomly. Here probability
of selection is 1/75
Advantage: Every element has an equal chance of getting selected to be the part
sample.
Systematic Sampling
Each member of the sample comes
after an equal interval from its
previous member.
All the elements are put together in
a sequence first where each
element has the equal chance of
being selected.
Select a random starting point and
then select the individual at regular
intervals
Example: Suppose we would like to select 10 students from any class consists of
75 students. Choosing a random stating roll choose every 5th student.
Example: Suppose we would like to select some students from any class
consists of 75 students. The students will be divided into groups of boys and
girls. Then some students will be chosen from boys and some from the girls.
Advantage: Members of each category or group will be chosen without any bias.
Cluster Sampling
From the big population, choose a small
group by diving it into clusters/sections i.e
area wise.
The clusters are randomly selected.
All the elements of the cluster are used for
sampling.
Example: Suppose we would like to know the awareness about COVID in a city.
Instead of going the details survey of the entire city one can divide the city into
clusters and randomly choose a cluster from that. All the members of the cluster
will be considered.
Example: An airline company wants to survey its customers one day, so they
Multi Stage Sampling
Population is divided into multiple clusters and then these clusters are further
divided and grouped into various sub groups (strata) based on similarity.
One or more clusters can be randomly selected from each stratum.
This process continues until the cluster can’t be divided anymore.
Example : A country can be divided into states, cities, urban and rural and all
the areas with similar characteristics can be merged together to form a strata.
Non-Probability Sampling
Every element will be chosen purposefully/intentionally from the population on the basis of certain past
experience and knowledge.
It is a less stringent method.
This sampling method depends heavily on the expertise of the researchers.
It is carried out by observation, and researchers use it widely for qualitative research.
Mainly classified into
Quota Sampling
Purpose Sampling/Judgemental Sampling
Convenience Sampling
Referral / Snowball Sampling
Quota Sampling
Quota sampling works by first dividing the selected population into exclusive subgroups.
The proportions of each subgroup are measured, and the ratio of selected subgroups are then
used in the final sampling process.
The proportions of the selected subgroups are used as boundaries for selecting a sample population of
proportionally represented subgroups.
There are two types of quota sampling:
proportional
non proportional
Proportional Quota Sampling
In proportional quota sampling you want to represent the major characteristics of the population by sampling a
proportional amount of each.
The problem here is that you have to decide the specific characteristics on which you will base the quota.
Will it be by gender, age, education race, religion, etc.?
For example, if you know the population has 40% women and 60% men, and that you want a total sample
size of 100, you will continue sampling until you get those percentages and then you will stop. So, if you’ve
already got the 40 women for your sample, but not the sixty men, you will continue to sample men but even
if legitimate women respondents come along, you will not sample them because you have already “met
your quota.”
Non-Proportional Quota Sampling
Use when it is important to ensure that a number of sub-groups in the field of study are well-covered.
Use when you want to compare results across sub-groups.
Use when there is likely to a wide variation in the studied characteristic within minority groups.
Identify sub-groups from which you want to ensure sufficient coverage. Specify a minimum sample size
from each sub-group.
Here, you’re not concerned with having numbers that match the proportions in the population.
Instead, you simply want to have enough to assure that you will be able to talk about even small groups
in the population.
Example:A study of the prosperity of ethnic groups across a city, specifies that a minimum of 50 people in ten
named groups must be included in the study. The distribution of incomes across each ethnic group is then
compared against one another.
Purpose Sampling/Judgemental
Sampling
Samples are chosen only on the basis of the researcher’s
knowledge and judgement.
It enables the researcher to select cases that will best
enable him to answer his research questions that meet the
objective.
Choosing a sample because of represent the certain
purpose.
Example-1: In online live voting for selecting a GOOD Singer from a competition, the people
who have interest in singing can be selected in the sample .
Example-2: If we want to understand the thought process of the people who are interested in pursuing
master’s degree then the selection criteria would be “Are you interested for Masters in..?”
All the people who respond with a “No” will be excluded from our sample.
Convenience Sampling
Convenience sampling (also called accidental sampling or grab sampling) is where you include people
who are easy to reach.
Sample are taken mainly on basis of the readily available.
Sample which is convenient to the researcher or the data analyst can be chosen. The task is done without
any principles or theories.
For example, you could survey people from:
Your workplace,
Your school,
A club you belong to,
The local mall.
Example: Suppose I would like to select 5 students from any class consists of
75 students. Choosing the 5 students who sits near by me without any principle of selection.
Referral / Snowball Sampling
Snowball sampling method is purely based on referrals and
that is how a researcher is able to generate a sample.
So the researcher will take the help from the first element
which he select for the population and ask him to
recommend others who will fit for the description of the
sample needed.
So this referral technique goes on, increasing the size of
population like a snowball.
Example: If you are studying the level of customer satisfaction among the members of an elite country club,
you will find it extremely difficult to collect primary data sources unless a member of the club agrees to have
a direct conversation with you and provides the contact details of the other members of the club.
Sampling Errors
Sampling error is a statistical error that occurs when an analyst does not select a
sample that represents the entire population of data.
The results found in the sample thus do not represent the results that
would be obtained from the entire population.
Sampling error can be reduced by randomizing sample selection
increasing the number of observations.
It mainly happens when the sample size is very small (10 to 100).
For example, if you wanted to figure out how many Formula: the formula for the margin
people out of a thousand were under 18, and you came of
error is 1/√n, where n is the
up with the figure 19.357%. If the actual percentage size of the
sample. For example, a random
equals 19.300%, the difference (19.357 – 19.300) of 0.57 sample
of 1,000 has about a 1/√n; =
3.2% error.
or 3% = the margin of error. If you continued to take
samples of 1,000 people, you’d probably get slightly
different statistics, 19.1%, 18.9%, 19.5% etc, but they
would all be around the same figure. This is one of the
reasons that you’ll often see sample sizes of 1,000 or
1,500 in surveys: they produce a very acceptable margin
of error of about 3%.
Five Common Types of Sampling
Errors
Population Specification Error—This error occurs when the researcher does not understand who
they should survey.
Sample Frame Error—A frame error occurs when the wrong sub-population is used to select a
sample.
Selection Error—This occurs when respondents self-select their participation in the study –
only those that are interested respond. Selection error can be controlled by going extra lengths
to get participation.
Non-Response—Non-response errors occur when respondents are different than those who do
not respond. This may occur because either the potential respondent was not contacted or they
refused to respond.
Sampling Errors—These errors occur because of variation in the number or
representativeness of the sample that responds. Sampling errors can be controlled by (1)
careful sample designs, (2) large samples, and (3) multiple contacts to assure representative
response.
Data Sets, Variables, and
Observations
• A data set is usually a rectangular array of data, with variables in columns and observations in
rows.
• A variable (or field or attribute) is a characteristic of members of a population, such as
height, gender, or salary.
• An observation (or case or record) is a list of all variable values for a single member of a
population.
Data set includes observations on 10 people who
responded to a questionnaire on the president’s
environmental policies.
Variables include age, gender, state, children, salary,
and opinion.
Include a row that lists variable names.
observation
Data Types
• Variables (or attributes, dimensions, features)
•A variable is a characteristic that can be measured and that can assume
different values. (Height, age, income, province or country of birth, grades
obtained at college and type of housing are all examples of variables.)
• Types of variables
•Variables may be classified into two main categories:
• Categorical (Qualitative)
• (A categorical variable (called qualitative variable) refers to a characteristic
that can’t be quantifiable.)
• Numeric (Quantitative)
• (A variable is numeric if meaningful arithmetic can be performed on it.)
A variable is numerical if meaningful arithmetic can be performed on it.
Otherwise, the variable is categorical.
Categorical (Qualitative)
Variable/Attribute
• Nominal: Nominal means “relating to names.” The values of a nominal attribute are symbols or names of things
for example,
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• The values do not have any meaningful order about them.
• Binary: Nominal attribute with only 2 states (0 and 1), where 0 typically means that the attribute is absent, and 1
means that it is present. Binary attributes are referred to as Boolean if the two states correspond to true and false.
• Symmetric binary: both outcomes equally important, e.g., gender
• Asymmetric binary: outcomes not equally important, e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal: Values have a meaningful order (ranking) but magnitude between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
• Other examples of ordinal attributes include Grade (e.g., A+, A, A−, B+, and so on) and
• Professional rank. Professional ranks can be enumerated in a sequential order, such as assistant, associate, and
full for professors,
Categorical (Qualitative)
Variable/Attribute (cont..)
• Categorical variables can be coded numerically.
• Gender has not been coded, whereas Opinion
has been coded.
• This is largely a matt er of taste-coding a
c a t e g o r i c a l v a r i a b l e d o e s n o t m a ke i
t numerical and appropriate for
arithmeti c operations.
• Now Opinion has been replaced by text, and
Gender has been coded as 1 for males and 0
for females.
• This 0−1 coding for a categorical variable is
very common. Such a variable is called a
dummy variable, it often simplies the analysis.
A dummy variable is a 0−1 coded variable for a specific category.
It is coded as 1 for all observations in that category and 0 for all
observations not in that category.
Categorical (Qualitative)
Variable/Attribute (cont..)
• A binned (or
discretized) variable
corresponds to a
numerical variable that
has been categorized
into discrete categories.
• T h e s e cate go r i e s a
r e usually called bins.
• The Age variable has been
categorized as
“young” (34 years or
younger),
• “middle-aged” (from 35 to
59 years), and “elderly”
Numerical (Quantitative)
Variable/Attribute
• Numerical variables can be classified as discrete or continuous.
• The basic distinction is whether the data arise from counts or continuous
measurements.
The variable Children is clearly a count (discrete),
whereas the variable Salary is best treated as continuous.
• Numeric attributes can be interval-scaled or ratio-scaled.
Interval
• Measured on a scale of equal-sized units. The values of interval-scaled
attributes have order and can be positive, 0, or negative. E.g.,
temperature in C˚or F˚, calendar dates No true zero-point
Ratio
• Inherent zero-point. We can speak of values as being an order of
magnitude larger than the unit of measurement (10 K˚ is twice as
Data sets can also be categorized as cross-sectional or time series
A time series data set generally has the same layout—variables in columns and observations
in rows—but now each variable is a time series. Also, one of the columns usually indicates the
time period.
It has quarterly observations on revenues from toy sales over a four-year period in column B,
with the time periods listed chronologically in column A.
Properties of Attribute Values
• T h e t y p e o f a n att r i b u te d e p e n d s o n w h i c h o f t h e
fo l l o w i n g properties/operations it possesses:
• Distinctness : = and
• Order : <, ≤, >, and
• Addition : ≥
(Differences are meaningful) + and -
• Multiplication : * and /
(Ratios are meaningful)
• Nominal attribute: distinctness
• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & meaningful differences
• Ratio attribute: all 4 properties/operations
Descriptive measures for Numerical
variables
• Measure of Central Tendency.
• Measure of Variability.
• Measure of Shape.(Kurtosis and Skewness)
Measures of Central Tendency
There are three common measures of central tendency, all of which try to
answer the basic question of which value is most “typical.”
• These are the mean, the median, and the mode.
• Mean of the Sample.
A measure of central tendency is a number that
represents the typical value in a collection of
numbers.
Mean = sum of all data points/n (The mean is
also known as the "average" or the "arithmetic
average.")
• If the data set represents a sample from some larger population, this
measure is called the sample mean and is denoted by X .
• If the data set represents the entire population, it is called the
population mean and is denoted by µ
Measures of Central Tendency
• A trimmed mean (sometimes called a truncated mean) is similar to a mean, but it trims any
outliers. Outliers can affect the mean (especially if there are just one or two very large values), so
a trimmed mean can often be a better fit for data sets with erratic high or low values or for
extremely skewed distributions. Even a small number of extreme values can corrupt the mean.
• For example, the mean salary at a company may be substantially pushed up by that of a few
highly paid managers. Similarly, the mean score of a class in an exam could be pulled down quite
a bit by a few very low scores.
• Which is the mean obtained after chopping off values at the high and low extremes.
• Example: Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
• Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three
values: 60, 81, 83, 91, 99.
• Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85.
Measures of Central Tendency
(cont..)
• Median of a Simple.
• The median of a set of data is the “middle element” when the data is arranged in ascending order. To
determine the median:-
If there are an odd number of data points, the median will be the number in the absolute middle.
If there is an even number of data points, the median is the mean of the two center data points,
meaning the two center values should be added together and divided by 2.
• Example:
• Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
• Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
• Step 2: Determine the absolute middle of the data. 9, 10, 12, 13, 14, 14, 17, 17, 20
Measures of Central Tendency
(cont..)
• The Mode of Sample:-
The mode is the most frequently occurring measurement in a data set.
There may be one mode; multiple modes, if more than one number occurs most frequently; or no mode at all,
if every number occurs only once.
Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
To determine the mode:
1. Put the data in order from smallest to largest, as you did to find your median.
2. Look for any value that occurs more than once.
3. Determine which of the values from Step 2 occurs most frequently.
• Example:-
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Look for any number that occurs more than once. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 3: Determine which of those occur most frequently. 14 and 17 both occur twice.
The modes of this data set are 14 and 17.
Measures of Central Tendency
(cont..)
Frequency , Relative Frequency and
Cumulative Relative Frequency.
• Frequency (or event) recording is a way to measure
the number of times a behavior occurs within a
given period.
• The advantage of using frequency distributions is
that they present raw data in an organized, easy-to-
read format. The most frequently occurring scores
are easily identified, as are score ranges, lower and
upper limits, cases that are not common, outliers,
and total number of observations between any
given scores.
• Then we add them all up and divide by 21. The quick way
to do it is to multiply each midpoint by each frequency:
where:
L is the lower class boundary of the group
containing the median
n is the total number of values
B is the cumulative frequency of the groups before
the median group
G is the frequency of the median group
w is the group width
For our
example:
n / 2 ( freq )
L = 60.5 median L1 )width
l
freq median
n = 21 (
B=2+7=9
G=8
w=5
Estimating the Mode
from Grouped
• Again, looking at our data:
Data
• We can easily find the modal group (the group with the highest
frequency), which is 61 - 65
• We can say "the modal group is 61 - 65"
• But the actual Mode may not even be in that group! Or there may
be more than one mode. Without the raw data we don't really
know. But, we can estimate the Mode using the following
formula:
where:
L is the lower class boundary of the modal group fm is the frequency of the modal
is the frequency of the group before the modal group group
fm-1
is the frequency of the group after the modal group w is the group width
f
For our
example:
L = 60.5
fm-1 =7
fm =8
fm+1 =4
w =5
Age
Baby Carrots
Example
Example The ages of the 112 people who live on
a tropical island are grouped as follows:
Example: You grew fifty baby carrots using
special soil. You dig them up and measure
their lengths (to the nearest mm) and
group the results:
Measures of
Variablility.
• Measures of variability give a sense of how spread out the response values are.
• The range, standard deviation and variance each reflect different aspects of
spread.
• Percentiles and quartiles certainly tell you something about variability.
• Specifically, for any percentage p, the pth percentile is the value such that a
percentage p of all values are less than it. Similarly, the first, second, and third
quartiles are the percentiles corresponding to p = 25%, p = 50%, and p = 75%.
These three values divide the data into four groups, each with (approximately) a
quarter of all observations.
• Note that the second quartile is equal to the median by definition.
For example, if you learn that your score in the verbal SAT test is at the 93rd percentile, this
means that you scored better than 93% of those taking the test.
Measures of Variablility
(cont..)
• Range
• Interquartile Range
• Variance
• Standard Deviation
Range
The range gives you an idea of how far apart the most extreme response scores are. To
find the range, simply subtract the lowest value from the highest value.
Measures of Variablility
(cont..)
Interquartile range
• The Interquartile range measures the variability, based on dividing an ordered set
of data into quartiles.
• Quartiles are three values or cuts that divide each respective part as the first, second,
and third quartiles, denoted by Q1, Q2, and Q3
Q1= It is the cut in the first half of the rank-ordered data set
Q2= It is the median value of the set
Q3= It is the cut in the second half of the rank-ordered data
set.
The use variance to see how individual numbers relate to each other within a data set.
Variance analysis helps an organization to be proactive in achieving their business
targets
School of Computer Engineering
Measures of Variablility
(cont..)
Standard deviation
A fundamental problem with variance is that it is in squared units.
A more natural measure is the standard deviation, which is the square root of the variance.
Important Points
• For two data sets with the same mean, the one with the larger standard
deviation is the one in which the data is more spread out from the center.
• Standard deviation is equal to 0 if all values are equal (because all values
are then equal to the mean).
School of Computer Engineering
Measures of Variablility
(cont..) There are six steps for finding the standard
Standard deviation deviation:
How about deriving a measure that captures the horizontal distance between the Mode and the
Mean of the distribution?
It’s intuitive to think that the higher the skewness, the more apart these measures will be.
Pearson’s Coefficient of Skewness : This method is most frequently used for measuring skewness.
The formula for measuring coefficient of skewness The value of this coefficient would be zero in a
is given by symmetrical distribution. If mean is greater
than mode, coefficient of skewness would be
positive otherwise negative. The value of the
Pearson’s coefficient of skewness usually lies
between ±1 for moderately skewed distubution.
Measures of Shape
(cont..) If this value is between:
Skewness
• -0.5 and 0.5, the distribution of the value is
almost symmetrical
• -1 and -0.5, the data is negatively skewed,
and if it is between 0.5 to 1, the data is
If mode is not well defined, we use the formula positively skewed. The skewness is moderate.
• If the skewness is lower than -1 (negatively
skewed) or greater than 1 (positively skewed),
the data is highly skewed.
• Contextual Outliers: Contextual outliers are those values of data points that deviate quite a lot from the rest
of the data points that are in the same context, however, in a different context, it may not be an outlier at all. For
example, a sudden surge in orders for an e-commerce site at night can be a contextual outlier.
Outliers can lead to vague or misleading predictions while using machine learning models. Specific models like
linear regression, logistic regression, and support vector machines are susceptible to outliers. Outliers decrease
the mathematical power of these models, and thus the output of the models becomes unreliable.
Outliers
(cont..)
• When there are no outliers in a sample, the mean and standard deviation are used
to summarize a typical value and the variability in the sample, respectively.
• When there are outliers in a sample, the median and interquartile range are used
to summarize a typical value and the variability in the sample, respectively.
Tukey Fences
• In previous example, for the diastolic blood pressures, the lower limit is 64 - 1.5(77-64) =
44.5 and the upper limit is 77 + 1.5(77-64) = 96.5. The diastolic blood pressures range
from 62 to 81. Therefore there are no outliers.
Outliers
(cont..)
Example : The Full Framingham Cohort Data
• The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study on
residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209
adult subjects from Framingham, and is now on its third generation of participants.
• Table 1 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variables in the subsample of n=10 participants who attended the
seventh examination of the Framingham Offspring Study.
Table 1 - Summary Statistics on n=10 Participants
Outliers
(cont..)
• Table 2 displays the observed minimum and maximum values along with the limits to determine
outliers using the quartile rule for each of the variables in the subsample of n=10 participants.
• Are there outliers in any of the variables? Which statistics are most appropriate to summarize
the average or typical value and the dispersion?
Table 2 - Limits for Assessing Outliers in Characteristics Measured in the n=10 Participants
Since there are no suspected outliers in the subsample of n=10 participants, the mean and standard deviation are the
most appropriate statistics to summarize average values and dispersion, respectively, of each of these characteristics.
Outliers
(cont..)
• For clarity, we have so far used a very small subset of the Framingham Offspring Cohort to
illustrate calculations of summary statistics and determination of outliers. For your interest,
Table 3 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variable displayed in Table 1 in the full sample (n=3,539) of
participants who attended the seventh examination of the Framingham Offspring Study.
Table 3-Summary Statistics on Sample of (n=3,539) Participants
Outliers
(cont..)
• Table 4 displays the observed minimum and maximum values along with the limits to determine
outliers using the quartile rule for each of the variables in the full sample (n=3,539).
Table 4 - Limits for Assessing Outliers in Characteristics Presented in Table 3
There is a systematic relationship between the inclination of missing values and the observed data,
but not the missing data. All that is required is a probabilistic relationship
Types of Missing Values
(cont..)
Missing not at Random (MNAR) - Nonignorable
Data was obtained from 31 women, of whom 14 were located six months later. Of these, three had exited from
homelessness, so the estimated proportion to have exited homelessness is 3/14 = 21%. As there is no data for the 17
women who could not be contacted, it is possible that none, some, or all of these 17 may have exited from
homelessness. This means that potentially the proportion to have exited from homelessness in the sample is
between 3/31 = 10% and 20/31 = 65%. As a result, reporting 21% as being the correct result is misleading. In this
example the missing data is nonignorable.
Strategies to handle MNAR are to find more data about the causes for the missingness, or to perform
what-if analyses to see how sensitive the results are under various scenarios.
Types of Missing Values
(cont..)
Can Formalize these
Definitions..
Let X represent a matrix of the data we 1. MCAR: P(R| Xo , Xm ) = P(R)
“expect” to have; X = {Xo,Xm} where Xo is
the observed data and Xm the missing
data.
2. MAR: P(R| Xo , Xm ) = P(R| Xo)
• The most meaningful way to describe a categorical variable is with counts, possibly expressed as
percentages of totals, and corresponding
• charts of the counts.
• We can find the counts of the categories of either variable separately, and more
• importantly, we can find counts of the joint categories of the two variables, such as the
• count of all nondrinkers who are also nonsmokers.
It is customary to display all such counts in a table called a crosstabs (for crosstabulations).
This is also sometimes called a contingency table.
Relationships Among Categorical
Variables (cont..)
(Categorical vs Categorical)
Do the data indicate that smoking and drinking habits are related? For
example,
• Do nondrinkers tend to be nonsmokers?
• Do heavy smokers tend to be heavy drinkers?
The 1st two arguments are for the condition on smoking;
the 2nd two are for the condition on drinking.
You can then sum across rows and down columns to get the totals.
It is useful to express the counts as percentages of row in the middle table
and as percentages of column in the bottom table.
The latter two tables indicate, in complementary ways, that there is
definitely a relationship between smoking and drinking.
These tables indicate that smoking and drinking habits tend to go with one
another. These tendencies are reinforced by the column charts of the two
percentage tables
Relationships Among Categorical
& Numerical
(Categorical vs Numerical) Variables
It describes a very common situation where the goal is to break down a numerical
variable such as salary by a categorical variable such as gender.
• This general problem, typically referred to as the comparison problem, is one of the
most important problems in data analysis.
• It occurs whenever you want to compare a numerical measure across two or more
subpopulations. Here are some examples:
The subpopulations are males and females, and the numerical measure is salary.
The subpopulations are different regions of the country, and the numerical measure is the cost of living.
The subpopulations are different days of the week, and the numerical measure is the number of customers
going to a particular fast-food chain.
The subpopulations are different machines in a manufacturing plant, and the numerical measure is the number
of defective parts produced per day.
The subpopulations are patients who have taken a new drug and those who have taken a placebo, and the
numerical measure is the recovery rate from a particular disease.
The subpopulations are undergraduates with various majors (business, English, history, and so on), and the
numerical measure is the starting salary after graduating.
Relationships Among Categorical & Numerical
Variables (cont..)
(Categorical vs Numerical)
The data are stacked if there are two “long” • Do pitchers (or any other positions) earn more than
variables, Gender and Salary, as indicated in others?
F i g u re . O c c a s i o n a l l y w i l l s e e d a ta • Does one league pay more than the other, or do any
i n unstacked format. (Note that both tables divisions pay more than others?
list exactly the same data) • How does the notoriously high Yankees payroll compare
to the others?
Relationships Among Categorical & Numerical
Variables (cont..)
(Categorical vs Numerical)
Data set includes an observation (Golf Stats) for each of the top 200 earners on the PGA Tour.
Relationships Among Numerical & Numerical
Variables (cont..)
(Scatterplot)
This example is typical in that there are many numerical
variables, and it is up to you to search for possible
relationships. A good first step is to ask some interesting
questions and then try to answer them with scatterplots.
For example,
Do younger players play more events?
Are earnings related to age?
Which is related most strongly to earnings:
driving, putting, or greens in regulation?
Do the answers to these questions remain the
same from year to year?
This example is all about exploring the data,
Relationships Among Numerical & Numerical
Variables (cont..)
(Scatterplot)
Relationships Among Numerical & Numerical
Variables
(Scatterplot)(cont..)
• Once you have a scatterplot, it enables you to superimpose one of several
trend lines on the scatterplot.
• A trend line is a line or curve that “fits” the scatter as well as possible.
• This could be a straight line, or it could be one of several types of curves.
Relationships Among Numerical & Numerical
Variables (cont..)
Correlation and Covariance
• Correlation and covariance measure the strength and direction of a linear
relationship between two numerical variables. (Bi-Variate Measures)
• The relationship is “strong” if the points in a scatterplot cluster tightly around some
straight line.
If this straight line rises from left to right, the relationship is positive and the measures will be
positive numbers.
If it falls from left to right, the relationship is negative and the measures will be negative
numbers.
• The two numerical variables must be “paired” variables.
They must have the same number of observations, and the values for any observation should be
naturally paired.
Specifically, each measures the strength and direction of a linear relationship between two numerical variables.
Relationships Among Numerical & Numerical
Variables (cont..)
• Covariance is essentially an average of products of deviations from means.
• With this in mind, let Xi and Yi be the paired values for observation i, and let n be the
number of observations. Then the covariance between X and Y, denoted by Covar(X,
Y).
• Covariance has a serious limitation as a descriptive measure because it is very
sensitive to the units in which X and Y are measured.
For example, the covariance can be inflated by a factor of 1000 simply by measuring X in dollars rather
than thousands of dollars. In contrast, the correlation, denoted remedies this problem.
Relationships Among Numerical & Numerical
Variables (cont..)
• Correlation is a unitless quantity that is unaffected by the measurement scale.
• For example, the correlation is the same regardless of whether the variables
are measured in dollars, thousands of dollars, or millions of dollars.
Finally, correlations (and covariances) are symmetric in that the correlation between any two variables X
and Y is the same as the correlation between Y and X.
Relationships Among Numerical & Numerical
Variables (cont..)
Correlation
The bigger the sample size, the more narrow the confidence interval will be.
How to determine the lower and upper confidence limit?
Confidence limit Standard deviation
Sample size A measure of how
many standard deviations
Mean Z-score are below or above the
population mean
Z-Scores for commonly used confidence intervals are as follows:
90% 1.645 99% 2.576 50% 0.674 Refer
95% 1.96 80% 1.282 98% 2.326 Appendix for
further details
Interval estimation
Intervel esitmator for large samples - in large sample, the interval
estimation is further studies under the 4 headings
• Confidence interval or limit for population mean
• Confidence interval or limit for population proprotion P
• Confidence interval or limit for population standard deviation
• Determination of a proper sample for estimating sample population
and population mean
confidence intervale or limit for
population mean
• the dertermination of the confidence interval or limits for population
mean in case of large sample that is n>30, requires the use of normal
distribution.
example
• A random sample of 100 oberservations yields sample mean = 150
and sample var = 400. compute 95% and 99% confidence interval for
the population mean
confidence interval or limit for
population proprotion P
• though the sampling distribution associated with proportions is the
binomial distribution, the norrmal distribution can be used as an
approximation provided the sample is large, i.e, n>30 and np >=5.
here n is the size of the sample, p is the proportion of success and
q=p-1.
example
• out of 1200 tosses of a coin, it gave 480 heads and 720 tails. find the
95% confidence interval for the heads.
example
• A random sample of 1000 households in a city revealed that 500 of
these had car. find 95% and 99% confidence limits for the proportion
of households in the city with car
confidence interval or limit for
population standard deviation
• the determination of the confidence intervel or limit for population
S.D in case of large sample requires the use of normal distribution.
Example
• A random sample of 50 observations gave a value of its standard
deviation equal to 24.5 . construct a 95% confidence interval for
population standard deviation
Determination of a proper sample for estimating
sample population and population mean
So far we have calculated the confidence intervals based on the
assumption that the sample size”n” is know. in most of the practical
situation, generally, sample size is not known. the method of
dertimning a proper sample size is stuided under 2 headings:
• Sample size for estimation a population mean
• Sample Size for esimating a population proportion
Sample size for estimation A
population mean :
Order to dertermine the sample size for estimating a population mean,
the following 3 factors must be known :
• desired confidence level and the corresponding values of Z.
• permissible sample error E.
• standard deviation or an estimate of population S.D
after having know the above mestioned factors, the sample size n is given
by
Example
• A cigarette manufacture wishes to use a random sample to estimate
the avg. nicotin content. the sample error should not be more than
one milligram above or below the true mean, with 99% confidence
level. the population standard Deviation is 4 milligram. what sample
size should the company use in order to satisfy these requirements?
Sample Size for esimating a
population proportion
In order to dertermine the sample size for estimating population
proportion, the following 3 factors must be known:
• the desired level of confidence and the corresponding value of Z.
• the permissible sampling error E.
• the actual or estimated true proportion of succes P.
the size of sample n
This helps us figure out how likely our calculated average from one specific
group is to represent the true average height of the whole town. The
sampling distribution helps us understand if our result is just by chance or
if it's a reliable estimate for the entire population
A Sampling Distribution
We are moving from descriptive statistics to inferential statistics.
Inferential statistics allow you to say that the candidate gets support from 47% of
the population with a margin of error of +/- 4%.
This means that the support in the population is likely somewhere between 43%
and 51%.
A Sampling Distribution
Margin of error is taken directly from a sampling distribution.
47%
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. Our census said the
mean is $30K.
$30K
A Sampling Distribution
Say that the standard deviation of this distribution is $10K.
Think back to the empirical rule. What are the odds you would get a sample mean that is more
than $20K off.
$30K
$30K
Our graphic display indicates that chances are good that the mean of our one
sample will not precisely represent the population’s mean. This is called sampling
error.
For example, if the variability is low, we can trust our number more
than if the variability is high, .
A Sampling Distribution
Which sampling distribution has the lower variability or standard deviation?
a b
Sa < Sb
The first sampling distribution above, a, has a lower standard error.
Now a definition!
The standard deviation of a normal sampling distribution is called the standard error.
A Sampling Distribution
Statisticians have found that the standard error of a sampling distribution is quite
directly affected by the number of cases in the sample(s), and the variability of
the population distribution.
Population Variability:
For example, Americans’ incomes are quite widely distributed, from $0 to Bill
Gates’.
Americans’ car values are less widely distributed, from about $50 to about $50K.
The standard error of the latter’s sampling distribution will be a lot less variable.
A Sampling Distribution
Population Variability:
Population
Cars
Income
Sampling
Distribution
Y-bar= /n
IF the population income were distributed with mean, = $30K with standard deviation, = $10K
$30
k
…the sampling distribution changes for varying sample sizes
A Sampling Distribution
So why are sampling distributions less variable when
sample size is larger?
Example 1:
Think about what kind of variability you would
get if you collected income through repeated
samples of size 1 each.
Contrast that with the variability you would get
if you collected income through repeated
samples of size N – 1 (or 300 million minus
one) each.
A Sampling Distribution
So why are sampling distributions less variable when sample size is larger?
Example 1:
Think about what kind of variability you would get if you collected income through repeated samples of size 1 each.
Contrast that with the variability you would get if you collected income through repeated samples of size N – 1 (or 300 million minus
one) each.
Example 2:
Think about drawing the population distribution and playing “darts” where the mean is the bull’s-eye. Record each one of your attempts.
Contrast that with playing “darts” but doing it in rounds of 30 and recording the average of each round.
What kind of variability will you see in the first versus the second way of recording your scores.
95% of
M’s
95% of M’s
? $12K ?
? $12K ?
-3 -2 -1 0 1 2 -3-2-1 0 1 2
3 3
A Sampling Distribution
An Example:
A population’s car values are = $12K with = $4K.
Which sampling distribution is for sample size 625 and which is for 2500? What are their s.e.’s?
s.e. = $4K/25 = $160 s.e. = $4K/50 = $80
95% of
M’s
95% of M’s
$11,840 $12K $12,320
$11,920$12K $12,160
-3 -2 -1 0 1 2 -3-2-1 0 1 2
3 3
A Sampling Distribution
A population’s car values are = $12K with = $4K.
Which sampling distribution is for sample size 625 and which is for 2500?
Which sample will be more precise? If you get a particularly bad sample, which sample size will help you be sure that you are
closer to the true mean?
95% of
M’s
95% of M’s
$11,840 $12K $12,320
$11,920$12K $12,160
-3 -2 -1 0 1 2 -3-2-1 0 1 2
3 3
A Sampling Distribution
Some rules about the sampling distribution of the
mean…
1. For a random sample of size n from a population having mean and standard
deviation , the sampling distribution of Y-bar (glitter-bar?) has mean and
standard error Y-bar = /n
2. The Central Limit Theorem says that for random sampling, as the sample size n
grows, the sampling distribution of Y-bar approaches a normal distribution.
3. The sampling distribution will be normal no matter what the population
distribution’s shape as long as n > 30.
4. If n < 30, the sampling distribution is likely normal only if the underlying population’s
distribution is normal.
5. As n increases, the standard error (remember that this word means standard
deviation of the sampling distribution) gets smaller.
6. Precision provided by any given sample increases as sample size n increases.
A Sampling Distribution
So we know in advance of ever collecting a sample, that if sample size is sufficiently large:
• The standard error will be a function of the population variability and sample size
• The larger the sample size, the more precise, or efficient, a particular sample is
• 95% of all sample means will fall between +/- 2 s.e. from the population mean
Probability Distributions
• A Note: Not all theoretical probability distributions are Normal. One example of many is the binomial
distribution.
• The binomial distribution gives the discrete probability distribution of obtaining exactly n successes out
of N trials where the result of each trial is true with known probability of success and false with the
inverse probability.
• The binomial distribution has a formula and changes shape with each probability of success and number
of trials.
a binomial
distribution
Successes: 0 1 2 3 4 5 6 7 8 9 10 11 12
• However, in this class the normal probability distribution is the most useful!
Sampling Distributions
There are 2 types of sampling distributions i.e.,
• Sampling Distribution of Mean – [Discussed earlier]
• T-Distribution
T-Distribution
• The t-distribution describes the standardized distances of sample means to the
population mean when the population standard deviation is not known, and the
observations come from a normally distributed population.
• The t-distribution is similar to normal distribution but flatter and shorter than a normal distribution
i.e., it is symmetrical, bell-shaped distribution, similar to the standard normal curve.
• The height of the t-distribution depends on the degrees of freedom (df) and refers to the maximum
number of logically independent values, which are values that have the freedom to vary, in the sample.
Degree of freedom
The easiest way to understand degrees of freedom conceptually is through several examples.
• Consider a data sample consisting of five positive integers. The values of the five integers
must have an average of six. If four of the items within the data set are {3, 8, 5, and 4}, the fifth
number must be 10. Because the first four numbers can be chosen at random, the degrees of
freedom is four.
• Consider a data sample consisting of one integer. That integer must be odd. Because there
are constraints on the single item within the data set, the degrees of freedom is zero.
• The formula to determine degrees of freedom is df = N – 1 where N is sample size.
• For example, imagine a task of selecting 10 baseball players whose bating average must
average to .250. The total number of players that will make up our data set is the sample
size, so N = 10. In this example, 9 (10 - 1) baseball players can theoretically be picked at
random, with the 10th baseball player having to have a specific batting average to adhere
to the .250 batting average constraint.
T-Distribution cont…
• As the df increases, the t-distribution will get closer and closer to matching the standard
normal distribution.
• The values of the t-statistic is : t = [ x̄ - μ ] / [ s / √ n ] where,
• t = t score,
• x̄ = sample mean,
• μ = population mean,
• s = standard deviation of the sample,
• n = sample size
Note: A t-score is equivalent to the number of standard deviations away from the mean of the t-
distribution.
• A law school claims it’s graduates earn an average of $300 per hour. A sample of 15
graduates is selected and found to have a mean salary of $280 with a sample standard
deviation of $50. Assuming the school’s claim is true, what is the t-score?
Solution: t= (280 – 300) / (50/ √ 15) = -20 / 12.909945 = -1.549.
T-Distribution cont…
Student’s t distribution is used when
• The sample size must be 30 or less than 30.
• The population standard deviation(σ) is unknown.
• The population distribution must be unimodal and skewed.
Note:
The t-score represents the number of standard errors by which the
sample mean differs from the population mean. For example, if a t-score
is 2.5, the sample mean is 2.5 standard errors above the population
mean. If a t-score is −2.5, the sample mean is 2.5 standard errors below
the population mean.
Inferential Statistics
• Statistics can be classified into two different categories i.e., descriptive statistics and
inferential statistics.
• The descriptive statistics summarizes the features of the dataset, whereas inferential statistics
help to make conclusion from the data.
• Inferential statistics is the process of using a sample to infer the properties of a population
and allows to generalize the population.
• In general, inference means “guess”, which means making inference about something. So,
statistical inference means, making inference about the population.
• Let’s look at a real flu vaccine study for an example of making a statistical inference. The
scientists for this study want to evaluate whether a flu vaccine effectively reduces flu cases
in the general population. However, the general population is much too large to include in their
study, so they must use a representative sample to make a statistical inference about the
vaccine’s effectiveness.
• Hypothesis testing is one of the type of inferential statistics.
Hypothesis
• A hypothesis is defined as a formal statement, which gives the explanation about the
relationship between the two or more variables of the specified population i.e., it
includes components like variables, population and the relation between the
variables.
• Hypothesis example:
• Two variables - if you eat more vegetables, you will lose weight faster. Here,
eating more vegetables is an independent variable, while losing weight is the
dependent variable.
• Two or more dependent variables and two or more independent variables -
Eating more vegetables and fruits leads to weight loss, glowing skin, and
reduces the risk of many diseases such as heart disease.
• Consumption of sugary drinks every day leads to obesity
• If a person gets 7 hours of sleep, then he will feel less fatigue than if he sleeps less.
Hypothesis Testing
• In today’s data-driven world, decisions are based on data all the time. Hypothesis plays
a crucial role in that process, whether it may be making business decisions, in the health
sector, academia, or in quality improvement. Without hypothesis & hypothesis tests, you risk
drawing the wrong conclusions and making bad decisions.
• Hypothesis testing is a type of statistical analysis in which assumptions are put about a
population parameter to the test. It is used to estimate the relationship between
variables.
• Examples:
• A faculty assumes that 60% of his students come from higher-middle-class families.
• A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.
• It involves setting up a null hypothesis and an alternative hypothesis. These two hypotheses
will always be mutually exclusive. This means that if the null hypothesis is true then the
alternative hypothesis is false and vice versa.
Null Hypothesis and Alternate
Hypothesis
• The null hypothesis is the statement There’s no effect in the population. A
null hypothesis has no bearing on the study's outcome unless it is rejected.
• Example:
• Smokers are no more susceptible to heart disease than nonsmokers.
• The new drug has a cure rate no higher than other drugs on the market.
• H0 is the symbol for it, and it is pronounced H-naught.
• Hypothesis testing is used to conclude if the null hypothesis can be rejected or
not. Suppose an experiment is conducted to check if girls are shorter than
boys at the age of 5. The null hypothesis will say that they are the same height.
• and the alternative hypothesis is the hypothesis that we are trying to
prove ( There’s an effect in the population.) and which is accepted if
we have sufficient evidence to reject the null hypothesis.
• It indicates that there is a statistical significance between two
possible outcomes and can be denoted as Ha.
• For the above-mentioned example, the alternative hypothesis would
be that “girls are shorter than boys at the age of 5”.
• The null hypothesis is usually the current thinking, or status quo.
The alternative hypothesis is usually the hypothesis to be proved.
The burden of proof is on the alternative hypothesis.
Null Hypothesis and Alternate
Hypothesis cont…
• A sanitizer manufacturer claims that its product kills 95 percent
of germs on average. To put this company's claim to the test,
create a null and alternate hypothesis.
• H0 (Null Hypothesis): Average = 95%.
• Alternative Hypothesis (Ha): The average is less than 95%.
Research question Ha H0
• Example: Effect of new bill pass on the loan of farmers and H0:
There is no significant effect of the new bill passed on loans of
farmers. The main intention is to check the new bill passes can
affect in both ways either increase or decrease the loan of
Types of Error
• Regardless of whether the investigator decides to accept or reject the
null hypothesis, it might be the wrong decision.
• The investigator might incorrectly reject the null hypothesis when it is
true, and might incorrectly accept the null hypothesis when it is false.
• In the tradition of hypothesis testing, these two types of errors
have acquired the names i.e., type I and type II errors.
• In general, commit a type I error occurs when one incorrectly reject a
null hypothesis that is true. On the other hand, type II error occurs
when you one incorrectly accept a null hypothesis that is false.
truth
H0 is true Ha is true
Decision Reject H0 Type I error No error
Do not reject H0 No error Type II error
Rejection Region
• The question, then, is how strong the evidence in favor of the
alternative hypothesis must be to reject the null hypothesis.
• This is done by means of a p-value. The p-value is the probability of
seeing a random sample at least as extreme as the observed sample,
given that the null hypothesis is true. The smaller the p-value, the
more evidence there is in favor of the alternative hypothesis.
• The p-values are expressed as decimals and can be converted
into percentage. For example, a p-value of 0.0237 is 2.37%, which
means there's a 2.37% chance of the results being random or having
happened by chance.
• In the hypothesis test, if the value is:
• A small p value (<=0.05), reject the null hypothesis.
• A large p value (>0.05), do not reject the null hypothesis
• The p-values are usually calculated using p-value tables, or
calculated automatically using statistical software like R, SPSS,
Python etc.
• Note: Other way to decide the rejection region is with z-score
and it is applicable when the sample size is less than or equal to
30.
Hypothesis Testing Example
• An investor says that the performance of their investment portfolio is
equivalent to that of the Standard & Poor’s (S&P) 500 Index. The person
performs a two-tailed test to determine this.
• The null hypothesis here says that the portfolio’s returns are equivalent to the
returns of S&P 500, while the alternative hypothesis says that the returns of
the portfolio and the returns of the S&P 500 are not equivalent.
• The p-value hypothesis test gives a measure of how much evidence is present
to reject the null hypothesis. The smaller the p value, the higher the evidence
against null hypothesis.
• Therefore, if the investor gets a P value of .001, it indicates strong evidence against
null hypothesis. So he confidently deduces that the portfolio’s returns and the
S&P 500’s returns are not equivalent.
Hypothesis Testing Numerical
• Problem Statement: A Telecom service provider claims that
individual customers pay on an average 400 rs. per month with
standard deviation of 25 rs. A random sample of 50 customers bills
during a given month is taken with a mean of 250 and standard
deviation of 15. What to say with respect to the claim made by the
service provider?
• Solution:
H0 (Null Hypothesis) : μ = 400
H1 (Alternate Hypothesis): μ ≠ 400 (Not equal means either μ > 400
or μ < 400 Hence it will be validated with two tailed test )
σ = 25 (Population Standard Deviation)
n = 50 (Sample size)
xbar x̄ = 250 (Sample mean)
s = 15 (sample Standard deviation)
n > = 30 hence will go with z-test
Step 1:
Calculate z using z-test formula as below:
Step 2:
get z critical value from z table for α = 5%
z critical values = (-1.96, +1.96)
to accept the claim (significantly), calculated z should be in between
-1.96 < z < +1.96
but calculated z (-42.42) < -1.96 which mean reject the null hypothesis
Chi-square test for independence
• A chi-square test of independence is to test whether two categorical variables are
related to each other or not.
• Example 1: we have a list of movie genres; this is the first variable. The second
variable is whether or not the patrons of those genres bought snacks at the theater.
The idea (or null hypothesis) is that the type of movie and whether or not people
bought snacks are unrelated. The owner of the movie theater wants to estimate how
many snacks to buy. If movie type and snack purchases are unrelated, estimating will be
simpler than if the movie types impact snack sales.
• Example 2: a veterinary clinic has a list of dog breeds they see as patients. The second
variable is whether owners feed dry food, canned food or a mixture. The idea (or
null hypothesis) is that the dog breed and types of food are unrelated. If this is true,
then the clinic can order food based only on the total number of dogs, without
consideration for the breeds.
Chi-square Test for Independence
Example
• Let’s take a closer look at the movie snacks example. Suppose we collect data for 600 people at our
theater. For each person, we know the type of movie they saw and whether or not they bought snacks.
• For the valid Chi-square test, the following conditions to be satisfied:
• Data values that are a simple random sample from the population of interest.
• Two categorical or nominal variables.
• For each combination of the levels of the two variables, we need at least five expected values. When
we have fewer than five for any one combination, the test results are not reliable. To confirm this, we
need to know the total counts for each type of movie and the total counts for whether snacks were
bought or not. For now, we assume we meet this requirement and will check it later.
Statistical details
The null hypothesis is that the type of movie and snack purchases are
independent. It is written as: H0:Movie Type and Snack
purchases are independent
The alternative hypothesis is the opposite i.e., Ha: Movie Type and Snack
purchases are not independent.
Chi-square Test for Independence
Example cont…
• The data summarized in a contingency table is as follows:
Type of movie Snacks No snacks
Action 50 75
Comedy 125 175
Family 90 30
Horror 45 10