0% found this document useful (0 votes)
21 views208 pages

Statistical Concepts Unit-2 DA

Uploaded by

harshitaaanmol23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views208 pages

Statistical Concepts Unit-2 DA

Uploaded by

harshitaaanmol23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 208

Statistical Concepts

Unit - 2
Data Exploration:
 Data exploration refers to the initial step in data analysis in which data analysts use data
visualization and statistical techniques to describe dataset characterizations, such as size,
quantity, and accuracy, in order to better understand the nature of the data.

 Raw data is typically reviewed with a combination of manual workflows and automated
data-exploration techniques to visually explore data sets, look for similarities, patterns
and outliers and to identify the relationships between different variables.
 This is also sometimes referred to as exploratory data analysis, which is a statistical
technique employed to analyze raw data sets in search of their broad characteristics.
Exploratory Data Analysis
•Raw data are not very informative. Exploratory Data Analysis (EDA) is how we make sense of the data by
converting them from their raw form to a more informative one.

 In particular, EDA consists of:


 organizing and summarizing the raw data,
 discovering important features and patterns in the data and any
striking deviations from those patterns, and then
 interpreting our findings in the context of the problem
 And can be useful for:
 describing the distribution of a single variable (center, spread,
shape, outliers)
 checking data (for errors or other problems)
 checking assumptions to more complex statistical analyses
 investigating relationships between variables
Important Features of Exploratory
Data Analysis
There are two important features to the structure of the EDA,

Examining Distributions — exploring data one variable at a time


(univariate).
Examining Relationships — exploring data two variables at a time
(bivariate).

In Exploratory Data Analysis, our exploration of data will always


consist of the following two elements:
 visual displays, supplemented by numerical measures.
Populations and
Observations/Samples
•Normally, when an experiment involving random variables is performed, data are generated.
These data points are often measurements of some kind.
 In statistics, all instances of a random variable as called observations of that variable.
 Furthermore, the set of data points actually collected is termed a sample of a
given
• population of observations.

•To generalize: the population consists of the


entire set of all possible observations, while the
sample is a subset of the population.
Definition of Population
•The different types of population are discussed as under:
 Finite Population: When the number of elements of the population is fixed and thus making it
possible to enumerate it in totality, the population is said to be finite.
 Infinite Population: When the number of units in a population are uncountable, and so it is
impossible to observe all the items of the universe, then the population is considered as infinite.
 Existent Population: The population which comprises of objects that exist in reality is called
existent population.
 Hypothetical Population: Hypothetical or imaginary population is the population which exists
hypothetically.
•Examples
• The population of all workers working in the sugar factory.
• The population of motorcycles produced by a particular company.
• The population of mosquitoes in a town.
• The population of tax payers in India.
Definition of Sample
• A part of population chosen at random for participation in the study.
• The sample so selected should be such that it represent the population in all its
characteristics, and it should be free from bias, so as to produce miniature cross-
section, as the sample observations are used to make generalisations about the
population.
• In other words, the respondents selected out of
population constitutes a ‘sample’.
• The process of selecting respondents is known as
‘sampling.’
• The units under study are called sampling units.
• The number of units in a sample is called sample size.
Sampling
 Sampling is a technique of selecting individual members or a subset of the population to
make a statistical conclusion in the basis of evidence from them and estimate characteristics of
the whole population.
 A sample is a “subgroup of a population”.
 As a way of obtaining a group of people or objects to study that were representative of a
large population or universe of interest. (Stacks & Hocking, 1999)
Sampling Methods
P robability Sampling Non-Probability Sampling

• Any element can be chosen randomly from • Every element will be chosen on
the population. It deals with choosing the the subjective judgment
sample randomly. (purposefully
• The most critical requirement of probability /intentionally) from the population on the
sampling is that everyone in your population basis of certain past experience &
has a known and equal chance of getting knowledge rather than random selection.
selected. • A sampling process where every
• Ex. When an unbiased coin is thrown single individual elements in the
(randomly), the probability of getting population may not have an
the head is ½. opportunity to be choosen as a
sample.
• Ex. Probability of getting a number i.e
6 when a dice will be thrown. • For example, one person could have
a 10% chance of being selected and
another person could have a 50%
Probability Sampling
 Simple Random Sampling

 Systematic Sampling

 Stratified Sampling

 Cluster sampling

 Multi Stage Sampling


Simple Random Sampling
 Randomly any element can be chosen
 Chance of selection is totally in a
randomized fashion.
 No previous knowledge, criteria and
procedure is followed at the time of selection
of the sample from the population.

Example: Suppose we would like to select 10 students from any class consists of 75
students. Write the roll numbers of each students in separate cheats and put it in a
container and 10 cheats from the container one by one randomly. Here probability
of selection is 1/75

Advantage: Every element has an equal chance of getting selected to be the part
sample.
Systematic Sampling
 Each member of the sample comes
after an equal interval from its
previous member.
 All the elements are put together in
a sequence first where each
element has the equal chance of
being selected.
 Select a random starting point and
then select the individual at regular
intervals

Example: Suppose we would like to select 10 students from any class consists of
75 students. Choosing a random stating roll choose every 5th student.

Advantage: As each student has a chance of getting selected there is no biasness in


selection.
Systematic Clustering (cont..)
 For a sample of size n, we divide our population of size N into subgroups of k
elements.
 We select our first element randomly from the first subgroup of k elements.
 To select other elements of sample, perform following:
 We know number of elements in each group is k i.e N/n
 So if our first element is n1 then Second element is n1+k i.e n2
 Third element n2+k i.e n3 and so on.. Taking an example of N=20, n=5
 No of elements in each of the subgroups is N/n i.e 20/5 =4= k
 Now, randomly select first element from the first subgroup. If we select
n1= 3, n2 = n1+k = 3+4 = 7, n3 = n2+k = 7+4 = 11
Stratified Sampling
 The population is divided into
smaller homogeneous groups or strata
by some characteristics.
 i.e the elements within the group are
homogeneous and heterogeneous
among the other subgroups formed.
 The samples are selected randomly
from these strata.
 We need to have prior information
about the population to create
subgroups

Example: Suppose we would like to select some students from any class
consists of 75 students. The students will be divided into groups of boys and
girls. Then some students will be chosen from boys and some from the girls.

Advantage: Members of each category or group will be chosen without any bias.
Cluster Sampling
 From the big population, choose a small
group by diving it into clusters/sections i.e
area wise.
 The clusters are randomly selected.
 All the elements of the cluster are used for
sampling.

Example: Suppose we would like to know the awareness about COVID in a city.
Instead of going the details survey of the entire city one can divide the city into
clusters and randomly choose a cluster from that. All the members of the cluster
will be considered.

Cluster sampling can be done in following ways:


 Single Stage Cluster Sampling
 Two Stage Cluster Sampling
Single and Two stage Cluster
Sampling
 Dividing the entire population into clusters.
Out of many clusters one cluster
is selected randomly for sampling.

 Dividing the entire population into clusters.


Randomly select two or more clusters and
then from those

selected clusters again randomly select


elements for sampling.

Example: An airline company wants to survey its customers one day, so they
Multi Stage Sampling
 Population is divided into multiple clusters and then these clusters are further
divided and grouped into various sub groups (strata) based on similarity.
 One or more clusters can be randomly selected from each stratum.
 This process continues until the cluster can’t be divided anymore.
 Example : A country can be divided into states, cities, urban and rural and all
the areas with similar characteristics can be merged together to form a strata.
Non-Probability Sampling
 Every element will be chosen purposefully/intentionally from the population on the basis of certain past
experience and knowledge.
 It is a less stringent method.
 This sampling method depends heavily on the expertise of the researchers.
 It is carried out by observation, and researchers use it widely for qualitative research.
 Mainly classified into
 Quota Sampling
 Purpose Sampling/Judgemental Sampling
 Convenience Sampling
 Referral / Snowball Sampling
Quota Sampling
 Quota sampling works by first dividing the selected population into exclusive subgroups.
 The proportions of each subgroup are measured, and the ratio of selected subgroups are then
used in the final sampling process.
 The proportions of the selected subgroups are used as boundaries for selecting a sample population of
proportionally represented subgroups.
 There are two types of quota sampling:
 proportional
 non proportional
Proportional Quota Sampling
 In proportional quota sampling you want to represent the major characteristics of the population by sampling a
proportional amount of each.
 The problem here is that you have to decide the specific characteristics on which you will base the quota.
Will it be by gender, age, education race, religion, etc.?
 For example, if you know the population has 40% women and 60% men, and that you want a total sample
size of 100, you will continue sampling until you get those percentages and then you will stop. So, if you’ve
already got the 40 women for your sample, but not the sixty men, you will continue to sample men but even
if legitimate women respondents come along, you will not sample them because you have already “met
your quota.”
Non-Proportional Quota Sampling
 Use when it is important to ensure that a number of sub-groups in the field of study are well-covered.
 Use when you want to compare results across sub-groups.
 Use when there is likely to a wide variation in the studied characteristic within minority groups.
 Identify sub-groups from which you want to ensure sufficient coverage. Specify a minimum sample size
from each sub-group.
 Here, you’re not concerned with having numbers that match the proportions in the population.
Instead, you simply want to have enough to assure that you will be able to talk about even small groups
in the population.
 Example:A study of the prosperity of ethnic groups across a city, specifies that a minimum of 50 people in ten
named groups must be included in the study. The distribution of incomes across each ethnic group is then
compared against one another.
Purpose Sampling/Judgemental
Sampling
 Samples are chosen only on the basis of the researcher’s
knowledge and judgement.
 It enables the researcher to select cases that will best
enable him to answer his research questions that meet the
objective.
 Choosing a sample because of represent the certain
purpose.
 Example-1: In online live voting for selecting a GOOD Singer from a competition, the people
who have interest in singing can be selected in the sample .
 Example-2: If we want to understand the thought process of the people who are interested in pursuing
master’s degree then the selection criteria would be “Are you interested for Masters in..?”
 All the people who respond with a “No” will be excluded from our sample.
Convenience Sampling
 Convenience sampling (also called accidental sampling or grab sampling) is where you include people
who are easy to reach.
 Sample are taken mainly on basis of the readily available.
 Sample which is convenient to the researcher or the data analyst can be chosen. The task is done without
any principles or theories.
 For example, you could survey people from:
 Your workplace,
 Your school,
 A club you belong to,
 The local mall.

 Example: Suppose I would like to select 5 students from any class consists of
75 students. Choosing the 5 students who sits near by me without any principle of selection.
Referral / Snowball Sampling
 Snowball sampling method is purely based on referrals and
that is how a researcher is able to generate a sample.
 So the researcher will take the help from the first element
which he select for the population and ask him to
recommend others who will fit for the description of the
sample needed.
 So this referral technique goes on, increasing the size of
population like a snowball.

Example: If you are studying the level of customer satisfaction among the members of an elite country club,
you will find it extremely difficult to collect primary data sources unless a member of the club agrees to have
a direct conversation with you and provides the contact details of the other members of the club.
Sampling Errors
 Sampling error is a statistical error that occurs when an analyst does not select a
sample that represents the entire population of data.
 The results found in the sample thus do not represent the results that
would be obtained from the entire population.
 Sampling error can be reduced by randomizing sample selection
increasing the number of observations.
 It mainly happens when the sample size is very small (10 to 100).
For example, if you wanted to figure out how many Formula: the formula for the margin
people out of a thousand were under 18, and you came of
error is 1/√n, where n is the
up with the figure 19.357%. If the actual percentage size of the
sample. For example, a random
equals 19.300%, the difference (19.357 – 19.300) of 0.57 sample
of 1,000 has about a 1/√n; =
3.2% error.
or 3% = the margin of error. If you continued to take
samples of 1,000 people, you’d probably get slightly
different statistics, 19.1%, 18.9%, 19.5% etc, but they
would all be around the same figure. This is one of the
reasons that you’ll often see sample sizes of 1,000 or
1,500 in surveys: they produce a very acceptable margin
of error of about 3%.
Five Common Types of Sampling
Errors
 Population Specification Error—This error occurs when the researcher does not understand who
they should survey.
 Sample Frame Error—A frame error occurs when the wrong sub-population is used to select a
sample.
 Selection Error—This occurs when respondents self-select their participation in the study –
only those that are interested respond. Selection error can be controlled by going extra lengths
to get participation.
 Non-Response—Non-response errors occur when respondents are different than those who do
not respond. This may occur because either the potential respondent was not contacted or they
refused to respond.
 Sampling Errors—These errors occur because of variation in the number or
representativeness of the sample that responds. Sampling errors can be controlled by (1)
careful sample designs, (2) large samples, and (3) multiple contacts to assure representative
response.
Data Sets, Variables, and
Observations
• A data set is usually a rectangular array of data, with variables in columns and observations in
rows.
• A variable (or field or attribute) is a characteristic of members of a population, such as
height, gender, or salary.
• An observation (or case or record) is a list of all variable values for a single member of a
population.
 Data set includes observations on 10 people who
responded to a questionnaire on the president’s
environmental policies.
 Variables include age, gender, state, children, salary,
and opinion.
 Include a row that lists variable names.

 Include a column that shows an index of the

observation
Data Types
• Variables (or attributes, dimensions, features)
•A variable is a characteristic that can be measured and that can assume
different values. (Height, age, income, province or country of birth, grades
obtained at college and type of housing are all examples of variables.)
• Types of variables
•Variables may be classified into two main categories:
• Categorical (Qualitative)
• (A categorical variable (called qualitative variable) refers to a characteristic
that can’t be quantifiable.)
• Numeric (Quantitative)
• (A variable is numeric if meaningful arithmetic can be performed on it.)
A variable is numerical if meaningful arithmetic can be performed on it.
Otherwise, the variable is categorical.
Categorical (Qualitative)
Variable/Attribute
• Nominal: Nominal means “relating to names.” The values of a nominal attribute are symbols or names of things
for example,
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• The values do not have any meaningful order about them.
• Binary: Nominal attribute with only 2 states (0 and 1), where 0 typically means that the attribute is absent, and 1
means that it is present. Binary attributes are referred to as Boolean if the two states correspond to true and false.
• Symmetric binary: both outcomes equally important, e.g., gender
• Asymmetric binary: outcomes not equally important, e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal: Values have a meaningful order (ranking) but magnitude between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
• Other examples of ordinal attributes include Grade (e.g., A+, A, A−, B+, and so on) and
• Professional rank. Professional ranks can be enumerated in a sequential order, such as assistant, associate, and
full for professors,
Categorical (Qualitative)
Variable/Attribute (cont..)
• Categorical variables can be coded numerically.
• Gender has not been coded, whereas Opinion
has been coded.
• This is largely a matt er of taste-coding a
c a t e g o r i c a l v a r i a b l e d o e s n o t m a ke i
t numerical and appropriate for
arithmeti c operations.
• Now Opinion has been replaced by text, and
Gender has been coded as 1 for males and 0
for females.
• This 0−1 coding for a categorical variable is
very common. Such a variable is called a
dummy variable, it often simplies the analysis.
A dummy variable is a 0−1 coded variable for a specific category.
It is coded as 1 for all observations in that category and 0 for all
observations not in that category.
Categorical (Qualitative)
Variable/Attribute (cont..)
• A binned (or
discretized) variable
corresponds to a
numerical variable that
has been categorized
into discrete categories.
• T h e s e cate go r i e s a
r e usually called bins.
• The Age variable has been
categorized as
“young” (34 years or
younger),
• “middle-aged” (from 35 to
59 years), and “elderly”
Numerical (Quantitative)
Variable/Attribute
• Numerical variables can be classified as discrete or continuous.
• The basic distinction is whether the data arise from counts or continuous
measurements.
 The variable Children is clearly a count (discrete),
 whereas the variable Salary is best treated as continuous.
• Numeric attributes can be interval-scaled or ratio-scaled.
 Interval
• Measured on a scale of equal-sized units. The values of interval-scaled
attributes have order and can be positive, 0, or negative. E.g.,
temperature in C˚or F˚, calendar dates No true zero-point
 Ratio
• Inherent zero-point. We can speak of values as being an order of
magnitude larger than the unit of measurement (10 K˚ is twice as
Data sets can also be categorized as cross-sectional or time series

• Cross-sectional data are data on a cross section of a population at a


distinct point in time.
 The opinion data set is cross-sectional.
• Time series data are data collected over time.
 a time series data set tracks one or more variables through time.

A time series data set generally has the same layout—variables in columns and observations
in rows—but now each variable is a time series. Also, one of the columns usually indicates the
time period.
It has quarterly observations on revenues from toy sales over a four-year period in column B,
with the time periods listed chronologically in column A.
Properties of Attribute Values
• T h e t y p e o f a n att r i b u te d e p e n d s o n w h i c h o f t h e
fo l l o w i n g properties/operations it possesses:
• Distinctness : = and 
• Order : <, ≤, >, and
• Addition : ≥
(Differences are meaningful) + and -
• Multiplication : * and /
(Ratios are meaningful)
• Nominal attribute: distinctness
• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & meaningful differences
• Ratio attribute: all 4 properties/operations
Descriptive measures for Numerical
variables
• Measure of Central Tendency.
• Measure of Variability.
• Measure of Shape.(Kurtosis and Skewness)
Measures of Central Tendency
There are three common measures of central tendency, all of which try to
answer the basic question of which value is most “typical.”
• These are the mean, the median, and the mode.
• Mean of the Sample.
A measure of central tendency is a number that
represents the typical value in a collection of
numbers.
Mean = sum of all data points/n (The mean is
also known as the "average" or the "arithmetic
average.")
• If the data set represents a sample from some larger population, this
measure is called the sample mean and is denoted by X .
• If the data set represents the entire population, it is called the
population mean and is denoted by µ
Measures of Central Tendency
• A trimmed mean (sometimes called a truncated mean) is similar to a mean, but it trims any
outliers. Outliers can affect the mean (especially if there are just one or two very large values), so
a trimmed mean can often be a better fit for data sets with erratic high or low values or for
extremely skewed distributions. Even a small number of extreme values can corrupt the mean.
• For example, the mean salary at a company may be substantially pushed up by that of a few
highly paid managers. Similarly, the mean score of a class in an exam could be pulled down quite
a bit by a few very low scores.
• Which is the mean obtained after chopping off values at the high and low extremes.
• Example: Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
• Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three
values: 60, 81, 83, 91, 99.
• Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85.
Measures of Central Tendency
(cont..)
• Median of a Simple.
• The median of a set of data is the “middle element” when the data is arranged in ascending order. To
determine the median:-

1. Put the data in order from smallest to largest.


2. Determine the number in the exact center.

If there are an odd number of data points, the median will be the number in the absolute middle.
If there is an even number of data points, the median is the mean of the two center data points,
meaning the two center values should be added together and divided by 2.

• Example:
• Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
• Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
• Step 2: Determine the absolute middle of the data. 9, 10, 12, 13, 14, 14, 17, 17, 20
Measures of Central Tendency
(cont..)
• The Mode of Sample:-
The mode is the most frequently occurring measurement in a data set.
There may be one mode; multiple modes, if more than one number occurs most frequently; or no mode at all,
if every number occurs only once.
Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
To determine the mode:
1. Put the data in order from smallest to largest, as you did to find your median.
2. Look for any value that occurs more than once.
3. Determine which of the values from Step 2 occurs most frequently.
• Example:-
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Look for any number that occurs more than once. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 3: Determine which of those occur most frequently. 14 and 17 both occur twice.
The modes of this data set are 14 and 17.
Measures of Central Tendency
(cont..)
Frequency , Relative Frequency and
Cumulative Relative Frequency.
• Frequency (or event) recording is a way to measure
the number of times a behavior occurs within a
given period.
• The advantage of using frequency distributions is
that they present raw data in an organized, easy-to-
read format. The most frequently occurring scores
are easily identified, as are score ranges, lower and
upper limits, cases that are not common, outliers,
and total number of observations between any
given scores.

• A relati ve frequency distributi on shows the


proportion of the total number of observations
• Cumulative frequency represents the sum of the
associated with each value or class of values and is
relative frequencies.
related to a probability distribution
• Cumulati ve frequency is used to determine the
• A dva nta ge: Within the overall number o
number of observations that lie above (or below) a
f observati ons, a relative frequency reflects
particular value in a data set.
how frequently a given type of event occurs within
that total number of observations.
exercises
1. Find the Central tendency Mean , Median and Mode.

2.Table represents the heights, in inches, of a sample


of 100 male semiprofessional soccer players.
a) From Table , find the percentage of heights that
are less than 65.95 inches.
b) Find the percentage of heights that fall
between 61.95 and 65.95 inches.
c) The number of players in the sample who are
between 61.95 and 71.95 inches tall is
.
Mean, Median and Mode from Grouped
Frequencies
The Race.....
This starts with some
raw data (not a
grouped frequency
yet) ...

After sorting of time


points in dataset;
53, 55, 56, 56, 58, 58,
59, 59, 60, 61, 61, 62,
62, 62, 64, 65, 65, 67,
68, 68, 70
Mean, Median and Mode from Grouped
Frequencies
• Grouped Frequency Table
• Alex then makes a Grouped Frequency Table:
• So 2 runners took between 51 and 55 seconds, 7 took
between 56 and 60 seconds, etc
Estimating the Mean from
Grouped Data

We can estimate the Mean by using the midpoints.


Let's now make the table using midpoints:
Estimating the Mean from
Grouped Data
• Our thinking is: "2 people took 53 sec, 7 people took 58
sec, 8 people took 63 sec and 4 took 68 sec". In other
words we imagine the data looks like this:
53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68

• Then we add them all up and divide by 21. The quick way
to do it is to multiply each midpoint by each frequency:

And then our estimate of the mean time to complete the


race is:
Estimating the Median from Grouped
Data
• Let's look at our data again:
• The median is the middle value, which in our
case is the 11th one, which is in the 61 - 65
group:
• We can say "the median group is 61 - 65"
• But if we want an estimated Median value we
need to look more closely at the 61 - 65 group.
Estimating the Median
from Grouped Data
• At 60.5 we already have 9 runners, and by the next
boundary at 65.5 we have 17 runners.
• By drawing a straight line in between we can pick out
where the median frequency of n/2 runners is:
And this handy formula does the calculation:

where:
L is the lower class boundary of the group
containing the median
n is the total number of values
B is the cumulative frequency of the groups before
the median group
G is the frequency of the median group
w is the group width
For our
example:
n / 2  ( freq )
L = 60.5 median  L1  )width
l
freq median
n = 21 (
B=2+7=9
G=8
w=5
Estimating the Mode
from Grouped
• Again, looking at our data:
Data
• We can easily find the modal group (the group with the highest
frequency), which is 61 - 65
• We can say "the modal group is 61 - 65"
• But the actual Mode may not even be in that group! Or there may
be more than one mode. Without the raw data we don't really
know. But, we can estimate the Mode using the following
formula:

where:
L is the lower class boundary of the modal group fm is the frequency of the modal
is the frequency of the group before the modal group group
fm-1
is the frequency of the group after the modal group w is the group width
f
For our
example:
L = 60.5
fm-1 =7
fm =8
fm+1 =4
w =5
Age
Baby Carrots
Example
Example The ages of the 112 people who live on
a tropical island are grouped as follows:
Example: You grew fifty baby carrots using
special soil. You dig them up and measure
their lengths (to the nearest mm) and
group the results:
Measures of
Variablility.
• Measures of variability give a sense of how spread out the response values are.
• The range, standard deviation and variance each reflect different aspects of
spread.
• Percentiles and quartiles certainly tell you something about variability.
• Specifically, for any percentage p, the pth percentile is the value such that a
percentage p of all values are less than it. Similarly, the first, second, and third
quartiles are the percentiles corresponding to p = 25%, p = 50%, and p = 75%.
These three values divide the data into four groups, each with (approximately) a
quarter of all observations.
• Note that the second quartile is equal to the median by definition.
For example, if you learn that your score in the verbal SAT test is at the 93rd percentile, this
means that you scored better than 93% of those taking the test.
Measures of Variablility
(cont..)
• Range
• Interquartile Range
• Variance
• Standard Deviation
Range
The range gives you an idea of how far apart the most extreme response scores are. To
find the range, simply subtract the lowest value from the highest value.
Measures of Variablility
(cont..)
Interquartile range
• The Interquartile range measures the variability, based on dividing an ordered set
of data into quartiles.
• Quartiles are three values or cuts that divide each respective part as the first, second,
and third quartiles, denoted by Q1, Q2, and Q3

Q1= It is the cut in the first half of the rank-ordered data set
Q2= It is the median value of the set
Q3= It is the cut in the second half of the rank-ordered data
set.

So it is really the range of the middle 50% of the data.

School of Computer Engineering


Measures of Variablility
(cont..)
Interquartile range(IQR) = Upper Quartile – Lower Quartile
IQR=Q3–Q1
Where,
IQR = Interquartile range
Q1 = (1/4)[(n + 1)]th term)
Q3= (3/4)[(n + 1)]th term)
n = number of data
points

School of Computer Engineering


Measures of Variablility
(cont..)

• The median of the data set is located to the right


of the center of the box, which indicates that the
distribution is negatively skewed.

• The median of the data set is located to the


left of the center of the box, which indicates
that the distribution is positively skewed.

School of Computer Engineering


Measures of Variablility
(cont..)
Variance
The variance is essentially the average of the squared deviations from the mean,
where if Xi is a typical observation, its squared deviation from the mean is (Xi − mean)2.
As in the discussion of the mean, there is a sample variance, denoted by s2 , and a
population variance, denoted by σ2

The use variance to see how individual numbers relate to each other within a data set.
Variance analysis helps an organization to be proactive in achieving their business
targets
School of Computer Engineering
Measures of Variablility
(cont..)
Standard deviation
A fundamental problem with variance is that it is in squared units.
A more natural measure is the standard deviation, which is the square root of the variance.

• The sample standard deviati on,


denoted by s, is the square root of
the sample variance.
• The population standard deviation,
denoted by σ, is the square root of
the population variance.

School of Computer Engineering


Measures of Variablility
(cont..)
Standard deviation

Important Points

• Standard deviation is sensitive to extreme values. A single very extreme


value can increase the standard deviation and misrepresent the dispersion.

• For two data sets with the same mean, the one with the larger standard
deviation is the one in which the data is more spread out from the center.

• Standard deviation is equal to 0 if all values are equal (because all values
are then equal to the mean).
School of Computer Engineering
Measures of Variablility
(cont..) There are six steps for finding the standard
Standard deviation deviation:

1. List each score and find their mean.


2. Subtract the mean from each score to get the
deviation from the mean.
3. Square each of these deviations.
4. Add up all of the squared deviations.
5. Divide the sum of the squared deviations by
N – 1.
6. Find the square root of the number you
found.

School of Computer Engineering


Measures of Variablility
(cont..)
Standard deviation
Empirical Rules for Interpreting Standard Deviation
 The interpretation of the standard deviation can be
stated as three empirical rules.
 “Empirical” means that they are based on commonly

observed data, as opposed to theoretical


mathematical arguments.
 If the values of a variable are approximately normally

distributed (symmetric and bell-shaped), then the


following rules hold:
 Approximately 68% of the observations are within one
standard deviation of the mean, that is interval
 Approximately 95% of the observations are within two
standard deviations of the mean, that is interval
 Approximately 99.7% of the observations are within
three standard deviations of the mean, that is interval

School of Computer Engineering


Measures of
Shape
There are two final measures of a distribution you will hear occasionally:
 skewness and
 kurtosis.
Skewness
• Skewness refers to the degree of symmetry,
or more precisely, the degree of lack of
symmetry.
• Skewness is used to measure the level of
asymmetry in our graph.
• It is the measure of asymmetry that
occurs
when our data deviates from the norm.
Measures of Shape
(cont..)
Skewness
• Sometimes, the normal distribution tends to tilt more on one side.
• This is because the probability of data being more or less than the mean is higher and hence
makes the distribution asymmetrical.
• This also means that the data is not equally distributed.
• The skewness can be on two types:
1. Positively Skewed: In a distribution that is Positively 2. Negatively Skewed: In a Negatively Skewed distribution,
Skewed, the values are more concentrated towards the right the data points are more concentrated towards the right-
side, and the left tail is spread out. Hence, the statistical hand side of the distributi on. This makes the mean,
results are bent towards the left-hand side. Hence, that the median, and mode bend towards the right. Hence these
mean, median, and mode are always positive. values are always negative.
In this distribution, Mean > Median > Mode. In this distribution, Mode > Median > Mean.
Measures of Shape
(cont..)
Skewness
How these central tendency measures tend to spread when the normal distribution is distorted.
For the nomenclature just follow the direction of the tail
— For the right graph has the tail to the right, so it is right-skewed (positively skewed) and
— For the left graph since the tail is to the left, it is left-skewed (negatively skewed).

How about deriving a measure that captures the horizontal distance between the Mode and the
Mean of the distribution?
It’s intuitive to think that the higher the skewness, the more apart these measures will be.
Pearson’s Coefficient of Skewness : This method is most frequently used for measuring skewness.
The formula for measuring coefficient of skewness The value of this coefficient would be zero in a
is given by symmetrical distribution. If mean is greater
than mode, coefficient of skewness would be
positive otherwise negative. The value of the
Pearson’s coefficient of skewness usually lies
between ±1 for moderately skewed distubution.
Measures of Shape
(cont..) If this value is between:
Skewness
• -0.5 and 0.5, the distribution of the value is
almost symmetrical
• -1 and -0.5, the data is negatively skewed,
and if it is between 0.5 to 1, the data is
If mode is not well defined, we use the formula positively skewed. The skewness is moderate.
• If the skewness is lower than -1 (negatively
skewed) or greater than 1 (positively skewed),
the data is highly skewed.

Substituting this in Pearson’s first coefficient


gives us Pearson’s second coefficient and the
formula for skewness:
Measures of Shape
(cont..)
What Is Kurtosis?
For sample:
• Kurtosis gives a measure of flatness of distribution.
• We need to know another measure to get the
complete idea about the shape of the distribution
which can be studied with the help of Kurtosis.
• Kurtosis is associated with the “movement of
probability mass from the shoulders of a distribution
into its center and tails.”
• The degree of kurtosis of a distribution is
measured relative to that of a normal curve.
• The curves with greater peakedness than
the normal curve are called “Leptokurtic”.
• The curves which are more flat than the
normal curve are called “Platykurtic”.
• The normal curve is called “Mesokurtic.”
Measures of Shape
(cont..)
Kurtosis
Kurtosis is used to find the presence of outliers in data.
It gives us the total degree of outliers present.
• A normal distributi on has kurtosis exactly 3
(excess kurtosis exactly 0).
• Any distribution with kurtosis ≈3 (excess ≈0) is
called mesokurtic.
• A distribution with kurtosis <3 (excess kurtosis
<0) is called platykurtic. ( Its tails are shorter and
thinner, and often its central peak is lower and
broader).
• A distribution with kurtosis >3 (excess kurtosis
>0) is called leptokurtic. ( Its tails are longer and
fatter, and often its central peak is higher and
sharper ).
Measures of Sample Skewness and Kurtosis
Examples: Calculate Sample Skewness and Sample Kurtosis from the following grouped
data

School of Computer Engineering


Outliers and Missing
values Outliers
• An outlier is a value or an entire observation (row) that lies well outside of the norm.
• Some statisticians define an outlier as any value more than three standard deviations from
the mean, but this is only a rule of thumb.
• Even if values are not unusual by themselves, there sti ll might be unusual
combinations of values.
• When dealing with outliers, it is best to run the analyses two ways: with the
outliers and without them.
• Let’s just agree to define outliers as extreme values, and then for any particular data
set, you can decide how extreme a value needs to be to qualify as an outlier.
For example, let us consider a row of data [10,15,22,330,30,45,60]. In this dataset, we can easily conclude that 330
is way off from the rest of the values in the dataset, thus 330 is an outlier. It was easy to figure out the outlier in such
a small dataset, but when the dataset is huge, we need various methods to determine whether a certain value is an
outlier or necessary information.
Outliers
(cont..)
Types of outliers :
There are three types of outliers
• Global Outliers: The data point or points whose values are far outside everything else in the dataset are global
outliers. Suppose we look at a taxi service company’s number of rides every day. The rides suddenly dropped to zero
due to the pandemic-induced lockdown. This sudden decrease in the number is a global outlier for the taxi company.
• Collective Outliers: Some data points collectively as a whole deviates from the dataset. These data points
individually may not be a global or contextual outlier, but they behave as outliers when aggregated together. For
example, closing all shops in a neighborhood is a collective outlier as individual shops keep on opening and closing,
but all shops together never close down; hence, this scenario will be considered a collective outlier.

• Contextual Outliers: Contextual outliers are those values of data points that deviate quite a lot from the rest
of the data points that are in the same context, however, in a different context, it may not be an outlier at all. For
example, a sudden surge in orders for an e-commerce site at night can be a contextual outlier.

Outliers can lead to vague or misleading predictions while using machine learning models. Specific models like
linear regression, logistic regression, and support vector machines are susceptible to outliers. Outliers decrease
the mathematical power of these models, and thus the output of the models becomes unreliable.
Outliers
(cont..)
• When there are no outliers in a sample, the mean and standard deviation are used
to summarize a typical value and the variability in the sample, respectively.
• When there are outliers in a sample, the median and interquartile range are used
to summarize a typical value and the variability in the sample, respectively.

Tukey Fences

• Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1) or equivalently, values


below Q1-1.5 IQR or above Q3+1.5 IQR.

• In previous example, for the diastolic blood pressures, the lower limit is 64 - 1.5(77-64) =
44.5 and the upper limit is 77 + 1.5(77-64) = 96.5. The diastolic blood pressures range
from 62 to 81. Therefore there are no outliers.
Outliers
(cont..)
Example : The Full Framingham Cohort Data
• The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study on
residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209
adult subjects from Framingham, and is now on its third generation of participants.
• Table 1 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variables in the subsample of n=10 participants who attended the
seventh examination of the Framingham Offspring Study.
Table 1 - Summary Statistics on n=10 Participants
Outliers
(cont..)
• Table 2 displays the observed minimum and maximum values along with the limits to determine
outliers using the quartile rule for each of the variables in the subsample of n=10 participants.
• Are there outliers in any of the variables? Which statistics are most appropriate to summarize
the average or typical value and the dispersion?
Table 2 - Limits for Assessing Outliers in Characteristics Measured in the n=10 Participants

Since there are no suspected outliers in the subsample of n=10 participants, the mean and standard deviation are the
most appropriate statistics to summarize average values and dispersion, respectively, of each of these characteristics.
Outliers
(cont..)
• For clarity, we have so far used a very small subset of the Framingham Offspring Cohort to
illustrate calculations of summary statistics and determination of outliers. For your interest,
Table 3 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variable displayed in Table 1 in the full sample (n=3,539) of
participants who attended the seventh examination of the Framingham Offspring Study.
Table 3-Summary Statistics on Sample of (n=3,539) Participants
Outliers
(cont..)
• Table 4 displays the observed minimum and maximum values along with the limits to determine
outliers using the quartile rule for each of the variables in the full sample (n=3,539).
Table 4 - Limits for Assessing Outliers in Characteristics Presented in Table 3

Are there outliers in any of


the variables?

Which statistics are most


appropriate to summarize
the average or typical values
and the dispersion for each
variable?
Outliers
(cont..)
Observations on
example......
• In the full sample, each of the characteristics has outliers on the upper end of the distribution as
the maximum values exceed the upper limits in each case. There are also outliers on the low end
for diastolic blood pressure and total cholesterol, since the minimums are below the lower limits.
• For some of these characteristics, the difference between the upper limit and the maximum (or
the lower limit and the minimum) is small (e.g., height, systolic and diastolic blood pressures),
while for others (e.g., total cholesterol, weight and body mass index) the difference is much
larger. This method for determining outliers is a popular one but not generally applied as a hard
and fast rule. In this application it would be reasonable to present means and standard
deviations for height, systolic and diastolic blood pressures and medians and interquartile ranges
for total cholesterol, weight and body mass index.
Outliers
(cont..)
Boxplot Analysis
Box plots are a simple way to visualize data through quantiles and detect outliers. A boxplot
incorporates the five-number summary as follows: Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles,
i.e., the height of the box is IQR. IQR(Interquartile
Range) is the basic mathematics behind boxplots.
• The median is marked by a line within the box
• Two lines (called whiskers) outside the box extend to
the smallest (Minimum) and largest (Maximum)
observations. The top and bottom whiskers can be Statistical detection :Removing and modifying the outliers using
understood as the boundaries of data, and any data statistical detection techniques is a widely followed method.
lying outside it will be an outlier. • Z-Score
• Outliers: points beyond a specified outlier threshold, • Density-based spatial clustering
• Regression Analysis
plotted individually • Proximity-based clustering
• IQR Scores
Outliers
(cont..)
• Z-Score for Outlier Detection.
While data points are referred to as x in a normal distribution, they are called z or z scores in the z
distribution. A z score is a standard score that tells you how many standard deviations away from the mean
an individual value (x) lies:
• A positive z score means that your x value is greater than the mean.
• A negative z score means that your x value is less than the mean.
• A z score of zero means that your x value is equal to the mean.
Outliers and Missing
values
Missing Values Detection and Handling.
• Related to Pre-processing.
• Consumes most of the time in Data Analytics.
• Handling missing values is one of the challenges of data analysis

• Reasons for Missing Values. • Handling missing


 Improper maintenance of past data.  Eliminate data objects or variables
values
 Observations are not recorded for certain fields due  Estimate missing values
to some reasons.  Ignore the missing value during analysis
 Failure in recording the values due to human error.
 Replace with all possible values
 The user has not provided the values intentionally.
(weighted by their probabilities)
 Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
 Information is not collected
(e.g., people decline to give their age and weight)
Types of Missing
Values
Some definitions are based on representation: Missing data is the lack of a
recorded answer for a particular field.

• Missing completely at random (MCAR)

• Missing at Random (MAR)

• Missing Not at Random (MNAR)


Types of Missing
Values
Missing Completely at Random (MCAR)
• Missingness of a value is independent of attributes
• Fill in values based on the attribute
• Analysis may be unbiased overall
• The missingness on the variable is completely unsystematic.
Example when we take a random sample of a population, where each member has the same chance of being included
in the sample.

When we make this assumption,


we are assuming that whether or
not the person has missing data is
completely unrelated to the
other information in the data.
When data is missing completely at random, it means that we can undertake analyses using only
observations that have complete data (provided we have enough of such observations).
Types of Missing
Values
Missing at Random (MAR)
• Missingness is related to other variables
• Fill in values based other values
• Almost always produces a bias in the analysis
Example of MAR is when we take a sample from a population, where the probability to be included depends on some
known property.

A simple predictive model is that income can be predicted


based on gender and age. Looking at the table, we note that
our missing value is for a Female aged 30 or more, and
observations says the other females aged 30 or more have a
High income. As a result, we can predict that the missing
value should be High.

There is a systematic relationship between the inclination of missing values and the observed data,
but not the missing data. All that is required is a probabilistic relationship
Types of Missing Values
(cont..)
Missing not at Random (MNAR) - Nonignorable

• Missingness is related to unobserved measurements


• When the missing values on a variable are related to the values of that
variable itself, even after controlling for other variables.
MNAR means that the probability of being missing varies for reasons that are unknown to us.

Data was obtained from 31 women, of whom 14 were located six months later. Of these, three had exited from
homelessness, so the estimated proportion to have exited homelessness is 3/14 = 21%. As there is no data for the 17
women who could not be contacted, it is possible that none, some, or all of these 17 may have exited from
homelessness. This means that potentially the proportion to have exited from homelessness in the sample is
between 3/31 = 10% and 20/31 = 65%. As a result, reporting 21% as being the correct result is misleading. In this
example the missing data is nonignorable.

Strategies to handle MNAR are to find more data about the causes for the missingness, or to perform
what-if analyses to see how sensitive the results are under various scenarios.
Types of Missing Values
(cont..)
Can Formalize these
Definitions..
Let X represent a matrix of the data we 1. MCAR: P(R| Xo , Xm ) = P(R)
“expect” to have; X = {Xo,Xm} where Xo is
the observed data and Xm the missing
data.
2. MAR: P(R| Xo , Xm ) = P(R| Xo)

Let’s define R as a matrix with the same


dimensions as X where Ri,j = 1 if the 3. MNAR: No simplification.
datum is missing, and 0 otherwise.
Finding relationships among
variables
Finding Relationships among
Variables
This is an important first step in any exploratory data analysis.
• To look closely at variables one at a time, but it is almost never the last step.
• The primary interest is usually in relationships between variables.
For a variable such as baseball salary, the entire focus
was on how salaries were distributed over some range.
It is natural to ask what drives baseball salaries.
• Does it depend on qualitative factors, such as;
 Player’s team or position?
• Does it depend on quantitative factors, such as;
 Number of hits the player gets or the number of
strikeouts?
To answer these questions, you have to examine
relationships between various variables and salary.
Types of Relationships among Variables
a. Categorical vs Categorical
b. Categorical vs Numerical
c. Numerical vs Numerical
Relationships Among Categorical
Variables
(Categorical vs Categorical)
Consider a data set with at least two categorical variables, Smoking and Drinking.
Smoking Drinking
Non Smoker Non Drinker Do the data indicate that smoking and
(NS) (ND) drinking habits are related? For example,
Occasional Occasional • Do nondrinkers tend to be nonsmokers?
Smoker (OS) Drinker (OD) • Do heavy smokers tend to be heavy
Heavy Heavy drinkers?
Smoker (HS) Drinker (HD)

• The most meaningful way to describe a categorical variable is with counts, possibly expressed as
percentages of totals, and corresponding
• charts of the counts.
• We can find the counts of the categories of either variable separately, and more
• importantly, we can find counts of the joint categories of the two variables, such as the
• count of all nondrinkers who are also nonsmokers.
It is customary to display all such counts in a table called a crosstabs (for crosstabulations).
This is also sometimes called a contingency table.
Relationships Among Categorical
Variables (cont..)
(Categorical vs Categorical)
Do the data indicate that smoking and drinking habits are related? For
example,
• Do nondrinkers tend to be nonsmokers?
• Do heavy smokers tend to be heavy drinkers?
The 1st two arguments are for the condition on smoking;
the 2nd two are for the condition on drinking.
 You can then sum across rows and down columns to get the totals.
 It is useful to express the counts as percentages of row in the middle table
and as percentages of column in the bottom table.
 The latter two tables indicate, in complementary ways, that there is
definitely a relationship between smoking and drinking.
These tables indicate that smoking and drinking habits tend to go with one
another. These tendencies are reinforced by the column charts of the two
percentage tables
Relationships Among Categorical
& Numerical
(Categorical vs Numerical) Variables
It describes a very common situation where the goal is to break down a numerical
variable such as salary by a categorical variable such as gender.
• This general problem, typically referred to as the comparison problem, is one of the
most important problems in data analysis.
• It occurs whenever you want to compare a numerical measure across two or more
subpopulations. Here are some examples:
 The subpopulations are males and females, and the numerical measure is salary.
 The subpopulations are different regions of the country, and the numerical measure is the cost of living.
 The subpopulations are different days of the week, and the numerical measure is the number of customers
going to a particular fast-food chain.
 The subpopulations are different machines in a manufacturing plant, and the numerical measure is the number
of defective parts produced per day.
 The subpopulations are patients who have taken a new drug and those who have taken a placebo, and the
numerical measure is the recovery rate from a particular disease.
 The subpopulations are undergraduates with various majors (business, English, history, and so on), and the
numerical measure is the starting salary after graduating.
Relationships Among Categorical & Numerical
Variables (cont..)
(Categorical vs Numerical)

There are two possible data formats you will see,


stacked unstacked

Data of baseball salaries

The data are stacked if there are two “long” • Do pitchers (or any other positions) earn more than
variables, Gender and Salary, as indicated in others?
F i g u re . O c c a s i o n a l l y w i l l s e e d a ta • Does one league pay more than the other, or do any
i n unstacked format. (Note that both tables divisions pay more than others?
list exactly the same data) • How does the notoriously high Yankees payroll compare
to the others?
Relationships Among Categorical & Numerical
Variables (cont..)
(Categorical vs Numerical)

This table lists each of the requested summary


measures for each of the nine positions in the
data set.
If you want to see salaries broken down by team
or any other categorical variable, you can easily
run this analysis again and choose a different Cat
variable.
Relationships Among Categorical & Numerical
Variables (cont..)
(Categorical vs Numerical)
From these box plots, we
can conclude the
following:
 Pitchers make
somewhat less than
other players,
although there are
many outliers in each
group.
 The Yankees payroll is
indeed much larger
than the payrolls for
the rest of the teams.
In fact, it is so large
that its stars’ salaries
aren’t even considered
outliers relative to the
rest of the team.

These side-by-side box plots are so easy to obtain, you can


generate a lot of them to provide insights into the salary data.
Relationships Among Numerical & Numerical
Variables (cont..)
(Numerical vs Numerical)

• To study relationships among numeric variables, a new type of chart,


called a scatterplot, and two new summary measures, correlation and
covariance, are used.
• These measures can be applied to any variables that are displayed
numerically.
• However, they are appropriate only for truly numerical variables, not for
categorical variables that have been coded numerically.
Relationships Among Numerical & Numerical
Variables (cont..)
(Scatterplot)
• A scatterplot is a scatter of points, where each point denotes the values of
an observation for two selected variables.
• It is a graphical method for detecting relationships between two numerical
variables.
• The two variables are often labeled generically as X and Y, so a scatterplot is
sometimes called an X-Y chart.
• The purpose of a scatterplot is to make a relationship (or the lack of it)
apparent.

Data set includes an observation (Golf Stats) for each of the top 200 earners on the PGA Tour.
Relationships Among Numerical & Numerical
Variables (cont..)
(Scatterplot)
This example is typical in that there are many numerical
variables, and it is up to you to search for possible
relationships. A good first step is to ask some interesting
questions and then try to answer them with scatterplots.
For example,
 Do younger players play more events?
 Are earnings related to age?
 Which is related most strongly to earnings:
driving, putting, or greens in regulation?
 Do the answers to these questions remain the
same from year to year?
This example is all about exploring the data,
Relationships Among Numerical & Numerical
Variables (cont..)
(Scatterplot)
Relationships Among Numerical & Numerical
Variables
(Scatterplot)(cont..)
• Once you have a scatterplot, it enables you to superimpose one of several
trend lines on the scatterplot.
• A trend line is a line or curve that “fits” the scatter as well as possible.
• This could be a straight line, or it could be one of several types of curves.
Relationships Among Numerical & Numerical
Variables (cont..)
Correlation and Covariance
• Correlation and covariance measure the strength and direction of a linear
relationship between two numerical variables. (Bi-Variate Measures)
• The relationship is “strong” if the points in a scatterplot cluster tightly around some
straight line.
 If this straight line rises from left to right, the relationship is positive and the measures will be
positive numbers.
 If it falls from left to right, the relationship is negative and the measures will be negative
numbers.
• The two numerical variables must be “paired” variables.
 They must have the same number of observations, and the values for any observation should be
naturally paired.
Specifically, each measures the strength and direction of a linear relationship between two numerical variables.
Relationships Among Numerical & Numerical
Variables (cont..)
• Covariance is essentially an average of products of deviations from means.

• With this in mind, let Xi and Yi be the paired values for observation i, and let n be the
number of observations. Then the covariance between X and Y, denoted by Covar(X,
Y).
• Covariance has a serious limitation as a descriptive measure because it is very
sensitive to the units in which X and Y are measured.

For example, the covariance can be inflated by a factor of 1000 simply by measuring X in dollars rather
than thousands of dollars. In contrast, the correlation, denoted remedies this problem.
Relationships Among Numerical & Numerical
Variables (cont..)
• Correlation is a unitless quantity that is unaffected by the measurement scale.
• For example, the correlation is the same regardless of whether the variables
are measured in dollars, thousands of dollars, or millions of dollars.

• The correlation is defined by Equation, where Stdev(X) and Stdev(Y) denote


the standard deviations of X & Y, and Covar(X,Y) denote the covariance of X &
Y.
Relationships Among Numerical & Numerical
Variables
Correlation (cont..)
Pearson’s correlation coefficient
formula
Both variables are quantitative and
normally distributed with no
outliers, so we calculate a Pearson’s
r correlation coefficient.

• The closer r (Correlation) is to zero, the weaker the linear relationship.


• Positive r (Correlation) values indicate a positive correlation, where the values of
both variables tend to increase together.
• Negative r (Correlation) values indicate a negative correlation, where the values of
one variable tend to increase when the values of the other variable decrease.
Relationships Among Numerical & Numerical
Variables
Correlation (cont..)
The resulting table of correlations appears in Figure. You can ignore the 1.000 values along
the diagonal because a variable is always perfectly correlated with itself.

Finally, correlations (and covariances) are symmetric in that the correlation between any two variables X
and Y is the same as the correlation between Y and X.
Relationships Among Numerical & Numerical
Variables (cont..)
Correlation

• For example, the scatterplot


corresponding to the 0.884
correlation between Cuts Made and
Rounds appears in Figure. (We also
superimposed a trend line.)
• This chart shows the strong linear
relationship between cuts made and
rounds played, but it also shows
that there is still considerable
variability around the best-fitting
straight line, even with a correlation
as large as 0.899.
Sampling and distributions
Estimators and estimates
• for the purpose of estimating a population parameter we can use
various sample statistics like sample mean, sample median, sample
variance, etc. are called estimators and the actual value taken by the
estimators are called estimates
Point Estimate
• A single value of a statistic that is used to estimate the unknown
population parameter is callerd a point estimate. for example, the
sample mean which we use for estimating the population mean is a
point estimator of population mean. similarly, the statistic s^2 is a
point estimator of population variance where the value of s^2 is
computed from a random sample
Interval Estimate
• An interval estimate refereal value to the probabale range within
which the real value of a parameter is espected to lie. the two
extreme limits of such a range is called a confidence interval. these
are determined on the basis of sample studies of a population.
properties of a good estimator
A good estimator is the once which is as close to the true value of the
parameter as possible. A good estimator possess the following
properties or characteristics
• unbiased estimator
• Consistent estimator
• sufficient estimator
unbiased estimator
• An estimator is said to be unbiasses estimator of the population
parameter if the mean of the sampling distribution of the estimator is
equal to the corresponding mathematical expectations, it is an
unbiased estimator of population unbiased estimator if the expected
value of the estimator is equal to the parameter being estimated
• sample variance is an unbiased estimate of the population variance
when
• E(sample variance) = population variance

• sample mean is an unbiased estimator of the population mean when


• E(sample mean) = population mean
Consistent estimator
• An estimator is said to be consistent if the estimator approches the
population parameter as the sample size increases. in other words,
esimator is said to be consistent estimator of the population
parameter, if the probability that estimated approches population
parameter is 1 as “n” becomes larger and larger.
• for eg.
• E(sample mean) -> population mean as n-> infi.
• var(sample mean ) -> 0 as n-> infi.
efficient estimator
• efficiency is a relative term. efficiency of an estimator is generally
defined by comparing it with another estimator.
let us take two unbiased estimator a and b. the esimator a, is called an
efficient estimator of x if the variance of a is less than the variance of b,
that is var(a) < var(b)
sufficient estimator
• the last property that a estimator should posses is sufficiency. an
estimator is said to be a sufficient estimator of a parameter x if it
contains all the information that the given sample can furnish about
the population
Confidence Interval Estimation
 The lower bound (in the example, 5%) is called a lower confidence limit and
the upper bound (in the example, 15%) is called an upper confidence limit.

 The bigger the sample size, the more narrow the confidence interval will be.
 How to determine the lower and upper confidence limit?
Confidence limit Standard deviation
Sample size A measure of how
many standard deviations
Mean Z-score are below or above the
population mean
 Z-Scores for commonly used confidence intervals are as follows:
 90%  1.645  99%  2.576  50%  0.674 Refer
 95%  1.96  80%  1.282  98%  2.326 Appendix for
further details
Interval estimation
Intervel esitmator for large samples - in large sample, the interval
estimation is further studies under the 4 headings
• Confidence interval or limit for population mean
• Confidence interval or limit for population proprotion P
• Confidence interval or limit for population standard deviation
• Determination of a proper sample for estimating sample population
and population mean
confidence intervale or limit for
population mean
• the dertermination of the confidence interval or limits for population
mean in case of large sample that is n>30, requires the use of normal
distribution.
example
• A random sample of 100 oberservations yields sample mean = 150
and sample var = 400. compute 95% and 99% confidence interval for
the population mean
confidence interval or limit for
population proprotion P
• though the sampling distribution associated with proportions is the
binomial distribution, the norrmal distribution can be used as an
approximation provided the sample is large, i.e, n>30 and np >=5.
here n is the size of the sample, p is the proportion of success and
q=p-1.
example
• out of 1200 tosses of a coin, it gave 480 heads and 720 tails. find the
95% confidence interval for the heads.
example
• A random sample of 1000 households in a city revealed that 500 of
these had car. find 95% and 99% confidence limits for the proportion
of households in the city with car
confidence interval or limit for
population standard deviation
• the determination of the confidence intervel or limit for population
S.D in case of large sample requires the use of normal distribution.
Example
• A random sample of 50 observations gave a value of its standard
deviation equal to 24.5 . construct a 95% confidence interval for
population standard deviation
Determination of a proper sample for estimating
sample population and population mean
So far we have calculated the confidence intervals based on the
assumption that the sample size”n” is know. in most of the practical
situation, generally, sample size is not known. the method of
dertimning a proper sample size is stuided under 2 headings:
• Sample size for estimation a population mean
• Sample Size for esimating a population proportion
Sample size for estimation A
population mean :
Order to dertermine the sample size for estimating a population mean,
the following 3 factors must be known :
• desired confidence level and the corresponding values of Z.
• permissible sample error E.
• standard deviation or an estimate of population S.D
after having know the above mestioned factors, the sample size n is given
by
Example
• A cigarette manufacture wishes to use a random sample to estimate
the avg. nicotin content. the sample error should not be more than
one milligram above or below the true mean, with 99% confidence
level. the population standard Deviation is 4 milligram. what sample
size should the company use in order to satisfy these requirements?
Sample Size for esimating a
population proportion
In order to dertermine the sample size for estimating population
proportion, the following 3 factors must be known:
• the desired level of confidence and the corresponding value of Z.
• the permissible sampling error E.
• the actual or estimated true proportion of succes P.
the size of sample n

• Z and E are predetermined. the value of the population proportion P


Example
• A firm wishes to determine with a maximum allowable error of 0.05
and at 99% level of confidence the proportion of consumer who
perfer its product. How large a sample will be required in order to
make such an estimate if the preliminary sales reports indicate that
25% of all th consumers prefer the firm’s product?
Interval estimation for samll
samples
the determination of confidence intervals in case of small size sample is
sample is studied under two heading:
• Confidence Interval or limits for population mean
• Confidence intervel or limits for population variance in case of small
samples
confidence Interval or limits for
population mean
• when the sample size is small and population SD is unknown, the
desired confidence interval or limits for population mean can be
found by aking use of t- distribution.
Example
• A random sample of size 16 has 50 as mean with s=3. obtain 98%
confidence limits of the mean of the population.
Example
• A random sample of 16 items from normal population showed a
mean of 53 and the sum of square of deviations from this mean is
equal to 150. obtain 95% and 99% confidence limits for the mean of
the population
Confidence intervel or limits for population
variance in case of small samples:
• the determintation of confidence interval or limit for population
variance requires the use of chi square test. here chi squre values are
used in place of t values.
Example
• A random sample of size 15 selected from a normal population has a
standard deviation s=2.5 construct a 95% confidence interval for S.D
and Var.
A Sampling Distribution
• A dibution of a statistic made from multipule simple random
samples drawn from a specific population
A Sampling Distribution
Imagine you're trying to find the average height of people in a town. You
take different random groups of people and calculate the average height
for each group. Now, if you do this many times, you'll get a bunch of
different averages.

The theoretical distribution of all these averages is called a sampling


distribution. It shows all the possible average heights and how likely they
are to happen in your town.

This helps us figure out how likely our calculated average from one specific
group is to represent the true average height of the whole town. The
sampling distribution helps us understand if our result is just by chance or
if it's a reliable estimate for the entire population
A Sampling Distribution
We are moving from descriptive statistics to inferential statistics.

Inferential statistics allow the researcher to come to conclusions about


a population on the basis of descriptive statistics about a sample.
A Sampling Distribution
For example:

Your sample says that a candidate gets support from 47%.

Inferential statistics allow you to say that the candidate gets support from 47% of
the population with a margin of error of +/- 4%.

This means that the support in the population is likely somewhere between 43%
and 51%.
A Sampling Distribution
Margin of error is taken directly from a sampling distribution.

It looks like this:

95% of Possible Sample


Means

47%

43% 51% Your Sample Mean


A Sampling Distribution
Let’s create a sampling distribution of means…
Take a sample of size 1,500 from the US. Record the mean income. Our census said the mean is
$30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Take another sample of size 1,500 from the US. Record the mean income. Our census said the
mean is $30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. Our census said the
mean is $30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. Our census said the
mean is $30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. Our census said the
mean is $30K.

$30K
A Sampling Distribution
Let’s create a sampling distribution of means…
Let’s repeat sampling of sizes 1,500 from the US. Record the mean incomes. Our census said the
mean is $30K.

The sample means would


stack up in a normal curve.
A normal sampling
distribution.

$30K
A Sampling Distribution
Say that the standard deviation of this distribution is $10K.
Think back to the empirical rule. What are the odds you would get a sample mean that is more
than $20K off.

The sample means would


stack up in a normal curve.
A normal sampling
distribution.

$30K

-3z -2z -1z 0z 1z 2z


3z
A Sampling Distribution
Say that the standard deviation of this distribution is $10K.
Think back to the empirical rule. What are the odds you would get a sample mean that is more
than $20K off.

The sample means would


stack up in a normal curve.
A normal sampling
distribution.
2.5% 2.5%

$30K

-3z -2z -1z 0z 1z 2z


3z
A Sampling Distribution
Social Scientists usually get only one chance to sample.

Our graphic display indicates that chances are good that the mean of our one
sample will not precisely represent the population’s mean. This is called sampling
error.

If we can determine the variability (standard deviation) of the sampling


distribution, we can make estimates of how far off our sample’s mean will be
from the population’s mean.
A Sampling Distribution
Knowing the likely variability of the sample means from repeated
sampling gives us a context within which to judge how much we can
trust the number we got from our sample.

For example, if the variability is low, we can trust our number more
than if the variability is high, .
A Sampling Distribution
Which sampling distribution has the lower variability or standard deviation?

a b

Sa < Sb
The first sampling distribution above, a, has a lower standard error.

Now a definition!
The standard deviation of a normal sampling distribution is called the standard error.
A Sampling Distribution
Statisticians have found that the standard error of a sampling distribution is quite
directly affected by the number of cases in the sample(s), and the variability of
the population distribution.

Population Variability:
For example, Americans’ incomes are quite widely distributed, from $0 to Bill
Gates’.

Americans’ car values are less widely distributed, from about $50 to about $50K.

The standard error of the latter’s sampling distribution will be a lot less variable.
A Sampling Distribution
Population Variability:

Population
Cars
Income
Sampling
Distribution

The standard error of income’s sampling distribution


will be a lot higher than car price’s.
A Sampling Distribution
Standard error = population standard deviation / square root of sample size

Y-bar= /n
IF the population income were distributed with mean,  = $30K with standard deviation,  = $10K

n = 2,500, Y-bar= $10K/50 =


$200
n = 25, Y-bar= $10K/5 = $2,000 Population
Distributio
n

$30
k
…the sampling distribution changes for varying sample sizes
A Sampling Distribution
So why are sampling distributions less variable when
sample size is larger?

Example 1:
 Think about what kind of variability you would
get if you collected income through repeated
samples of size 1 each.
 Contrast that with the variability you would get
if you collected income through repeated
samples of size N – 1 (or 300 million minus
one) each.
A Sampling Distribution
So why are sampling distributions less variable when sample size is larger?

Example 1:
 Think about what kind of variability you would get if you collected income through repeated samples of size 1 each.
 Contrast that with the variability you would get if you collected income through repeated samples of size N – 1 (or 300 million minus
one) each.

Example 2:
 Think about drawing the population distribution and playing “darts” where the mean is the bull’s-eye. Record each one of your attempts.
 Contrast that with playing “darts” but doing it in rounds of 30 and recording the average of each round.
 What kind of variability will you see in the first versus the second way of recording your scores.

…Now, do you trust larger samples to be more accurate?


A Sampling Distribution
An Example:
A population’s car values are  = $12K with  = $4K.
Which sampling distribution is for sample size 625 and which is for 2500? What are
their s.e.’s?

95% of
M’s
95% of M’s
? $12K ?
? $12K ?
-3 -2 -1 0 1 2 -3-2-1 0 1 2
3 3
A Sampling Distribution
An Example:
A population’s car values are  = $12K with  = $4K.
Which sampling distribution is for sample size 625 and which is for 2500? What are their s.e.’s?
s.e. = $4K/25 = $160 s.e. = $4K/50 = $80

(625 = 25) (2500 = 50)

95% of
M’s
95% of M’s
$11,840 $12K $12,320
$11,920$12K $12,160

-3 -2 -1 0 1 2 -3-2-1 0 1 2
3 3
A Sampling Distribution
A population’s car values are  = $12K with  = $4K.
Which sampling distribution is for sample size 625 and which is for 2500?

Which sample will be more precise? If you get a particularly bad sample, which sample size will help you be sure that you are
closer to the true mean?

95% of
M’s
95% of M’s
$11,840 $12K $12,320
$11,920$12K $12,160

-3 -2 -1 0 1 2 -3-2-1 0 1 2
3 3
A Sampling Distribution
Some rules about the sampling distribution of the
mean…
1. For a random sample of size n from a population having mean  and standard
deviation , the sampling distribution of Y-bar (glitter-bar?) has mean  and
standard error Y-bar = /n
2. The Central Limit Theorem says that for random sampling, as the sample size n
grows, the sampling distribution of Y-bar approaches a normal distribution.
3. The sampling distribution will be normal no matter what the population
distribution’s shape as long as n > 30.
4. If n < 30, the sampling distribution is likely normal only if the underlying population’s
distribution is normal.
5. As n increases, the standard error (remember that this word means standard
deviation of the sampling distribution) gets smaller.
6. Precision provided by any given sample increases as sample size n increases.
A Sampling Distribution
So we know in advance of ever collecting a sample, that if sample size is sufficiently large:

• Repeated samples would pile up in a normal distribution

• The sample means will center on the true population mean

• The standard error will be a function of the population variability and sample size

• The larger the sample size, the more precise, or efficient, a particular sample is

• 95% of all sample means will fall between +/- 2 s.e. from the population mean
Probability Distributions
• A Note: Not all theoretical probability distributions are Normal. One example of many is the binomial
distribution.
• The binomial distribution gives the discrete probability distribution of obtaining exactly n successes out
of N trials where the result of each trial is true with known probability of success and false with the
inverse probability.
• The binomial distribution has a formula and changes shape with each probability of success and number
of trials.

a binomial
distribution

Successes: 0 1 2 3 4 5 6 7 8 9 10 11 12

• However, in this class the normal probability distribution is the most useful!
Sampling Distributions
There are 2 types of sampling distributions i.e.,
• Sampling Distribution of Mean – [Discussed earlier]
• T-Distribution
T-Distribution
• The t-distribution describes the standardized distances of sample means to the
population mean when the population standard deviation is not known, and the
observations come from a normally distributed population.
• The t-distribution is similar to normal distribution but flatter and shorter than a normal distribution
i.e., it is symmetrical, bell-shaped distribution, similar to the standard normal curve.

• The height of the t-distribution depends on the degrees of freedom (df) and refers to the maximum
number of logically independent values, which are values that have the freedom to vary, in the sample.
Degree of freedom
The easiest way to understand degrees of freedom conceptually is through several examples.
• Consider a data sample consisting of five positive integers. The values of the five integers
must have an average of six. If four of the items within the data set are {3, 8, 5, and 4}, the fifth
number must be 10. Because the first four numbers can be chosen at random, the degrees of
freedom is four.
• Consider a data sample consisting of one integer. That integer must be odd. Because there
are constraints on the single item within the data set, the degrees of freedom is zero.
• The formula to determine degrees of freedom is df = N – 1 where N is sample size.
• For example, imagine a task of selecting 10 baseball players whose bating average must
average to .250. The total number of players that will make up our data set is the sample
size, so N = 10. In this example, 9 (10 - 1) baseball players can theoretically be picked at
random, with the 10th baseball player having to have a specific batting average to adhere
to the .250 batting average constraint.
T-Distribution cont…
• As the df increases, the t-distribution will get closer and closer to matching the standard
normal distribution.
• The values of the t-statistic is : t = [ x̄ - μ ] / [ s / √ n ] where,
• t = t score,
• x̄ = sample mean,
• μ = population mean,
• s = standard deviation of the sample,
• n = sample size
Note: A t-score is equivalent to the number of standard deviations away from the mean of the t-
distribution.
• A law school claims it’s graduates earn an average of $300 per hour. A sample of 15
graduates is selected and found to have a mean salary of $280 with a sample standard
deviation of $50. Assuming the school’s claim is true, what is the t-score?
Solution: t= (280 – 300) / (50/ √ 15) = -20 / 12.909945 = -1.549.
T-Distribution cont…
Student’s t distribution is used when
• The sample size must be 30 or less than 30.
• The population standard deviation(σ) is unknown.
• The population distribution must be unimodal and skewed.
Note:
The t-score represents the number of standard errors by which the
sample mean differs from the population mean. For example, if a t-score
is 2.5, the sample mean is 2.5 standard errors above the population
mean. If a t-score is −2.5, the sample mean is 2.5 standard errors below
the population mean.
Inferential Statistics
• Statistics can be classified into two different categories i.e., descriptive statistics and
inferential statistics.
• The descriptive statistics summarizes the features of the dataset, whereas inferential statistics
help to make conclusion from the data.
• Inferential statistics is the process of using a sample to infer the properties of a population
and allows to generalize the population.
• In general, inference means “guess”, which means making inference about something. So,
statistical inference means, making inference about the population.
• Let’s look at a real flu vaccine study for an example of making a statistical inference. The
scientists for this study want to evaluate whether a flu vaccine effectively reduces flu cases
in the general population. However, the general population is much too large to include in their
study, so they must use a representative sample to make a statistical inference about the
vaccine’s effectiveness.
• Hypothesis testing is one of the type of inferential statistics.
Hypothesis
• A hypothesis is defined as a formal statement, which gives the explanation about the
relationship between the two or more variables of the specified population i.e., it
includes components like variables, population and the relation between the
variables.
• Hypothesis example:
• Two variables - if you eat more vegetables, you will lose weight faster. Here,
eating more vegetables is an independent variable, while losing weight is the
dependent variable.
• Two or more dependent variables and two or more independent variables -
Eating more vegetables and fruits leads to weight loss, glowing skin, and
reduces the risk of many diseases such as heart disease.
• Consumption of sugary drinks every day leads to obesity
• If a person gets 7 hours of sleep, then he will feel less fatigue than if he sleeps less.
Hypothesis Testing
• In today’s data-driven world, decisions are based on data all the time. Hypothesis plays
a crucial role in that process, whether it may be making business decisions, in the health
sector, academia, or in quality improvement. Without hypothesis & hypothesis tests, you risk
drawing the wrong conclusions and making bad decisions.
• Hypothesis testing is a type of statistical analysis in which assumptions are put about a
population parameter to the test. It is used to estimate the relationship between
variables.
• Examples:
• A faculty assumes that 60% of his students come from higher-middle-class families.
• A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.
• It involves setting up a null hypothesis and an alternative hypothesis. These two hypotheses
will always be mutually exclusive. This means that if the null hypothesis is true then the
alternative hypothesis is false and vice versa.
Null Hypothesis and Alternate
Hypothesis
• The null hypothesis is the statement There’s no effect in the population. A
null hypothesis has no bearing on the study's outcome unless it is rejected.
• Example:
• Smokers are no more susceptible to heart disease than nonsmokers.
• The new drug has a cure rate no higher than other drugs on the market.
• H0 is the symbol for it, and it is pronounced H-naught.
• Hypothesis testing is used to conclude if the null hypothesis can be rejected or
not. Suppose an experiment is conducted to check if girls are shorter than
boys at the age of 5. The null hypothesis will say that they are the same height.
• and the alternative hypothesis is the hypothesis that we are trying to
prove ( There’s an effect in the population.) and which is accepted if
we have sufficient evidence to reject the null hypothesis.
• It indicates that there is a statistical significance between two
possible outcomes and can be denoted as Ha.
• For the above-mentioned example, the alternative hypothesis would
be that “girls are shorter than boys at the age of 5”.
• The null hypothesis is usually the current thinking, or status quo.
The alternative hypothesis is usually the hypothesis to be proved.
The burden of proof is on the alternative hypothesis.
Null Hypothesis and Alternate
Hypothesis cont…
• A sanitizer manufacturer claims that its product kills 95 percent
of germs on average. To put this company's claim to the test,
create a null and alternate hypothesis.
• H0 (Null Hypothesis): Average = 95%.
• Alternative Hypothesis (Ha): The average is less than 95%.
Research question Ha H0

Does tooth flossing Tooth flossing has no Tooth flossing has an


affect the number of effect on the number of effect on the number
cavities? cavities. of cavities.
Does the amount of The amount of The amount of
text highlighted in the text highlighted in the text highlighted in the
textbook affect exam textbook has no textbook has an
scores? effect on exam effect on exam
scores. scores.
Does daily Daily meditation does Daily meditation
Null Hypothesis and Alternate
Hypothesis cont…
• How to write null and alternate hypothesis - The only thing to know are the
dependent (DV) variables and independent variables (IV). To write null
hypothesis, and alternative hypothesis, fill in the following sentences with variables
i.e., does independent variable affect dependent variable?
• Null hypothesis (H0): IV does not affect DV.
• Alternative Hypothesis (Ha): IV affects DV.
• Characteristics of a Hypothesis
• It has to be clear and accurate in order to look reliable.
• It has to be specific.
• There should be scope for further investigation and experiments.
• It should be explained in simple language while retaining its significance.
• IVs and DVs must be included with the relationship between them.
Tails of distributions
• The tails of a distribution are the appendages on the side of a
distribution. Although it can apply to a set of data, it makes
more sense if that data is graphed, because the tails become
easily visible.

Contains the Contains the


lower values upper values
in a
distribution
Lower tail Upper tail in a
distribution
Hypothesis Testing cont…
• The purpose of statistical inference is to draw conclusions about a
population on the basis of data obtained from a sample of that population.
• Hypothesis testing is the process used to evaluate the strength of
evidence from the sample and provides a framework for making
determinations related to the population, i.e, it provides a method for
understanding how reliably one can extrapolate observed findings in a
sample under study to the larger population from which the sample was
drawn.
• The investigator formulates a specific hypothesis, evaluates data from the
sample, and uses these data to decide whether they support the
specific hypothesis.
• The first step in testing hypotheses is the transformation of the
research question into a null hypothesis, and an
alternative hypothesis. Subsequently, the hypothesis testing.
• In hypothesis testing, a one-tailed test and a two-tailed test are
alternative ways of computing the statistical significance of a
parameter inferred from a data set, in terms of a test statistic.
One-Tailed Hypothesis Testing
• A one-tailed test is based on a unidirectional hypothesis where the area of
rejection is on only one side of the sampling distribution.
• It determines whether a particular population parameter is larger or
smaller than the predefined parameter. It uses one single critical value to
test the data.

• Example: Effect of participants of students in coding competition on their


fear level and H0: There is no important effect of students in coding
competition on their fear level. The main intention is to check the
decreased fear level when students participate in a coding competition.
Two-Tailed Hypothesis Testing
• A two-tailed test is also called a non-directional hypothesis. For
checking whether the sample is greater or less than a range of
values, the two-tailed is used. It is used for null hypothesis testing.

• Example: Effect of new bill pass on the loan of farmers and H0:
There is no significant effect of the new bill passed on loans of
farmers. The main intention is to check the new bill passes can
affect in both ways either increase or decrease the loan of
Types of Error
• Regardless of whether the investigator decides to accept or reject the
null hypothesis, it might be the wrong decision.
• The investigator might incorrectly reject the null hypothesis when it is
true, and might incorrectly accept the null hypothesis when it is false.
• In the tradition of hypothesis testing, these two types of errors
have acquired the names i.e., type I and type II errors.
• In general, commit a type I error occurs when one incorrectly reject a
null hypothesis that is true. On the other hand, type II error occurs
when you one incorrectly accept a null hypothesis that is false.
truth
H0 is true Ha is true
Decision Reject H0 Type I error No error
Do not reject H0 No error Type II error
Rejection Region
• The question, then, is how strong the evidence in favor of the
alternative hypothesis must be to reject the null hypothesis.
• This is done by means of a p-value. The p-value is the probability of
seeing a random sample at least as extreme as the observed sample,
given that the null hypothesis is true. The smaller the p-value, the
more evidence there is in favor of the alternative hypothesis.
• The p-values are expressed as decimals and can be converted
into percentage. For example, a p-value of 0.0237 is 2.37%, which
means there's a 2.37% chance of the results being random or having
happened by chance.
• In the hypothesis test, if the value is:
• A small p value (<=0.05), reject the null hypothesis.
• A large p value (>0.05), do not reject the null hypothesis
• The p-values are usually calculated using p-value tables, or
calculated automatically using statistical software like R, SPSS,
Python etc.
• Note: Other way to decide the rejection region is with z-score
and it is applicable when the sample size is less than or equal to
30.
Hypothesis Testing Example
• An investor says that the performance of their investment portfolio is
equivalent to that of the Standard & Poor’s (S&P) 500 Index. The person
performs a two-tailed test to determine this.
• The null hypothesis here says that the portfolio’s returns are equivalent to the
returns of S&P 500, while the alternative hypothesis says that the returns of
the portfolio and the returns of the S&P 500 are not equivalent.
• The p-value hypothesis test gives a measure of how much evidence is present
to reject the null hypothesis. The smaller the p value, the higher the evidence
against null hypothesis.
• Therefore, if the investor gets a P value of .001, it indicates strong evidence against
null hypothesis. So he confidently deduces that the portfolio’s returns and the
S&P 500’s returns are not equivalent.
Hypothesis Testing Numerical
• Problem Statement: A Telecom service provider claims that
individual customers pay on an average 400 rs. per month with
standard deviation of 25 rs. A random sample of 50 customers bills
during a given month is taken with a mean of 250 and standard
deviation of 15. What to say with respect to the claim made by the
service provider?
• Solution:
H0 (Null Hypothesis) : μ = 400
H1 (Alternate Hypothesis): μ ≠ 400 (Not equal means either μ > 400
or μ < 400 Hence it will be validated with two tailed test )
σ = 25 (Population Standard Deviation)

LoS (α) = 5% (Take 5% if not given in question)

n = 50 (Sample size)
xbar x̄ = 250 (Sample mean)
s = 15 (sample Standard deviation)
n > = 30 hence will go with z-test

Step 1:
Calculate z using z-test formula as below:

z = (x̄ - μ)/ (σ/√n)


z = (250 - 400) / (25/√50)
z = -42.42

Step 2:
get z critical value from z table for α = 5%
z critical values = (-1.96, +1.96)
to accept the claim (significantly), calculated z should be in between
-1.96 < z < +1.96

but calculated z (-42.42) < -1.96 which mean reject the null hypothesis
Chi-square test for independence
• A chi-square test of independence is to test whether two categorical variables are
related to each other or not.
• Example 1: we have a list of movie genres; this is the first variable. The second
variable is whether or not the patrons of those genres bought snacks at the theater.
The idea (or null hypothesis) is that the type of movie and whether or not people
bought snacks are unrelated. The owner of the movie theater wants to estimate how
many snacks to buy. If movie type and snack purchases are unrelated, estimating will be
simpler than if the movie types impact snack sales.
• Example 2: a veterinary clinic has a list of dog breeds they see as patients. The second
variable is whether owners feed dry food, canned food or a mixture. The idea (or
null hypothesis) is that the dog breed and types of food are unrelated. If this is true,
then the clinic can order food based only on the total number of dogs, without
consideration for the breeds.
Chi-square Test for Independence
Example
• Let’s take a closer look at the movie snacks example. Suppose we collect data for 600 people at our
theater. For each person, we know the type of movie they saw and whether or not they bought snacks.
• For the valid Chi-square test, the following conditions to be satisfied:
• Data values that are a simple random sample from the population of interest.
• Two categorical or nominal variables.
• For each combination of the levels of the two variables, we need at least five expected values. When
we have fewer than five for any one combination, the test results are not reliable. To confirm this, we
need to know the total counts for each type of movie and the total counts for whether snacks were
bought or not. For now, we assume we meet this requirement and will check it later.
Statistical details
 The null hypothesis is that the type of movie and snack purchases are
independent. It is written as: H0:Movie Type and Snack
purchases are independent
 The alternative hypothesis is the opposite i.e., Ha: Movie Type and Snack
purchases are not independent.
Chi-square Test for Independence
Example cont…
• The data summarized in a contingency table is as follows:
Type of movie Snacks No snacks
Action 50 75
Comedy 125 175
Family 90 30
Horror 45 10

• Before we go any further, let’s check the assumption of five


expected values in each category. The data has more than five
counts in each combination of Movie Type and Snacks.
• To find expected counts for each Movie-Snack combination, we
first need the row and column totals, which are shown below:
Type of movie Snacks No snacks Row Totals
Action 50 75 50 + 75 = 125
Comedy 125 175 125 + 175 =
300
Family 90 30 90 + 30 = 120
Horror 45 10 45 + 10 = 55
Column Totals 50+125+90+45 = 310 75+175+30+10 = Grand Total =
290 600
Chi-square Test for Independence
Example cont…
• The expected counts for each Movie-Snack combination are based
on the row and column totals. We multiply the row total by the
column total and then divide by the grand total. This gives us the
expected count for each cell in the table.
• For example, for the Action-Snacks cell: (125 * 310) / 600 = 65.
If there is not a relationship between movie type and snack
purchasing we would expect 65 people to have watched an
action film with snacks.
• For the Action-No Snacks cell: (125 * 290) / 600 = 60. Similarly, it
can be counted for others…
• The expected count appears in bold beneath the actual count.
Type of movie Snacks No snacks Row Totals
Action 50 75 125
125*310/600 = 65 125*290/600 = 60
Comedy 125 175 300
300*310/600 = 155 300*290/600 = 145
Family 90 30 120
120*310/600 = 62 120*290/600 = 58
Horror 45 10 55
55*310/600 = 28 55*290/600 = 27
Column Totals 310 290 Grand Total =
600
Chi-square Test for Independence
Example cont…
• All of the expected counts for our data are larger than five, so
we meet the requirement for applying the independence test.
• If we look at each of the cells, we can see that some expected
counts are close to the actual counts but most are not.
• If there is no relationship between the movie type and snack
purchases, the actual and expected counts will be similar. If
there is a relationship, the actual and expected counts will be
different.
Performing the Chi-square Test
• The basic idea in calculating the test statistic is to compare actual and
expected values, given the row and column totals that we have in the data.
• First, we calculate the difference from actual and expected for each
Movie-Snacks combination.
• Next, we square that difference. Squaring gives the same
importance to combinations with fewer actual values than expected and
combinations with more actual values than expected.
• Next, we divide by the expected value for the combination. We add up
these values for each Movie-Snacks combination. This gives the test
statistic.
Chi-square Test for Independence
Example cont…
Type of movie Snacks No snacks Row Totals
Action Actual: 50 Actual: 75 125
Expected: 65 Expected: 60
Difference: 50 – 65 = -15 Difference: 75 – 60 = 15
Squared Difference = 225 Squared Difference = 225
Divide by Expected: 225/65 = Divide by Expected: 225/60 =
3.46 3.75
Comedy Actual: 125 Actual: 175 300
Expected:155 Expected: 145
Difference: 125– Difference: 175– 145 = 30
155 = -30 Squared Difference = 900
Squared Difference Divide by Expected: 900/145 = 6.21
= 900
Divide by Expected:
900/155 = 5.81
Family Actual: 90 Actual: 30 120
Expected:62 Expected: 58
Difference: 90 – 62 = 28 Difference: 30 – 58 = -28
Squared Difference = 784 Squared Difference = 784
Divide by Expected: 784/62 = Divide by Expected: 784/58 =
12.65 13.52
Horror Actual: 45 Actual: 10 55
Expected:28 Expected: 27
Difference: 45 – 28 = -16 Difference: 10 – 27 = -17
Squared Difference = 256 Squared Difference = 289
Divide by Expected: 256/28 = Divide by Expected: 289/27 =
9.14 10.70
Column Totals 310 290 Grand Total = 600
Chi-square Test for Independence
Example cont…
• Lastly, to get our test statistic, we add the numbers in the final row for
each cell: 3.46 + 3.75 + 5.81 + 6.21 + 12.65 + 13.52 + 9.14 + 10.70 = 65.24
• Now, we need to find the critical value from the Chi-square distribution
based on degrees of freedom and significance level. This is the value to
expect if the two variables are independent.
• The degrees of freedom depend on how many rows and how many
columns we have. The degrees of freedom (df) are calculated as
df=(r−1)×(c−1) where r is the number of rows, and c is the number of
columns in the contingency table. From the example, r is 4 and c is 2.
Hence, df = (4−1)×(2−1)=3×1=3.
• The Chi-square value with α = 0.05 (it is given and represents the
probability of rejecting the null hypothesis when it is true) and
three degrees of freedom is 7.815. Note: This value of 7.815
to be infer from the Chi-square distribution table
Refer Appendix for further
details

• We compare the value of our test statistic (65.24) to the Chi-


square value. Since 65.24 > 7.815, we reject the idea that movie
type and snack purchases are independent.
• Therefore, we conclude that there is some relationship between
movie type and snack purchases.
• However, the owner of the movie theater cannot estimate how many
snacks to buy regardless of the type of movies being shown.
Instead, the owner must think about the type of movies being
shown when estimating snack purchases.
• It's important to note that we cannot conclude that the type of
movie causes a snack purchase. The independence test tells us only
whether there is a relationship or not; it does not tell that one
variable causes the other.
Example-2
H_0 : no association between foot and
hand length(independent)
H_1 : there is an association
Appendix
Z-values for confidence interval
Confidence Level Z value
0.70 1.04
0.75 1.15
0.80 1.28
0.85 1.44
0.90 1.64
0.92 1.75
0.95 1.96
0.96 2.05
0.98 2.33
0.99 2.58
0.50 0.674
Appendix
Chi-square Distribution Table

You might also like