Recal5 RelationAnalysis
Recal5 RelationAnalysis
1
Hypothesis Testing Strategies
2
Parametric Tests : Applications
Usually assume certain properties of the population from
which we draw samples.
• Observation come from a normal population
3
Hypothesis Testing : Non-Parametric Test
Non-Parametric tests
o Does not under any assumption
o Assumes only nominal or ordinal data
Note: Non-parametric tests need entire population (or very large sample size)
4
Relationship Analysis
Example: Wage Data
A large data regarding the wages for a group of employees from the eastern
region of India is given.
5
Relationship Analysis
Example: Wage Data
6
Relationship Analysis
Example: Wage Data
Employee’s age and wage: How wages vary with ages?
Interpretation: On the average, wage increases with age until about 60 years of age, at
which point it begins to decline.
7
Relationship Analysis
Example: Wage Data
?
How wages vary with time?
8
Relationship Analysis
Example: Wage Data
Wage and calendar year: How wages vary with years?
Interpretation: There is a slow but steady increase in the average wage between 2010 and
2016.
.
9
Relationship Analysis
Example: Wage Data
?
Whether wages are related with education?
10
Relationship Analysis
Example: Wage Data
Wage and education level: Whether wages vary with employees’ education levels?
11
Relationship Analysis
Given an employee’s wage can we predict his age?
Whether wage has any association with both year and education
level?
etc….
12
An Open Challenge!
Suppose there are countably infinite points in the . We need a huge memory to store all
such points.
Is there any way out to store this information with a least amount of memory?
Say, with two values only.
13
Yahoo!
y=ax+b
Note: Here, tricks was to find a relationship among all the points.
14
Measures of Relationship
Univariate population: The population consisting of only one variable.
15
Measures of Relationship
Multivariate population: If the data happen to be one more than two variable.
l ume
Vo
Temperature
Pressure
16
Measures of Relationship
In case of bivariate and multivariate populations, usually, we have to answer two
types of questions:
Q1: Does there exist correlation (i.e., association) between two (or more) variables?
If yes, of what degree?
Q2: Is there any cause and effect relationship between the two variables (in case of
bivariate population) or one variable in one side and two or more variables on the
other side (in case of multivariate population)?
If yes, of what degree and in which direction?
17
Correlation Analysis
18
Correlation Analysis
In statistics, the word correlation is used to denote some form of
association between two variables.
Example: Weight is correlated with height
Example:
100
90
80
70
60
50
40
30
20
10
1 2 3 4 5 6 7
Hours of study
20
Correlation Analysis
Do you find any correlation between X and Y as shown in the table?.
# CD
# Cigarette
Note:
In data analytics, correlation analysis make sense only when relationship make sense.
There should be a cause-effect relationship.
21
Correlation Analysis
Positive correlation
Negative correlation
7
7 Zero correlation
6 7
6
6
5
5
5
4
4
4
3 3
3
2 2
2
1 1
1
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11
22
Correlation Coefficient
Correlation coefficient is used to measure the degree of association.
It is usually denoted by r.
23
Correlation Coefficient
High Positive Correlation Low Positive Correlation
7
7
6 6
5 5
4 4
3 3
2 2
1
1
1 2 3 4 5 6 7 4 7
1 2 3 5 6
7
6
6
5
5
4
4
3
3
2
2
1
1
4 5 6 7 8 1 2 3 4 5 6 7 8
1 2 3
24
Correlation Coefficient
25
Correlation Coefficient
R = +0.60
R = +0.80
R = +0.80
R = +0.40
26
Measuring Correlation Coefficients
There are three methods known to measure the correlation coefficients
27
Pearson’s Correlation Coefficient
28
Karl Pearson’s Correlation Coefficient
This is also called Pearson’s Product Moment Correlation
where
29
Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
A small study is conducted involving 17 infants to investigate the association between
gestational age at birth, measured in weeks, and birth weight, measured in grams.
30
Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
We wish to estimate the association between gestational age and infant birth weight.
In this example, birth weight is the dependent variable and gestational age is the
independent variable. Thus Y = birth weight and X = gestational age.
The data are displayed in a scatter diagram in the figure below.
31
Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
For the given data, it can be shown the following
= 0.82
32
Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
Significance Test
To test whether the association is merely apparent, and might have arisen by chance use the t test
in the following calculation
𝑡=𝑟
√ 𝑛− 2
1 −𝑟
Number of pair of observation is 17. Hence,
2
𝑡=0.82
√
17 −2
1 − 0.82 2
=1.44
Consulting the t-test table, at degrees of freedom 15 and for , we find that t = 1.753. Thus, the
value of Pearson’s correlation coefficient in this case may be regarded as highly significant.
33
Rank Correlation Coefficient
34
Charles Spearman’s Correlation Coefficient
This correlation measurement is also called Rank correlation.
Example:
Rank assigned
35
Charles Spearman’s Correlation Coefficient
36
Charles Spearman’s Coefficient of Correlation
Example 7.2: The hypothesis that the depth of a river does not progressively increase with the
width of the river.
W
A sample of size 10 is collected to test the hypothesis, using Spearman’s correlation coefficient.
37
Charles Spearman’s Coefficient of Correlation
Step 1: Assign rank to each data. It is customary to assign rank 1 to the largest data, and 2 to
next largest and so on.
Note: If there are two or more samples with the same value, the mean rank should be
used.
38
Charles Spearman’s Coefficient of Correlation
Step 2: The contingency table will look like
𝑟 𝑠=0.9757
39
Charles Spearman’s Coefficient of Correlation
Step 3: To see, if this value is significant, the Spearman’s rank significance table (or
graph) must be consulted.
Note:
1.0
0.9
0.8
0.7
0.6
Spearaman’s rank correlation
0.5
0.4
0.3 0.1%
0.2 1%
5%
coefficient
0.1
2 4 6 8 10
40
Charles Spearman’s Coefficient of Correlation
Step 4: Final conclusion
From the graph, we see that lies above the line at 8 and 0.01%
significance level. Hence, there is a greater than 99% chance that the
relationship is significant (i.e., not random) and hence the hypothesis
should be rejected.
Thus, we can reject the hypothesis and conclude that in this case, depth of
a river progressively increases the further with the width of the river.
41
χ2-Correlation Analysis
42
Chi-Squared Test of Correlation
This method is also alternatively termed as Pearson’s –test or simply -test
This method is applicable to categorical (discrete) data only.
43
–Test Methodology
Contingency Table
Given a data set, it is customary to draw a contingency table, whose structure is given
below.
44
–Test Methodology
Entry into Contingency Table: Observed Frequency
In contingency table, an entry Oij denotes the event that attribute A takes on value ai and
attribute B takes on value bj (i.e., A = ai, B = bj).
45
–Test Methodology
Entry into Contingency Table: Expected Frequency
In contingency table, an entry eij denotes the expected frequency, which can be calculated
as
𝐶𝑜𝑢𝑛𝑡 ( 𝐴=𝑎 𝑖 )× 𝐶𝑜𝑢𝑛𝑡 ( 𝐵=𝑏 𝑗 ) 𝐴 𝑖 × 𝐵 𝑗
𝑒 𝑖𝑗 = =
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙 𝑁
46
– Test
47
– Test
The cell that contribute the most to the 𝛘2 value are those whose
actual count is very different from the expected.
48
– Test
Example 7.3: Survey on Gender versus Hobby.
Suppose, a survey was conducted among a population of size 1500. In this survey,
gender of each person and their hobby as either “book” or “computer” was noted.
The survey result obtained in a table like the following.
We have to find if there is any association between Gender and Hobby of a people,
that is, we are to test whether “gender” and “hobby” are correlated.
49
– Test
Example 7.3: Survey on Gender versus Hobby.
From the survey table, the observed frequency are counted and entered into the
contingency table, which is shown below.
GENDER
Male Female Total
Book
HOBBY
Computer
Total
50
– Test
Example 7.3: Survey on Gender versus Hobby.
From the survey table, the expected frequency are counted and entered into the
contingency table, which is shown below.
GENDER
Male Female Total
Book
HOBBY
Computer
Total
51
– Test
Using equation for 𝛘2 computation, we get
𝛘2 = + + +
=
This value needs to be compared with the tabulated value of 𝛘2 (available in any
standard book on statistics) with 1 degree of freedom (for a table of m × n, the
degrees of freedom is ; here m = 2, n = 2).
For 1 degree of freedom, the 𝛘2 value needed to reject the hypothesis at the 0.01
significance level is 10.828. Since our computed value is above this, we reject the
hypothesis that “Gender” and “Hobby” are independent and hence, conclude that the
two attributes are strongly correlated for the given group of people.
52
– Test
Example 7.4: Hypothesis on “accident proneness” versus “driver’s handedness”.
Consider the following on car accidents among left and right-handed drivers’ of
sample size 175.
Hypothesis is that “fatality of accidents is independent of driver’s handedness”
HANDEDNESS
Fatal
Total
Find the correlation between Fatality and Handedness and test the significance of the
correlation with significance level 0.1%.
53
Regression Analysis
54
Regression Analysis
The regression analysis is a statistical method to deal with the formulation of
mathematical model depicting relationship amongst variables, which can be used
for the purpose of prediction of the values of dependent variable, given the values
of independent variables.
Classification of Regression Analysis Models
Linear regression models
1. Simple linear regression
2. Multiple linear regression
Non-linear regression models
Y Y Y
X X X
Simple linear regression Z Multiple linear regression Non-linear regression
55
Simple Linear Regression Model
In simple linear regression, we have only two variables:
Dependent variable (also called Response), usually denoted as .
Independent variable (alternatively called Regressor), usually denoted as .
A reasonable form of a relationship between the Response and the Regressor is the linear
relationship, that is in the form
Y=α+βx
β=tan(θ)
θ
α
Note:
There are infinite number of lines (and hence )
The concept of regression analysis deal with finding the best relationship between and
(and hence best fitted values of ) quantifying the strength of that relationship.
56
Regression Analysis
Given the set of data involving pairs of values, our objective is to find “true” or population
regression line such that
Here, is a random variable with and . The quantity is often called the error variance.
Note:
implies that at a specific , the values are distributed around the “true” regression line (i.e.,
the positive and negative errors around the true line is reasonable).
are called regression coefficients.
Ŷ=a+bx
Y=α+βx
58
Least Square Method to estimate
This method uses the concept of residual. A residual is essentially an error in the fit of the model
. Thus, residual is
Ŷ=a+bx
Y ei
Ɛi
Y=α+βx
59
Least Square method
The residual sum of squares is often called the sum of squares of the errors about the fitted
line and is denoted as SSE
SSE = =
We are to minimize the value of SSE and hence to determine the parameters of a and b.
60
Least Square method to estimate
Thus we set
+b=
These two equations can be solved to determine the values of and b, and it can be
calculated that
61
: Measure of Quality of Fit
A quantity , is called coefficient of determination is used to measure the proportion
of variability of the fitted model.
We have
Note:
If fit is perfect, all residuals are zero and thus = 1.0 (very good fit)
If SSE is only slightly smaller than SST, then (very poor fit)
62
: Measure of Quality of Fit
Y Y Ŷ
63
Multiple Linear Regression
When more than one variable are independent variable, then the regression
can be estimated as a multiple regression model
When this model is linear in coefficients, it is called multiple linear regression
model
If k-independent variables , …………, are associated, the multiple linear
regression model is given by
++
++
64
Multiple Linear Regression
Estimating the coefficients
Let the data points given to us is
( )
Thus,
++
and ++
where and are the random error and residual error, respectively associated with true
response and fitted response.
Using the concept of Least Square Method to estimate we minimize the expression
SSE = =
65
Multiple Linear Regression
Differentiating SSE in turn with respect to and equating to zero, we generate the set of
(k+1) normal estimation equations for multiple linear regression.
++
+
… … … … … …
… … … … … …
+
The system of linear equations can be solved for by any appropriate method for solving
system of linear equations.
Hence, the multiple linear regression model can be built.
66
Non Linear Regression Model
When the regression equation is in terms of r-degree, r>1, then it is called nonlinear
regression model. When more than one independent variables are there, then it is
called Multiple Non linear Regression model. Also, alternatively termed as
polynomial regression model. In general, it takes the form
++
++
67
Solving for Polynomial Regression Model
Given that (); i = 1,2,…,n are n pairs of observations. Each observations would satisfy the
equations:
++
and ++ +
where, r is the degree of polynomial
= is the random error
= is the residual error
Note: The number of observations, n, must be at least as large as r+1, the number of
parameters to be estimated.
The polynomial model can be transformed into a general linear regression model setting ,
…, = . Thus, the equation assumes the form:
++
++r +
This model then can be solved using the procedure followed for multiple linear
regression model.
68
Auto-Regression Analysis
69
Auto Regression Analysis
Regression analysis for time-ordered data is known as Auto-Regression
Analysis
Time series data are data collected on the same observational unit at multiple
time periods
70
Auto Regression Analysis
Examples: Which of the following is a time-series data?
Aggregate consumption and GDP for a country (for example, 20 years of quarterly
observations = 80 observations)
Yen/$, pound/$ and Euro/$ exchange rates (daily data for 1 year = 365
observations)
Cigarette consumption per capita in a state, by years
71
Auto Regression Analysis
Examples: Which of the following graph is due to time-series data?
72
Use of Time Series Data
To develop forecast model
What will the rate of inflation be next year?
Rates of inflation and unemployment in the country can be observed only over
time!
73
Modeling with Time Series Data
Correlation over time
Serial correlation, also called autocorrelation
Calculating standard error
How to estimate?
Forecasting model
74
Auto-Regression Model for Forecasting
75
Some Notations and Concepts
Yt = Value of Y in a period t
Data set [Y1, Y2, … YT-1, YT]: T observations on the time series random variable
Y
Assumptions
We consider only consecutive, evenly spaced observations
For example, monthly, 2000-2015, no missing months
A time series Yt is stationary if its probability distribution does not change over
time, that is, if the joint distribution of (Yi+1, Yi+2, …, Yi+T) does not depend on i.
Stationary property implies that history is relevant. In other words, Stationary requires the
future to be like the past (in a probabilistic sense).
Auto Regression analysis assumes that Yt is stationary.
76
Some Notations and Concepts
There are four ways to have the time series data for AutoRegression analysis
Difference: The fist difference of a series, Yt is its change between period t and t-
1, that is, yt = Yt - Yt-1
Percentage:
77
Some Notations and Concepts
Autocorrelation
The correlation of a series with its own lagged values is called autocorrelation
(also called serial correlation)
78
Some Notations and Concepts
79
Auto-Regression Model for Forecatsing
A natural starting point for forecasting model is to use past values of Y, that
is, Yt-1, Yt-2, … to predict Yt
80
p-th Order AutoRegression Model
Definition 7.5: p-th AutoRegression Model
81
Computing AR Coefficients
A number of techniques known for computing the AR coefficients
The most common method is called Least Squares Method (LSM)
The LSM is based upon the Yule-Walker equations
82
Reference
83