0% found this document useful (0 votes)
33 views83 pages

Recal5 RelationAnalysis

The document discusses different types of hypothesis testing strategies and their applications. It also provides an example of analyzing relationships in wage data, examining how wages vary with age, year, and education level. Various measures of analyzing relationships, including correlation and regression analysis, are also introduced.

Uploaded by

shvdo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views83 pages

Recal5 RelationAnalysis

The document discusses different types of hypothesis testing strategies and their applications. It also provides an example of analyzing relationships in wage data, examining how wages vary with age, year, and education level. Various measures of analyzing relationships, including correlation and regression analysis, are also introduced.

Uploaded by

shvdo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 83

Relation Analysis

1
Hypothesis Testing Strategies

 There are two types of tests of hypotheses

Parametric tests (also called standard test of hypotheses).

 Non-parametric tests (also called distribution-free test of hypotheses)

2
Parametric Tests : Applications
 Usually assume certain properties of the population from
which we draw samples.
• Observation come from a normal population

• Sample size is small

• Population parameters like mean, variance, etc. are hold good.

• Requires measurement equivalent to interval scaled data.

3
Hypothesis Testing : Non-Parametric Test
Non-Parametric tests
o Does not under any assumption
o Assumes only nominal or ordinal data

Note: Non-parametric tests need entire population (or very large sample size)

4
Relationship Analysis
Example: Wage Data

A large data regarding the wages for a group of employees from the eastern
region of India is given.

In particular, we wish to understand the following relationships:


 Employee’s age and wage: How wages vary with ages?
 Calendar year and wage: How wages vary with time?
 Employee’s age and education: Whether wages are anyway related with
employees’ education levels?

5
Relationship Analysis
 Example: Wage Data

 Case I. Wage versus Age


 From the data set, we have a graphical representations, which is as follows:

 How wages vary with ages?


?
How wages vary with ages?

6
Relationship Analysis
 Example: Wage Data
 Employee’s age and wage: How wages vary with ages?

Interpretation: On the average, wage increases with age until about 60 years of age, at
which point it begins to decline.

7
Relationship Analysis
 Example: Wage Data

 Case II. Wage versus Year


 From the data set, we have a graphical representations, which is as follows:

?
How wages vary with time?

8
Relationship Analysis
 Example: Wage Data
 Wage and calendar year: How wages vary with years?

Interpretation: There is a slow but steady increase in the average wage between 2010 and
2016.
.
9
Relationship Analysis
 Example: Wage Data

 Case III. Wage versus Education


 From the data set, we have a graphical representations, which is as follows:

?
Whether wages are related with education?

10
Relationship Analysis
 Example: Wage Data
 Wage and education level: Whether wages vary with employees’ education levels?

Interpretation: On the average, wage increases with the level of education.

11
Relationship Analysis
Given an employee’s wage can we predict his age?

Whether wage has any association with both year and education
level?

etc….

12
An Open Challenge!

Suppose there are countably infinite points in the . We need a huge memory to store all
such points.

Is there any way out to store this information with a least amount of memory?
Say, with two values only.

13
Yahoo!
y=ax+b

Just decide the values of a and b


(as if storing one point’s data only!)

Note: Here, tricks was to find a relationship among all the points.

14
Measures of Relationship
 Univariate population: The population consisting of only one variable.

Here, statistical measures are suffice to find a relationship.

 Bivariate population: Here, the data happen to be on two variables.

15
Measures of Relationship
 Multivariate population: If the data happen to be one more than two variable.

l ume
Vo
Temperature
Pressure

? If we add another variable say viscosity in addition to Pressure, Volume or Temperature?

16
Measures of Relationship
In case of bivariate and multivariate populations, usually, we have to answer two
types of questions:

Q1: Does there exist correlation (i.e., association) between two (or more) variables?
If yes, of what degree?

Q2: Is there any cause and effect relationship between the two variables (in case of
bivariate population) or one variable in one side and two or more variables on the
other side (in case of multivariate population)?
If yes, of what degree and in which direction?

To find solutions to the above questions, two approaches are known.


 Correlation Analysis
 Regression Analysis

17
Correlation Analysis

18
Correlation Analysis
 In statistics, the word correlation is used to denote some form of
association between two variables.
 Example: Weight is correlated with height

Example:

The correlation may be positive, negative or zero.


 Positive correlation: If the value of the attribute A increases with the increase
in the value of the attribute B and vice-versa.
 Negative correlation: If the value of the attribute A decreases with the
increase in the value of the attribute B and vice-versa.
 Zero correlation: When the values of attribute A varies at random with B and
vice-versa.
19
Correlation Analysis
 In order to measure the degree of correlation between two attributes.

100
90
80
70
60
50
40
30
20
10

1 2 3 4 5 6 7
Hours of study

20
Correlation Analysis
 Do you find any correlation between X and Y as shown in the table?.

# CD

# Cigarette

Note:
In data analytics, correlation analysis make sense only when relationship make sense.
There should be a cause-effect relationship.
21
Correlation Analysis
Positive correlation
Negative correlation
7
7 Zero correlation
6 7
6
6
5
5
5
4
4
4
3 3
3

2 2
2

1 1
1

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11

22
Correlation Coefficient
 Correlation coefficient is used to measure the degree of association.

 It is usually denoted by r.

 The value of r lies between +1 and -1.

 Positive values of r indicates positive correlation between two variables,


whereas, negative values of r indicate negative correlation.

 The value of nearer to +1 or -1 indicates high degree of correlation between


the two variables.
 r = 0 implies, there is no correlation

23
Correlation Coefficient
High Positive Correlation Low Positive Correlation
7
7

6 6

5 5

4 4

3 3

2 2

1
1

1 2 3 4 5 6 7 4 7
1 2 3 5 6

Low Negative Correlation


High Negative Correlation 7

7
6
6
5
5
4
4

3
3

2
2

1
1

4 5 6 7 8 1 2 3 4 5 6 7 8
1 2 3
24
Correlation Coefficient

25
Correlation Coefficient
R = +0.60
R = +0.80

R = +0.80

R = +0.40

26
Measuring Correlation Coefficients
 There are three methods known to measure the correlation coefficients

 Karl Pearson’s coefficient of correlation


 This method is applicable to find correlation coefficient between two numerical
attributes

 Charles Spearman’s coefficient of correlation


 This method is applicable to find correlation coefficient between two ordinal attributes

 Chi-square coefficient of correlation


 This method is applicable to find correlation coefficient between two categorical
attributes

27
Pearson’s Correlation Coefficient

28
Karl Pearson’s Correlation Coefficient
 This is also called Pearson’s Product Moment Correlation

Definition 7.1: Karl Pearson’s correlation coefficient


Let us consider two attributes are X and Y.
The Karl Pearson’s coefficient of correlation is denoted by 𝑟∗ and is defined
as

where

29
Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
 A small study is conducted involving 17 infants to investigate the association between
gestational age at birth, measured in weeks, and birth weight, measured in grams.

30
Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
 We wish to estimate the association between gestational age and infant birth weight.
 In this example, birth weight is the dependent variable and gestational age is the
independent variable. Thus Y = birth weight and X = gestational age.
 The data are displayed in a scatter diagram in the figure below.

31
Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
 For the given data, it can be shown the following

= 0.82

Conclusion: The sample’s correlation coefficient indicates a strong positive correlation


between Gestational Age and Birth Weight.

32
Karl Pearson’s coefficient of Correlation
Example 7.1: Correlation of Gestational Age and Birth Weight
 Significance Test
 To test whether the association is merely apparent, and might have arisen by chance use the t test
in the following calculation

𝑡=𝑟
√ 𝑛− 2
1 −𝑟
 Number of pair of observation is 17. Hence,
2

𝑡=0.82

17 −2
1 − 0.82 2
=1.44

 Consulting the t-test table, at degrees of freedom 15 and for , we find that t = 1.753. Thus, the
value of Pearson’s correlation coefficient in this case may be regarded as highly significant.

33
Rank Correlation Coefficient

34
Charles Spearman’s Correlation Coefficient
 This correlation measurement is also called Rank correlation.

 This technique is applicable to determine the degree of correlation between two


variables in case of ordinal data.
 We can assign rank to the different values of a variable with ordinal data type.

Example:

Rank assigned

35
Charles Spearman’s Correlation Coefficient

Definition 7.2: Charles Spearman’s correlation coefficient

The rank correlation can be defined as

 The Spearman’s coefficient is often used as a statistical methods to aid either


providing or disproving a hypothesis.

36
Charles Spearman’s Coefficient of Correlation
Example 7.2: The hypothesis that the depth of a river does not progressively increase with the
width of the river.
W

A sample of size 10 is collected to test the hypothesis, using Spearman’s correlation coefficient.

37
Charles Spearman’s Coefficient of Correlation
Step 1: Assign rank to each data. It is customary to assign rank 1 to the largest data, and 2 to
next largest and so on.
Note: If there are two or more samples with the same value, the mean rank should be
used.

38
Charles Spearman’s Coefficient of Correlation
Step 2: The contingency table will look like

𝑟 𝑠=0.9757
39
Charles Spearman’s Coefficient of Correlation
Step 3: To see, if this value is significant, the Spearman’s rank significance table (or
graph) must be consulted.
Note:

1.0
0.9
0.8
0.7
0.6
Spearaman’s rank correlation

0.5
0.4

0.3 0.1%
0.2 1%
5%
coefficient

0.1
2 4 6 8 10

40
Charles Spearman’s Coefficient of Correlation
Step 4: Final conclusion
From the graph, we see that lies above the line at 8 and 0.01%
significance level. Hence, there is a greater than 99% chance that the
relationship is significant (i.e., not random) and hence the hypothesis
should be rejected.

Thus, we can reject the hypothesis and conclude that in this case, depth of
a river progressively increases the further with the width of the river.

41
χ2-Correlation Analysis

42
Chi-Squared Test of Correlation
 This method is also alternatively termed as Pearson’s –test or simply -test
 This method is applicable to categorical (discrete) data only.

 Suppose, two attributes A and B with categorical values


A = , , ,….., and
B = , , ,…..,
having m and n distinct values.

Between whom we are to find the correlation relationship.

43
–Test Methodology
Contingency Table
Given a data set, it is customary to draw a contingency table, whose structure is given
below.

44
–Test Methodology
Entry into Contingency Table: Observed Frequency
In contingency table, an entry Oij denotes the event that attribute A takes on value ai and
attribute B takes on value bj (i.e., A = ai, B = bj).

45
–Test Methodology
Entry into Contingency Table: Expected Frequency
In contingency table, an entry eij denotes the expected frequency, which can be calculated
as
𝐶𝑜𝑢𝑛𝑡 ( 𝐴=𝑎 𝑖 )× 𝐶𝑜𝑢𝑛𝑡 ( 𝐵=𝑏 𝑗 ) 𝐴 𝑖 × 𝐵 𝑗
𝑒 𝑖𝑗 = =
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙 𝑁

46
– Test

Definition 7.3: χ2-Value

The value ( also known as the Pearson’s test) can be computes as

is the expected frequency

47
– Test
 The cell that contribute the most to the 𝛘2 value are those whose
actual count is very different from the expected.

 The 𝛘2 statistics tests the hypothesis that A and B are independent.


The test is based on a significance level, with (n-1) ×(m-1) degrees
of freedom., with a contingency table of size n×m

 If the hypothesis can be rejected, then we say that A and B are


statistically related or associated.

48
– Test
Example 7.3: Survey on Gender versus Hobby.
 Suppose, a survey was conducted among a population of size 1500. In this survey,
gender of each person and their hobby as either “book” or “computer” was noted.
The survey result obtained in a table like the following.

 We have to find if there is any association between Gender and Hobby of a people,
that is, we are to test whether “gender” and “hobby” are correlated.
49
– Test
Example 7.3: Survey on Gender versus Hobby.
 From the survey table, the observed frequency are counted and entered into the
contingency table, which is shown below.

GENDER
Male Female Total

Book
HOBBY

Computer

Total

50
– Test
Example 7.3: Survey on Gender versus Hobby.
 From the survey table, the expected frequency are counted and entered into the
contingency table, which is shown below.

GENDER
Male Female Total

Book
HOBBY

Computer

Total

51
– Test
 Using equation for 𝛘2 computation, we get

𝛘2 = + + +
=
 This value needs to be compared with the tabulated value of 𝛘2 (available in any
standard book on statistics) with 1 degree of freedom (for a table of m × n, the
degrees of freedom is ; here m = 2, n = 2).
 For 1 degree of freedom, the 𝛘2 value needed to reject the hypothesis at the 0.01
significance level is 10.828. Since our computed value is above this, we reject the
hypothesis that “Gender” and “Hobby” are independent and hence, conclude that the
two attributes are strongly correlated for the given group of people.

52
– Test
Example 7.4: Hypothesis on “accident proneness” versus “driver’s handedness”.
 Consider the following on car accidents among left and right-handed drivers’ of
sample size 175.
 Hypothesis is that “fatality of accidents is independent of driver’s handedness”

HANDEDNESS

Left-Handed Right-Handed Total


Non-Fatal
FATALITY

Fatal

Total
 Find the correlation between Fatality and Handedness and test the significance of the
correlation with significance level 0.1%.
53
Regression Analysis

54
Regression Analysis
 The regression analysis is a statistical method to deal with the formulation of
mathematical model depicting relationship amongst variables, which can be used
for the purpose of prediction of the values of dependent variable, given the values
of independent variables.
 Classification of Regression Analysis Models
 Linear regression models
1. Simple linear regression
2. Multiple linear regression
 Non-linear regression models

Y Y Y

X X X
Simple linear regression Z Multiple linear regression Non-linear regression

55
Simple Linear Regression Model
In simple linear regression, we have only two variables:
 Dependent variable (also called Response), usually denoted as .
 Independent variable (alternatively called Regressor), usually denoted as .
 A reasonable form of a relationship between the Response and the Regressor is the linear
relationship, that is in the form

Y=α+βx

β=tan(θ)
θ

α
Note:
 There are infinite number of lines (and hence )
 The concept of regression analysis deal with finding the best relationship between and
(and hence best fitted values of ) quantifying the strength of that relationship.

56
Regression Analysis

Given the set of data involving pairs of values, our objective is to find “true” or population
regression line such that

Here, is a random variable with and . The quantity is often called the error variance.
Note:
 implies that at a specific , the values are distributed around the “true” regression line (i.e.,
the positive and negative errors around the true line is reasonable).
 are called regression coefficients.

 values are to be estimated from the data. 57


True versus Fitted Regression Line
 The task in regression analysis is to estimate the regression coefficients .
 Suppose, we denote the estimates a for and b for . Then the fitted regression line is

where is the predicted or fitted value.

Ŷ=a+bx

Y=α+βx

58
Least Square Method to estimate
This method uses the concept of residual. A residual is essentially an error in the fit of the model
. Thus, residual is

Ŷ=a+bx
Y ei
Ɛi
Y=α+βx

59
Least Square method
 The residual sum of squares is often called the sum of squares of the errors about the fitted
line and is denoted as SSE
SSE = =

 We are to minimize the value of SSE and hence to determine the parameters of a and b.

 Differentiating SSE with respect to a and b, we have

For minimum value of SSE, 0

60
Least Square method to estimate
Thus we set

+b=

These two equations can be solved to determine the values of and b, and it can be
calculated that

61
: Measure of Quality of Fit
 A quantity , is called coefficient of determination is used to measure the proportion
of variability of the fitted model.
 We have

 It signifies the variability due to error.


 Now, let us define the total corrected sum of squares, defined as

 SST represents the variation in the response values. The is

Note:
 If fit is perfect, all residuals are zero and thus = 1.0 (very good fit)
 If SSE is only slightly smaller than SST, then (very poor fit)

62
: Measure of Quality of Fit

Y Y Ŷ

R2≈ 1.0 (Very good fit) 𝑅 2 ≈ 0 (Very poor fit)

63
Multiple Linear Regression
 When more than one variable are independent variable, then the regression
can be estimated as a multiple regression model
 When this model is linear in coefficients, it is called multiple linear regression
model
 If k-independent variables , …………, are associated, the multiple linear
regression model is given by
++

 And the estimated response is obtained as

++

64
Multiple Linear Regression
Estimating the coefficients
Let the data points given to us is
( )

where is the observed response to the values of k independent variables .

Thus,
++
and ++

where and are the random error and residual error, respectively associated with true
response and fitted response.

Using the concept of Least Square Method to estimate we minimize the expression

SSE = =

65
Multiple Linear Regression
 Differentiating SSE in turn with respect to and equating to zero, we generate the set of
(k+1) normal estimation equations for multiple linear regression.

++

+
… … … … … …
… … … … … …
+

 The system of linear equations can be solved for by any appropriate method for solving
system of linear equations.
 Hence, the multiple linear regression model can be built.

66
Non Linear Regression Model
 When the regression equation is in terms of r-degree, r>1, then it is called nonlinear
regression model. When more than one independent variables are there, then it is
called Multiple Non linear Regression model. Also, alternatively termed as
polynomial regression model. In general, it takes the form

++

 The estimated response is obtained as

++

67
Solving for Polynomial Regression Model
Given that (); i = 1,2,…,n are n pairs of observations. Each observations would satisfy the
equations:
++
and ++ +
where, r is the degree of polynomial
= is the random error
= is the residual error

Note: The number of observations, n, must be at least as large as r+1, the number of
parameters to be estimated.

The polynomial model can be transformed into a general linear regression model setting ,
…, = . Thus, the equation assumes the form:
++
++r +

This model then can be solved using the procedure followed for multiple linear
regression model.
68
Auto-Regression Analysis

69
Auto Regression Analysis
 Regression analysis for time-ordered data is known as Auto-Regression
Analysis
 Time series data are data collected on the same observational unit at multiple
time periods

Example: Indian rate of price inflation

70
Auto Regression Analysis
 Examples: Which of the following is a time-series data?
 Aggregate consumption and GDP for a country (for example, 20 years of quarterly
observations = 80 observations)

 Yen/$, pound/$ and Euro/$ exchange rates (daily data for 1 year = 365
observations)
 Cigarette consumption per capita in a state, by years

 Rainfall data over a year

 Sales of tea from a tea shop in a season

71
Auto Regression Analysis
 Examples: Which of the following graph is due to time-series data?

72
Use of Time Series Data
 To develop forecast model
 What will the rate of inflation be next year?

 To estimate dynamic causal effects


 If the rate of interest increases the interest rate now, what will be the effect on the
rates of inflation and unemployment in 3 months? in 12 months?
 What is the effect over time on electronics good consumption of a hike in the
excise duty?

 Time dependent analysis

 Rates of inflation and unemployment in the country can be observed only over
time!

73
Modeling with Time Series Data
 Correlation over time
 Serial correlation, also called autocorrelation
 Calculating standard error

 To estimate dynamic causal effects


 Under which dynamic effects can be estimated?

 How to estimate?

 Forecasting model

 Forecasting model build on regression model

74
Auto-Regression Model for Forecasting

 Can we predict the tend at a time say 2017?

75
Some Notations and Concepts
 Yt = Value of Y in a period t

 Data set [Y1, Y2, … YT-1, YT]: T observations on the time series random variable
Y
 Assumptions
 We consider only consecutive, evenly spaced observations
 For example, monthly, 2000-2015, no missing months

 A time series Yt is stationary if its probability distribution does not change over
time, that is, if the joint distribution of (Yi+1, Yi+2, …, Yi+T) does not depend on i.
 Stationary property implies that history is relevant. In other words, Stationary requires the
future to be like the past (in a probabilistic sense).
 Auto Regression analysis assumes that Yt is stationary.

76
Some Notations and Concepts
 There are four ways to have the time series data for AutoRegression analysis

 Lag: The first lag of Yt is Yt-1, its j-th lag is Yt-j

 Difference: The fist difference of a series, Yt is its change between period t and t-
1, that is, yt = Yt - Yt-1

 Log difference: yt = log(Yt) - log(Yt-1)

 Percentage:

77
Some Notations and Concepts
 Autocorrelation
 The correlation of a series with its own lagged values is called autocorrelation
(also called serial correlation)

Definition 7.4: j-th Autocorrelation

The j-th autocorrelation, denoted by ρj is defined as

78
Some Notations and Concepts

 For the given data, say ρ1 = 0.84


 This implies that the Dollars per Pound is highly serially correlated

 Similarly, we can determine ρ2 , ρ3 …. etc., and hence different regression analyses

79
Auto-Regression Model for Forecatsing
 A natural starting point for forecasting model is to use past values of Y, that
is, Yt-1, Yt-2, … to predict Yt

 An autoregression is a regression model in which Yt is regressed against its


own lagged values.
 The number of lags used as regressors is called the order of autoregression
 In first order autoregression (denoted as AR(1)), Yt is regressed against Yt-1

 In p-th order autoregression (denoted as AR(p)), Yt is regressed against, Yt-1, Yt-2,


…,Yt-p

80
p-th Order AutoRegression Model
Definition 7.5: p-th AutoRegression Model

In general, the p-th order autregression model is defined as

is called autoregression coefficients and is the noise term or residue and in


practice it is assumed to Gausian white noise

 For example, AR(1) is


 The task in AR analysis is to derive the "best" values for i = 0, 1, …, p given
a time series Yt.

81
Computing AR Coefficients
 A number of techniques known for computing the AR coefficients
 The most common method is called Least Squares Method (LSM)
 The LSM is based upon the Yule-Walker equations

 Here, ri (i = 1, 2 , 3, …, p-1) denotes the i-th auto correlation coefficient.

 β0 can be chosen empirically, usually taken as zero.

82
Reference

The Elements of Statistical Learning, Data Mining,


Inference, and Prediction (2nd Edn.), Trevor Hastie, Robert
Tibshirani, Jerome Friedman, Springer, 2014.

83

You might also like