0% found this document useful (0 votes)

50 views11 pages

DA Unit-3

Data analytics unit 3 notes

Uploaded by

KaataRanjithkumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views11 pages

DA Unit-3

Data analytics unit 3 notes

Uploaded by

KaataRanjithkumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

BLUE Property assumptions

 The Gauss Markov theorem tells us that if a certain set of assumptions are met, the
ordinary least squares estimate for regression coefficients gives you the Best Linear
Unbiased Estimate (BLUE) possible.

 There are five Gauss Markov assumptions (also called conditions):

 Linearity:
o The parameters we are estimating using the OLS method must be themselves
linear.
 Random:
o Our data must have been randomly sampled from the population.
 Non-Collinearity:
o The regressors being calculated aren’t perfectly correlated with each other.
 Exogeneity:
o The regressors aren’t correlated with the error term.
 Homoscedasticity:
o No matter what the values of our regressors might be, the error of the variance is
constant.

Purpose of the Assumptions

 The Gauss Markov assumptions guarantee the validity of ordinary least squares for
estimating regression coefficients.

 Checking how well our data matches these assumptions is an important part of estimating
regression coefficients.

 When you know where these conditions are violated, you may be able to plan ways to
change your experiment setup to help your situation fit the ideal Gauss Markov situation
more closely.

 In practice, the Gauss Markov assumptions are rarely all met perfectly, but they are still
useful as a benchmark, and because they show us what ‘ideal’ conditions would be.

 They also allow us to pinpoint problem areas that might cause our estimated regression
coefficients to be inaccurate or even unusable.
The Gauss-Markov Assumptions in Algebra
 We can summarize the Gauss-Markov Assumptions succinctly in algebra, by saying that
a linear regression model represented by

 and generated by the ordinary least squares estimate is the best linear unbiased estimate
(BLUE) possible if

 The first of these assumptions can be read as “The expected value of the error term is
zero.”. The second assumption is collinearity, the third is exogeneity, and the fourth is
homoscedasticity.
Regression Concepts

Regression

 It is a Predictive modeling technique where the target variable to be estimated is

continuous.

Examples of applications of regression

 predicting a stock market index using other economic indicators

 forecasting the amount of precipitation in a region based on characteristics of the jet
stream
 projecting the total sales of a company based on the amount spent for advertising
 estimating the age of a fossil according to the amount of carbon-14 left in the organic
material.

 Let D denote a data set that contains N observations,

 Each xi corresponds to the set of attributes of the ith observation (known as explanatory
variables) and yi corresponds to the target (or response) variable.
 The explanatory attributes of a regression task can be either discrete or continuous.

Regression (Definition)

 Regression is the task of learning a target function f that maps each attribute set x into a
continuous-valued output y.

The goal of regression

 To find a target function that can fit the input data with minimum error.
 The error function for a regression task can be expressed in terms of the sum of absolute
or squared error:
Simple Linear Regression

 Consider the physiological data shown in Figure D.1.

 The data corresponds to measurements of heat flux and skin temperature of a person
during sleep.
 Suppose we are interested in predicting the skin temperature of a person based on the
heat flux measurements generated by a heat sensor.
 The two-dimensional scatter plot shows that there is a strong linear relationship between
the two variables.
Least Square Estimation or Least Square Method

 Suppose we wish to fit the following linear model to the observed data:

 where w0 and w1 are parameters of the model and are called the regression coefficients.
 A standard approach for doing this is to apply the method of least squares, which
attempts to find the parameters (w0,w1) that minimize the sum of the squared error

 which is also known as the residual sum of squares.

 This optimization problem can be solved by taking the partial derivative of E with respect
to w0 and w1, setting them to zero, and solving the corresponding system of linear
equations.

 These equations can be summarized by the following matrix equation' which is also
known as the normal equation:

 Since

 the normal equations can be solved to obtain the following estimates for the parameters.
 Thus, the linear model that best fits the data in terms of minimizing the SSE is

 Figure D.2 shows the line corresponding to this model.

 We can show that the general solution to the normal equations given in D.6 can be
expressed as follow:
 Thus, linear model that results in the minimum squared error is given by

 In summary, the least squares method is a systematic approach to fit a linear model to the
response variable g by minimizing the squared error between the true and estimated value
of g.
 Although the model is relatively simple, it seems to provide a reasonably accurate
approximation because a linear model is the first-order Taylor series approximation for
any function with continuous derivatives.
Logistic Regression

Logistic regression, or Logit regression, or Logit model

o is a regression model where the dependent variable (DV) is categorical.

o was developed by statistician David Cox in 1958.

 The response variable Y has been regarded as a continuous quantitative variable.

 There are situations, however, where the response variable is qualitative.
 The predictor variables, however, have been both quantitative, as well as qualitative.
 Indicator variables fall into the second category.

 Consider a procedure in which individuals are selected on the basis of their scores in a
battery of tests.
 After five years the candidates are classified as "good" or "poor.”
 We are interested in examining the ability of the tests to predict the job performance of
the candidates.
 Here the response variable, performance, is dichotomous.
 We can code "good" as 1 and "poor" as 0, for example.
 The predictor variables are the scores in the tests.
 In a study to determine the risk factors for cancer, health records of several people were
studied.
 Data were collected on several variables, such as age, gender, smoking, diet, and the
family's medical history.
 The response variable was the person had cancer (Y = 1) or did not have cancer (Y = 0).

 The relationship between the probability π and X can often be represented by a logistic
response function.
 It resembles an S-shaped curve.
 The probability π initially increases slowly with increase in X, and then the increase
accelerates, finally stabilizes, but does not increase beyond 1.
 Intuitively this makes sense.
 Consider the probability of a questionnaire being returned as a function of cash reward,
or the probability of passing a test as a function of the time put in studying for it.
 The shape of the S-curve can be reproduced if we model the probabilities as follows:

 A sigmoid function is a bounded differentiable real function that is defined for all real
input values and has a positive derivative at each point.

 It has an “S” shape. It is defined by below function:

 The process of linearization of logistic regression function is called Logit
Transformation.

 Modeling the response probabilities by the logistic distribution and estimating the
parameters of the model given below constitutes fitting a logistic regression.
 In logistic regression the fitting is carried out by working with the logits.
 The Logit transformation produces a model that is linear in the parameters.
 The method of estimation used is the maximum likelihood method.
 The maximum likelihood estimates are obtained numerically, using an iterative
procedure.

OLS:

 The ordinary least squares, or OLS, can also be called the linear least squares.
 This is a method for approximately determining the unknown parameters located in a
linear regression model.
 According to books of statistics and other online sources, the ordinary least squares is
obtained by minimizing the total of squared vertical distances between the observed
responses within the dataset and the responses predicted by the linear approximation.
 Through a simple formula, you can express the resulting estimator, especially the single
regressor, located on the right-hand side of the linear regression model.
 For example, you have a set of equations which consists of several equations that have
unknown parameters.
 You may use the ordinary least squares method because this is the most standard
approach in finding the approximate solution to your overly determined systems.
 In other words, it is your overall solution in minimizing the sum of the squares of errors
in your equation.
 Data fitting can be your most suited application. Online sources have stated that the data
that best fits the ordinary least squares minimizes the sum of squared residuals.
 “Residual” is “the difference between an observed value and the fitted value provided by
a model.”
Maximum likelihood estimation, or MLE,

 is a method used in estimating the parameters of a statistical model, and for fitting a
statistical model to data.
 If you want to find the height measurement of every basketball player in a specific
location, you can use the maximum likelihood estimation.
 Normally, you would encounter problems such as cost and time constraints.
 If you could not afford to measure all of the basketball players’ heights, the maximum
likelihood estimation would be very handy.
 Using the maximum likelihood estimation, you can estimate the mean and variance of the
height of your subjects.
 The MLE would set the mean and variance as parameters in determining the specific
parametric values in a given model.

Multinomial Logistic Regression

 We have n independent observations with p explanatory variables.

 The qualitative response variable has k categories.
 To construct the logits in the multinomial case one of the categories is considered the
base level and all the logits are constructed relative to it. Any category can be taken as the
base level.
 We will take category k as the base level in our description of the method.
 Since there is no ordering, it is apparent that any category may be labeled k. Let 7rj
denote the multinomial probability of an observation falling in the jth category.
 We want to find the relationship between this probability and the p explanatory variables,
Xl, X 2 , ... ,Xp. The multiple logistic regression model then is

 Since all the 7r'S add to unity, this reduces to

 For j = 1,2,···, (k - 1). The model parameters are estimated by the method of maximum
likelihood. Statistical software is available to do this fitting.

Ecommerce Management System
100% (1)
Ecommerce Management System
20 pages
Card Payment Security Using RSA
100% (2)
Card Payment Security Using RSA
23 pages
Data Analytics Iii Unit
No ratings yet
Data Analytics Iii Unit
8 pages
Unit III Regression
No ratings yet
Unit III Regression
24 pages
Da Unit-Iii
No ratings yet
Da Unit-Iii
14 pages
Module 3 EDA
No ratings yet
Module 3 EDA
14 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
(Unit-04) Part-01 - ML Algo
No ratings yet
(Unit-04) Part-01 - ML Algo
49 pages
Unit III
No ratings yet
Unit III
18 pages
Lecture2 241007 162001
No ratings yet
Lecture2 241007 162001
11 pages
Linear Regression Models
No ratings yet
Linear Regression Models
41 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
DA Unit 3 Trio
No ratings yet
DA Unit 3 Trio
13 pages
ML - LAB - BE CSE (DS) Final
No ratings yet
ML - LAB - BE CSE (DS) Final
110 pages
Unit1 - Data Science - SPPU
No ratings yet
Unit1 - Data Science - SPPU
15 pages
Daunit 3
No ratings yet
Daunit 3
32 pages
Unit - Iii
No ratings yet
Unit - Iii
9 pages
049 Stat 326 Regression Final Paper
No ratings yet
049 Stat 326 Regression Final Paper
17 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
Linear Regression Models
No ratings yet
Linear Regression Models
42 pages
TCH442E Quantitative Methods For Finance
No ratings yet
TCH442E Quantitative Methods For Finance
21 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Simple Regression
No ratings yet
Simple Regression
45 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
7 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
89 pages
Da Unit III
0% (1)
Da Unit III
43 pages
Chapter 2 Econometrics
No ratings yet
Chapter 2 Econometrics
9 pages
Lecture 2: MRA and Inference: Dr. Yundan Gong
No ratings yet
Lecture 2: MRA and Inference: Dr. Yundan Gong
52 pages
Chapter4 Regression
No ratings yet
Chapter4 Regression
15 pages
Models Assignment
No ratings yet
Models Assignment
43 pages
Multivariate Polynomial Regression
No ratings yet
Multivariate Polynomial Regression
4 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
66 pages
Regression
No ratings yet
Regression
60 pages
Regression
No ratings yet
Regression
12 pages
Chapter Two
No ratings yet
Chapter Two
44 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Handout Theory by PD Sir
No ratings yet
Handout Theory by PD Sir
94 pages
T32 NOV. 1, 2017 Ecenumel NOV. 3, 2017 Numerical Methods Laboratory Engr. Luigi de Jesus Gaw, Kherrie Irish D. Experiment 9
No ratings yet
T32 NOV. 1, 2017 Ecenumel NOV. 3, 2017 Numerical Methods Laboratory Engr. Luigi de Jesus Gaw, Kherrie Irish D. Experiment 9
5 pages
Chapter 3 - Classical Simple Linear Regression
No ratings yet
Chapter 3 - Classical Simple Linear Regression
52 pages
Da Unit III
No ratings yet
Da Unit III
43 pages
Chapter2 Regression SimpleLinearRegressionAnalysis
No ratings yet
Chapter2 Regression SimpleLinearRegressionAnalysis
41 pages
Regression
No ratings yet
Regression
14 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Linear Regression Case Study
No ratings yet
Linear Regression Case Study
6 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Chapter 2 Simple Linear Regression
No ratings yet
Chapter 2 Simple Linear Regression
31 pages
Regression: Dr. Agustinus Suryantoro, M.S
No ratings yet
Regression: Dr. Agustinus Suryantoro, M.S
31 pages
Regression Kann Ur 14
No ratings yet
Regression Kann Ur 14
43 pages
Data Analytics Unit 3 Notes
100% (3)
Data Analytics Unit 3 Notes
28 pages
1-Chap II Econometrics ABC DR Mitiku
No ratings yet
1-Chap II Econometrics ABC DR Mitiku
80 pages
Fda Unit 5
No ratings yet
Fda Unit 5
20 pages
Simple Regression Model: Erbil Technology Institute
No ratings yet
Simple Regression Model: Erbil Technology Institute
9 pages
Linear Regression
No ratings yet
Linear Regression
53 pages
Least Square Regression
No ratings yet
Least Square Regression
13 pages
Unit 3 1
No ratings yet
Unit 3 1
41 pages
Economterics Final 2024.
No ratings yet
Economterics Final 2024.
32 pages
Linear Regression For Intermediate
No ratings yet
Linear Regression For Intermediate
6 pages
UNIT 2 Machine Learning BCAI601BCDS062
No ratings yet
UNIT 2 Machine Learning BCAI601BCDS062
244 pages
Linear Regression-Part 2
No ratings yet
Linear Regression-Part 2
26 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
DA Unit-5
No ratings yet
DA Unit-5
6 pages
Cloud Based Online Blood Bank
100% (1)
Cloud Based Online Blood Bank
16 pages
Retail Banking - Icicibank
No ratings yet
Retail Banking - Icicibank
69 pages
3level Password Authentication System
100% (2)
3level Password Authentication System
4 pages
E-Ration Card Management System Using RFID
No ratings yet
E-Ration Card Management System Using RFID
18 pages
Bug Tracking System
No ratings yet
Bug Tracking System
28 pages
Android Vehicle Tracking With Speed Limiting
100% (1)
Android Vehicle Tracking With Speed Limiting
20 pages
Employee Work Management System
No ratings yet
Employee Work Management System
25 pages
Card Payment Security Using RSA
100% (1)
Card Payment Security Using RSA
23 pages
Vehicle Management System
No ratings yet
Vehicle Management System
27 pages
ETHZ Lecture9
No ratings yet
ETHZ Lecture9
28 pages
Lecture 4 Big - M Special - Cases
No ratings yet
Lecture 4 Big - M Special - Cases
24 pages
Romberg Integration Techniques
No ratings yet
Romberg Integration Techniques
4 pages
Stability of The Leapfrogmidpoint Method
No ratings yet
Stability of The Leapfrogmidpoint Method
10 pages
Untitled2.ipynb - Colab
No ratings yet
Untitled2.ipynb - Colab
24 pages
2.monomials and Polynomials
No ratings yet
2.monomials and Polynomials
6 pages
Factoring Quiz
No ratings yet
Factoring Quiz
2 pages
1704719125wpdm - MATHEMATICS II JULY AUG 2023
No ratings yet
1704719125wpdm - MATHEMATICS II JULY AUG 2023
8 pages
CBSE Class 7 Mathematics MCQs Algebraic Expression
100% (1)
CBSE Class 7 Mathematics MCQs Algebraic Expression
3 pages
1 Periodic Test - Mathematics 9
No ratings yet
1 Periodic Test - Mathematics 9
5 pages
Gunjan Maths
No ratings yet
Gunjan Maths
8 pages
ECE120L - Activity 3 PDF
No ratings yet
ECE120L - Activity 3 PDF
5 pages
The Knapsack Problem: Lecture 16 - Greedy Algorithms
No ratings yet
The Knapsack Problem: Lecture 16 - Greedy Algorithms
6 pages
Lecture 6 - Discretisation Part 1
No ratings yet
Lecture 6 - Discretisation Part 1
40 pages
Numerical Analysis PRACTICAL
No ratings yet
Numerical Analysis PRACTICAL
10 pages
Assignment For Unit 2
No ratings yet
Assignment For Unit 2
2 pages
WAVELET Slides PDF
No ratings yet
WAVELET Slides PDF
159 pages
T2222-Advanced Operation Research
No ratings yet
T2222-Advanced Operation Research
3 pages
1a. Unconstraint - Kuhn Condition
No ratings yet
1a. Unconstraint - Kuhn Condition
26 pages
Special Cases of LPP
100% (1)
Special Cases of LPP
4 pages
CSC2211 - Greedy
No ratings yet
CSC2211 - Greedy
13 pages
Bayesian Multiple Linear Regression
No ratings yet
Bayesian Multiple Linear Regression
7 pages
22eil384 - Basic Programming With Matlab
No ratings yet
22eil384 - Basic Programming With Matlab
2 pages
Solving System of Linear Non-Homogeneous Equations: Jacobi'S Iteration Method
No ratings yet
Solving System of Linear Non-Homogeneous Equations: Jacobi'S Iteration Method
8 pages
3.1 Graphs of Polynomial Functions
No ratings yet
3.1 Graphs of Polynomial Functions
17 pages
GCF Jmap Worksheet
No ratings yet
GCF Jmap Worksheet
2 pages
Numerical Solution To ODE's
No ratings yet
Numerical Solution To ODE's
6 pages
Measuring Errors
100% (1)
Measuring Errors
20 pages
Wolfe Conditions
No ratings yet
Wolfe Conditions
5 pages
(22287) Sheet 6 Binomial Theorem B
No ratings yet
(22287) Sheet 6 Binomial Theorem B
27 pages

DA Unit-3

Uploaded by

DA Unit-3

Uploaded by

BLUE Property assumptions

 There are five Gauss Markov assumptions (also called conditions):

Purpose of the Assumptions

 It is a Predictive modeling technique where the target variable to be estimated is

Examples of applications of regression

 predicting a stock market index using other economic indicators

 Let D denote a data set that contains N observations,

The goal of regression

 Consider the physiological data shown in Figure D.1.

 which is also known as the residual sum of squares.

 Figure D.2 shows the line corresponding to this model.

Logistic regression, or Logit regression, or Logit model

o is a regression model where the dependent variable (DV) is categorical.

 The response variable Y has been regarded as a continuous quantitative variable.

 It has an “S” shape. It is defined by below function:

Multinomial Logistic Regression

 We have n independent observations with p explanatory variables.

 Since all the 7r'S add to unity, this reduces to

You might also like