DA Unit 3 Trio
DA Unit 3 Trio
Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable Rationalization,
and Model Building etc. Logistic Regression: Model Theory, Model fit Statistics, Model Construction,
Analytics applications to various Business Domains etc.
where y is termed as the dependent or study variable and X is termed as independent or explanatory
variable. The terms 0 and 1 are the parameters of the model. The parameter 0 is termed as intercept
term and the parameter 1 is termed as slope parameter. These parameters are usually called as regression
coefficients. The unobservable error component accounts for the failure of data to lie on the straight
line and represents the difference between the true and observed realization of y . There can be several
reasons
for such difference, e.g., the effect of all deleted variables in the model, variables may be qualitative,
inherit randomness in the observations etc. We assume that is observed as independent and
identically distributed random variable with mean zero and constant variance 2
. Later, we will
additionally assume that is normally distributed.
11
Var( y | x) = 2 .
When the values 0 , 1 and 2 are known, the model is completely described. The parameters 0 , 1 and
of
2 are generally unknown in practice and is unobserved. The determination of the statistical model
y = 0 + 1 X + depends on the determination (i.e., estimation ) 0 , 1 and 2 . In order to know the
of
values of these parameters, n pairs of observations (xi , yi )(i = 1,..., n) on ( X , y) are observed/collected
and are used to determine these unknown parameters.
U-UNBIASED
E-ESTIMATOR
Linearity
An estimator is said to be a linear estimator of (f3) if it is a linear function of the sample observations
+X2 + + Xn
N Sample mean is a linear estimator because it is a linear
function of the X values.
21
UNBIASEDNESS
A desirable property of a distribut ion of estimates is that its mean equals the true
mean of the variables being estimated fo r m al l y , a n esti m ato r i s a n u n b i a s e d
e s t i m a t o r i f i ts Sampling distribution has as its expected value equal to the true value
of population.
Similarly, if this is not the case, we say that the estimator is biased
POINTER ESTIMATORS
For example, the sample mean x is a point estimate of the population mean p. Similarly, the
sample proportion p is a point estimate of the population proportion P.
Interval Estimators
An interval estimate is defined by two numbers, between which a population parameter is said to
lie. For example, a < x < b is an interval estimate of the population mean p. It indicates that the
population mean is greater than a but less than b.
The method of least squares estimates the parameters 0 and 1 by minimizing the sum of squares of
Difference between the observations and the line in the scatter diagram. Such an idea is viewed from
different perspectives. When the vertical difference between the observations and the line in the
scatter
Diagram is considered and its sum of squares is minimized to obtain the estimates 0 and 1 , the meth
of is known as direct regression.
31
yi
(xi,
Y = 0 + 1 X
(Xi,
xi
Direct regression
Alternatively, the sum of squares of difference between the observations and the line in horizontal direction
in the scatter diagram can be minimized to obtain the estimates of 0 and 1 . This is known as reverse (or
Inverse) regression method.
yi
(xi, Y = 0 +
yi) 1 X
(Xi, Yi)
xi
Reverse regression method ,
Instead of horizontal or vertical errors, if the sum of squares of perpendicular distances between the
observations and the line in the scatter diagram is minimized to obtain the estimates of 0 and 1 , the
yi
41
xi)
(Xi
Major axis regression method
Instead of minimizing the distance, the area can also be minimized. The reduced major axis regression
method minimizes the sum of the areas of rectangles defined between the observed data points and the
nearest point on the line in the scatter diagram to obtain the estimates of regression coefficients. This is
shown in the following figure:
yi
(xi yi)
Y = 0 + 1 X
(Xi, Yi)
xi
Reduced major axis method
The method of least absolute deviation regression considers the sum of the absolute deviation of the
observations from the line in the vertical direction in the scatter diagram as in the case of direct
regression to
obtain the estimates of 0 and 1 .
No assumption is required about the form of probability distribution of i in deriving the least squares
estimates. For the purpose of deriving the statistical inferences only, we assume that i ' s are random
variable with E( i) = 0,Var( )i = 2 and Cov (i , j ) = 0 for all i j(i, j = 1, 2,..., n). This assumption is
needed to find the mean, variance and other properties of the least squares estimates. The assumption that
i ' s are normally distributed is utilized while constructing the tests of hypotheses and confidence intervals
of the parameters.
51
Data Analytics UNIT - III
Based on these approaches, different estimates of 0 and 1 are obtained which have different
statisticalproperties. Among them the direct regression approach is more popular. Generally, the
direct regression estimates are referred as the least squares estimates or ordinary least squares
estimates.
6|Page
Data Analytics UNIT - III
7|Page
Data Analytics UNIT - III
8|Page
Data Analytics UNIT - III
9|Page
Data Analytics UNIT - III
LOGISTIC REGRESSION
Logistic regression, or Logistic regression, or Logistic model is a regression model where the
dependent variable (DV) is categorical. Logistic regression was developed by statistician David Cox in
1958.
> Ordinary Least Square
> Maximum Likelihood Estimation
The ordinary least squares, or OLS, can also be called the linear least squares.
10 | P a g e
Data Analytics UNIT - III
approximation. Through a simple formula, you can express the resulting estimation of the linear
regression model.
11 | P a g e
Data Analytics UNIT - III
Report
A statistical model embodies a set of assumptions concerning the generation of the observed data,
and similar data from a larger population. A model represents, often in considerably idealized form, the
data-generating process. Signal processing is an enabling technology that encompasses the fundamental
theory, applications, algorithms, and implementations of processing or transferring information contained
in many different physical, symbolic, or abstract formats broadly designated as signals. It uses
mathematical, statistical, computational, heuristic, and linguistic representations, formalisms, and
techniques for representation, modelling, analysis, synthesis, discovery, recovery, sensing, acquisition,
extraction, learning, security, or forensics. In manufacturing statistical models are used to define
Warranty policies, solving various conveyor related issues, Statistical Process Control etc.
Databases & Type of data and variables:
A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a
"centralized repository of information about data such as meaning, relationships to other data, origin,
usage, and format”.
Data can be categorized on various parameters like Categorical, Type etc.
Data can be categorized on various parameters like Categorical, Type etc.
Data is of 2 types – Numeric and Character. Again numeric data can be further divided into sub
group of – Discrete and Continuous.
Again, Data can be divided into 2 categories – Nominal and ordinal.
Also based on usage data is divided into 2 categories – Quantitative and Qualitative
12 | P a g e
Data Analytics UNIT - III
13 | P a g e