0% found this document useful (0 votes)
26 views13 pages

DA Unit 3 Trio

The document covers regression concepts, including simple and multiple linear regression models, estimation methods like least squares and maximum likelihood, and the BLUE property assumptions. It also discusses logistic regression, its applications in various business domains, and the importance of predictive analytics in decision-making. Additionally, it highlights data modeling techniques, types of data, and the role of statistical models in business applications.

Uploaded by

tanaymaniyar895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views13 pages

DA Unit 3 Trio

The document covers regression concepts, including simple and multiple linear regression models, estimation methods like least squares and maximum likelihood, and the BLUE property assumptions. It also discusses logistic regression, its applications in various business domains, and the importance of predictive analytics in decision-making. Additionally, it highlights data modeling techniques, types of data, and the role of statistical models in business applications.

Uploaded by

tanaymaniyar895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT – III

Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable Rationalization,
and Model Building etc. Logistic Regression: Model Theory, Model fit Statistics, Model Construction,
Analytics applications to various Business Domains etc.

THE SIMPLE LINEAR REGRESSION MODEL


We consider the modeling between the dependent and one independent variable. When there is only one
independent variable in the linear regression model, the model is generally termed as simple linear
regression model. When there are more than one independent variables in the model, then the linear
model is termed as the multiple linear regression model.
The linear model
Consider a simple linear regression model
y = 0 +  1 X + 

where y is termed as the dependent or study variable and X is termed as independent or explanatory
variable. The terms 0 and 1 are the parameters of the model. The parameter 0 is termed as intercept

term and the parameter 1 is termed as slope parameter. These parameters are usually called as regression
coefficients. The unobservable error component  accounts for the failure of data to lie on the straight
line and represents the difference between the true and observed realization of y . There can be several
reasons
for such difference, e.g., the effect of all deleted variables in the model, variables may be qualitative,
inherit randomness in the observations etc. We assume that  is observed as independent and
identically distributed random variable with mean zero and constant variance  2
. Later, we will
additionally assume that  is normally distributed.

The independent variable is viewed as controlled by the experimenter, so it is considered as non-


stochastic whereas y is viewed as a random variable with
E( y) = 0 + 1 X
and
Var( y) =  2 .

11
Var( y | x) =  2 .
When the values 0 , 1 and  2 are known, the model is completely described. The parameters 0 , 1 and
of 
 2 are generally unknown in practice and  is unobserved. The determination of the statistical model
y =  0 + 1 X +  depends on the determination (i.e., estimation ) 0 , 1 and  2 . In order to know the
of
values of these parameters, n pairs of observations (xi , yi )(i = 1,..., n) on ( X , y) are observed/collected
and are used to determine these unknown parameters.

Various methods of estimation can be used to


determine the estimates of the parameters.
Among them, the methods of least squares and
maximum likelihood are the popular methods
of estimation.

BLUE PROPERTY ASSUMPTIONS


• B-BEST
L-LINEAR

U-UNBIASED
E-ESTIMATOR

An estimator is BLUE if the following hold:


It is linear (Regression model)
It is unbiased

It is an efficient estimator(unbiased estimator with least


variance)

Linearity

An estimator is said to be a linear estimator of (f3) if it is a linear function of the sample observations
+X2 + + Xn
N Sample mean is a linear estimator because it is a linear
function of the X values.

21
UNBIASEDNESS

A desirable property of a distribut ion of estimates is that its mean equals the true
mean of the variables being estimated fo r m al l y , a n esti m ato r i s a n u n b i a s e d
e s t i m a t o r i f i ts Sampling distribution has as its expected value equal to the true value
of population.

We also write this as follows:

Similarly, if this is not the case, we say that the estimator is biased

TWO TYPES OF ESTIMATORS

A point estimate of a population parameter is a single value of a statistic.

POINTER ESTIMATORS

For example, the sample mean x is a point estimate of the population mean p. Similarly, the
sample proportion p is a point estimate of the population proportion P.
Interval Estimators

An interval estimate is defined by two numbers, between which a population parameter is said to
lie. For example, a < x < b is an interval estimate of the population mean p. It indicates that the
population mean is greater than a but less than b.

LEAST SQUARES ESTIMATION


Suppose a sample of n sets of paired observations (xi , yi ) (i = 1, 2,..., n) are available. These
observations are assumed to satisfy the simple linear regression model and so we can write
yi = 0 + 1 xi + i (i = 1, 2,..., n).

The method of least squares estimates the parameters 0 and 1 by minimizing the sum of squares of
Difference between the observations and the line in the scatter diagram. Such an idea is viewed from
different perspectives. When the vertical difference between the observations and the line in the
scatter
Diagram is considered and its sum of squares is minimized to obtain the estimates 0 and 1 , the meth
of is known as direct regression.

31
yi

(xi,

Y = 0 + 1 X
(Xi,

xi
Direct regression
Alternatively, the sum of squares of difference between the observations and the line in horizontal direction
in the scatter diagram can be minimized to obtain the estimates of 0 and 1 . This is known as reverse (or
Inverse) regression method.

yi

(xi, Y = 0 +
yi) 1 X

(Xi, Yi)
xi
Reverse regression method ,

Instead of horizontal or vertical errors, if the sum of squares of perpendicular distances between the
observations and the line in the scatter diagram is minimized to obtain the estimates of 0 and 1 , the

method is known as orthogonal regression or major axis regression method.

yi

41
xi)
(Xi
Major axis regression method
Instead of minimizing the distance, the area can also be minimized. The reduced major axis regression
method minimizes the sum of the areas of rectangles defined between the observed data points and the
nearest point on the line in the scatter diagram to obtain the estimates of regression coefficients. This is
shown in the following figure:

yi

(xi yi)

Y = 0 + 1 X

(Xi, Yi)

xi
Reduced major axis method

The method of least absolute deviation regression considers the sum of the absolute deviation of the
observations from the line in the vertical direction in the scatter diagram as in the case of direct
regression to
obtain the estimates of 0 and 1 .

No assumption is required about the form of probability distribution of i in deriving the least squares
estimates. For the purpose of deriving the statistical inferences only, we assume that i ' s are random
variable with E( i) = 0,Var( )i =  2 and Cov (i ,  j ) = 0 for all i  j(i, j = 1, 2,..., n). This assumption is
needed to find the mean, variance and other properties of the least squares estimates. The assumption that
i ' s are normally distributed is utilized while constructing the tests of hypotheses and confidence intervals
of the parameters.

51
Data Analytics UNIT - III

Based on these approaches, different estimates of 0 and 1 are obtained which have different
statisticalproperties. Among them the direct regression approach is more popular. Generally, the
direct regression estimates are referred as the least squares estimates or ordinary least squares
estimates.

6|Page
Data Analytics UNIT - III

7|Page
Data Analytics UNIT - III

8|Page
Data Analytics UNIT - III

9|Page
Data Analytics UNIT - III

LOGISTIC REGRESSION
Logistic regression, or Logistic regression, or Logistic model is a regression model where the
dependent variable (DV) is categorical. Logistic regression was developed by statistician David Cox in
1958.
> Ordinary Least Square
> Maximum Likelihood Estimation
The ordinary least squares, or OLS, can also be called the linear least squares.

OLS and MLE:


OLS -> Ordinary Least Square
MLE -> Maximum Likelihood Estimation
The ordinary least squares, or OLS, can also be called the linear least squares. This is a method
for approximately determining the unknown parameters located in a linear regression model. According
to books of statistics and other online sources, the ordinary by minimizing the total of squared vertical
distances between the observed responses within the dataset and the responses predicted by the linear

10 | P a g e
Data Analytics UNIT - III

approximation. Through a simple formula, you can express the resulting estimation of the linear
regression model.

ANALYTICS APPLICATIONS TO VARIOUS BUSINESS DOMAINS

Predictive Analytics is an art of predicting future on the basis of past trend.

It is a branch of Statistics which comprises of Modelling Techniques, Machine Learning &


Data Mining.

Predictive Analytics is primarily used in Decision Making.

What and Why analytics:

Analytics is a journey that involves a combination of potential skills, advanced technologies,


applications, and processes used by firm to gain business insights from data and statistics. This is done to
perform business planning.
Reporting Vs Analytics:
Reporting is presenting result of data analysis and Analytics is process or systems involved in
analysis of data to obtain a desired output.
Introduction to tools and Environment:
Analytics is now days used in all the fields ranging from Medical Science to Aero science to
Government Activities.
Data Science and Analytics are used by Manufacturing companies as well as to develop their
business and solve various issues by the help of historical data base.
Tools are the software that can be used for Analytics like SAS or R. While techniques are the
procedures to be followed to reach up to a solution.
Various steps involved in Analytics:
Access
Manage
Analyze

11 | P a g e
Data Analytics UNIT - III

Report

Various Analytics techniques are:


Data Preparation
Reporting, Dashboards & Visualization
Segmentation Icon
Forecasting
Descriptive Modelling
Predictive Modelling

APPLICATION OF MODELLING IN BUSINESS:

A statistical model embodies a set of assumptions concerning the generation of the observed data,
and similar data from a larger population. A model represents, often in considerably idealized form, the
data-generating process. Signal processing is an enabling technology that encompasses the fundamental
theory, applications, algorithms, and implementations of processing or transferring information contained
in many different physical, symbolic, or abstract formats broadly designated as signals. It uses
mathematical, statistical, computational, heuristic, and linguistic representations, formalisms, and
techniques for representation, modelling, analysis, synthesis, discovery, recovery, sensing, acquisition,
extraction, learning, security, or forensics. In manufacturing statistical models are used to define
Warranty policies, solving various conveyor related issues, Statistical Process Control etc.
Databases & Type of data and variables:
A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a
"centralized repository of information about data such as meaning, relationships to other data, origin,
usage, and format”.
Data can be categorized on various parameters like Categorical, Type etc.
Data can be categorized on various parameters like Categorical, Type etc.
Data is of 2 types – Numeric and Character. Again numeric data can be further divided into sub
group of – Discrete and Continuous.
Again, Data can be divided into 2 categories – Nominal and ordinal.
Also based on usage data is divided into 2 categories – Quantitative and Qualitative

12 | P a g e
Data Analytics UNIT - III

Data Modelling Techniques Overview:


Regression analysis mainly focuses on finding a relationship between a dependent variable and
one or more independent variables.
Predict the value of a dependent variable based on the value of at least one independent
variable.
It explains the impact of changes in an independent variable on the dependent variable. Y =
f(X, β) where Y is the dependent variable unknown coefficient
Missing Imputations:
In R, missing values are represented by the symbol NA (not available). Impossible values (e.g.,
dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same
symbol for character and numeric data.

13 | P a g e

You might also like