0% found this document useful (0 votes)
45 views18 pages

Unit III

da
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views18 pages

Unit III

da
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT - III

Regression

A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyperplane which goes through the points.

Regression Analysis is a statistical process for estimating the relationships between the
dependent variables or criterion variables and one or more independent variables or
predictors. Regression analysis explains the changes in criterions in relation to changes in
select predictors. The conditional expectation of the criterions based on predictors where the
average value of the dependent variables is given when the independent variables are
changed. Three major uses for regression analysis are determining the strength of predictors,
forecasting an effect, and trend forecasting.

Types of Regression –

 Linear regression
 Logistic regression
 Polynomial regression
 Stepwise regression
 Ridge regression
 Lasso regression
 ElasticNet regression

Linear regression is used for predictive analysis. Linear regression is a linear approach for
modeling the relationship between the criterion or the scalar response and the multiple
predictors or explanatory variables. Linear regression focuses on the conditional probability
distribution of the response given the values of the predictors. For linear regression, there is a
danger of overfitting. The formula for linear regression is: Y’ = bX + A.

Y = estimated dependent variable score, A = constant, b = regression coefficient, and X =


score on the independent variable.

Logistic regression is used when the dependent variable is dichotomous. Logistic regression
estimates the parameters of a logistic model and is form of binomial regression. Logistic
regression is used to deal with data that has two possible criterions and the relationship
between the criterions and the predictors. The equation for logistic regression is:

z = b0 + b1X1 + b2X2 +....+ bkXk

Where b0 is constant and k is independent (X) variables. In ordinal logistic regression, the
threshold coefficient will be different for every order of dependent variables. The coefficient
will give the cumulative probability of every order of dependent variables

Polynomial regression is used for curvilinear data. Polynomial regression is fit with the
method of least squares. The goal of regression analysis to model the expected value of a
dependent variable y in regards to the independent variable x. The equation for polynomial
regression is:
where ε is an unobserved random error with mean zero conditioned on a scalar variable x. In
this model, for each unit increase in the value of x, the conditional expectation of y increases
by β1 units.

Stepwise regression is used for fitting regression models with predictive models. It is carried
out automatically. With each step, the variable is added or subtracted from the set of
explanatory variables. The approaches for stepwise regression are forward selection,
backward elimination, and bidirectional elimination. The formula for stepwise regression is

Where Sy and Sx are the standard deviations for the dependent variable and the
corresponding jth independent variable

Ridge regression is a technique for analyzing multiple regression data. When multi-
collinearity occurs, least squares estimates are unbiased. A degree of bias is added to the
regression estimates, and a result, ridge regression reduces the standard errors. The formula
for ridge regression is

β is Coefficient

X=Independent Variable = Feature = Attribute = Predictor

The λ parameter is the regularization penalty

Y = response variable

Lasso regression is a regression analysis method that performs both variable selection and
regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset
of the provided covariates for use in the final model. Lasso regression is

Objective = RSS + α * (sum of absolute value of coefficients)

Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS
and magnitude of coefficients. Like that of ridge, α can take various values.

Lets iterate it here briefly:

1. α = 0: Same coefficients as simple linear regression


2. α = ∞: All coefficients zero (same logic as before)
3. 0 < α < ∞: coefficients between 0 and that of simple linear regression

ElasticNet regression is a regularized regression method that linearly combines the penalties
of the lasso and ridge methods. ElasticNet regression is used for support vector machines,
metric learning, and portfolio optimization. The penalty function is given by:
Use of this penalty function has several limitations. For example, in the "large p, small
n" case (high-dimensional data with few examples), the LASSO selects at most n
variables before it saturates.
Blue property assumptions
• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)

LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the
sample observations

• Sample mean is a linear estimator because it is a linear function of the X values.

UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true
mean of the variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
• We also write this as follows:

Similarly, if this is not the case, we say that the estimator is biased

Bias=
MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be
centered around the true population, so it is desirable for the sampling distribution to
be as narrow (or precise) as possible.

– Centering around “the truth” but with high variability might be of


very little use

• One way of narrowing the sampling distribution is to increase


the sampling size
What is the Least Squares Regression Method?
The least-squares regression method is a technique commonly used in Regression Analysis. It
is a mathematical method used to find the best fit line that represents the relationship
between an independent and dependent variable.

To understand the least-squares regression method lets get familiar with the concepts
involved in formulating the line of best fit.

What is the Line of Best Fit?


Line of best fit is drawn to represent the relationship between 2 or more variables. To be
more specific, the best fit line is drawn across a scatter plot of data points in order to
represent a relationship between those data points.
Regression analysis makes use of mathematical methods such as least squares to obtain a
definite relationship between the predictor variable (s) and the target variable. The least-
squares method is one of the most effective ways used to draw the line of best fit. It is based
on the idea that the square of the errors obtained must be minimized to the most possible
extent and hence the name least squares method.

If we were to plot the best fit line that shows the depicts the sales of a company over a period
of time, it would look something like this:

Notice that the line is as close as possible to all the scattered data points. This is what an ideal
best fit line looks like.

Let’s see how to calculate the line using the Least Squares Regression.

Steps to calculate the Line of Best Fit


To start constructing the line that best depicts the relationship between variables in the data,
the equation used is:

It is a simple equation that represents a straight line along 2 Dimensional data, i.e. x-axis and
y-axis. To better understand this, let’s break down the equation:

 y: dependent variable
 m: the slope of the line
 x: independent variable
 c: y-intercept

So the aim is to calculate the values of slope, y-intercept and substitute the corresponding ‘x’
values in the equation in order to derive the value of the dependent variable.

Let’s see how this can be done.

As an assumption, let’s consider that there are ‘n’ data points.

Step 1: Calculate the slope ‘m’ by using the following formula:

Step 2: Compute the y-intercept (the value of y at the point where the line crosses the y-axis):

Step 3: Substitute the values in the final equation:

Simple, isn’t it?

Now let’s look at an example and see how you can use the least-squares regression method to
compute the line of best fit.

Least Squares Regression Example


Consider an example. Tom who is the owner of a retail shop, found the price of different T-
shirts vs the number of T-shirts sold at his shop over a period of one week.

He tabulated this like shown below:


Let us use the concept of least squares regression to find the line of best fit for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:

After you substitute the respective values, m = 1.518 approximately.


Step 2: Compute the y-intercept value

After you substitute the respective values, c = 0.305 approximately.

Step 3: Substitute the values in the final equation


y = 1.518x + 0.305

Once you substitute the values, it should look something like this:

Let’s construct a graph that represents the y=mx + c line of best fit:

Now Tom can use the above equation to estimate how many T-shirts of price $8 can he sell at
the retail shop.

y = 1.518 X 8 + 0.305 = 12.45 T-shirts

This comes down to 13 T-shirts! That’s how simple it is to make predictions using Linear
Regression.

Now let’s try to understand based on what factors can we confirm that the above line is the
line of best fit.

The least squares regression method works by minimizing the sum of the square of the errors
as small as possible, hence the name least squares. Basically the distance between the line of
best fit and the error must be minimized as much as possible. This is the basic idea behind the
least squares regression method.
A few things to keep in mind before implementing the least squares regression method is:

 The data must be free of outliers because they might lead to a biased and wrongful
line of best fit.
 The line of best fit can be drawn iteratively until you get a line with the minimum
possible squares of errors.
 This method works well even with non-linear data.
 Technically, the difference between the actual value of ‘y’ and the predicted value of
‘y’ is called the Residual (denotes the error).

Variable Rationalization

Data Rationalization is an enabler of effective Data Governance. How can you govern
information assets if you don’t know where they are or what they mean? Similarly, Data
Rationalization can aid in the development of Master Data Management solutions. By
identifying common data entities, and how these relate to other pieces of data (again, across
many systems), MDM solutions will be able to better accommodate the needs of all the
systems which require the master/reference data.

How does it work?

In order to be able to rationalize your data, meta relationships between model objects (across
model levels) must be established. Of course, we are not talking about supplanting the normal
types of relationships between model objects in the same model. Meta relationships can be
established in multiple ways:

1. Use automated modeling tool functionality (e.g. ERStudio Where Used,


PowerDesigner Link and Sync)
2. Use manual modeling tool functionality (e.g. ERStudio User Defined Mapping)
3. Use modeling tool meta data fields (e.g. ERwin User Defined Properties (UDP))
4. Use Meta Data Repository tool (e.g. Rochade, Adaptive, Advantage Repository, etc)
to manually establish links using a GUI or other interface.
5. Use a spreadsheet(Excel Sheet)
Model Building

Model Building Lifecycle

Model building is the focus of our algorithms of interest. Each technique has its own
strengths and weaknesses.

Model validation is very important to develop a sense of trust in the models prior to their
usage. Classically, this is achieved by using a data subset which has not been used in the
model development as an independent assessment of the model quality. As a further quality
control measure, models which perform well against the training and test sets can also be
evaluated against a third (validation) set. If there is a paucity of data (a fat array), then other
cross-validation strategies can be applied. Alternately, in this case, the symbolic regression
modeling can be done using all of the available data and the Pareto front trading expression
accuracy and complexity may be used to select models at an appropriate point in the trade-off
between model accuracy and complexity — which implicitly minimizes the risk of selecting
an over-trained model. Symbolic regression models have an additional advantage in terms of
transparency — i.e., a user can look at the expression structure and constituent variables (and
variable combinations) and agree that the expression is reasonable and intuitively correct.
Model behavior at limit conditions should also be examined since, for example, a model
which recommends that the optimal planting depth for corn is four inches above the surface
of the ground (true story) is not likely to get much in terms of user acceptance. Thus, our
definition of a good model typically includes robustness as well as absolute accuracy.

Additionally, we may want to identify and ensemble of diverse models (of similar modeling
capacity) whose agreement can be used as a trust metric. Assembling this ensemble is easiest
for stacked analytic nets and relatively easy for symbolic regression models. The use of
ensembles and the associated trust metric has several very significant advantages:
 divergence in the prediction of the individual models within the ensemble provides
and immediate warning that the model is either operating in uncharted territories or
the underlying system has undergone fundamental changes. In either event, the model
predictions should not be trusted;
 all of the data may be used in the model building rather than requiring a partitioning
into training, testing, validation, etc. subsets. The resulting models will be less
myopic. Additionally, this may be a very significant advantage if data is sparse.

Empirical modeling is simply an academic exercise unless their value is extracted. Model
deployment, therefore, is a key aspect of the modeling lifecycle. Associated with this are
system building aspects of user interfaces and data entry. Trust metrics are also important to
identify, if possible, when the model is being used inappropriately — e.g., in parameter space
outside of the domain used to train the model.

Logistic Regression
Model Theory

Logistic regression makes use of a binary classifier. It utilizes the Logistic function or
Sigmoid function to predict a probability that the answer to some question is 1 or 0, yes or no,
true or false, good or bad etc.. It's this function that will drive the algorithm and is also
interesting in that it can be used as an "activation function" for Neural Networks. Logistic
Regression will be a good algorithm to dig into for understanding Machine Learning.

Logistic Regression is an algorithm that is relatively simple and powerful for deciding
between two classes, i.e. it's a binary classifier. It basically gives a function that is a boundary
between two different classes. It can be extended to handle more than two classes by a
method referred to as "one-vs-all" (multinomial logistic regression or softmax regression)
which is really a collection of binary classifiers that just picks out the most likely class by
looking at each class individually versus everything else and then picks the class that has the
highest probability.

Examples of problems that could be addressed with Logistic Regression are,

 Spam filtering -- spam or not spam


 Cell image -- cancer or normal
 Production line part scan -- good or defective
 Epidemiological study for illness, "symptoms" -- has it or doesn't
 is-a-(fill in the blank) or not
 COVID-19 : +ve or -ve

It's a simple yes-or-no type of classifier. Logistic regression can make use of large numbers
of features including continuous and discrete variables and non-linear features. It can be used
for many kinds of problems.

Building a Logistic Regression Model for Numerical Data


Now that we understand the basics of logistic regression for numerical data, let’s walk
through the process of building a logistic regression model step-by-step.
Step 1: Prepare the Data

The first step is to prepare the data. This involves cleaning the data, handling missing values,
and converting numerical variables into categorical variables. We’ll also split the data into
training and testing sets to evaluate the model’s performance.

Step 2: Select the Independent Variables

The next step is to select the independent variables to include in the model. We’ll want to
include all relevant independent variables, including any categorical variables that were
already present in the dataset.

Step 3: Fit the Model

Now it’s time to fit the model. We’ll use a logistic regression algorithm to estimate the
coefficients for each independent variable. These coefficients represent the strength and
direction of the relationship between each independent variable and the dependent variable.

Step 4: Evaluate the Model

Once the model is fit, we need to evaluate its performance. We’ll use the testing dataset to
evaluate the model’s accuracy, precision, recall, and F1 score. These metrics will help us
understand how well the model is performing and identify any areas where it needs
improvement.

Step 5: Fine-Tune the Model

Based on the evaluation results, we may need to fine-tune the model. This could involve adjusting
the parameters of the logistic regression algorithm, selecting different independent variables, or
changing the method used to convert numerical variables into categorical variables. We’ll repeat
Steps 3 and 4 until we have a model that meets our performance goals.

Model fit Statistics

When selecting the model for the logistic regression analysis, another important consideration
is the model fit. In logistic regression the fitting is carried out by working with the logits. The
Logit transformation produces a model that is linear in the parameters. The method of
estimation used is the maximum likelihood method. The maximum likelihood estimates are
obtained numerically, using an iterative procedure.

OLS and MLE:

OLS -- > Ordinary Least Square

MLE -- > Maximum Likelihood Estimation

The ordinary least squares, or OLS, can also be called the linear least squares. This is a
method for approximately determining the unknown parameters located in a linear regression
model. Ordinary least squares is obtained by minimizing the total of squared vertical
distances between the observed responses within the dataset and the responses predicted by
the linear approximation. Through a simple formula, you can express the resulting estimator,
especially the single regressor, located on the right-hand side of the linear regression model.

For example, you have a set of equations which consists of several equations that have
unknown parameters. You may use the ordinary least squares method because this is the most
standard approach in finding the approximate solution. In other words, it is your overall
solution in minimizing the sum of the squares of errors in your equation. Data fitting can be
your most suited application. Online sources have stated that the data that best fits the
ordinary least squares minimizes the sum of squared residuals. “Residual” is “the difference
between an observed value and the fitted value provided by a model”.

Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of


a statistical model, and for fitting a statistical model to data. If you want to find the height
measurement of every basketball player in a specific location, you can use the maximum
likelihood estimation. Normally, you would encounter problems such as cost and time
constraints. If you could not afford to measure all of the basketball players’ heights, the
maximum likelihood estimation would be very handy. Using the maximum likelihood
estimation, you can estimate the mean and variance of the height of your subjects.
The MLE would set the mean and variance as parameters in determining the specific
parametric values in a given model.

Hosmer Lemeshow Test:

The Hosmer–Lemeshow test is a statistical test for goodness of fit for logistic regression
models.
It is used frequently in risk prediction models.
The test assesses whether or not the observed event rates match expected event rates in
subgroups of the model population.
The Hosmer–Lemeshow test specifically identifies subgroups as the deciles of fitted risk
values.
Models for which expected and observed event rates in subgroups are similar are called
well calibrated.
The Hosmer–Lemeshow test statistic is given by:

Here Og, Eg, Ng, and πg denote the observed events, expected events, observations, predicted
risk for the gth risk decile group, and G is the number of groups.

Error Matrix:
A confusion matrix, also known as a contingency table or an error matrix , is a specific table
layout that allows visualization of the performance of an algorithm, typically a supervised
learning one (in unsupervised learning it is usually called a matching matrix). Each column of
the matrix represents the instances in a predicted class while each row represents the
instances in an actual class (or vice-versa).

The name stems from the fact that it makes it easy to see if the system is confusing two
classes (i.e. commonly mislabeling one as another).

A table of confusion (sometimes also called a confusion matrix), is a table with two rows
and two columns that reports the number of false positives, false negatives, true positives, and
true negatives. This allows more detailed analysis than mere proportion of correct guesses
(accuracy). Accuracy is not a reliable metric for the real performance of a classifier, because
it will yield misleading results if the data set is unbalanced (that is, when the number of
samples in different classes vary greatly).
For example, if there were 95 cats and only 5 dogs in the data set, the classifier could easily
be biased into classifying all the samples as cats. The overall accuracy would be 95%, but in
practice the classifier would have a 100% recognition rate for the cat class but a 0%
recognition rate for the dog class.

Assuming the confusion matrix above, its corresponding table of confusion, for the cat class,
would be:

Receiver Operating Characteristics:


A receiver operating characteristic (ROC), or ROC curve, is a graphical plot that
illustrates the performance of a binary classifier system as its discrimination threshold is
varied. The curve is created by plotting the true positive rate (TPR) against the false positive
rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or
the sensitivity index d', known as "d-prime" in signal detection and biomedical informatics,
or recall in machine learning. The false-positive rate is also known as the fall-out and can be
calculated as (1 - specificity).
The ROC curve is thus the sensitivity as a function of fall-out. In general, if the probability
distributions for both detection and false alarm are known, the ROC curve can be generated
by plotting the cumulative distribution function (area under the probability distribution from
to) of the detection probability in the y-axis versus the cumulative distribution function of the
false-alarm probability in x-axis. ROC analysis provides tools to select possibly optimal
models and to discard suboptimal ones independently from (and prior to specifying) the cost
context or the class distribution. ROC analysis is related in a direct and natural way to
cost/benefit analysis of diagnostic decision making.

The ROC curve was first developed by electrical engineers and radar engineers during World
War II for detecting enemy objects in battlefields and was soon introduced to psychology to
account for perceptual detection of stimuli. ROC analysis since then has been used in
medicine, radiology, biometrics, and other areas for many decades and is increasingly used in
machine learning and data mining research.

The ROC is also known as a relative operating characteristic curve, because it is a


comparison of two operating characteristics (TPR and FPR) as the criterion changes.

Model Construction:
Step 1: Use univariate analysis to identify important covariates – the ones that are at least
moderately associated with response.
• One covariate at a time.
• Analyze contingency tables for each categorical covariate. Pay particular attention to cells
with low counts. May need to collapse categories in a sensible fashion.
• Use nonparametric smoothing for each continuous covariate. Can also categorize the
covariate and look at the plot of mean response (estimate of π) in each group against the
group mid-point. To get a plot on logit scale, plot the logit transformation of this mean
response. This plot also suggests the appropriate scale of the variable.
• Can also fit logistic regression models with one covariate at a time and analyze the fits. In
particular, look at the estimated coefficients, their standard errors and the likelihood ratio test
for the significance of the coefficient.
• Rule of thumb: select all the variables whose p-value < 0.25 along with the variables of
known clinical importance.

Step 2: Fit a multiple logistic regression model using the variables selected in step 1.
• Verify the importance of each variable in this multiple model using Wald statistic.
• Compare the coefficients of the each variable with the coefficient from the model
containing only that variable.
• Eliminate any variable that doesn’t appear to be important and fit a new model. Check if
the new model is significantly different from the old model. If it is, then the deleted variable
was important.
• Repeat this process of deleting, refitting and verifying until it appears that all the important
variables are included in the model.
• At this point, add the variables into the model that were not selected in the original multiple
models.
Assess the joint significance of the variables that were not selected. This step is important as
it helps to identify the confounding variables. Make changes in the model, if necessary.
At the end, we have the preliminary main effects model – it contains the important variables.

Step 3: Check the assumption of linearity in logit for each continuous covariate.
• Look at the smoothed plot of logit in step 1 against the covariate.
• If not linear, find a suitable transformation of the covariate so that the logit is roughly
linear in the new variable.
• Try simple transformations such as power, log, etc.

At the end, we have the main effects model.

Step 4: Check for interactions.


• Create a list of possible pairs of variables in the main effects model that have some
scientific basis to interact with each other. This list may or may not consist of all possible
pairs.
• Add the interaction term, one at a time, in the model containing all the main effects and
assess its significance using the likelihood ratio test.
• Identify the significant interaction terms.

Step 5: Add the interactions found significant in step 4 to the main effects model and
evaluate its fit
• Look at the Wald tests and LR tests for the interaction terms.
• Drop any non-significant interaction.

At the end, we get our preliminary final model. We should now assess the overall goodness-
of-fit of this model and perform model diagnostics.

Analytics applications to various Business Domains

 Financial Service Analytics


 Marketing Analytics
 Pricing Analytics
 Retail sales Analytics
 Risk and Credit analytics
 Supply Chain Analytics
 Transportation analytics
 Cyber Analytics
 Enterprise optimization

You might also like