Unit III
Unit III
Regression
A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyperplane which goes through the points.
Regression Analysis is a statistical process for estimating the relationships between the
dependent variables or criterion variables and one or more independent variables or
predictors. Regression analysis explains the changes in criterions in relation to changes in
select predictors. The conditional expectation of the criterions based on predictors where the
average value of the dependent variables is given when the independent variables are
changed. Three major uses for regression analysis are determining the strength of predictors,
forecasting an effect, and trend forecasting.
Types of Regression –
Linear regression
Logistic regression
Polynomial regression
Stepwise regression
Ridge regression
Lasso regression
ElasticNet regression
Linear regression is used for predictive analysis. Linear regression is a linear approach for
modeling the relationship between the criterion or the scalar response and the multiple
predictors or explanatory variables. Linear regression focuses on the conditional probability
distribution of the response given the values of the predictors. For linear regression, there is a
danger of overfitting. The formula for linear regression is: Y’ = bX + A.
Logistic regression is used when the dependent variable is dichotomous. Logistic regression
estimates the parameters of a logistic model and is form of binomial regression. Logistic
regression is used to deal with data that has two possible criterions and the relationship
between the criterions and the predictors. The equation for logistic regression is:
Where b0 is constant and k is independent (X) variables. In ordinal logistic regression, the
threshold coefficient will be different for every order of dependent variables. The coefficient
will give the cumulative probability of every order of dependent variables
Polynomial regression is used for curvilinear data. Polynomial regression is fit with the
method of least squares. The goal of regression analysis to model the expected value of a
dependent variable y in regards to the independent variable x. The equation for polynomial
regression is:
where ε is an unobserved random error with mean zero conditioned on a scalar variable x. In
this model, for each unit increase in the value of x, the conditional expectation of y increases
by β1 units.
Stepwise regression is used for fitting regression models with predictive models. It is carried
out automatically. With each step, the variable is added or subtracted from the set of
explanatory variables. The approaches for stepwise regression are forward selection,
backward elimination, and bidirectional elimination. The formula for stepwise regression is
Where Sy and Sx are the standard deviations for the dependent variable and the
corresponding jth independent variable
Ridge regression is a technique for analyzing multiple regression data. When multi-
collinearity occurs, least squares estimates are unbiased. A degree of bias is added to the
regression estimates, and a result, ridge regression reduces the standard errors. The formula
for ridge regression is
β is Coefficient
Y = response variable
Lasso regression is a regression analysis method that performs both variable selection and
regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset
of the provided covariates for use in the final model. Lasso regression is
Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS
and magnitude of coefficients. Like that of ridge, α can take various values.
ElasticNet regression is a regularized regression method that linearly combines the penalties
of the lasso and ridge methods. ElasticNet regression is used for support vector machines,
metric learning, and portfolio optimization. The penalty function is given by:
Use of this penalty function has several limitations. For example, in the "large p, small
n" case (high-dimensional data with few examples), the LASSO selects at most n
variables before it saturates.
Blue property assumptions
• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)
LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the
sample observations
UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true
mean of the variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
• We also write this as follows:
Similarly, if this is not the case, we say that the estimator is biased
Bias=
MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be
centered around the true population, so it is desirable for the sampling distribution to
be as narrow (or precise) as possible.
To understand the least-squares regression method lets get familiar with the concepts
involved in formulating the line of best fit.
If we were to plot the best fit line that shows the depicts the sales of a company over a period
of time, it would look something like this:
Notice that the line is as close as possible to all the scattered data points. This is what an ideal
best fit line looks like.
Let’s see how to calculate the line using the Least Squares Regression.
It is a simple equation that represents a straight line along 2 Dimensional data, i.e. x-axis and
y-axis. To better understand this, let’s break down the equation:
y: dependent variable
m: the slope of the line
x: independent variable
c: y-intercept
So the aim is to calculate the values of slope, y-intercept and substitute the corresponding ‘x’
values in the equation in order to derive the value of the dependent variable.
Step 2: Compute the y-intercept (the value of y at the point where the line crosses the y-axis):
Now let’s look at an example and see how you can use the least-squares regression method to
compute the line of best fit.
Once you substitute the values, it should look something like this:
Let’s construct a graph that represents the y=mx + c line of best fit:
Now Tom can use the above equation to estimate how many T-shirts of price $8 can he sell at
the retail shop.
This comes down to 13 T-shirts! That’s how simple it is to make predictions using Linear
Regression.
Now let’s try to understand based on what factors can we confirm that the above line is the
line of best fit.
The least squares regression method works by minimizing the sum of the square of the errors
as small as possible, hence the name least squares. Basically the distance between the line of
best fit and the error must be minimized as much as possible. This is the basic idea behind the
least squares regression method.
A few things to keep in mind before implementing the least squares regression method is:
The data must be free of outliers because they might lead to a biased and wrongful
line of best fit.
The line of best fit can be drawn iteratively until you get a line with the minimum
possible squares of errors.
This method works well even with non-linear data.
Technically, the difference between the actual value of ‘y’ and the predicted value of
‘y’ is called the Residual (denotes the error).
Variable Rationalization
Data Rationalization is an enabler of effective Data Governance. How can you govern
information assets if you don’t know where they are or what they mean? Similarly, Data
Rationalization can aid in the development of Master Data Management solutions. By
identifying common data entities, and how these relate to other pieces of data (again, across
many systems), MDM solutions will be able to better accommodate the needs of all the
systems which require the master/reference data.
In order to be able to rationalize your data, meta relationships between model objects (across
model levels) must be established. Of course, we are not talking about supplanting the normal
types of relationships between model objects in the same model. Meta relationships can be
established in multiple ways:
Model building is the focus of our algorithms of interest. Each technique has its own
strengths and weaknesses.
Model validation is very important to develop a sense of trust in the models prior to their
usage. Classically, this is achieved by using a data subset which has not been used in the
model development as an independent assessment of the model quality. As a further quality
control measure, models which perform well against the training and test sets can also be
evaluated against a third (validation) set. If there is a paucity of data (a fat array), then other
cross-validation strategies can be applied. Alternately, in this case, the symbolic regression
modeling can be done using all of the available data and the Pareto front trading expression
accuracy and complexity may be used to select models at an appropriate point in the trade-off
between model accuracy and complexity — which implicitly minimizes the risk of selecting
an over-trained model. Symbolic regression models have an additional advantage in terms of
transparency — i.e., a user can look at the expression structure and constituent variables (and
variable combinations) and agree that the expression is reasonable and intuitively correct.
Model behavior at limit conditions should also be examined since, for example, a model
which recommends that the optimal planting depth for corn is four inches above the surface
of the ground (true story) is not likely to get much in terms of user acceptance. Thus, our
definition of a good model typically includes robustness as well as absolute accuracy.
Additionally, we may want to identify and ensemble of diverse models (of similar modeling
capacity) whose agreement can be used as a trust metric. Assembling this ensemble is easiest
for stacked analytic nets and relatively easy for symbolic regression models. The use of
ensembles and the associated trust metric has several very significant advantages:
divergence in the prediction of the individual models within the ensemble provides
and immediate warning that the model is either operating in uncharted territories or
the underlying system has undergone fundamental changes. In either event, the model
predictions should not be trusted;
all of the data may be used in the model building rather than requiring a partitioning
into training, testing, validation, etc. subsets. The resulting models will be less
myopic. Additionally, this may be a very significant advantage if data is sparse.
Empirical modeling is simply an academic exercise unless their value is extracted. Model
deployment, therefore, is a key aspect of the modeling lifecycle. Associated with this are
system building aspects of user interfaces and data entry. Trust metrics are also important to
identify, if possible, when the model is being used inappropriately — e.g., in parameter space
outside of the domain used to train the model.
Logistic Regression
Model Theory
Logistic regression makes use of a binary classifier. It utilizes the Logistic function or
Sigmoid function to predict a probability that the answer to some question is 1 or 0, yes or no,
true or false, good or bad etc.. It's this function that will drive the algorithm and is also
interesting in that it can be used as an "activation function" for Neural Networks. Logistic
Regression will be a good algorithm to dig into for understanding Machine Learning.
Logistic Regression is an algorithm that is relatively simple and powerful for deciding
between two classes, i.e. it's a binary classifier. It basically gives a function that is a boundary
between two different classes. It can be extended to handle more than two classes by a
method referred to as "one-vs-all" (multinomial logistic regression or softmax regression)
which is really a collection of binary classifiers that just picks out the most likely class by
looking at each class individually versus everything else and then picks the class that has the
highest probability.
It's a simple yes-or-no type of classifier. Logistic regression can make use of large numbers
of features including continuous and discrete variables and non-linear features. It can be used
for many kinds of problems.
The first step is to prepare the data. This involves cleaning the data, handling missing values,
and converting numerical variables into categorical variables. We’ll also split the data into
training and testing sets to evaluate the model’s performance.
The next step is to select the independent variables to include in the model. We’ll want to
include all relevant independent variables, including any categorical variables that were
already present in the dataset.
Now it’s time to fit the model. We’ll use a logistic regression algorithm to estimate the
coefficients for each independent variable. These coefficients represent the strength and
direction of the relationship between each independent variable and the dependent variable.
Once the model is fit, we need to evaluate its performance. We’ll use the testing dataset to
evaluate the model’s accuracy, precision, recall, and F1 score. These metrics will help us
understand how well the model is performing and identify any areas where it needs
improvement.
Based on the evaluation results, we may need to fine-tune the model. This could involve adjusting
the parameters of the logistic regression algorithm, selecting different independent variables, or
changing the method used to convert numerical variables into categorical variables. We’ll repeat
Steps 3 and 4 until we have a model that meets our performance goals.
When selecting the model for the logistic regression analysis, another important consideration
is the model fit. In logistic regression the fitting is carried out by working with the logits. The
Logit transformation produces a model that is linear in the parameters. The method of
estimation used is the maximum likelihood method. The maximum likelihood estimates are
obtained numerically, using an iterative procedure.
The ordinary least squares, or OLS, can also be called the linear least squares. This is a
method for approximately determining the unknown parameters located in a linear regression
model. Ordinary least squares is obtained by minimizing the total of squared vertical
distances between the observed responses within the dataset and the responses predicted by
the linear approximation. Through a simple formula, you can express the resulting estimator,
especially the single regressor, located on the right-hand side of the linear regression model.
For example, you have a set of equations which consists of several equations that have
unknown parameters. You may use the ordinary least squares method because this is the most
standard approach in finding the approximate solution. In other words, it is your overall
solution in minimizing the sum of the squares of errors in your equation. Data fitting can be
your most suited application. Online sources have stated that the data that best fits the
ordinary least squares minimizes the sum of squared residuals. “Residual” is “the difference
between an observed value and the fitted value provided by a model”.
The Hosmer–Lemeshow test is a statistical test for goodness of fit for logistic regression
models.
It is used frequently in risk prediction models.
The test assesses whether or not the observed event rates match expected event rates in
subgroups of the model population.
The Hosmer–Lemeshow test specifically identifies subgroups as the deciles of fitted risk
values.
Models for which expected and observed event rates in subgroups are similar are called
well calibrated.
The Hosmer–Lemeshow test statistic is given by:
Here Og, Eg, Ng, and πg denote the observed events, expected events, observations, predicted
risk for the gth risk decile group, and G is the number of groups.
Error Matrix:
A confusion matrix, also known as a contingency table or an error matrix , is a specific table
layout that allows visualization of the performance of an algorithm, typically a supervised
learning one (in unsupervised learning it is usually called a matching matrix). Each column of
the matrix represents the instances in a predicted class while each row represents the
instances in an actual class (or vice-versa).
The name stems from the fact that it makes it easy to see if the system is confusing two
classes (i.e. commonly mislabeling one as another).
A table of confusion (sometimes also called a confusion matrix), is a table with two rows
and two columns that reports the number of false positives, false negatives, true positives, and
true negatives. This allows more detailed analysis than mere proportion of correct guesses
(accuracy). Accuracy is not a reliable metric for the real performance of a classifier, because
it will yield misleading results if the data set is unbalanced (that is, when the number of
samples in different classes vary greatly).
For example, if there were 95 cats and only 5 dogs in the data set, the classifier could easily
be biased into classifying all the samples as cats. The overall accuracy would be 95%, but in
practice the classifier would have a 100% recognition rate for the cat class but a 0%
recognition rate for the dog class.
Assuming the confusion matrix above, its corresponding table of confusion, for the cat class,
would be:
The ROC curve was first developed by electrical engineers and radar engineers during World
War II for detecting enemy objects in battlefields and was soon introduced to psychology to
account for perceptual detection of stimuli. ROC analysis since then has been used in
medicine, radiology, biometrics, and other areas for many decades and is increasingly used in
machine learning and data mining research.
Model Construction:
Step 1: Use univariate analysis to identify important covariates – the ones that are at least
moderately associated with response.
• One covariate at a time.
• Analyze contingency tables for each categorical covariate. Pay particular attention to cells
with low counts. May need to collapse categories in a sensible fashion.
• Use nonparametric smoothing for each continuous covariate. Can also categorize the
covariate and look at the plot of mean response (estimate of π) in each group against the
group mid-point. To get a plot on logit scale, plot the logit transformation of this mean
response. This plot also suggests the appropriate scale of the variable.
• Can also fit logistic regression models with one covariate at a time and analyze the fits. In
particular, look at the estimated coefficients, their standard errors and the likelihood ratio test
for the significance of the coefficient.
• Rule of thumb: select all the variables whose p-value < 0.25 along with the variables of
known clinical importance.
Step 2: Fit a multiple logistic regression model using the variables selected in step 1.
• Verify the importance of each variable in this multiple model using Wald statistic.
• Compare the coefficients of the each variable with the coefficient from the model
containing only that variable.
• Eliminate any variable that doesn’t appear to be important and fit a new model. Check if
the new model is significantly different from the old model. If it is, then the deleted variable
was important.
• Repeat this process of deleting, refitting and verifying until it appears that all the important
variables are included in the model.
• At this point, add the variables into the model that were not selected in the original multiple
models.
Assess the joint significance of the variables that were not selected. This step is important as
it helps to identify the confounding variables. Make changes in the model, if necessary.
At the end, we have the preliminary main effects model – it contains the important variables.
Step 3: Check the assumption of linearity in logit for each continuous covariate.
• Look at the smoothed plot of logit in step 1 against the covariate.
• If not linear, find a suitable transformation of the covariate so that the logit is roughly
linear in the new variable.
• Try simple transformations such as power, log, etc.
Step 5: Add the interactions found significant in step 4 to the main effects model and
evaluate its fit
• Look at the Wald tests and LR tests for the interaction terms.
• Drop any non-significant interaction.
At the end, we get our preliminary final model. We should now assess the overall goodness-
of-fit of this model and perform model diagnostics.