0% found this document useful (0 votes)
32 views16 pages

Mod3 Eda

The document provides an overview of linear regression, including its definition, importance, types (simple and multiple), and applications across various fields. It discusses the process of parameter estimation, the significance of the best-fit line, and the concepts of variance and frequentist statistics. Additionally, it covers variable selection in linear regression modeling to avoid overfitting and enhance predictive performance.

Uploaded by

chiragcs9911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views16 pages

Mod3 Eda

The document provides an overview of linear regression, including its definition, importance, types (simple and multiple), and applications across various fields. It discusses the process of parameter estimation, the significance of the best-fit line, and the concepts of variance and frequentist statistics. Additionally, it covers variable selection in linear regression modeling to avoid overfitting and enhance predictive performance.

Uploaded by

chiragcs9911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Exploratory Data Analysis MODULE 3

Linear Regression and Variable Selection: Meaning- Review Expectation, Variance,


Frequentist Basics, Parameter Estimation, Linear Methods, Point Estimate, Example Results,
Theoretical Justification, R Scripts.
Variable Selection- Variable Selection for the Linear Model, R Scripts.

What is Linear Regression?


Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between a dependent variable and one or more independent features.
When the number of the independent feature, is 1 then it is known as Univariate Linear
regression, and in the case of more than one feature, it is known as multivariate linear
regression.

Why Linear Regression is Important?


The interpretability of linear regression is a notable strength. The model’s equation
provides clear coefficients that elucidate the impact of each independent variable on the
dependent variable, facilitating a deeper understanding of the underlying dynamics. Its
simplicity is a virtue, as linear regression is transparent, easy to implement, and serves as a
foundational concept for more complex algorithms.
Linear regression is not merely a predictive tool; it forms the basis for various advanced
models. Techniques like regularization and support vector machines draw inspiration from
linear regression, expanding its utility. Additionally, linear regression is a cornerstone in
assumption testing, enabling researchers to validate key assumptions about the data.
Types of Linear Regression
There are two main types of linear regression:
Simple Linear Regression
This is the simplest form of linear regression, and it involves only one independent variable
and one dependent variable. The equation for simple linear regression is:

where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

Multiple Linear Regression


This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:

where:
Y is the dependent variable
X1, X2, …, Xp are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to
learn a function so if you want to predict Y from an unknown X this learned function can
be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features.
What is the best Fit Line?
Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a minimum.
There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between
the dependent and independent variables. The slope of the line indicates how much the
dependent variable changes for a unit change in the independent variable(s).

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used
for regression. A linear function is the simplest type of function. Here, X may be a single
feature or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x)). Hence, the name is Linear Regression. In the figure above,
X (input) is the work experience and Y (output) is the salary of a person. The regression
line is the best-fit line for our model.
We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines.
Linear Regression Line
The linear regression line provides valuable insights into the relationship between the two
variables. It represents the best-fitting line that captures the overall trend of how a
dependent variable (Y) changes in response to variations in an independent variable (X).
 Positive Linear Regression Line: A positive linear regression line indicates a
direct relationship between the independent variable (X) and the dependent
variable (Y). This means that as the value of X increases, the value of Y also
increases. The slope of a positive linear regression line is positive, meaning that
the line slants upward from left to right.
 Negative Linear Regression Line: A negative linear regression line indicates an
inverse relationship between the independent variable (X) and the dependent
variable (Y). This means that as the value of X increases, the value of Y
decreases. The slope of a negative linear regression line is negative, meaning
that the line slants downward from left to right.
Applications of Linear Regression
Linear regression is used in many different fields, including finance, economics, and
psychology, to understand and predict the behavior of a particular variable. For example, in
finance, linear regression might be used to understand the relationship between a
company’s stock price and its earnings or to predict the future value of a currency based on
its past performance.
Advantages of Linear Regression
i) Linear regression is a relatively simple algorithm, making it easy to understand
and implement. The coefficients of the linear regression model can be
interpreted as the change in the dependent variable for a one-unit change in the

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

independent variable, providing insights into the relationships between


variables.
ii) Linear regression is computationally efficient and can handle large datasets
effectively. It can be trained quickly on large datasets, making it suitable for
real-time applications.
iii) Linear regression is relatively robust to outliers compared to other machine
learning algorithms. Outliers may have a smaller impact on the overall model
performance.
iv) Linear regression often serves as a good baseline model for comparison with
more complex machine learning algorithms.
v) Linear regression is a well-established algorithm with a rich history and is
widely available in various machine learning libraries and software packages.
Disadvantages of Linear Regression
 Linear regression assumes a linear relationship between the dependent and
independent variables. If the relationship is not linear, the model may not
perform well.
 Linear regression is sensitive to multicollinearity, which occurs when there is a
high correlation between independent variables. Multicollinearity can inflate the
variance of the coefficients and lead to unstable model predictions.
 Linear regression assumes that the features are already in a suitable form for the
model. Feature engineering may be required to transform features into a format
that can be effectively used by the model.
 Linear regression is susceptible to both overfitting and underfitting. Overfitting
occurs when the model learns the training data too well and fails to generalize to
unseen data. Underfitting occurs when the model is too simple to capture the
underlying relationships in the data.
 Linear regression provides limited explanatory power for complex relationships
between variables. More advanced machine learning techniques may be
necessary for deeper insights.
Variance
Variance refers to the variability or spread of the predicted values around the regression line.
It is one of the components in the decomposition of the mean squared error (MSE), which is a
measure of the overall model performance. The MSE is often used as a criterion for

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

evaluating the quality of a linear regression model. The below break down shows how
variance is related to linear regression:
1. Regression Equation: In linear regression, you are trying to fit a line (or hyperplane
in higher dimensions) to the data. The equation of a simple linear regression line is
typically represented as:

2. Residuals: The residuals are the differences between the observed values (actual
values in the dataset) and the predicted values (^Y^) from the regression equation.
Mathematically, the residual for each data point i is given by:

Variance in the Context of Linear Regression:


Residual Variance: The variance in linear regression specifically refers to the variance of the
residuals. It measures how much individual predicted values deviate from the actual values.
The formula for the variance of the residuals is given by:

Here, n is the number of data points. The division by (n−2) is used to obtain an unbiased
estimate of the variance, adjusting for the two degrees of freedom consumed by estimating
the intercept and slope.
Mean Squared Error (MSE): The MSE is the average of the squared residuals and is often
used as an overall measure of the model's performance. It is the sum of the squared residuals
divided by the number of observations:

The MSE can be decomposed into two components: the variance of the residuals and the
squared bias of the model. This is known as the bias-variance tradeoff. A good model should
have a balance between low bias and low variance.

Frequestist Basics
Dr K. MeenaDevi, Dept of MBA, RNSIT
Exploratory Data Analysis MODULE 3

Frequentist statistics is a branch of statistics that focuses on the interpretation of


probabilities and statistical inferences. Here are some key concepts and principles
associated with frequentist statistics:
1. Probability: In frequentist statistics, probability is defined as the long-term relative
frequency of an event occurring in repeated, independent trials. For example, if you
were to flip a fair coin many times, the probability of getting heads would be the
ratio of the number of times heads appears to the total number of flips.
2. Random Variables: A random variable is a variable whose values are outcomes of a
random phenomenon. It could be discrete, taking on distinct values, or continuous,
taking on any value within a range. The probability distribution of a random
variable describes the likelihood of different outcomes.
3. Sampling: Frequentist statistics often involves drawing conclusions about a
population based on a sample from that population. The properties of the sample,
such as the sample mean or sample proportion, are used to make inferences about
the corresponding population parameters.
4. Point Estimation: In frequentist statistics, point estimation involves using a single
value (a point estimate) to approximate an unknown parameter of a population. For
example, the sample mean is often used as a point estimate of the population mean.
5. Confidence Intervals: Instead of providing just a point estimate, frequentist
statistics often uses confidence intervals to provide a range of values within which
the true parameter is likely to fall with a certain level of confidence. A 95%
confidence interval, for instance, suggests that in repeated sampling, the interval
would contain the true parameter in 95% of the cases.
6. Hypothesis Testing: Frequentist statistics employs hypothesis testing to make
decisions about population parameters based on sample data. A hypothesis test
involves formulating a null hypothesis and an alternative hypothesis, collecting data,
and using statistical methods to assess the evidence against the null hypothesis.
7. P-values: In hypothesis testing, the p-value is the probability of observing a test
statistic as extreme as, or more extreme than, the one calculated from the sample
data, assuming the null hypothesis is true. A smaller p-value suggests stronger
evidence against the null hypothesis.
8. Statistical Significance: If the p-value is below a pre-determined significance level
(commonly 0.05), the null hypothesis is rejected in favor of the alternative
hypothesis. This is an indication that the observed result is statistically significant.

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

9. Type I and Type II Errors: In hypothesis testing, a Type I error occurs when the
null hypothesis is wrongly rejected, and a Type II error occurs when the null
hypothesis is wrongly accepted. The significance level (α) and power of the test are
related to these errors.
10. Law of Large Numbers and Central Limit Theorem: These are fundamental
principles in frequentist statistics. The Law of Large Numbers states that as the
sample size increases, the sample mean approaches the population mean. The
Central Limit Theorem states that the distribution of the sum (or average) of a large
number of independent, identically distributed random variables approaches a
normal distribution, regardless of the original distribution of the variables.

Parameter Estimation

In linear regression, the goal is to estimate the parameters of a linear relationship between
two variables. This method estimates the parameters by minimizing the sum of squared
errors, which is the vertical distance of each observed response from the regression line. The
parameters of a linear regression model can be estimated using a least squares procedure or
by a maximum likelihood estimation procedure.

1.The least square method is the process of finding the best-fitting curve or line of best fit
for a set of data points by reducing the sum of the squares of the offsets (residual part) of the
points from the curve. During the process of finding the relation between two variables, the
trend of outcomes are estimated quantitatively. This process is termed as regression analysis.
The method of curve fitting is an approach to regression analysis. This method of fitting
equations which approximates the curves to given raw data is the least squares. The least
square method is a statistical method used to find the line of best fit of the form of an
equation such as y = mx + b to the given data.

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

3. Maximum likelihood estimation is a probabilistic framework for automatically


finding the probability distribution and parameters that best describe the observed
data. Supervised learning can be framed as a conditional probability problem, and
maximum likelihood estimation can be used to fit the parameters of a model that best
summarizes the conditional probability distribution, so-called conditional maximum
likelihood estimation.

Linear Methods

Linear methods, in the context of statistics and machine learning, refer to techniques that
involve linear relationships between variables. These methods assume that the
relationship between the input features and the output can be adequately described using a
linear model. Here are some common linear methods:

1. Linear Regression: Predict a continuous target variable based on one or more


independent variables.

2. Ridge Regression: Similar to linear regression but includes a regularization term to


prevent overfitting.

3. Lasso Regression: Similar to linear regression but includes a regularization term with
an L1 penalty.

4. Linear Discriminant Analysis: Find the linear combination of features that


characterizes or separates two or more classes.

5. Principal Component Analysis: Reduce dimensionality by transforming the data into a


lower-dimensional space while retaining most of the original variance.

6. Linear Support Vector Machines: Find a hyperplane that best separates data points
from different classes in a high-dimensional space.

7. Logistic Regression: Predict the probability of an instance belonging to a particular


class (binary classification).

Linear methods are often preferred due to their simplicity, interpretability, and efficiency.
Regularization techniques like Ridge and Lasso regression are useful for handling

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

multicollinearity and preventing overfitting, while methods like PCA aid in


dimensionality reduction. Linear methods are widely used in various fields, including
statistics, machine learning, and econometrics.

Point Estimate

A point estimate in linear regression is the value predicted by the regression model for a new
observation . It represents our best guess for the value of the new observation, but it’s
unlikely to exactly match the value of the new observation. When using a regression model to
make predictions on new observations, the value predicted by the regression model is known
as a point estimate.
Although the point estimate represents our best guess for the value of the new observation,
it’s unlikely to exactly match the value of the new observation. For example, instead of
predicting that a new individual will be 66.8 inches tall, we may create the following
confidence interval:
95% Confidence Interval = [64.8 inches, 68.8 inches]
We would interpret this interval to mean that we’re 95% confident that the true height of this
individual is between 64.8 inches and 68.8 inches.

Variable selection

Variable selection is an important step in linear regression modeling. The goal of variable
selection is to choose a reduced number of explanatory variables that can describe the
response variable in a regression model. Adding too many variables can lead to overfitting,
which means that the model describes random error or noise instead of any underlying
relationship. Overfitted models generally have poor predictive performance on test data.

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

There are several methods for variable selection in linear regression, including Best Subset
Selection (BSS), Least Absolute Shrinkage and Selection Operator (Lasso), and Elastic Net
(Enet).

 In BSS, we select a subset of predictor variables to perform regression or


classification by choosing k predicting variables from the total of p variables yielding
minimum RSS (β^).
 Lasso and Enet are alternatives to BSS. Lasso shrinks the regression coefficients of
some predictors to zero, while Enet combines the properties of both Lasso and Ridge
regression.

The choice of variable selection method depends on the specific problem and the goals of the
analysis.

Simple Linear Regression in R

Salary dataset:

Years
experienced Salary

1.1 39343.00

1.3 46205.00

1.5 37731.00

2.0 43525.00

2.2 39891.00

2.9 56642.00

3.0 60150.00

3.2 54445.00

3.2 64445.00

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

Years
experienced Salary

3.7 57189.00

For general purposes, we define:


 x as a feature vector, i.e x = [x_1, x_2, …., x_n],
 y as a response vector, i.e y = [y_1, y_2, …., y_n]
 for n observations (in the above example, n=10).
Step 1: First we convert these data values into R Data Frame

# Create the data frame

data <- data.frame(

Years_Exp = c(1.1, 1.3, 1.5, 2.0, 2.2, 2.9, 3.0, 3.2, 3.2, 3.7),

Salary = c(39343.00, 46205.00, 37731.00, 43525.00,

39891.00, 56642.00, 60150.00, 54445.00, 64445.00, 57189.00)

Scatter plot of the given dataset

Output:

Now, we have to find a line that fits the above scatter plot through which we can predict
any value of y or response for any value of x
The line which best fits is called the Regression line.

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

The equation of the regression line is given by:


y=a+bx
Where y is the predicted response value, a is the y-intercept, x is the feature value and b is
the slope.
To create the model, let’s evaluate the values of regression coefficients a and b. And as
soon as the estimation of these coefficients is done, the response model can be predicted.
Here we are going to use the Least Square Technique.
The following R code is used to implement Simple Linear Regression:

install.packages('caTools')

library(caTools)

split = sample.split(data$Salary, SplitRatio = 0.7)

trainingset = subset(data, split == TRUE)

testset = subset(data, split == FALSE)

# Fitting Simple Linear Regression to the Training set

lm.r= lm(formula = Salary ~ Years_Exp,

data = trainingset)

#Summary of the model

summary(lm.r)

Output:

Call:
lm(formula = Salary ~ Years_Exp, data = trainingset)
Residuals:
1 2 3 5 6 8 10
463.1 5879.1 -4041.0 -6942.0 4748.0 381.9 -489.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30927 4877 6.341 0.00144 **
Years_Exp 7230 1983 3.645 0.01482 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4944 on 5 degrees of freedom
Multiple R-squared: 0.7266, Adjusted R-squared: 0.6719
F-statistic: 13.29 on 1 and 5 DF, p-value: 0.01482

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

i) Call: Using the “lm” function, we will be performing a regression analysis of “Salary”
against “Years_Exp” according to the formula displayed on this line.
ii)Residuals: Each residual in the “Residuals” section denotes the difference between the
actual salaries and predicted values. These values are unique to each observation in the data
set. For instance, observation 1 has a residual of 463.1.
iii)Coefficients: Linear regression coefficients are revealed within the contents of this
section.
iv)(Intercept): The estimated salary when Years_Exp is zero is 30927, which represents
the intercept for this case.
v)Years_Exp: For every year of experience gained, the expected salary is estimated to
increase by 7230 units according to the coefficient for “Years_Exp”. This coefficient value
suggests that each year of experience has a significant impact on the estimated salary.
vi)Estimate:The model’s estimated coefficients can be found in this column.
vii)Std. Error: “More precise estimates” can be deduced from smaller standard errors that
are a gauge of the ambiguity that comes along with coefficient estimates.
viii)t value: The coefficient estimate’s standard error distance from zero is measured by the
t-value. Its purpose is to examine the likelihood of the coefficient being zero by testing the
null hypothesis. A higher t-value’s absolute value indicates a higher possibility of statistical
significance pertaining to the coefficient.
ix)Pr(>|t|): This column provides the p-value associated with the t-value. The p-value
indicates the probability of observing the t-statistic (or more extreme) under the null
hypothesis that the coefficient is zero. In this case, the p-value for the intercept is 0.00144,
and for “Years_Exp,” it is 0.01482.
x)Signif. codes: These codes indicate the level of significance of the coefficients.
xi)Residual standard error: This is a measure of the variability of the residuals. In this
case, it’s 4944, which represents the typical difference between the actual salaries and the
predicted salaries.
xii)Multiple R-squared: R-squared (R²) is a measure of the goodness of fit of the model. It
represents the proportion of the variance in the dependent variable that is explained by the
independent variable(s). In this case, the R-squared is 0.7266, which means that
approximately 72.66% of the variation in salaries can be explained by years of experience.
xiii)Adjusted R-squared: The adjusted R-squared adjusts the R-squared value based on the
number of predictors in the model. It accounts for the complexity of the model. In this case,
the adjusted R-squared is 0.6719.

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

xiv)F-statistic: The F-statistic is used to test the overall significance of the model. In this
case, the F-statistic is 13.29 with 1 and 5 degrees of freedom, and the associated p-value is
0.01482. This p-value suggests that the model as a whole is statistically significant.
In summary, this linear regression analysis suggests that there is a significant relationship
between years of experience (Years_Exp) and salary (Salary). The model explains
approximately 72.66% of the variance in salaries, and both the intercept and the coefficient
for “Years_Exp” are statistically significant at the 0.01 and 0.05 significance levels,
respectively.

Predict values using predict function

# Create a data frame with new input values

new_data <- data.frame(Years_Exp = c(4.0, 4.5, 5.0))

# Predict using the linear regression model

predicted_salaries <- predict(lm.r, newdata = new_data)

# Display the predicted salaries

print(predicted_salaries)

Output:
1 2 3
65673.14 70227.40 74781.66

Visualizing the Training set results:

# Visualising the Training set results

ggplot() + geom_point(aes(x = trainingset$Years_Ex,

y = trainingset$Salary), colour = 'red') +

geom_line(aes(x = trainingset$Years_Ex,

y = predict(lm.r, newdata = trainingset)), colour = 'blue') +

ggtitle('Salary vs Experience (Training set)') +

xlab('Years of experience') +

ylab('Salary')

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

Output:

Visualizing the Testing set results:

# Visualising the Test set results

ggplot() +

geom_point(aes(x = testset$Years_Exp, y = testset$Salary),

colour = 'red') +

geom_line(aes(x = trainingset$Years_Exp,

y = predict(lm.r, newdata = trainingset)),

colour = 'blue') +

ggtitle('Salary vs Experience (Test set)') +

xlab('Years of experience') +

ylab('Salary')

Output:

Dr K. MeenaDevi, Dept of MBA, RNSIT


Exploratory Data Analysis MODULE 3

Dr K. MeenaDevi, Dept of MBA, RNSIT

You might also like