Mod3 Eda
Mod3 Eda
where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope
where:
Y is the dependent variable
X1, X2, …, Xp are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to
learn a function so if you want to predict Y from an unknown X this learned function can
be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features.
What is the best Fit Line?
Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a minimum.
There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between
the dependent and independent variables. The slope of the line indicates how much the
dependent variable changes for a unit change in the independent variable(s).
Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used
for regression. A linear function is the simplest type of function. Here, X may be a single
feature or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x)). Hence, the name is Linear Regression. In the figure above,
X (input) is the work experience and Y (output) is the salary of a person. The regression
line is the best-fit line for our model.
We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines.
Linear Regression Line
The linear regression line provides valuable insights into the relationship between the two
variables. It represents the best-fitting line that captures the overall trend of how a
dependent variable (Y) changes in response to variations in an independent variable (X).
Positive Linear Regression Line: A positive linear regression line indicates a
direct relationship between the independent variable (X) and the dependent
variable (Y). This means that as the value of X increases, the value of Y also
increases. The slope of a positive linear regression line is positive, meaning that
the line slants upward from left to right.
Negative Linear Regression Line: A negative linear regression line indicates an
inverse relationship between the independent variable (X) and the dependent
variable (Y). This means that as the value of X increases, the value of Y
decreases. The slope of a negative linear regression line is negative, meaning
that the line slants downward from left to right.
Applications of Linear Regression
Linear regression is used in many different fields, including finance, economics, and
psychology, to understand and predict the behavior of a particular variable. For example, in
finance, linear regression might be used to understand the relationship between a
company’s stock price and its earnings or to predict the future value of a currency based on
its past performance.
Advantages of Linear Regression
i) Linear regression is a relatively simple algorithm, making it easy to understand
and implement. The coefficients of the linear regression model can be
interpreted as the change in the dependent variable for a one-unit change in the
evaluating the quality of a linear regression model. The below break down shows how
variance is related to linear regression:
1. Regression Equation: In linear regression, you are trying to fit a line (or hyperplane
in higher dimensions) to the data. The equation of a simple linear regression line is
typically represented as:
2. Residuals: The residuals are the differences between the observed values (actual
values in the dataset) and the predicted values (^Y^) from the regression equation.
Mathematically, the residual for each data point i is given by:
Here, n is the number of data points. The division by (n−2) is used to obtain an unbiased
estimate of the variance, adjusting for the two degrees of freedom consumed by estimating
the intercept and slope.
Mean Squared Error (MSE): The MSE is the average of the squared residuals and is often
used as an overall measure of the model's performance. It is the sum of the squared residuals
divided by the number of observations:
The MSE can be decomposed into two components: the variance of the residuals and the
squared bias of the model. This is known as the bias-variance tradeoff. A good model should
have a balance between low bias and low variance.
Frequestist Basics
Dr K. MeenaDevi, Dept of MBA, RNSIT
Exploratory Data Analysis MODULE 3
9. Type I and Type II Errors: In hypothesis testing, a Type I error occurs when the
null hypothesis is wrongly rejected, and a Type II error occurs when the null
hypothesis is wrongly accepted. The significance level (α) and power of the test are
related to these errors.
10. Law of Large Numbers and Central Limit Theorem: These are fundamental
principles in frequentist statistics. The Law of Large Numbers states that as the
sample size increases, the sample mean approaches the population mean. The
Central Limit Theorem states that the distribution of the sum (or average) of a large
number of independent, identically distributed random variables approaches a
normal distribution, regardless of the original distribution of the variables.
Parameter Estimation
In linear regression, the goal is to estimate the parameters of a linear relationship between
two variables. This method estimates the parameters by minimizing the sum of squared
errors, which is the vertical distance of each observed response from the regression line. The
parameters of a linear regression model can be estimated using a least squares procedure or
by a maximum likelihood estimation procedure.
1.The least square method is the process of finding the best-fitting curve or line of best fit
for a set of data points by reducing the sum of the squares of the offsets (residual part) of the
points from the curve. During the process of finding the relation between two variables, the
trend of outcomes are estimated quantitatively. This process is termed as regression analysis.
The method of curve fitting is an approach to regression analysis. This method of fitting
equations which approximates the curves to given raw data is the least squares. The least
square method is a statistical method used to find the line of best fit of the form of an
equation such as y = mx + b to the given data.
Linear Methods
Linear methods, in the context of statistics and machine learning, refer to techniques that
involve linear relationships between variables. These methods assume that the
relationship between the input features and the output can be adequately described using a
linear model. Here are some common linear methods:
3. Lasso Regression: Similar to linear regression but includes a regularization term with
an L1 penalty.
6. Linear Support Vector Machines: Find a hyperplane that best separates data points
from different classes in a high-dimensional space.
Linear methods are often preferred due to their simplicity, interpretability, and efficiency.
Regularization techniques like Ridge and Lasso regression are useful for handling
Point Estimate
A point estimate in linear regression is the value predicted by the regression model for a new
observation . It represents our best guess for the value of the new observation, but it’s
unlikely to exactly match the value of the new observation. When using a regression model to
make predictions on new observations, the value predicted by the regression model is known
as a point estimate.
Although the point estimate represents our best guess for the value of the new observation,
it’s unlikely to exactly match the value of the new observation. For example, instead of
predicting that a new individual will be 66.8 inches tall, we may create the following
confidence interval:
95% Confidence Interval = [64.8 inches, 68.8 inches]
We would interpret this interval to mean that we’re 95% confident that the true height of this
individual is between 64.8 inches and 68.8 inches.
Variable selection
Variable selection is an important step in linear regression modeling. The goal of variable
selection is to choose a reduced number of explanatory variables that can describe the
response variable in a regression model. Adding too many variables can lead to overfitting,
which means that the model describes random error or noise instead of any underlying
relationship. Overfitted models generally have poor predictive performance on test data.
There are several methods for variable selection in linear regression, including Best Subset
Selection (BSS), Least Absolute Shrinkage and Selection Operator (Lasso), and Elastic Net
(Enet).
The choice of variable selection method depends on the specific problem and the goals of the
analysis.
Salary dataset:
Years
experienced Salary
1.1 39343.00
1.3 46205.00
1.5 37731.00
2.0 43525.00
2.2 39891.00
2.9 56642.00
3.0 60150.00
3.2 54445.00
3.2 64445.00
Years
experienced Salary
3.7 57189.00
Years_Exp = c(1.1, 1.3, 1.5, 2.0, 2.2, 2.9, 3.0, 3.2, 3.2, 3.7),
Output:
Now, we have to find a line that fits the above scatter plot through which we can predict
any value of y or response for any value of x
The line which best fits is called the Regression line.
install.packages('caTools')
library(caTools)
data = trainingset)
summary(lm.r)
Output:
Call:
lm(formula = Salary ~ Years_Exp, data = trainingset)
Residuals:
1 2 3 5 6 8 10
463.1 5879.1 -4041.0 -6942.0 4748.0 381.9 -489.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30927 4877 6.341 0.00144 **
Years_Exp 7230 1983 3.645 0.01482 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4944 on 5 degrees of freedom
Multiple R-squared: 0.7266, Adjusted R-squared: 0.6719
F-statistic: 13.29 on 1 and 5 DF, p-value: 0.01482
i) Call: Using the “lm” function, we will be performing a regression analysis of “Salary”
against “Years_Exp” according to the formula displayed on this line.
ii)Residuals: Each residual in the “Residuals” section denotes the difference between the
actual salaries and predicted values. These values are unique to each observation in the data
set. For instance, observation 1 has a residual of 463.1.
iii)Coefficients: Linear regression coefficients are revealed within the contents of this
section.
iv)(Intercept): The estimated salary when Years_Exp is zero is 30927, which represents
the intercept for this case.
v)Years_Exp: For every year of experience gained, the expected salary is estimated to
increase by 7230 units according to the coefficient for “Years_Exp”. This coefficient value
suggests that each year of experience has a significant impact on the estimated salary.
vi)Estimate:The model’s estimated coefficients can be found in this column.
vii)Std. Error: “More precise estimates” can be deduced from smaller standard errors that
are a gauge of the ambiguity that comes along with coefficient estimates.
viii)t value: The coefficient estimate’s standard error distance from zero is measured by the
t-value. Its purpose is to examine the likelihood of the coefficient being zero by testing the
null hypothesis. A higher t-value’s absolute value indicates a higher possibility of statistical
significance pertaining to the coefficient.
ix)Pr(>|t|): This column provides the p-value associated with the t-value. The p-value
indicates the probability of observing the t-statistic (or more extreme) under the null
hypothesis that the coefficient is zero. In this case, the p-value for the intercept is 0.00144,
and for “Years_Exp,” it is 0.01482.
x)Signif. codes: These codes indicate the level of significance of the coefficients.
xi)Residual standard error: This is a measure of the variability of the residuals. In this
case, it’s 4944, which represents the typical difference between the actual salaries and the
predicted salaries.
xii)Multiple R-squared: R-squared (R²) is a measure of the goodness of fit of the model. It
represents the proportion of the variance in the dependent variable that is explained by the
independent variable(s). In this case, the R-squared is 0.7266, which means that
approximately 72.66% of the variation in salaries can be explained by years of experience.
xiii)Adjusted R-squared: The adjusted R-squared adjusts the R-squared value based on the
number of predictors in the model. It accounts for the complexity of the model. In this case,
the adjusted R-squared is 0.6719.
xiv)F-statistic: The F-statistic is used to test the overall significance of the model. In this
case, the F-statistic is 13.29 with 1 and 5 degrees of freedom, and the associated p-value is
0.01482. This p-value suggests that the model as a whole is statistically significant.
In summary, this linear regression analysis suggests that there is a significant relationship
between years of experience (Years_Exp) and salary (Salary). The model explains
approximately 72.66% of the variance in salaries, and both the intercept and the coefficient
for “Years_Exp” are statistically significant at the 0.01 and 0.05 significance levels,
respectively.
print(predicted_salaries)
Output:
1 2 3
65673.14 70227.40 74781.66
geom_line(aes(x = trainingset$Years_Ex,
xlab('Years of experience') +
ylab('Salary')
Output:
ggplot() +
colour = 'red') +
geom_line(aes(x = trainingset$Years_Exp,
colour = 'blue') +
xlab('Years of experience') +
ylab('Salary')
Output: