Linear Regression
Linear Regression
In this session, you understood the importance of linear models and learnt about the most commonly used
linear model: linear regression. You began by understanding the internals of the algorithm and looking at a
basic implementation of linear regression using the NumPy library. Furthermore, you learnt how to build
regression models using the most commonly used libraries: statsmodels and scikit-learn. Here, you learnt
about the different steps involved in model building and understood how to fine-tune your model. In this
session, you covered the following topics:
● Introduction to Linear Models
● Linear Models in Practice
● Mathematics behind Linear Regression
● Linear Regression with NumPy
● Assumptions of Linear Regression
● Evaluation Metrics
● Linear Models as Benchmarks
Linearity is a property between a pair of variables that determines whether or not their relationship can be
represented by a straight line.
y = mx + c
In this segment, you learnt how to represent a linear relationship among more than two variables. Consider
the following relationship.
Here, y is the dependent variable (predictor variable) and ‘x1, x2, ..., xn are independent variables (output
variables). In this segment, you also learnt about overfitting, which is a common phenomenon in machine
learning, that occurs when a model becomes too specific to the data on which it is trained on and fails to
generalise to other unseen data points. A model that has become too specific to a training data set has
‘learnt’ not only the hidden patterns in the data, but also the noise and inconsistencies in it. Note that noise
is something that a model cannot learn, as it is random and does not follow a pattern. In a typical case of
overfitting, the model performs quite well on the training data set, but performs quite poorly on the test data
set.
Unfortunately, in the real world, all the data points may not fit any linear line perfectly, but they can be
approximated to fit a certain line as shown in the plot given below.
Two-point form: If the two points are (x1, y1) and (x2, y2), then the formula for the line is as follows:
y2 − y1
y - y1 = x2 − x1
(x - x1).
In machine learning, you will mostly be dealing with the third criterion, wherein the number of equations, or,
in other words, the number of data points, is greater than the number of unknowns. There would be no
unique solution in this case. All you need to do is try to find the best line that would take the values x1, x2,
..., xn and give a value of y that is closest to the actual value.
An error or a residual is the difference between the predicted value and the actual value. For example, if y =
mx is the final assumed line, then for a given point (x1, y1), the error associated would be y1 − mx1.
This overall error is also known as the sum of squared errors (SSE) or the residual sum of squares (RSS).
It is given by Σ(y i − mxi )2 . Here, the objective is to minimise the SSEs. Assuming the line y = mx best
represents the set of points (x1, y1) (x2, y2) ... (xn, yn), you try to solve the equation in order to minimise
yT x
the SSEs. This would provide an analytical solution that a line with slope m = x Tx
is the ideal line in this
case.
For a generalised case, wherein you have n features (a number of rooms, sizes, floors, etc), you can
represent the individual values as shown in the equation given below:
The objective is to find β . Note that β in the equation y = Xβ given above is a vector as shown below:
Here, you consider the cost function, which is nothing but the sum of squares, and minimise it to find the
vector β. The solution to β is an analytical one, which is represented below.
︿
After finding out β using the given values xi, you can find out the value of yi . Hence,
The assumptions that are made while fitting a linear model between X and Y are as follows:
1. A linear relationship between X and Y: X and Y should display some form of linear relationship (as
shown in the plots given below); otherwise, it is fruitless fitting a linear model between them.
2. Independent error terms: The error terms should not be dependent on one another (as shown in the
plots given below)
5. All the columns of the predictor variables should be linearly independent. This is essential because
if any dependency exists between features (columns of X), then the determinant of X would be 0,
which makes X inverse undefined. Hence, there should be no dependency among features. The
existence of such dependencies is known as multicollinearity. Multicollinearity occurs when the
© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved
input data set contains predictor (independent) variables that are related to one another? In simple
terms, in a model that has been built using several independent variables, some of the variables
might be interrelated, and, therefore, redundant in the model
To understand how well the model is representing the data and how it performs on unseen data,
you need to understand the plot given below.
︿
Here, (xi, yi) represents your data point. y represented the predicted value in the case of (xi, yi). The sum
︿
of squared errors (SSE) or the residual sum of squares (RSS) is Σ(y i − y )2 . The sum of squares defines
the variance of the residuals as shown below.
︿
RSS = Σ(y i − y )2
The strength of the line given above should be defined around this particular RSS because it is the
value/objective function that you need to minimise. Now, the most simplistic or trivial linear regression
model that you can build on a set of data points is the line y that passes through the mean of all the target
values, where y = y .
Hence, the total sum of squares (TSS) (i.e., the maximum variation in the data) equals Σ(y i − y )2 .
Therefore, we define one of the commonly used evaluation metrics, R2, as follows:
R2 should be the go-to metric for you to make predictions, although there are certain issues with using this
metric. The number of predictor variables associated with y should also be considered in the calculations.
This is why the adjusted R-squared value is defined as follows:
where
N = Number of points in the data and
p = Number of independent predictor variables, excluding the constant.
Important: Note that the r2 score, R2 and R-squared are the same and can be used interchangeably.
Linear models are the simplest models to build, and the performance of any non-linear model that you build
should be better than that of a linear model. Linearity is a simple, naïve interpretation of data.
In this segment, you learnt how to build a linear model using the scikit-learn API. The first step in building a
linear regression model is to create an instance of the linear regression class by defining an object as
shown below.
lr = LinearRegression()
The next step is to build the model using the .fit() method and specifying the independent and dependent
variables as follows:
lr.fit(x, y)
Here, x is the input/independent variable and y is the output/dependent variable. The .fit() method is used
for building a linear regression model. Now, the lr variable has all the information pertaining to the model.
Using the lr.predict() method, you can predict the values of y for given values of x. lr.coef_ and lr.intercept_
return the coefficients and the intercept of the regression model, respectively.
help(lr) is used to learn more about the other parameters that are defined in the function.
Note that the .fit() method expects the X parameter to be a 2D array, not a 1D array. In order to convert the
1D array to a 2D array, you need to use .reshape(-1,1). Using lr.fit(x,y) with a 1D array would give you the
following error, if you do not reshape X:
You can read more about this in the link provided below.
Additional Link
Stackoverflow thread explaining the significance of - 1 in the numpy reshape
Note: The data set used in the demonstration comes loaded with the scikit-learn library, and to use it, you
need to import the sklearn datasets first. Using .load_boston(), you can directly load the data set necessary
for the demonstration.
The steps to create a dataframe have been shown in the code below.
# Importing pandas
Import pandas as pd
As in the earlier case of simple linear regression, to build a multiple linear regression model, you need to
define the X and Y parameters of the model.
Note: X and Y are not data frames but arrays. It is important to remember this whenever you are building a
model.
To determine the performance of this model, you first need to import the necessary metrics from the sklearn
library as shown below:
After importing the metrics, you need to call the mean_squared_error and r2_score functions by passing
the necessary functional parameters. The mean_squared_error is the sum of the squares of the residuals
divided by the total data points, or, in other words, the number of y values. It is expressed as follows:
r2_score(y, yhat)
︿
Here, y is the actual value and yi is the predictions made by the model. r2_score or the r squared metric
explains how much of the variance of y is explained by the model. On executing the code, you found that
the model had explained 74% of the variance of y, which means it is performing well.
It is extremely important to rescale variables so that they have comparable scales. If you do not have
comparable scales, then some of the coefficients obtained by fitting the regression model might be very
large or very small as compared with the others.
So, if you observe the scales of the values in the Boston Housing data set, then you will observe that the
scale of TAX is in the range of 100s, whereas those of DIS and NOX are in the range of 10s, as shown in
the table given below.
You can rescale this data using a min-max scalar feature transformer of the sklearn library. First, you need
to import the MinMaxScalar from the sklearn preprocessing module using the following code:
Next, using the .fit() method, you can transform the input data set into a scaled data set as follows:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'value']
In the unscaled data set, you will notice that some features are associated with high-value coefficients,
whereas some others have coefficients of almost negligible values. A coefficient explains the effect of a
particular independent variable on the dependent variable when all other variables remain constant.
In this segment, you learnt about cumulative feature variances. Consider the following code where we have
used the explained_variance_score to denote the proportion of variance that has been explained for the
input columns. Note that the data used belongs to the boston housing dataset.
y = boston.target
X = boston.data
variances = []
for i in range(X.shape[1]):
xx = X[:, :(i + 1)]
lr.fit(xx, y)
variances.append(explained_variance_score(y, lr.predict(X[:, :(i + 1)])))
plt.plot(variances, 'go-')
plt.xlabel('Number of features')
plt.ylabel('Explained Variance')
plt.grid()
On executing the code, you will get the plot as shown below.In this diagram, the explained variance
gradually increases when you increase the features considered one by one.
One major observation is that as you add to the list of features, the R2 score increases. In order to maintain
a balance between keeping the model simple and explaining the highest variance (which means that
you would want to keep as many variables as possible), you need to penalise a model for keeping a large
number of predictor variables.
Hence, you should measure the following two evaluation metrics in this scenario:
In the second formula, n is the sample size, which represents the number of rows that you would have in
the data set, and p is the number of predictor variables. Unfortunately, the scikit-learn library does not
provide a direct implementation to these metrics; however, the statsmodel package provides these metrics
as part of summary statistics.
In this segment, you learnt how to perform hypothesis testing on the estimated values of the coefficients,
i.e., the betas. This, in turn, determines which independent variables are significant, so you can use them in
your model.
Suppose you have a data set whose scatter plot looks like this:
You always fit a line through the data by applying linear regression using the least-square method.
However, as you can see, the fitted line in the diagram given above is of no use in this case. Hence,
hypothesis testing can be used to test whether or not the fitted line is a significant one.
Consider the simple model y = β1x + β0. In order to conduct hypothesis testing, you need to propose the
null hypothesis that β1 is 0. The alternative hypothesis, thus, becomes β1 is not 0. The hypotheses are are
as follows:
Now, if you reject the null hypothesis, then it would mean that β1 is not 0 and the fitted line is significant.
● You need to compute the t-score (which is similar to the Z-score), which is given by:
,
where μ is the population mean and s is the sample standard deviation, which, when divided by √n,
is also known as the standard error.
︿
● The t-score for β 1 would be (since the null hypothesis is that β1 is equal to 0) as follows:
Considering a significance level of 0.05, if the p-value turns out to be less than 0.05, then you can reject the
null hypothesis and state that β1 is indeed significant.
Remember to use the command 'add_constant' so that statsmodels also fits an intercept. If you do not use
this command, then it will fit a line passing through the origin by default.
Summary Statistics
Let’s understand summary statistics with the help of the table given below.
F-statistic
The heuristic for F-statistic is similar to what you learnt in the normal p-value calculation. If 'Prob
(F-statistic)' is less than 0.05, then you can conclude that the overall model fit is significant. If it is greater
than 0.05, then you will need to review your model, as the fit might have occurred by chance, i.e., the line
may have fit the data by luck. In the image given above, you can see that the p-value of the F-statistic is
© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved
1.52e-52, which is practically a zero value. This means that the model for which this was calculated is
definitely significant since the F-statistic for it is less than 0.05.
R-squared
In the previous diagram, the R-squared value is about 0.741, indicating the model is able to explain 74% of
the variance, which is quite good.
The image given below shows the warnings that appear along with the summary statistics.
The second point in the image given above highlights the existence of multicollinearity. Multicollinearity
occurs when the input data set contains predictor (independent) variables that are related to one another.
1
VIF =
1−Ri 2
Here,'i' refers to the i-th variable, which is represented as a linear combination of the rest of the
independent variables, and Ri 2 is the R2 score of the model when the linear regression model is fit for the
variable i against the other independent variables.
Using the VIF value, you can identify features that best define the model and remove features that are
redundant in the data set. The VIF can be calculated using variance_inflation_factor function of the
statsmodels library.
To find an optimal model, you can try all possible combinations of the independent variables and check
which model fits best. However, this method is time-consuming and infeasible. Therefore, you need another
method to find a decent model. This is where the manual feature elimination method comes into the picture,
wherein you:
Note that you should never drop all the insignificant variables and the variables with high VIFs in a single
step. You must do it sequentially, as there is interdependence among the variables and you would not want
to drop an important variable on the basis of the initial result only.
As per the general norm, you use automated approaches such as RFE (recursive feature elimination) or
regularisation. When you are down to a few features, you can start looking at the p-values and the VIF, and
proceed with a manual improvement by considering the business requirements and other validity checks.
In this segment, you learnt about the most basic method of feature selection: variance thresholding.
Variance thresholding involves removing all the features that show extremely low variance, or, in other
words, columns whose values remain more or less the same for the data points in the data set. You should
be careful while using a thresholding value, as it may eliminate important features in the model.
Consider the features and their individual variances given in the image below.
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['CRIM', 'ZN', 'INDUS', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT','price']
df[num_vars] = scalar.fit_transform(df[num_vars])
As you can observe in the image given below, the variances in NOX and RM are not as insignificant as
shown in the earlier plot. This is because you have scaled your data and brought all the features into a
comparable range. It is essential to do this when your data involves features of different scales.
In this segment, you learnt about the selectKbest feature estimator of the sklearn library, which is used for
feature selection. selectKbest is used for finding the correlation between a set of input features and the
output feature, and based on this correlation, it chooses all the features that show the highest correlation.
You can directly use the SelectKBest function of the sklearn library
Recursive feature elimination (RFE) is one of the most widely used approaches for feature selection. It is a
backward stepwise approach, wherein all the features are considered for model building, and after each
iteration, identifies the least significant feature that can be dropped.
You can use the RFE function from the sklearn library to apply this technique.
In this session, you first explored the Spark ML library API and then learnt how to perform basic exploratory
data analysis on a data set. You also learnt about certain components such as feature transformers and
feature estimator pipelines, and their usage while writing your code in PySpark. Finally, you learnt how to
build regression and classification models.
In the previous course, you learned that in the case of linear regression, the entire algorithm comes down to
a simple matrix representation of β ,where
You can see that the matrix given above is composed of the following four operations:
1. Multiplication X.T and X,
2. Inverse of X.T * X,
3. Multiplication X.T and Y, and
4. Multiplication of inv(X.T * X) and (X.T * Y).
All the operations above are multiplications, which can be easily parallelized in Spark, and, hence, most of
the machine learning algorithms are parallelizable in Spark, as they are usually a series of matrix
operations.
The read() method is available through the SparkSession class and it also supports various optional
methods for indicating the header and schema. By setting the header to ‘true’, Spark treats the first row in
the DataFrame as the header and the rest of the rows as samples. Similarly, by setting ‘inferschema’ to
‘true’, Spark automatically infers the schema of the data set.
The code for creating a Spark session and reading data from a csv file is given below.
As the next step in data cleaning, you can either remove records containing incomplete or garbage values,
or replace missing values with approximate values. Generally, the median or the mean of the complete
column variable serves as a good approximate value. Removing records often leads to loss of some
valuable information and so, you may want to impute those values, instead. In this segment, you learnt
about the following two methods for handling missing values:
1. In the first method, you remove records with missing values by using the na() method available in
Spark. This method drops all the rows that may contain a missing value.
df2 = df1.dropna()
2. In the second method, you can replace missing values with the mean of their respective features
using the Imputer() transformer that is present in the Spark ML library. It is an extension of the
transformer class.
A feature transformer transforms the data stored in a data frame and stores the data back as a new data
frame. This transformation generally takes place by appending one or more columns to the existing data
frame. It can be broken down into the following sequence: DataFrame = [transform] = > DataFrame.
The VectorAssembler() transformer is a feature transformer that takes a set of individual column features
as input and returns a vector that contains all the column features. It is an extension of the transformer
class and supports the .transform() method.
It is a common practice to scale all the data variables within the range [0, 1]. You can perform scaling using
transformers, such as MaxAbsScaler(), MinMaxScaler(), etc.
To perform scaling, you need to create a scalar object followed by a scalar model. This model will then
transform any input DataFrame into a scaled DataFrame.
assembler = VectorAssembler(
inputCols=[ ‘’’ INPUT FEATURES ’’’ ],
outputCol=‘’’ TARGET VARIABLE ’’’)
output = assembler.transform(df)
Instead of executing a transform method at each data preparation step, a pipeline clubs all the steps of
data, such as cleansing, scaling, normalising, etc., in the desired sequence. By creating a pipeline, you can
skip multiple steps and make your code more efficient. Also, once you have designed a pipeline with all the
required steps, it can be reused for various data sets without the need to severely alter the nature of the
code for each data set.
A pipeline can be built by declaring the ‘Pipeline’ object available in the Spark ML library. Furthermore, you
need to build a PipelineModel by fitting the pipeline object on the data. Unlike the steps involved in the
© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved
previous segments, the PipelineModel will take only one data frame as input and output the final prepared
DataFrame.
In short, a Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an
Estimator. These stages are run in a specific order, and the input dataFrame is transformed as it passes
through each stage. As you have learnt previously, no intermediate dataframe is created at any stage.
Additional Reading
Machine learning models require input in the form of a vector. If this criterion is not satisfied, then you will
get an error saying ‘A linear regression object expects an input features column’.
Using VectorAssembler, a feature transformer, you can assemble the features in the form of a vector as
shown below.
After assembling the features to form a new features column, you can build a linear regression model
by specifying the two columns: featuresCol and labelCol as shown in the code snippet below.
lr = LinearRegression(featuresCol='features', labelCol='price')
model = lr.fit(dataset)
You can find the r2 score associated with the model by using model.evaluate(). You will find that the model
gives the same r2 score as given by the statsmodel api and scikit-learn library.
You can convert a column of string values in your data frame to a column of numeric values using the
StringIndexer transformer. It assigns index values into numerical values based on their corresponding
string frequencies. for example, consider your input data has the following strings:
String
High
Low
High
High
Low
Medium
You can use StringIndexer to convert these strings into indices. You can refer to the table given below to
understand this conversion.
High 3 0
Low 2 1
High 3 0
High 3 0
Low 2 1
Medium 1 2
Finally, by using the .transform() method, you can transform the input data frame into a new indexed
column. Now, once you have applied the StringIndexer feature transformer to the output column, you can
now build a logistic regression model on the top of it by specifying the input, feature column, and the output,
label column.
In order to evaluate a model, you can perform a train–test split and check how the model performs on
unseen data. Using the .evaluate() method on the test data set, you can print the summary statistics of the
model.
Note: The implementation of Naive Bayes classifier is the same as compared to the logistic regression
classifier; the only difference being the use of MulticlassClassificationEvaluator.
You can perform the following steps to fit a linear regression model:
# Step 1
diabetes = datasets.load_diabetes()
df = None
assembler = None
dataset = None
# Step 2
lr = None
model = None
summary = None
print(summary.r2, summary.explainedVariance)
Step 3: Find a subset of features with the highest absolute coefficients (by plotting)
Step 4: Train a new model on this subset, and find the R2 score and explained variance
subset = None
assembler = None
small_dataset = None
In cross-validation, you begin by splitting the input data set into training and testing data sets. This is done
to understand the performance of your model on an unseen data set. The heuristic is that you do either a
70–30% or an 80–20% split of the input data set into training and test data sets. Once you have the training
data sets, you can then build models of the training datasets and check their performance on the testing
dataset.
Cross-validation helps you identify the issues that are present in your model or in the data set, such as
overfitting. Setting the seed value will reduce the random behaviour; however, upon splitting the data set
may not have so much variability that the training R2 score is greater than the test R2 score and vice versa.
This might be possible because the data set is ill-conditioned, meaning it has no underlying pattern that can
be modelled.
As you have learnt, any data set that you come across in real life will have some noise. Also, any model
that you create is to model the true underlying function. Now, assuming that the true function that fits the
data set is denoted by f(x) and the model that you built is denoted by f′(x), the relationship between the
response values, Y (which are, essentially, the data points that are available for modelling), and the
predictor variables, X, is given by:
Y = f(X) + ϵ,
Hence, the mean square error (MSE) of an unknown data set is the expected value of
The first term is the bias of the model and the second term is the variance of f’(x) that the model used for
prediction, while the third term is the variance of the irreducible error. Thus we can write the MSE as
follows:
As you can see in the equations given above, for the same MSE, if the bias decreases, the variance
increases, and vice versa. Hence, there is a trade-off between bias and variance in the process of building
a machine learning model.
Bias
A bias error is the difference between the predicted value and the true value of the data. It is important to
understand that the predicted value depends on the assumptions made by the model. Hence, if the model
makes a large number of assumptions about the data (like it does in the case of linear regression), then the
bias error will be high, and vice versa.
Note that due to the presence of the irreducible error, the true value may differ from the actual value
available to you. However, for the purpose of evaluation of your model, you can consider the training error
as a representative of the bias error. Hence, if the training error is high, then there is high bias, and vice
versa.
Bias quantifies how accurate the model is likely to be on the future/test data. Extremely simple models are
likely to fail in predicting complex real-world phenomena. Hence, simplicity has its own disadvantages in
machine learning. Refer to the model given below as an example.
High bias suggests that the model has been built by making many assumptions, thus simplifying and
making the algorithm less complex. Linear regression is an example of a high-bias algorithm.
Variance
A variance error is an error that is generated when you pay too much attention to the training data.
The ‘variance’ of a model is the variance in its output on the same test data with respect to the changes to
the training data. In other words, variance refers to the degree of changes to the model itself with respect to
the changes in the training data.
For example, consider the model (given below) that memorises the entire training data set. Even if you
made a small change to the data set, the model would change drastically. The model is, therefore, unstable
and is sensitive to the changes to the training data, and this is called high variance. Also, this has
happened because the model has also modelled the irreducible error, which cannot be modelled.
As you can see, the model has been fitted perfectly on the training data; however, it poorly fits the test data.
Hence, the testing error is often considered a representation of the model variance.
Since this model is found to capture the unwanted noise within the training data, and is overly sensitive to it,
such a model is said to be overfitting. Overfitting is a phenomenon where a model becomes too specific to
the data on which it is trained, and it fails to generalise to other unseen data points in the larger domain. An
example of a high-variance algorithm is decision trees.
In this case study, you learnt how to solve a practical problem using the concepts that you have learnt so
far. For this, you used the data set of New York City taxi fares. The objective of this case study is to build a
pricing model wherein the fare for a particular ride is predicted based on a set of attributes, such as
date-time, and the coordinates for pickup and drop-off.
1. Data Exploration
The first part of this case study was data exploration, wherein you performed the following steps:
● First, you invoked the Spark session context and read the data from s3.
● Then you imported the VectorAssembler and LinearRegression functions from the MLlib
library. Next, you created a linear regression model by providing the input and the output
features.
● On fitting the model on the data set, you created a summary instance using the evaluate()
function.
Note: At this point, the r2 value works out to be 0.0002, meaning there is very little correlation
between the target variable and the independent features. Hence, you need to explore and
transform the data to get a better fit.
2. Outlier Treatment
You identified many outliers in the data set that might have led to a low r2 score, as linear
regression is quite sensitive to outliers. Here, you performed the following steps:
● You performed a simple Google search and found that New York City lies between 73 and
75 degrees West, and 40 and 42 degrees North. Using this information, you filtered out the
values in the pickup and drop off locations that lie outside the given coordinates of latitude
and longitude.
● On exploring the data further, you found that certain rides had zero passengers. Since these
rides must not be considered for building the pricing model, you filtered them out and
considered only those rides that had at least one passenger.
● You also found that the pricing data consisted of some rows with negative fare values.
Hence, you removed these rows and considered only those rows that had a fare value
greater than 0.
● Once all the data was cleaned, you saved this new data frame and fit the linear regression
model.
● You computed the new r2 value as 0.25. However, this value is still less, and the data set
needs to be explored further.
3. Feature Engineering
Here, you extracted some additional features from the data set to improve the r2 score of the model.
These features are as follows:
● The date-time feature is modified by removing the ‘UTC’ substring and converting it to the
timestamp data type for further analysis.
● An interesting point to note is that all the timestamps correspond to the UTC timestamps,
which differ from the New York timestamps by 5 hours. Hence, you used the Spark API to
convert the timestamp to the EST time and saved it in a new column.
● Using the pyspark.sql.functions, you extracted the year, month, day and hour, and added
these columns to the data frame.
● The abs function is imported from the PySpark API to compute the l1 norm distance between
two latitudes or two longitudes. Using this api, the horizontal and vertical distances are
computed from the coordinates of the pickup and drop-off locations.
● Using the ‘l1’ column as an additional feature, you fit the regression model and calculated
the r2 score to be 0.679.
In this part, you split the data set into a training and a test data set. You trained the model on the
training data and evaluated it on the test data. The steps that you performed are as follows:
● You began by creating a train–test split of the data set. For this, you used the randomSplit
function and performed a 2:1 split. Then you stored these data frames in the ‘trainDataset‘
and ‘testDataset‘ data frames.
● Using the features that you created after cleaning and modifying the data, you fit the training
data using the LinearRegression library.
● You computed the summary by using the model .evaluate function on the test data set.
● You computed the r2 value as 0.681, which is similar to the r2 value that was computed on
the entire data set. This indicates that the model was not overfitting the train data.