0% found this document useful (0 votes)
2 views

7. Machine Learning - Develop machine learning model - Regression

The document outlines the process of developing a machine learning model, focusing on regression techniques to predict continuous target variables like vehicle prices. It discusses various regression methods, including Linear Regression, Random Forest Regressor, Decision Tree Regressor, and Support Vector Regressor, along with evaluation metrics such as RMSE and R-squared. Additionally, it covers the importance of data preparation, model training, and testing, as well as the advantages and applications of different regression models.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

7. Machine Learning - Develop machine learning model - Regression

The document outlines the process of developing a machine learning model, focusing on regression techniques to predict continuous target variables like vehicle prices. It discusses various regression methods, including Linear Regression, Random Forest Regressor, Decision Tree Regressor, and Support Vector Regressor, along with evaluation metrics such as RMSE and R-squared. Additionally, it covers the importance of data preparation, model training, and testing, as well as the advantages and applications of different regression models.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Develop machine learning model

 Before develop machine learning model:


 Define the target variable

 You split your dataset into training and testing sets using the train_test_split function from Scikit-
learn.
Develop machine learning model

 Why Do We Use random_state?


 Consistent reproducibility
 Collaboration
 Etc.
Develop machine learning model

 Model Building
 As we know our target variable (selling_price) is continuous data, so we will use regression
technique. i.e.:
 Linear Regression,
 Random Forest Regressor,
 Decision Tree Regressor, and
 Support Vector Regressor (SVR).
 We train each model using the training data and then test how well they could predict vehicle
prices using both the training and testing data.
 We used metrics like Root Mean Squared Error (RMSE) and R-squared values to see how
accurate each model is.
 This helps us to understand which method works best for predicting vehicle prices accurately
Develop machine learning model

Why metrics like Root Mean Squared Error (RMSE) and R-squared values to see how accurate each model is?
Develop machine learning model

Why RMSE is used?


 Measures Prediction Error.
 RMSE represents the average deviation of the model's predictions from the actual values.
 Smaller RMSE values indicate better model performance.
Develop machine learning model

Why RMSE is used?


 Punishes Larger Errors
 RMSE squares the differences between predicted and actual values, making large errors more
significant than smaller ones.
 This is useful if large deviations are especially undesirable in your problem.
Develop machine learning model

Why RMSE is used?


 Interpretable in Original Units
 RMSE has the same units as the target variable, making it easier to interpret in a real-world context.
 For example, if the target variable is in dollars, RMSE tells you the average error in dollars.
Develop machine learning model

RMSE Limitations
 RMSE is sensitive to outliers since errors are squared.
 A single large error can disproportionately affect the RMSE.
Develop machine learning model

Why R-squared (R²)?

 R-squared indicates how well the independent variables explain the variability of the
dependent variable, with values closer to 1 suggesting a better fit.
Develop machine learning model
Develop machine learning model
Develop machine learning model
Develop machine learning model

Regression Analysis In ML
 Regression analysis is a statistical technique that predicts continuous numeric values based on the
relationship between independent and dependent variables.
 The main goal of regression analysis is to plot a line or curve that best fit the data and to estimate
how one variable affects another.
 Regression analysis is a fundamental concept in machine learning and it is used in many
applications such as forecasting, predictive analytics, etc.
 Regression models use the input data features (independent variables) and their corresponding
continuous numeric output values (dependent or outcome variables) to learn specific associations
between inputs and corresponding outputs.
Develop machine learning model

Terminologies in Regression Analysis


 Independent Variables: Predictors or features used to estimate the dependent variable.
 Dependent Variables: Target variables whose values are predicted.
 Regression Line: A line or curve that best fits the data points.
 Overfitting: Occurs when a model performs well on training data but poorly on test data (high
variance).
 Underfitting: Happens when the model fails to capture patterns in training data (high bias).
 Outliers: Extreme values that deviate significantly from the rest of the data.
 Multicollinearity: When independent variables are highly correlated with each other.
Develop machine learning model

Types of Regression in Machine Learning


Generally, the classification of regression methods is done based on the three metrics:
1. the number of independent variables,
2. type of dependent variables, and
3. shape of the regression line.
Develop machine learning model

Types of Regression in Machine Learning


Generally, the classification of regression methods is done based on the three metrics:
 Linear Regression
 Logistic Regression
 Polynomial Regression
 Lasso Regression
 Ridge Regression
 Decision Tree Regression
 Random Forest Regression
 Support Vector Regression
Develop machine learning model

Linear Regression
 Linear Regression is a supervised learning algorithm used for predicting a continuous target
variable based on one or more input variables (features).
 It assumes a linear relationship between the dependent and independent variables and uses a linear
equation to model this relationship.
Develop machine learning model

What is Linear Regression?


 Linear regression is a statistical technique that estimates the linear relationship
between a dependent and one or more independent variables.
 In machine learning, linear regression is implemented as a supervised learning
approach.
 In machine learning, labeled datasets contain input data (features) and output labels
(target values).
 For linear regression in machine learning, we represent features as independent
variables and target values as the dependent variable.
Develop machine learning model

Linear Regression
 Linear regression is the most commonly used regression model in machine learning.
 It may be defined as the statistical model that analyses the linear relationship between a dependent
variable with a given set of independent variables.
 A linear relationship between variables means that when the value of one or more independent variables
changes (increase or decrease), the value of the dependent variable will also change accordingly (increase
or decrease).
 Linear regression is further divided into two subcategories:
1. simple linear regression and
2. multiple linear regression (also known as multivariate linear regression).
Develop machine learning model

Simple Linear Regression


 In simple linear regression, a single independent variable (or predictor) is used to predict the dependent
variable.
 Mathematically, the simple linear regression can be represented as follows Y=mX+b
Where,
 Y: is the dependent variable we are trying to predict.
 X: is the dependent variable we are using to make predictions.
 m: is the slope of the regression line, which represents the effect X has on Y
 b: is a constant known as the Y-intercept. If X=0, Y would be equal to b
Develop machine learning model

Simple Linear Regression (Single feature and single target)

Square Feet (X) House Price (Y)


1300 240
1500 320
1700 330
1830 295
1550 256
2350 409
1450 319
Develop machine learning model

Simple Linear Regression Where

• Y is the dependent variable (target).


• X is the independent variable (feature).
• w0 is the y-intercept of the line.
• w1 is the slope of the line, representing the
effect of X on Y.
• ε is the error term, capturing the variability
in Y not explained by X.
Develop machine learning model

Simple Linear Regression


Develop machine learning model
Kilometers_driven Selling_price
1.1 39343
1.3 46205

Exercises (Simple Linear Regression ) 1.5


2
37731
43525
2.2 39891
1. Perform Data Preparation 2.9 56642
3 60150
2. Check the correlation in data 3.2 54445
3.2 64445
3. Check if the dataset is linear or not(check data dispersion) 3.7 57189
3.9 63218
4. Split the dataset into training and testing sets 4 55794
4 56957
5. Perform Model Training (Fitting the Simple Linear Regression to Training Set) 4.1 57081
4.5 61111
6. Perform Model Testing 4.9 67938
5.1 66029
7. Perform Model Evaluation(root_mean_squared_error, mean_absolute_error and r2_score ) 5.3 83088
5.9 81363
8. Visualize Training Set Results (with Regression Line) 6 93940
6.8 91738
9. Visualize the Test Set Results (with Regression Line) 7.1 98273
7.9 101302
10.Predict for new values 8.2 113812
8.7 109431
11.Find the intercept and slope 9 105582
9.5 116969
9.6 112635
10.3 122391
10.5 121872
Develop machine learning model

Multiple Linear Regression


Multiple linear regression is basically the extension of simple linear regression that predicts a
response using two or more features.
Develop machine learning model

Exercises (Multiple Linear Regression )


1. Perform Data Preparation
2. Check the correlation in data
3. Check if the dataset is linear or not(check data dispersion)
4. Split the dataset into training and testing sets
5. Perform Model Training (Fitting the Simple Linear Regression to Training Set)
6. Perform Model Testing
7. Perform Model Evaluation(root_mean_squared_error, mean_absolute_error and
r2_score)
8. Predict for new values
9. Find the intercept and slope
Develop machine learning model

Random Forest Regressor


 A random forest is an ensemble learning method that combines the predictions from multiple
decision trees to produce a more accurate and stable prediction.
 It is used for predicting numerical values.
 It is a type of supervised learning algorithm that can be used for both classification and regression
tasks.
 It predicts continuous values by averaging the results of multiple decision trees.
Develop machine learning model

Working of Random Forest Regression


 Random Forest Regression works by creating multiple of decision trees each trained on a random subset of the
data.
 After the trees are trained each tree make a prediction and the final prediction for regression tasks is the
average of all the individual tree predictions and this process is called as Aggregation.
 This approach is beneficial because individual decision trees may have high variance and are prone to
overfitting especially with complex data.
 However by averaging the predictions from multiple decision trees Random Forest minimizes this variance
leading to more accurate and stable predictions and hence improving generalization of model
Develop machine learning model

Applications of Random Forest Regression


 Predicting continuous numerical values: Predicting house prices, stock prices or customer lifetime value.
 Identifying risk factors: Detecting risk factors for diseases, financial crises or other negative events.
 Handling high-dimensional data: Analysing datasets with a large number of
input features.
 Capturing complex relationships: Modeling complex relationships between input features and the target
variable.
Develop machine learning model

Advantages of Random Forest Regression


 Handles Non-Linearity: It can capture complex, non-linear relationships in the data
that other models might miss.
 Reduces Overfitting: By combining multiple decision trees and averaging predictions
it reduces the risk of overfitting compared to a single decision tree.
 Robust to Outliers: Random Forest is less sensitive to outliers as it aggregates the
predictions from multiple trees.
 Works Well with Large Datasets: It can efficiently handle large datasets and high-
dimensional data without a significant loss in performance.
 Handles Missing Data: Random Forest can handle missing values by using surrogate
splits and maintaining high accuracy even with incomplete data.
 No Need for Feature Scaling: Unlike many other algorithms Random Forest does not
require normalization or scaling of the data.
Develop machine learning model

Quiz
1. What is the difference between random forest and regression?
2. Why is random forest better than regression?
Develop machine learning model

Decision Tree Regressor


 Unlike traditional linear regression, which assumes a straight-line relationship between input
features and the target variable, Decision Tree Regression is a non-linear regression method that can
handle complex datasets with intricate patterns.
 It uses a tree-like model to make predictions, making it both flexible and easy to interpret.
 Decision Tree Regression predicts continuous values.
 It does this by splitting the data into smaller subsets based on decision rules derived from the input
features.
 At leaf node of the tree the model predicts a continuous value which is typically the average of the
target values in that node.
Develop machine learning model

How It Works (Step-by-Step)


 Choose the Best Feature to Split
 The algorithm selects a feature that best splits the data into two or more subsets.
 It minimizes the variance within each subset to ensure better predictions.
 Split the Data Recursively
 The process repeats at each node, creating smaller subgroups.
 Each split aims to reduce the prediction error.
 Stop Splitting (Stopping Criteria)

 The tree stops growing when:

✅ A maximum depth is reached

✅ A minimum number of samples per leaf is met

✅ Further splitting does not improve predictions


 Make Predictions
 Each leaf node contains a numerical value (the average of training samples in that node).
 Given a new input, the model follows the decision path and returns the leaf value.
Develop machine learning model

Support Vector Regressor (SVR)


 Support vector regression (SVR) is a type of support vector machine (SVM) that is used for
regression tasks.
 It tries to find a function that best predicts the continuous output value for a given input value.
 SVR can use both linear and non-linear kernels.
End!

You might also like