0% found this document useful (0 votes)
35 views12 pages

Modern Pridictive Modelling (Regression)

The document outlines the fundamentals of regression analysis, focusing on linear and logistic regression, and their applications in predictive modeling. It details the modeling process, including problem definition, data preparation, model development, evaluation, and deployment, while also discussing evaluation metrics for both regression and classification models. Additionally, it provides practical examples of real estate price prediction and medical diagnosis using regression techniques.

Uploaded by

ayomide.adekoya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views12 pages

Modern Pridictive Modelling (Regression)

The document outlines the fundamentals of regression analysis, focusing on linear and logistic regression, and their applications in predictive modeling. It details the modeling process, including problem definition, data preparation, model development, evaluation, and deployment, while also discussing evaluation metrics for both regression and classification models. Additionally, it provides practical examples of real estate price prediction and medical diagnosis using regression techniques.

Uploaded by

ayomide.adekoya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

ICE 516: REGRESSION ANALYSIS

MODERN APPROACHES TO PREDICTIVE MODELING


Learning Objective
By the end of this lecture, you will:
 Understand the fundamentals of regression analysis.
 Differentiate between linear and logistic regression.
 Learn how to implement both regression types.
 Apply regression concepts to real-world problems.

What is Predictive Modelling?


 The process of using data and statistical algorithms to predict future outcomes.
 Combines statistics, machine learning, and data mining techniques.
 Forms the backbone of modern data-driven decision-making.

Types of Predictive Models


Supervised Learning Models
1. Regression Models
o Linear Regression
o Polynomial Regression
o Ridge/Lasso Regression
o Elastic Net
2. Classification Models
o Logistic Regression
o Decision Trees
o Random Forests
o Support Vector Machines
o Neural Networks
Unsupervised Learning Models
1. Clustering
o K-means
o Hierarchical Clustering
o DBSCAN
2. Dimensionality Reduction
o Principal Component Analysis (PCA)
o t-SNE
o UMAP

The Modelling Process


1. Problem Definition: This is the crucial first step where you establish what you're
trying to achieve
 Define objectives: This involves clearly articulating your model's goals. For
example, are you trying to predict customer churn, classify images, or forecast
sales? You must translate the business problem into a specific modelling task
(classification, regression, clustering, etc.).
 Identify key metrics: Here you determine how success will be measured. For a
classification problem, this might include accuracy, precision, recall, or F1-score.
For regression, you might use MAE, MSE, or RMSE. The choice depends on your
specific use case and business impact
 Establish success criteria: This means setting concrete thresholds for your
metrics that define when the model is suitable for deployment. For example,
"We need 95% accuracy" or "The model must have less than 5% false positives.
2. Data Preparation: This phase focuses on getting your data ready for modelling
 Data collection: Gathering relevant data from various sources (databases,
APIs, files, etc.). This includes understanding data availability, quality, and
accessibility. You might need to set up data pipelines or work with stakeholders
to get access to necessary information.
 Cleaning and preprocessing: This involves handling missing values,
removing duplicates, dealing with outliers, and correcting inconsistencies. You
might need to standardise formats, handle encoding of categorical variables,
and ensure data quality.
 Feature engineering: Creating new features from existing ones to better
capture the underlying patterns.
 Data splitting (train/validation/test): Dividing your data into:
i. Training set (typically 60-80% of data) for model learning
ii. Validation set (10-20%) for hyperparameter tuning.
iii. Test set (10-20%) for final evaluation
3. Model Development: This is where you build and refine your model
 Algorithm selection: Choosing appropriate algorithms based on
i. Problem type (classification, regression, etc.)
ii. Data size and characteristics
iii. Interpretability requirements
iv. Computational constraints
 Hyperparameter tuning: Finding the optimal configuration for your model
using techniques like Grid search, Random search, Bayesian optimisation etc.
 Cross-validation: Implementing techniques to ensure model robustness.
 Ensemble creation: Combining multiple models to improve performance.
4. Model Evaluation: This phase ensures your model performs as expected
 Performance metrics: Calculating relevant metrics based on your problem using
classification, Regression and clustering.
 Error analysis: Understanding where and why your model makes mistakes.
 Model interpretation: Making sense of how your model works.
 Validation strategies: Ensuring model generalisation.
5. Deployment: The final phase where your model goes into production
 Model Serving: Setting up the infrastructure to serve predictions.
 Monitoring: Tracking model performance in production
 Maintenance: Keeping the model healthy
 Updates: Continuous improvement process

Model Evaluation and Selection


Evaluation Metrics
1. Regression Metrics: These metrics help evaluate models that predict
continuous values
o Mean Squared Error (MSE): Calculates the average of squared
differences between predicted and actual values
1
 Formula: MSE = ( ) ∑ ( y true − y pred )
2
n
 Penalizes larger errors more heavily due to squaring
 Always positive, with 0 indicating perfect predictions
 Unit is squared
o Root Mean Squared Error (RMSE): Square root of MSE
 Formula: RMSE ¿ √ MSE
 Returns error in same unit as target variable
 More interpretable than MSE
 Commonly used in practice
 Like MSE, penalizes larger errors more heavily
o Mean Absolute Error (MAE): Average of absolute differences between
predicted and actual values

 Formula: MAE ¿ ( 1n )∑| y true − y pred|

 More robust to outliers than MSE/RMSE


 Easier to interpret as average error magnitude
 Treats all errors linearly
o R-squared(R)2: Proportion of variance in dependent variable explained
by model
 Formula: R² = 1 - (SSres/SStot)
 Ranges from 0 to 1 (1 being perfect fit)
 Can be negative if model performs worse than horizontal line
 Useful for comparing models on same dataset

2. Classification Metrics
o Accuracy
o Recall
o F1-Score
o ROC-AUC: Area under Receiver Operating Characteristic curve
 Plots true positive rate vs false positive rate at various thresholds.
 Range is 0 to 1 (0.5 is random, 1 is perfect)
 Threshold-independent metric
 Good for imbalanced classes
o Precision-Recall Curves: Plots precision vs recall at various thresholds
Model Selection Techniques
1. Cross-validation
o K-fold
o Stratified K-fold
o Time-series cross-validation
2. Hyperparameter Optimization
o Grid Search
o Random Search
o Bayesian Optimization
o Neural Architecture Search

Introduction to Regression Analysis


What is Regression Analysis?
 A statistical technique for modelling relationships between variables
 Used to predict outcomes based on one or more input variables
 Fundamental tool in statistics, data science, and machine learning
 Essential for prediction, forecasting, and understanding variable relationships

Primary Uses
1. Prediction and forecasting (machine learning applications)
2. Measuring variable relationships and influences
3. Data-driven decision making
Linear Regression
Linear regression predicts continuous values by modelling linear relationships
between variables. It's one of the most widely used statistical techniques in data
science.

Types of Linear Regression


1. Simple Linear Regression
 One independent variable
 One dependent variable
 Uses straight-line relationship
2. Multiple Linear Regression
 Multiple independent variables
 One dependent variable
 Creates multidimensional relationship plane

Mathematical Representation of Linear Regression


Basic Formula: Y i=f ( X i , β ) + ei

Where:
Y i=¿ Dependent variable
f =¿ Function
X i =¿Independent Variable
β=¿ Unknown Parameters
e i=¿ Error term

Steps Involved when performing Linear Regression


As the name suggested, the idea behind performing Linear Regression (simple linear
regression) is that we should come up with a linear equation that describes the
relationship between dependent and independent variables.
Step 1
Let’s assume that we have a dataset where x is the independent variable and Y is a
function of x (Y=f(x)). Thus, by using Linear Regression we can form the following
equation (equation for the best-fitted line):
Y= mx + c
y denotes the response variable
x denotes ith predictor variable
This is an equation of a straight line where m is the slope of the line and c is the
intercept.
Step 2
Now, to derive the best-fitted line, first, we assign random values to m and c and
calculate the corresponding value of the given training data points Y for a given x.
This Y value is the output value.
Step 3
Now, as we have our calculated output value (let’s represent it as ŷ), we can verify
whether our prediction is accurate or not. In the case of Linear Regression, we
calculate this error (residual) by using the MSE method (mean squared error) and we
name it as loss function:
1
L= ∑((y – ŷ)2)
n

Step 4

To achieve the best-fitted line, we have to minimise the value of the loss function. To
minimise the loss function, we use a technique called gradient descent.

Gradient Descent

A Cost Function is a mathematical formula used to calculate the error, difference


between predicted value and the actual value. If we look at the formula for the
loss function, it’s the ‘mean square error’ means the error is represented in second-
order terms. If we plot the loss function for the weight (in our equation weights are m
and c), it will be a parabolic curve. Now as our goal is to minimize the loss function,
we have to reach the bottom of the curve.
To achieve this we should take the first-order derivative of the loss function for the
weights (m and c). Then we will subtract the result of the derivative from the initial
weight multiplying with a learning rate (α). We will keep repeating this step until we
reach the minimum value (we call it global minima). We fix a threshold of a very small
value (example: 0.0001) as global minima. If we don’t set the threshold value then it
may take forever to reach the exact zero value.

Step 5

Once the loss function is minimized, we get the final equation for the best-fitted line
and we can predict the value of Y for any given X.

Requirements for Linear Regression


1. Continuous variables
2. Linear relationship between variables
3. Independent observations
4. No significant outliers
5. Homoscedasticity
6. Normal distribution of residuals

Logistic Regression
Overview
Logistic regression is used for classification problems, particularly binary outcomes. It
predicts categorical variables by calculating probabilities.
Types of Logistic Regression
1. Binary Logistic Regression
 Two possible outcomes (Yes/No, 0/1)
 Most common form

2. Multinomial Logistic Regression


 Multiple unordered outcomes
 Example: Transportation type prediction

3. Ordinal Logistic Regression


 Multiple ordered outcomes
 Example: Rating scales (1-5 stars)

Steps involved when performing Logistic Regression

In logistic regression model, we decide a probability threshold. If the probability of a


particular element is higher than the probability threshold then we classify that
element in one group or vice versa.

Step 1

To calculate the binary separation, first, we determine the best-fitted line by following
the Linear Regression steps.

Step 2

The regression line we get from Linear Regression is highly susceptible to outliers.
Thus it will not do a good job in classifying two classes.

Thus, the predicted value gets converted into probability by feeding it to the sigmoid
function.

The logistic regression hypothesis generalizes from the linear regression hypothesis
that it uses the logistic function is also known as sigmoid function (activation
function).

1
The equation of sigmoid: S ( x )= −x
1+ e

Thus, if we feed the output ŷ value to the sigmoid function it retunes a probability
value between 0 and 1.
Step 3

Finally, the output value of the sigmoid function gets converted into 0 or 1 (discreet
values) based on the threshold value. We usually set the threshold value as 0.5. In this
way, we get the binary classification.

Requirements
1. Binary/categorical dependent variable
2. Independent predictor variables
3. Low/no multicollinearity
4. Large sample size

Comparison: Linear vs. Logistic Regression


Similarities
 Both are supervised learning algorithms
 Use parametric regression approaches
 Require training data
 Based on linear relationships

Key Differences
Aspect Linear regression Logistic Regression
Output Continuous Values Categorical Values
Purpose Prediction Classification
Function Best fit line Sigmoid curve
Loss Calculation Mean square Error Maximum Likelihood
Application Quantitative response Binary/Categorical Response

Modern Applications
Linear Regression
 Price prediction
 Sales forecasting
 Resource allocation
 Performance analysis

Logistic Regression
 Spam detection
 Medical diagnosis
 Credit risk assessment
 Customer behaviour prediction

Practical Applications of Regression (Examples)


1. Real Estate Price Prediction
Problem Statement
Predicting house prices based on location, square footage, number of bedrooms, etc.
Implementation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Sample real estate data


real_estate_data = {
'sqft': [1500, 2000, 1800, 2200, 1600],
'bedrooms': [3, 4, 3, 4, 3],
'age': [10, 5, 15, 2, 8],
'location_score': [8, 9, 7, 9, 8],
'price': [300000, 450000, 320000, 500000, 330000]
}
df = pd.DataFrame(real_estate_data)

# Feature preparation
X = df[['sqft', 'bedrooms', 'age', 'location_score']]
y = df['price']

# Split and scale data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Example prediction
new_house = [[1900, 3, 5, 8]] # sqft, bedrooms, age, location_score
new_house_scaled = scaler.transform(new_house)
predicted_price = model.predict(new_house_scaled)
2. Medical Diagnosis (Logistic Regression)
Problem Statement
Predicting the likelihood of a disease based on patient symptoms and characteristics.
Implementation
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Sample patient data


patient_data = {
'age': [45, 52, 38, 60, 42],
'blood_pressure': [130, 145, 125, 150, 135],
'cholesterol': [200, 250, 180, 260, 220],
'has_disease': [0, 1, 0, 1, 0] # 0: No, 1: Yes
}

df = pd.DataFrame(patient_data)

# Prepare features
X = df[['age', 'blood_pressure', 'cholesterol']]
y = df['has_disease']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train model
model = LogisticRegression()
model.fit(X_scaled, y)

# Predict for new patient


new_patient = [[50, 140, 230]] # age, blood_pressure, cholesterol
new_patient_scaled = scaler.transform(new_patient)
risk_probability = model.predict_proba(new_patient_scaled)[:, 1]
Recommended Tools
- Python (sklearn, statsmodels)
- R (stats package)
- MATLAB
- Excel (basic analysis)

Further Reading
- Statistical Learning Theory
- Advanced Regression Techniques
- Machine Learning Applications
- Model Optimization Methods

You might also like