Module 2
Module 2
Supervised Learning: Linear and Non-Linear examples -Regression - Stochastic Gradient descent -Regularization
Multiclass Regression - Generalized Linear Models - Decision Trees - Support vector Machine (SVM)- K Nearest
Neighbors (kNN) - Time series forecasting - Evaluating a classification model performance- RoC Curve
Supervised learning is a type of machine learning where the model is trained using labeled data. In this method, each
input in the training dataset has a corresponding output, and the model is trained to learn the relationship between
the inputs and outputs. The goal is to make predictions on unseen data based on this learned relationship.
Regression is a technique in supervised learning used to predict continuous (quantitative) values based on input
variables. The model learns the relationship between the independent variable(s) and the dependent variable and
then makes predictions based on that relationship. Unlike classification, where the output is discrete, regression
focuses on predicting continuous values.
The following table outlines common business use cases across different domains where regression can be applied:
Retail: How much will be the daily, monthly, and yearly sales for a given store for the next three years?
Energy / Environmental: What will be the temperature for the next five days?
The common thread in these questions is that they ask for a quantitative or continuous number, such as a sales figure,
labor cost, or temperature. Regression analysis helps answer such questions by examining the relationship between
variables.
Data Analysis
The correlation between the two variables is calculated, which reveals a strong positive relationship (98%). This
suggests that as the number of hours studied increases, the test grade also increases. However, correlation only shows
the strength of the relationship and not necessarily a causal one.
Correlation means that two variables are related, but it does not imply one causes the other.
The correlation between firemen’s presence and the size of a fire is strong, but firemen don’t cause the fire.
Similarly, sleeping with shoes on might correlate with a headache, but it could be due to alcohol intoxication
rather than the shoes themselves.
Correlation doesn’t imply causation; however, the existence of causation will always imply correlation.
To predict test scores based on study hours, we fit a linear regression model using the formula:
Y=mX+cY = mX + cY=mX+c
Where:
The objective of linear regression is to minimize the residuals, which are the differences between the actual values and
the predicted values. The best fit line is the one that minimizes the sum of the squared residuals, which is why it's
called the least squares method.
For a student studying for 6 hours, the predicted test grade is 80.74. This means that for every additional hour studied,
the grade increases by 4.74 points.
There are three primary metrics for evaluating linear model performance:
R-squared
R-squared indicates the proportion of variance in the dependent variable explained by the independent variable. It
ranges from 0 to 1, with values closer to 1 signifying a better fit. An example calculation shows that 97% of the variability
in the dependent variable (test score) is explained by the independent variable (hours studied).
Root Mean Squared Error (RMSE)
RMSE measures how close predicted values are to actual values. Lower RMSE values indicate better model
performance. It has the same units as the target variable.
MAE is the average of the absolute differences between the predicted and actual values. Smaller values of MAE signify
a better model.
Polynomial Regression
Polynomial regression is a type of regression that models the relationship between a dependent variable YYY and an
independent variable XXX as an nth-degree polynomial. Unlike linear regression, which fits a straight line, polynomial
regression can model curves, making it more flexible and capable of fitting complex data patterns.
In polynomial regression, we introduce higher-order degree variables of the same independent variable in the
equation.
Example:
For a dataset containing the number of hours studied and corresponding test grades, a polynomial regression can be
used to fit a curve to the data. The relationship between hours studied and test grades may not be perfectly linear. For
example, after a certain number of study hours, the test grades may plateau, and polynomial regression can model
that behavior better than linear regression.
Multivariate Regression
Multivariate regression is an extension of simple linear regression, where there is more than one independent variable
(predictor). This type of regression is used when the dependent variable is influenced by multiple independent
variables.
Multivariate regression is used when you want to predict an outcome based on multiple factors. For example,
predicting house prices based on multiple features such as lot size, number of bedrooms, number of bathrooms, and
the presence of a garage.
Example:
For predicting house prices, the dependent variable YYY is the Price of the house. The independent variables could
include:
Number of Bedrooms
Number of Bathrooms
Under-fitting: Occurs when the model fails to capture the underlying trend in the data, resulting in low
accuracy on both training and test datasets. The model is too simple to capture the complexity of the data.
Over-fitting: Occurs when the model fits the training data too well, capturing all the noise and outliers. This
results in high accuracy on the training set but low accuracy on the test set, as the model fails to generalize to
unseen data.
Importance of Choosing the Right Polynomial Degree: The degree of the polynomial plays a crucial role in
avoiding over-fitting and under-fitting. A balance must be struck to ensure the model generalizes well.
Nonlinear Regression
Nonlinear Models: Unlike linear models, nonlinear regression allows the fitted line (or curve) to take any
shape, which is often required when the underlying process is complex or based on physical or biological
phenomena.
Interpretation of Nonlinear Models: Nonlinear regression models have a direct physical or biological
interpretation, making them useful for applications like enzyme kinetics, distribution modeling, and more.
Scipy's curve_fit Function: The curve_fit function from the Scipy library is used to fit nonlinear models to data.
It estimates the parameters of a given model based on scientific theories. Common use cases include
Michaelis–Menten enzyme kinetics, Weibull distribution, and power law distribution.
# Data
plt.title('Non-linear fitting')
# Fit the model to the data plt.xlabel('x')
popt, pcov = curve_fit(func, x, y, p0=(1.0, 0.2)) plt.ylabel('y')
p1 = popt[0] plt.legend(['Data', 'Fit'], loc='best')
p2 = popt[1] plt.show()
Output
The output will show the original data points (in blue) and the nonlinear fitted curve (in red) that best represents
the relationship between the variables based on the chosen model.
Stochastic Gradient Descent (SGD)
Overview: Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize the error (cost function)
when fitting a model to a large dataset. It iteratively updates the model's parameters (weights) by calculating gradients
and adjusting the parameters in the direction that reduces the error.
Learning Rate (α): The learning rate controls the size of the steps taken during optimization. A smaller learning rate
ensures the algorithm does not overshoot the global minimum of the cost function, ensuring a more gradual
convergence.
Logistic Regression Solver: In scikit-learn, the default solver for logistic regression is 'liblinear', which works well for
smaller datasets. For large datasets with numerous independent variables, it is recommended to use the 'sag'
(Stochastic Average Gradient Descent) solver, which converges faster for larger datasets.
Regularization
Regularization: A technique to prevent over-fitting by penalizing overly complex models. It helps to reduce the
effect of noisy variables, especially when the number of variables increases, thereby improving the model's ability
to generalize.
Ridge and LASSO Regression: Two regularization techniques provided by Statsmodels and scikit-learn:
o LASSO (L1 Regularization): Adds a penalty to the absolute value of the coefficients, driving some
coefficients to zero, effectively performing variable selection.
o Ridge Regression (L2 Regularization): Adds a penalty to the square of the magnitude of coefficients,
shrinking them towards zero but not exactly to zero. It helps improve model accuracy when many variables
contribute slightly to the prediction.
Alpha Parameter: In both LASSO and Ridge regression, the alpha parameter controls the strength of the
regularization. A higher alpha increases the penalty, leading to simpler models.
Impact on Model Complexity: Regularization techniques like Ridge and LASSO reduce the complexity of the model
by limiting the size of the coefficients, helping to avoid over-fitting while maintaining a model that generalizes
better.
0 = Iris-Setosa
1 = Iris-Versicolor
2 = Iris-Virginica
The Iris dataset includes features such as petal length and petal width, which are used for classification.
Code Example:
iris = datasets.load_iris()
X = iris.data
y = iris.target
Output:
Class labels: [0 1 2]
This code loads the Iris dataset, where X represents the features (petal length, petal width, etc.) and y represents the
target variable (the iris species). The unique class labels in the target variable are [0, 1, 2].
Since the features might have different units of measurement, it is important to normalize the data to ensure that all
features are on the same scale. This helps in the effective training of the model.
Code Example:
sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)
This code applies standard scaling to the features, ensuring that each feature has a mean of 0 and a standard deviation
of 1, which is useful for model performance.
Code Example:
This code splits the data into training (70%) and testing (30%) sets, ensuring reproducibility by setting a
random_state.
For multiclass classification, Logistic Regression is trained using the LogisticRegression model from scikit-
learn. The model is evaluated on both training and test sets using accuracy, confusion matrix, and classification report.
Different GLM Distribution Families: The GLM framework includes several distribution families, each appropriate for
different types of data and applications. Here are some examples:
Family Description
Poisson Applied when the target variable represents counts (e.g., number of occurrences).
Gaussian Used when the target variable is continuous (e.g., normally distributed data).
Gamma Used for modeling waiting times or time between Poisson events.
Inverse Gaussian Suitable for situations where there is an inverse relationship between time and distance.
Negative Binomial Models the number of successes before a failure in a sequence of Bernoulli trials.
Example: In Listing 3-35, a dataset is loaded to apply both a Linear Regression model and a Generalized Linear Model
(GLM) to the data. The dataset used (Grade_Set_1.csv) contains the number of hours studied (independent variable)
and the test grades (dependent variable).
A simple linear regression is applied to predict test grades based on hours studied.
The model is trained using the .fit() method, and the intercept and coefficient are printed out. In this case,
the intercept is 49.68 and the coefficient is 5.02, meaning for each additional hour studied, the test grade
increases by approximately 5.02 points.
lr = lm.LinearRegression()
lr.fit(x, y)
print "Intercept: ", lr.intercept_
GLM is then applied with the Gaussian family and identity link function.
The sm.GLM() function is used with the target variable y and the feature x. The model is then fitted using
.fit(), and the results (coefficients, p-values, and confidence intervals) are displayed.
The output shows the same coefficient of 5.02 and intercept of 49.68, indicating that the GLM model gives
the same result as linear regression in this case but can be extended to more complex distributions.
import statsmodels.api as sm
model = model.fit()
print(model.summary())
o Historical data is used to train the model, and validation techniques (like cross-validation) are
employed to evaluate its performance.
2. Prediction:
o Once trained and validated, the model can be applied to new, unseen data to make predictions about
future or unknown outcomes.
Process Flow Diagram: Figure 3-13 summarizes the supervised learning process flow. The general steps include:
4. Model Evaluation: Validate the model using appropriate metrics (like accuracy, precision, recall).
This process helps build models that can generalize well to new, unseen data and make accurate predictions about
real-world phenomena.
1. Root Node: The starting point of the tree where all data is considered.
2. Branch Node: Represents decision points where data is split based on an attribute.
3. Leaf Node: The end of a decision path where the class label is assigned.
4.
Key Example: For the given decision, "Should you play outside on Saturday morning?", rules could be created:
These rules, which are derived from the decision tree, are often more useful in business contexts than the final decision
itself.
How the Tree Splits and Grows
The decision tree construction follows a greedy algorithm, meaning it splits the data recursively from the root to the
leaf nodes based on the best split criterion at each step. Here’s how the process works:
The decision to split depends on a heuristic or statistical impurity measure, such as:
To prevent the decision tree from growing excessively (overfitting), splitting stops when:
If no further split is possible, majority voting is used for the leaf class classification.
To manage the size and complexity of a decision tree, several parameters are adjusted during training:
max_features: Limits the number of features considered for each split. By default, all features are considered.
min_samples_split: Specifies the minimum number of samples required to split a node. If the number is not
met, the node will not split further.
min_samples_leaf: Ensures that leaf nodes have a minimum number of samples, preventing overfitting.
max_depth: Limits the depth of the tree, stopping further splits once this limit is reached.
These parameters help control the growth of the tree and ensure that the model does not overfit.
import numpy as np
import pandas as pd
# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Standardize features
sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)
clf.fit(X_train, y_train)
# Evaluation metrics
# Generate visualization
tree.export_graphviz(clf, out_file='tree.dot')
out_data = StringIO()
graph = pydot.graph_from_dot_data(out_data.getvalue())
The objective is to find the hyperplane with the maximum margin that separates the two classes optimally. SVM is
robust to outliers because it only focuses on the support vectors (the data points closest to the decision boundary),
unlike algorithms like logistic regression, which are affected by outliers.
Key Parameters
1. C (Penalty Parameter): It controls the trade-off between achieving a high margin and classifying training
points correctly. A high value of CCC tries to classify all training data correctly, while a lower value allows
some misclassification for a larger margin.
2. Kernel: A kernel is a function used to compute the similarity between data points. The kernel can be:
RBF (Radial Basis Function): A non-linear hyperplane that can capture more complex relationships.
The following code demonstrates an SVM classifier using the Iris dataset:
from sklearn.svm import SVC # Split data into training and testing sets
print("Train - Accuracy:",
metrics.accuracy_score(y_train, clf.predict(X_train)))
# Standardize features
print("Test - Accuracy:",
sc = StandardScaler()
metrics.accuracy_score(y_test, clf.predict(X_test)))
In this example:
We use a linear kernel and split the dataset into training and test sets.
The accuracy of the model on the training and test sets is calculated.
We can also visualize the decision boundary of the SVM, especially when working with two features. The code below
creates a synthetic dataset and plots the decision boundary:
OUTPUT:
This code:
Creates an SVM model with a linear kernel and fits it to the data.
Plots the decision boundary, support vectors, and the data points.
k-Nearest Neighbors (kNN) is a non-parametric classification algorithm developed by Fix and Hodges in 1951.
The method is widely used for pattern classification when reliable parametric estimates of probability
densities are either unknown or difficult to determine.
The core idea is to classify an unknown data point based on the majority vote from its k nearest neighbors,
where k is a positive integer representing the number of neighbors to consider.
How kNN Works:
The algorithm calculates the distance between the unknown data point and all other data points in the
dataset.
It then selects the k closest data points and assigns the class that is most common among these neighbors.
The distance metric commonly used is Minkowski distance, though other metrics like Euclidean or
Manhattan distance can be used as well.
In Figure 3-16, the process is demonstrated where k = 5 for the nearest neighbors. The class of the unknown
data point is decided by the majority class of these 5 neighbors.
Key Considerations in kNN:
For a two-class problem, it is recommended to choose an odd value for k to avoid ties.
k should not be a multiple of the number of classes, as this could lead to incorrect predictions.
2. Drawback of kNN:
The main drawback of kNN is the high computational complexity. Since the algorithm calculates the
distance between each point and every other point in the dataset, it can be slow, especially for large
datasets.
Example:
clf.fit(X_train, y_train)
In this code snippet, the KNeighborsClassifier is initialized with k=5, using the Minkowski distance metric (p=2
refers to the Euclidean distance). It is then trained on the training data (X_train, y_train).
Evaluation Metrics:
Confusion Matrix: This matrix shows the number of correct and incorrect classifications for each class.
Classification Report: This provides precision, recall, f1-score, and support for each class.
Sample Output:
Train Accuracy: 97.1%, showing that the model performs well on the training set.
Test Accuracy: 97.8%, demonstrating that the model generalizes well to new, unseen data.
Confusion Matrix: The confusion matrix provides a detailed breakdown of predictions for each class.
Classification Report: The report provides additional metrics, such as precision (the proportion of positive
predictions that are actually correct) and recall (the proportion of actual positives that are correctly identified).
Notes:
Decision Trees, SVM, and kNN algorithms can be used not only for classification tasks but also for regression.
For continuous dependent variables, Scikit-learn offers DecisionTreeRegressor, SVR (Support Vector Regressor),
and KNeighborsRegressor as alternatives to these classification algorithms.
Time-series forecasting
Time-series forecasting is a method used to predict future data points based on historical data that is collected
sequentially over time. The data is typically collected at regular intervals and can display a variety of patterns. Here are
the key concepts from the provided text:
2. Seasonality: Patterns that repeat at regular intervals, such as higher sales during the holiday season.
3. Cycle: Longer-term fluctuations that are not fixed in period, usually influenced by external factors.
ARIMA Model:
The Autoregressive Integrated Moving Average (ARIMA) model is one of the most popular models for time-series
forecasting. The ARIMA model has three key components:
Autoregressive (AR): This refers to a model where past values of the variable are used to predict future
values.
Integrated (I): This is the differencing of the data to make it stationary (removing trends or seasonality).
Moving Average (MA): This involves using past forecast errors to predict future values.
d: The degree of differencing (the number of times the data is differenced to make it stationary).
q: The order of the moving average part (the number of past forecast errors used).
1. Stationarity Check: Before applying ARIMA, you need to check if the data is stationary. This can be done
using tests like the Dickey-Fuller test and by visualizing the data for trends or seasonality.
2. Plot ACF and PACF: These plots help to determine the optimal values for p and q by showing the
autocorrelations at various lags.
3. Model Building: Once the data is stationary, an ARIMA model can be built using the parameters p, d, and q.
The model is then trained on historical data.
4. Model Evaluation: After training, the model is evaluated using metrics such as AIC, BIC, Mean Absolute Error
(MAE), and Root Mean Squared Error (RMSE) to assess its performance.
Definitions of Terms:
True Negatives (TN): Correctly predicted as FALSE (the model correctly identified a negative class).
False Positives (FP): Incorrectly predicted as TRUE when it was actually FALSE (also known as Type I error).
False Negatives (FN): Incorrectly predicted as FALSE when it was actually TRUE (also known as Type II error).
True Positives (TP): Correctly predicted as TRUE (the model correctly identified a positive class).
Low False Positives (FP) and False Negatives (FN), meaning fewer Type I and Type II errors.