0% found this document useful (0 votes)
15 views21 pages

Module 2

Module 2 covers supervised learning, focusing on regression techniques, including linear and polynomial regression, and the use of algorithms like stochastic gradient descent and regularization methods. It discusses the importance of evaluating model performance using metrics such as R-squared, RMSE, and MAE, and introduces generalized linear models for various data distributions. Additionally, it highlights the application of logistic regression for multiclass classification using the Iris dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views21 pages

Module 2

Module 2 covers supervised learning, focusing on regression techniques, including linear and polynomial regression, and the use of algorithms like stochastic gradient descent and regularization methods. It discusses the importance of evaluating model performance using metrics such as R-squared, RMSE, and MAE, and introduces generalized linear models for various data distributions. Additionally, it highlights the application of logistic regression for multiclass classification using the Iris dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Module 2

Supervised Learning: Linear and Non-Linear examples -Regression - Stochastic Gradient descent -Regularization
Multiclass Regression - Generalized Linear Models - Decision Trees - Support vector Machine (SVM)- K Nearest
Neighbors (kNN) - Time series forecasting - Evaluating a classification model performance- RoC Curve

Supervised Learning – Regression


What is Supervised Learning?

Supervised learning is a type of machine learning where the model is trained using labeled data. In this method, each
input in the training dataset has a corresponding output, and the model is trained to learn the relationship between
the inputs and outputs. The goal is to make predictions on unseen data based on this learned relationship.

Regression in Supervised Learning

Regression is a technique in supervised learning used to predict continuous (quantitative) values based on input
variables. The model learns the relationship between the independent variable(s) and the dependent variable and
then makes predictions based on that relationship. Unlike classification, where the output is discrete, regression
focuses on predicting continuous values.

Use Cases of Supervised Learning – Regression

The following table outlines common business use cases across different domains where regression can be applied:

 Retail: How much will be the daily, monthly, and yearly sales for a given store for the next three years?

 Manufacturing: How much will be the product-wise manufacturing labor cost?

 Banking: What is the credit score of a customer?

 Insurance: How many customers will claim insurance this year?

 Energy / Environmental: What will be the temperature for the next five days?

The common thread in these questions is that they ask for a quantitative or continuous number, such as a sales figure,
labor cost, or temperature. Regression analysis helps answer such questions by examining the relationship between
variables.

Example of Regression – Students' Scores vs. Hours Studied


Let’s explore a practical example where we predict students' test scores based on the number of hours studied. In this
scenario, the hours studied serve as the independent variable, and the test grade serves as the dependent variable.

Data Analysis

A dataset is collected with the following columns:

 Hours_Studied: Number of hours a student has studied.

 Test_Grade: The student’s test grade.

The correlation between the two variables is calculated, which reveals a strong positive relationship (98%). This
suggests that as the number of hours studied increases, the test grade also increases. However, correlation only shows
the strength of the relationship and not necessarily a causal one.

Correlation and Causation

 Correlation means that two variables are related, but it does not imply one causes the other.

 Causation means that one variable directly affects the other.


For instance:

 The correlation between firemen’s presence and the size of a fire is strong, but firemen don’t cause the fire.

 Similarly, sleeping with shoes on might correlate with a headache, but it could be due to alcohol intoxication
rather than the shoes themselves.

Correlation doesn’t imply causation; however, the existence of causation will always imply correlation.

Fitting a Linear Regression Line

To predict test scores based on study hours, we fit a linear regression model using the formula:

Y=mX+cY = mX + cY=mX+c

Where:

 Y is the predicted value (test grade).

 X is the independent variable (hours studied).

 m is the slope (rate of change).

 c is the intercept (the starting point on the y-axis).

Least Squares Method

The objective of linear regression is to minimize the residuals, which are the differences between the actual values and
the predicted values. The best fit line is the one that minimizes the sum of the squared residuals, which is why it's
called the least squares method.

Linear Regression Example Using Python


Using the scikit-learn library, we can implement linear regression to predict a student's grade based on study hours.
Here is an example of how to implement this:

import pandas as pd print("Intercept: ", lr.intercept_)


import sklearn.linear_model as lm print("Coefficient: ", lr.coef_)
import matplotlib.pyplot as plt # Make a manual prediction
# Load data print("Manual prediction: ", 52.2928994083 +
4.74260355 * 6)
df = pd.read_csv('Data/Grade_Set_1.csv')
# Predict using the built-in function
# Create linear regression object
print("Using predict function: ", lr.predict(6))
lr = lm.LinearRegression()
# Plot the fitted line
# Independent variable
plt.scatter(x, y, color='black')
x = df.Hours_Studied[:, np.newaxis]
plt.plot(x, lr.predict(x), color='blue', linewidth=3)
# Dependent variable
plt.title('Grade vs Hours Studied')
y = df.Test_Grade.values
plt.xlabel('Hours Studied')
# Train the model
plt.ylabel('Test Grade')
lr.fit(x, y)
plt.show()
# Print the intercept and coefficient (slope)
Output:

 Intercept (c): 52.29

 Coefficient (m): 4.74

 Prediction for 6 hours of study: 80.74

For a student studying for 6 hours, the predicted test grade is 80.74. This means that for every additional hour studied,
the grade increases by 4.74 points.

How Good Is Your Model?

There are three primary metrics for evaluating linear model performance:

 R-squared

 RMSE (Root Mean Squared Error)

 MAE (Mean Absolute Error)

R-Squared for Goodness of Fit

R-squared indicates the proportion of variance in the dependent variable explained by the independent variable. It
ranges from 0 to 1, with values closer to 1 signifying a better fit. An example calculation shows that 97% of the variability
in the dependent variable (test score) is explained by the independent variable (hours studied).
Root Mean Squared Error (RMSE)

RMSE measures how close predicted values are to actual values. Lower RMSE values indicate better model
performance. It has the same units as the target variable.

Mean Absolute Error (MAE)

MAE is the average of the absolute differences between the predicted and actual values. Smaller values of MAE signify
a better model.

Polynomial Regression
Polynomial regression is a type of regression that models the relationship between a dependent variable YYY and an
independent variable XXX as an nth-degree polynomial. Unlike linear regression, which fits a straight line, polynomial
regression can model curves, making it more flexible and capable of fitting complex data patterns.

In polynomial regression, we introduce higher-order degree variables of the same independent variable in the
equation.

The general form of a polynomial regression equation is:

Example:

For a dataset containing the number of hours studied and corresponding test grades, a polynomial regression can be
used to fit a curve to the data. The relationship between hours studied and test grades may not be perfectly linear. For
example, after a certain number of study hours, the test grades may plateau, and polynomial regression can model
that behavior better than linear regression.
Multivariate Regression

Multivariate regression is an extension of simple linear regression, where there is more than one independent variable
(predictor). This type of regression is used when the dependent variable is influenced by multiple independent
variables.

The general form of the equation for multivariate regression is:

Multivariate regression is used when you want to predict an outcome based on multiple factors. For example,
predicting house prices based on multiple features such as lot size, number of bedrooms, number of bathrooms, and
the presence of a garage.

Example:

For predicting house prices, the dependent variable YYY is the Price of the house. The independent variables could
include:

 Lot Size (in square feet)

 Number of Bedrooms

 Number of Bathrooms

 Whether the house has a garage

 Whether the house has air conditioning

Over-fitting and Under-fitting

 Under-fitting: Occurs when the model fails to capture the underlying trend in the data, resulting in low
accuracy on both training and test datasets. The model is too simple to capture the complexity of the data.

 Over-fitting: Occurs when the model fits the training data too well, capturing all the noise and outliers. This
results in high accuracy on the training set but low accuracy on the test set, as the model fails to generalize to
unseen data.

 Importance of Choosing the Right Polynomial Degree: The degree of the polynomial plays a crucial role in
avoiding over-fitting and under-fitting. A balance must be struck to ensure the model generalizes well.
Nonlinear Regression
 Nonlinear Models: Unlike linear models, nonlinear regression allows the fitted line (or curve) to take any
shape, which is often required when the underlying process is complex or based on physical or biological
phenomena.

 Interpretation of Nonlinear Models: Nonlinear regression models have a direct physical or biological
interpretation, making them useful for applications like enzyme kinetics, distribution modeling, and more.

 Scipy's curve_fit Function: The curve_fit function from the Scipy library is used to fit nonlinear models to data.
It estimates the parameters of a given model based on scientific theories. Common use cases include
Michaelis–Menten enzyme kinetics, Weibull distribution, and power law distribution.

Example Code for Nonlinear Regression

import numpy as np # Calculate residuals and sum of squared residuals

import matplotlib.pyplot as plt residuals = y - func(x, p1, p2)

from scipy.optimize import curve_fit fres = sum(residuals**2)

# Data

x = np.array([-2,-1.64,- # Generate fitted curve for plotting


0.7,0,0.45,1.2,1.64,2.32,2.9])
curvex = np.linspace(-2, 3, 100)
y = np.array([1.0, 1.5, 2.4, 2, 1.49, 1.2, 1.3, 1.2,
curvey = func(curvex, p1, p2)
0.5])

# Defining the nonlinear function (e.g., a


combination of sine and cosine) # Plotting the results
def func(x, p1, p2): plt.plot(x, y, 'bo') # Original data points
return p1 * np.sin(p2 * x) + p2 * np.cos(p1 * x) plt.plot(curvex, curvey, 'r') # Fitted curve

plt.title('Non-linear fitting')
# Fit the model to the data plt.xlabel('x')
popt, pcov = curve_fit(func, x, y, p0=(1.0, 0.2)) plt.ylabel('y')
p1 = popt[0] plt.legend(['Data', 'Fit'], loc='best')
p2 = popt[1] plt.show()

Output

The output will show the original data points (in blue) and the nonlinear fitted curve (in red) that best represents
the relationship between the variables based on the chosen model.
Stochastic Gradient Descent (SGD)
Overview: Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize the error (cost function)
when fitting a model to a large dataset. It iteratively updates the model's parameters (weights) by calculating gradients
and adjusting the parameters in the direction that reduces the error.

Learning Rate (α): The learning rate controls the size of the steps taken during optimization. A smaller learning rate
ensures the algorithm does not overshoot the global minimum of the cost function, ensuring a more gradual
convergence.

Logistic Regression Solver: In scikit-learn, the default solver for logistic regression is 'liblinear', which works well for
smaller datasets. For large datasets with numerous independent variables, it is recommended to use the 'sag'
(Stochastic Average Gradient Descent) solver, which converges faster for larger datasets.

Regularization
 Regularization: A technique to prevent over-fitting by penalizing overly complex models. It helps to reduce the
effect of noisy variables, especially when the number of variables increases, thereby improving the model's ability
to generalize.

 Ridge and LASSO Regression: Two regularization techniques provided by Statsmodels and scikit-learn:

o LASSO (L1 Regularization): Adds a penalty to the absolute value of the coefficients, driving some
coefficients to zero, effectively performing variable selection.

o Ridge Regression (L2 Regularization): Adds a penalty to the square of the magnitude of coefficients,
shrinking them towards zero but not exactly to zero. It helps improve model accuracy when many variables
contribute slightly to the prediction.
 Alpha Parameter: In both LASSO and Ridge regression, the alpha parameter controls the strength of the
regularization. A higher alpha increases the penalty, leading to simpler models.

 Impact on Model Complexity: Regularization techniques like Ridge and LASSO reduce the complexity of the model
by limiting the size of the coefficients, helping to avoid over-fitting while maintaining a model that generalizes
better.

Multiclass Logistic Regression


Logistic regression can be extended to multiclass problems, where the goal is to predict a categorical dependent variable
with more than two classes. One of the well-known datasets used for demonstrating this concept is the Iris dataset,
which contains 3 classes of Iris plants (Iris-Setosa, Iris-Versicolor, and Iris-Virginica), each having 50 instances. These
classes are already converted to integer labels:

 0 = Iris-Setosa
 1 = Iris-Versicolor
 2 = Iris-Virginica

The Iris dataset includes features such as petal length and petal width, which are used for classification.

Loading the Data

Code Example:

from sklearn import datasets


import numpy as np
import pandas as pd

iris = datasets.load_iris()
X = iris.data
y = iris.target

print('Class labels:', np.unique(y))

Output:

Class labels: [0 1 2]

This code loads the Iris dataset, where X represents the features (petal length, petal width, etc.) and y represents the
target variable (the iris species). The unique class labels in the target variable are [0, 1, 2].

Normalizing the Data

Since the features might have different units of measurement, it is important to normalize the data to ensure that all
features are on the same scale. This helps in the effective training of the model.

Code Example:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)

This code applies standard scaling to the features, ensuring that each feature has a mean of 0 and a standard deviation
of 1, which is useful for model performance.

Splitting the Data into Train and Test Sets


It is a good practice to split the dataset into training and testing sets. This helps in evaluating the performance of the
model on unseen data. The train_test_split function from scikit-learn is used to achieve this.

Code Example:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=0)

This code splits the data into training (70%) and testing (30%) sets, ensuring reproducibility by setting a
random_state.

Training the Logistic Regression Model and Evaluating the Model

For multiclass classification, Logistic Regression is trained using the LogisticRegression model from scikit-
learn. The model is evaluated on both training and test sets using accuracy, confusion matrix, and classification report.

Generalized Linear Models (GLM)


Generalized Linear Models (GLM) were introduced by John Nelder and Robert Wedderburn to unify different statistical
models such as linear regression, logistic regression, and Poisson regression. GLMs allow for a wide range of models
depending on the distribution of the target variable. These models use a linear predictor to estimate the relationship
between the input variables and the output, but they differ based on the family of distributions they assume.

Different GLM Distribution Families: The GLM framework includes several distribution families, each appropriate for
different types of data and applications. Here are some examples:

Family Description

Binomial Used for binary response variables (e.g., success/failure).

Poisson Applied when the target variable represents counts (e.g., number of occurrences).

Gaussian Used when the target variable is continuous (e.g., normally distributed data).

Gamma Used for modeling waiting times or time between Poisson events.

Inverse Gaussian Suitable for situations where there is an inverse relationship between time and distance.

Negative Binomial Models the number of successes before a failure in a sequence of Bernoulli trials.

Example: In Listing 3-35, a dataset is loaded to apply both a Linear Regression model and a Generalized Linear Model
(GLM) to the data. The dataset used (Grade_Set_1.csv) contains the number of hours studied (independent variable)
and the test grades (dependent variable).

1. Linear Regression Model:

 A simple linear regression is applied to predict test grades based on hours studied.

 The model is trained using the .fit() method, and the intercept and coefficient are printed out. In this case,
the intercept is 49.68 and the coefficient is 5.02, meaning for each additional hour studied, the test grade
increases by approximately 5.02 points.

lr = lm.LinearRegression()

x = df.Hours_Studied[:, np.newaxis] # Independent variable

y = df.Test_Grade.values # Dependent variable

lr.fit(x, y)
print "Intercept: ", lr.intercept_

print "Coefficient: ", lr.coef_

2. Generalized Linear Model (GLM):

 GLM is then applied with the Gaussian family and identity link function.

 The sm.GLM() function is used with the target variable y and the feature x. The model is then fitted using
.fit(), and the results (coefficients, p-values, and confidence intervals) are displayed.

 The output shows the same coefficient of 5.02 and intercept of 49.68, indicating that the GLM model gives
the same result as linear regression in this case but can be extended to more complex distributions.

import statsmodels.api as sm

x = sm.add_constant(x, prepend=False) # Add intercept to x

model = sm.GLM(y, x, family=sm.families.Gaussian())

model = model.fit()

print(model.summary())

Supervised Learning – Process Flow


Overview: Supervised learning involves training a machine learning model on a labeled dataset, where the model
learns to map input features to a target variable. The process flow for supervised learning typically involves:

1. Training and Validation:

o Historical data is used to train the model, and validation techniques (like cross-validation) are
employed to evaluate its performance.

2. Prediction:

o Once trained and validated, the model can be applied to new, unseen data to make predictions about
future or unknown outcomes.

Process Flow Diagram: Figure 3-13 summarizes the supervised learning process flow. The general steps include:

1. Data Collection: Gather a labeled dataset.

2. Data Preprocessing: Clean, normalize, and prepare the data.


3. Model Training: Apply machine learning techniques (such as linear regression, logistic regression, or GLM) to
train the model.

4. Model Evaluation: Validate the model using appropriate metrics (like accuracy, precision, recall).

5. Prediction: Use the trained model to predict outcomes on new data.

This process helps build models that can generalize well to new, unseen data and make accurate predictions about
real-world phenomena.

Decision Trees in Machine Learning


Overview A decision tree is a machine learning model introduced by J. R. Quinlan in 1986, designed to make decisions
based on a series of tests on attributes. It is structured as a tree where:

 Internal nodes represent tests on attributes.

 Branches represent outcomes of these tests.

 Leaf nodes represent class labels, determining the final decision.

Types of Nodes in a Decision Tree:

1. Root Node: The starting point of the tree where all data is considered.

2. Branch Node: Represents decision points where data is split based on an attribute.

3. Leaf Node: The end of a decision path where the class label is assigned.

4.
Key Example: For the given decision, "Should you play outside on Saturday morning?", rules could be created:

 Rule 1: If it's sunny and temperature > 30°C → Do not play.

 Rule 2: If it's rainy and windy → Do not play.

These rules, which are derived from the decision tree, are often more useful in business contexts than the final decision
itself.
How the Tree Splits and Grows

The decision tree construction follows a greedy algorithm, meaning it splits the data recursively from the root to the
leaf nodes based on the best split criterion at each step. Here’s how the process works:

 Initially, all the training examples are at the root.

 Data is partitioned at each step based on selected attributes (features).

 The decision to split depends on a heuristic or statistical impurity measure, such as:

Stopping Conditions for Tree Growth

To prevent the decision tree from growing excessively (overfitting), splitting stops when:

1. All samples at a node belong to the same class.

2. There are no remaining attributes to further partition.

3. There are no samples left at the node.

If no further split is possible, majority voting is used for the leaf class classification.

Key Parameters in Decision Trees

To manage the size and complexity of a decision tree, several parameters are adjusted during training:

 max_features: Limits the number of features considered for each split. By default, all features are considered.

 min_samples_split: Specifies the minimum number of samples required to split a node. If the number is not
met, the node will not split further.

 min_samples_leaf: Ensures that leaf nodes have a minimum number of samples, preventing overfitting.

 max_depth: Limits the depth of the tree, stopping further splits once this limit is reached.

These parameters help control the growth of the tree and ensure that the model does not overfit.

Code Example for Decision Tree Model

from sklearn import datasets

import numpy as np

import pandas as pd

from sklearn import tree

from sklearn.preprocessing import StandardScaler

from sklearn.cross_validation import train_test_split

from sklearn import metrics


import pydot

from sklearn.externals.six import StringIO

# Load dataset

iris = datasets.load_iris()

X = iris.data

y = iris.target

# Standardize features

sc = StandardScaler()

sc.fit(X)

X = sc.transform(X)

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Initialize and fit the model

clf = tree.DecisionTreeClassifier(criterion='entropy', random_state=0)

clf.fit(X_train, y_train)

# Evaluation metrics

print("Train - Accuracy:", metrics.accuracy_score(y_train, clf.predict(X_train)))

print("Test - Accuracy:", metrics.accuracy_score(y_test, clf.predict(X_test)))

# Generate visualization

tree.export_graphviz(clf, out_file='tree.dot')

out_data = StringIO()

tree.export_graphviz(clf, out_file=out_data, feature_names=iris.feature_names,


class_names=clf.classes_.astype(int).astype(str), filled=True, rounded=True, special_characters=True)

graph = pydot.graph_from_dot_data(out_data.getvalue())

graph[0].write_pdf("iris.pdf") # Save tree visualization as a PDF


OUTPUT:

Support Vector Machine (SVM)


Support Vector Machine (SVM) is a supervised machine learning algorithm proposed by Vladimir N. Vapnik and Alexey
Ya. Chervonenkis in 1963. The primary goal of SVM is to identify a hyperplane that best separates two classes with the
maximum margin. The margin is the distance between the closest data points from each class and the hyperplane.

The objective is to find the hyperplane with the maximum margin that separates the two classes optimally. SVM is
robust to outliers because it only focuses on the support vectors (the data points closest to the decision boundary),
unlike algorithms like logistic regression, which are affected by outliers.
Key Parameters

1. C (Penalty Parameter): It controls the trade-off between achieving a high margin and classifying training
points correctly. A high value of CCC tries to classify all training data correctly, while a lower value allows
some misclassification for a larger margin.

2. Kernel: A kernel is a function used to compute the similarity between data points. The kernel can be:

 Linear: A linear hyperplane to separate the classes.

 RBF (Radial Basis Function): A non-linear hyperplane that can capture more complex relationships.

 Polynomial: A polynomial function for more flexible separation.

 Sigmoid: A sigmoid function.

 Precomputed: A kernel computed externally.

Example with Iris Dataset

The following code demonstrates an SVM classifier using the Iris dataset:

from sklearn import datasets X = sc.fit_transform(X)

from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVC # Split data into training and testing sets

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.3, random_state=0)
from sklearn import metrics

# Create and train SVM classifier


# Load iris dataset
clf = SVC(kernel='linear', C=1.0, random_state=0)
iris = datasets.load_iris()
clf.fit(X_train, y_train)
X = iris.data[:, [2, 3]] # Only use two features for
simplicity

y = iris.target # Evaluate model

print("Train - Accuracy:",
metrics.accuracy_score(y_train, clf.predict(X_train)))
# Standardize features
print("Test - Accuracy:",
sc = StandardScaler()
metrics.accuracy_score(y_test, clf.predict(X_test)))

In this example:

 We use a linear kernel and split the dataset into training and test sets.

 The accuracy of the model on the training and test sets is calculated.

Key Evaluation Metrics

For classification problems, we often use metrics like:

 Accuracy: The ratio of correctly predicted instances to the total instances.

 Confusion Matrix: A table showing actual vs predicted classifications.

 Classification Report: Precision, recall, and F1-score for each class.


Visualizing Decision Boundaries

We can also visualize the decision boundary of the SVM, especially when working with two features. The code below
creates a synthetic dataset and plots the decision boundary:

from sklearn.datasets import yy = a * xx - (clf.intercept_[0]) / w[1]


make_classification
# Plotting the decision boundary
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='autumn')
from sklearn.svm import SVC
plt.plot(xx, yy, 'k-', label='Decision Boundary')
# Create a synthetic dataset
# Plot the support vectors
X, y = make_classification(100, 2, 2, 0,
plt.scatter(clf.support_vectors_[:, 0],
weights=[.5, .5], random_state=0)
clf.support_vectors_[:, 1], s=80,
# Create and train SVM classifier facecolors='none', edgecolors='k',
label='Support Vectors')
clf = SVC(kernel='linear', random_state=0)
plt.xlabel('X1')
clf.fit(X, y)
plt.ylabel('X2')
# Get the separating hyperplane
plt.legend(loc='upper left')
w = clf.coef_[0]
plt.tight_layout()
a = -w[0] / w[1]
plt.show()
xx = np.linspace(-5, 5)

OUTPUT:

This code:

 Generates a synthetic dataset using make_classification.

 Creates an SVM model with a linear kernel and fits it to the data.

 Plots the decision boundary, support vectors, and the data points.

k Nearest Neighbors (kNN)


Introduction to kNN:

 k-Nearest Neighbors (kNN) is a non-parametric classification algorithm developed by Fix and Hodges in 1951.

 The method is widely used for pattern classification when reliable parametric estimates of probability
densities are either unknown or difficult to determine.

 The core idea is to classify an unknown data point based on the majority vote from its k nearest neighbors,
where k is a positive integer representing the number of neighbors to consider.
How kNN Works:

 The algorithm calculates the distance between the unknown data point and all other data points in the
dataset.

 It then selects the k closest data points and assigns the class that is most common among these neighbors.

 The distance metric commonly used is Minkowski distance, though other metrics like Euclidean or
Manhattan distance can be used as well.

 In Figure 3-16, the process is demonstrated where k = 5 for the nearest neighbors. The class of the unknown
data point is decided by the majority class of these 5 neighbors.


Key Considerations in kNN:

1. Choosing the right k:

 For a two-class problem, it is recommended to choose an odd value for k to avoid ties.
 k should not be a multiple of the number of classes, as this could lead to incorrect predictions.

2. Drawback of kNN:

 The main drawback of kNN is the high computational complexity. Since the algorithm calculates the
distance between each point and every other point in the dataset, it can be slow, especially for large
datasets.

Example:

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')

clf.fit(X_train, y_train)
 In this code snippet, the KNeighborsClassifier is initialized with k=5, using the Minkowski distance metric (p=2
refers to the Euclidean distance). It is then trained on the training data (X_train, y_train).

Evaluation Metrics:

 Accuracy: This is the proportion of correct predictions out of all predictions.

 Confusion Matrix: This matrix shows the number of correct and incorrect classifications for each class.

 Classification Report: This provides precision, recall, f1-score, and support for each class.
Sample Output:

Train - Accuracy: 0.971

Train - Confusion Matrix: [[34 0 0] [0 31 1] [0 2 37]]

Test - Accuracy: 0.978

Test - Confusion Matrix: [[16 0 0] [0 17 1] [0 0 11]]

 Train Accuracy: 97.1%, showing that the model performs well on the training set.

 Test Accuracy: 97.8%, demonstrating that the model generalizes well to new, unseen data.

 Confusion Matrix: The confusion matrix provides a detailed breakdown of predictions for each class.

 Classification Report: The report provides additional metrics, such as precision (the proportion of positive
predictions that are actually correct) and recall (the proportion of actual positives that are correctly identified).

Notes:

 Decision Trees, SVM, and kNN algorithms can be used not only for classification tasks but also for regression.
For continuous dependent variables, Scikit-learn offers DecisionTreeRegressor, SVR (Support Vector Regressor),
and KNeighborsRegressor as alternatives to these classification algorithms.

Time-series forecasting
Time-series forecasting is a method used to predict future data points based on historical data that is collected
sequentially over time. The data is typically collected at regular intervals and can display a variety of patterns. Here are
the key concepts from the provided text:

Key Components of Time Series:

1. Trend: A long-term increase or decrease in the data.

2. Seasonality: Patterns that repeat at regular intervals, such as higher sales during the holiday season.

3. Cycle: Longer-term fluctuations that are not fixed in period, usually influenced by external factors.

ARIMA Model:

The Autoregressive Integrated Moving Average (ARIMA) model is one of the most popular models for time-series
forecasting. The ARIMA model has three key components:

 Autoregressive (AR): This refers to a model where past values of the variable are used to predict future
values.

 Integrated (I): This is the differencing of the data to make it stationary (removing trends or seasonality).

 Moving Average (MA): This involves using past forecast errors to predict future values.

The model is represented as ARIMA(p, d, q):


 p: The order of the autoregressive part (the number of past values used).

 d: The degree of differencing (the number of times the data is differenced to make it stationary).

 q: The order of the moving average part (the number of past forecast errors used).

Steps to Build an ARIMA Model:

1. Stationarity Check: Before applying ARIMA, you need to check if the data is stationary. This can be done
using tests like the Dickey-Fuller test and by visualizing the data for trends or seasonality.

2. Plot ACF and PACF: These plots help to determine the optimal values for p and q by showing the
autocorrelations at various lags.

3. Model Building: Once the data is stationary, an ARIMA model can be built using the parameters p, d, and q.
The model is then trained on historical data.

4. Model Evaluation: After training, the model is evaluated using metrics such as AIC, BIC, Mean Absolute Error
(MAE), and Root Mean Squared Error (RMSE) to assess its performance.

Evaluating Classification Model Performance


A confusion matrix is a crucial tool for evaluating the performance of classification models. It provides insight into
how well the model distinguishes between classes. A typical confusion matrix for a binary classification problem is
shown below:

Predicted: FALSE Predicted: TRUE

Actual: FALSE True Negatives (TN) False Positives (FP)

Actual: TRUE False Negatives (FN) True Positives (TP)

Definitions of Terms:

 True Negatives (TN): Correctly predicted as FALSE (the model correctly identified a negative class).

 False Positives (FP): Incorrectly predicted as TRUE when it was actually FALSE (also known as Type I error).

 False Negatives (FN): Incorrectly predicted as FALSE when it was actually TRUE (also known as Type II error).

 True Positives (TP): Correctly predicted as TRUE (the model correctly identified a positive class).

Ideal Model Characteristics

A good classification model should ideally have:

 High True Negatives (TN) and True Positives (TP).

 Low False Positives (FP) and False Negatives (FN), meaning fewer Type I and Type II errors.

Key Metrics Derived from Confusion Matrix


To understand the performance of a classification model, several key metrics are calculated using values from the
confusion matrix. Below is a detailed breakdown of these metrics:
ROC Curve
The Receiver Operating Characteristic (ROC) curve is a crucial tool for evaluating the performance of a binary classifier.
It visually represents the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) at various
threshold settings. The Area Under the Curve (AUC) is a summary statistic that provides a single number to represent
the overall performance of the model.

ROC Curve Interpretation:


 The ROC curve plots the TPR (y-axis) against the FPR (x-axis). A perfect classifier will have an ROC curve that
passes through the top left corner (0, 1), indicating a TPR of 1 and FPR of 0.

 The AUC is the area under this curve:

o An AUC of 1 indicates a perfect model.

o An AUC of 0.5 indicates a model no better than random guessing.

o A higher AUC indicates better model performance.

 Example Code to Calculate and Plot ROC Curve:

from sklearn import metrics # Plot the ROC curve


import matplotlib.pyplot as plt plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area =
# Calculate false positive rate and true %0.2f)' % roc_auc)
positive rate plt.plot([0, 1], [0, 1], 'k--') # diagonal line
fpr, tpr, _ = metrics.roc_curve(y, (random guessing)
model.predict_proba(x)[:,1]) plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
# Calculate the AUC plt.xlabel('False Positive Rate')
roc_auc = metrics.auc(fpr, tpr) plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
# Print AUC plt.legend(loc="lower right")
print('ROC AUC: %0.2f' % roc_auc) plt.show()

You might also like