@vtudeveloper - in ML Mod 3
@vtudeveloper - in ML Mod 3
Module-3
Nearest-Neighbor Learning
Definition:
Working:
o Classification:
The algorithm determines the class of a test instance by considering the ‘K’
nearest neighbors and selecting the class with the majority vote.
o Regression:
The output is the mean of the target variable values of the ‘K’ nearest
neighbors.
Assumption:
o k-NN relies on the assumption that similar objects are closer to each other in the feature
space.
Instance-Based Learning:
o Memory-Based: The algorithm does not build a prediction model ahead of time, but stores
training data for predictions to be made at the time of the test instance.
o Lazy Learning: No model is constructed during training; the learning process
happens only during testing when predictions are required.
Distance Metric:
o The most common distance metric used is Euclidean distance to measure the
closeness of training data instances to the test instance.
Choosing ‘K’:
o The value of ‘K’ determines how many neighbors should be considered for the prediction.
It is typically selected by experimenting with different values of K to find the optimal one
that produces the most accurate predictions.
Classification Process:
o For a discrete target variable (classification): The class of the test instance is
determined by the majority vote of the 'K' nearest neighbors.
o For a continuous target variable (regression): The output is the mean of the output
variable values of the ‘K’ nearest neighbors.
Advantages:
Disadvantages:
Motivation:
o Traditional k-NN assigns equal importance to all the ‘k’ nearest neighbors, which can lead to
poor performance when:
Neighbors are at varying distances.
The nearest instances are more relevant than the farther ones.
Working Principle:
Closer neighbors get higher weights, while farther neighbors get lower
weights.
o The final prediction is based on the weighted majority vote (classification) or the
weighted average (regression) of the k nearest neighbors.
Weight Assignment:
o Uniform Weighting: All neighbors are given the same weight (as in standard k-NN).
o Distance-Based Weighting: Weights are computed based on the inverse distance, giving
closer neighbors more influence.
Advantages:
Applications:
o Classification: Predict the class of the test instance by weighted voting of the k nearest
neighbors.
o Regression: Predict the output value by computing the weighted mean of the k nearest
neighbors.
Limitations:
o Computational cost increases as distance calculations and weight assignments are performed
for each query.
o Sensitive to the choice of the distance metric (e.g., Euclidean, Manhattan, etc.).
A simple alternative to k-NN classifiers for similarity-based classification is the Nearest Centroid
Classifier.
The idea of this classifier is to classify a test instance to the class whose centroid/mean is closest
to that instance.
The Nearest Centroid Classifier (also known as the Mean Difference Classifier) is a simple
alternative to k-Nearest Neighbors (k-NN) for similarity-based classification.
Algorithm
2. Compute the distance between the test instance and mean/centroid of each class
(Euclidean Distance).
3. Predict the class by choosing the class with the smaller distance.
Using nearest neighbors algorithm, we find the instances that are closest to a test instance and
fit linear function to each of those ‘K’ nearest instances in the local regression model.
The key idea is that we need to approximate the linear functions of all ‘K’ neighbors that
minimize the error such that the prediction line is no more linear but rather it is a curve.
Chapter – 02
Regression Analysis
Introduction to Regression
Definition:
Regression analysis is a supervised learning technique used to model the relationship between
one or more independent variables (x) and a dependent variable (y).
Objective:
The goal is to predict or forecast the dependent variable (y) based on the independent variables
(x), which are also called explanatory, predictor, or independent variables.
Mathematical Representation:
The relationship is represented by a function:
Purpose:
Regression analysis helps to determine how the dependent variable changes when an independent
variable is varied while others remain constant.
Applications:
Sales forecasting
Bond values in portfolio management
Insurance premiums
Agricultural yield predictions
Real estate pricing
Prediction Focus:
Regression is primarily used for predicting continuous or quantitative variables, such as price,
revenue, and other measurable factors.
Definition:
Linear Regression is a fundamental supervised learning algorithm used to model the
relationship between one or more independent variables (predictors) and a dependent variable
(target).
Objective:
The primary goal of linear regression is to find a linear equation that best fits the data points.
This equation is used to predict the dependent variable based on the values of the independent
variables.
Mathematical Representation:
The relationship is represented as:
Assumptions:
Applications:
Advantages:
Limitations:
Multiple regression model involves multiple predictors or independent variables and one
dependent variable.
This is an extension of the linear regression problem. The basic assumptions of multiple linear
regression are that the independent variables are not highly correlated and hence multicollinearity
problem does not exist.
Definition:
Multiple Linear Regression (MLR) is an extension of simple linear regression, where multiple
independent variables (predictors) are used to model the relationship with a single dependent
variable (target).
Mathematical Representation:
The relationship is represented as:
No Multicollinearity: The independent variables should not be highly correlated with each
other. Multicollinearity can cause issues in estimating the coefficients accurately.
Normality of Residuals: The residuals (errors) should be normally distributed for valid
inference and hypothesis testing.
o Linearity: The relationship between each independent variable and the dependent variable
should be linear.
o Independence of Errors: Observations should be independent of each other.
o Homoscedasticity: The variance of residuals should be constant across all levels of the
independent variables.
Applications:
o Predicting house prices based on multiple features (size, location, number of rooms, etc.).
o Estimating the sales of a product based on various factors (price, advertising budget,
competition, etc.).
o Modeling health outcomes based on multiple risk factors (age, BMI, physical activity, etc.).
Advantages:
o Can model the relationship between multiple predictors and a single outcome.
o Provides insights into how different predictors influence the dependent variable.
Limitations:
Polynomial Regression
Definition:
Polynomial Regression is a form of regression analysis that models the relationship between the
independent variable(s) and the dependent variable as a polynomial function.
It is used when the relationship between variables is non-linear and cannot be effectively modeled
using linear regression.
Purpose:
When the data exhibits a non-linear trend, linear regression may result in large errors.
Polynomial regression overcomes this limitation by fitting a curved line to the data.
Applications:
Advantages:
Limitations:
Increasing the polynomial degree can lead to overfitting the training data.
Sensitive to outliers, which can significantly distort the fitted curve.
May require careful tuning of the degree nnn to balance bias and variance.
Logistic Regression
Definition:
Logistic Regression is a supervised learning algorithm used for classification problems,
particularly binary classification, where the output is a categorical variable with two possible
outcomes (e.g., yes/no, pass/fail, spam/not spam).
Purpose:
Logistic Regression predicts the probability of a categorical outcome and maps the
prediction to a value between 0 and 1. It works well when the dependent variable is binary.
Applications:
Core Concept:
o For instance, if the predicted probability of an email being spam is 0.7, there is a 70% chance
the email is spam.
o Linear regression can predict values outside the range of 0 to 1, which is unsuitable for
probabilities.
o Logistic Regression overcomes this by using a sigmoid function to map values to the range [0,
1].
Sigmoid Function:
The sigmoid function (also called the log it function) is used to map any real number to the range
[0, 1]. It is mathematically represented as:
For example:
If the probability of an event is 0.75, the odds are:
Advantages:
Limitations:
Struggles with non-linear decision boundaries (can be addressed with extensions like
polynomial logistic regression).
Sensitive to outliers in the dataset.
Chapter – 03
Overview:
Decision tree learning is a popular supervised predictive model for classification tasks.
It performs inductive inference, generalizing from observed examples.
It can classify both categorical and continuous target variables.
The model is often used for solving complex classification problems with high
accuracy.
Root Node: The topmost node that represents the entire dataset.
Internal/Decision Nodes: These are nodes that perform tests on input attributes and split
the dataset based on test outcomes.
Branches: Represent the outcomes of a test condition at a decision node.
Leaf Nodes/Terminal Nodes: Represent the target labels or output of the decision process.
Path: A path from root to leaf node represents a logical rule for classification.
Goal: Construct a decision tree from the given training dataset. Tree
Construction:
o Start from the root and recursively find the best attribute for splitting.
o This process continues until the tree reaches leaf nodes that cannot be further
split.
o The tree represents all possible hypotheses about the data.
Output: A fully constructed decision tree that represents the learned model. Inference or
Classification:
Goal: For a given test instance, classify it into the correct target class. Classification:
o Start at the root node and traverse the tree based on the test conditions for each
attribute.
o Continue evaluating test conditions until reaching a leaf node, which provides the
target class label for the instance.
5. Fast to train.
1. It is difficult to determine how deep the tree should grow and when to stop.
2. Sensitive to errors and missing attribute values in training data.
3. Computational complexity in handling continuous attributes, requiring
discretization.
4. Risk of overfitting with complex trees.
5. Not suitable for classifying multiple output classes.
6. Learning an optimal decision tree is an NP-complete problem.
Several decision tree algorithms are widely used in classification tasks, including ID3, C4.5, and
CART, among others.
These algorithms differ in their splitting criteria, handling of attributes, and robustness to data
characteristics.
C4.5:
Algorithm