Week 5 Slides
Week 5 Slides
Week 5 - Classification
Neural Network
● The goal of neural networks (ANNs) is to model
complex relationships between input and output
variables by mimicking the structure of biological
neurons.
● Neural networks are based on simple mathematical
models, such as linear models where inputs (X1, X2,
X3) are scaled and combined to predict an output (Y).
● Y = 1 + 2x1+3X2+4X3
Neural Network
Neurons are the primary cells that make up the nervous system and are
essential for transmitting information throughout the body. Each neuron
consists of three main parts
Cell Body (Soma): This is the central part of the neuron where the
nucleus is located. It’s responsible for processing information received
from other neurons.
Dendrites: These are branch-like extensions that come off the cell body.
Dendrites receive signals from other neurons and transmit them to the cell
body. They act like antennas, picking up information.
Axon: A long, thin structure that extends from the cell body. Carries
electrical signals away from the cell body to other neurons or to muscles
and glands. The axon can vary in length, and at its end, it branches out to
connect with other neurons.
For this example, imagine a dataset with three numeric input attributes (X1, X2, X3)
and one numeric output (Y).
Neural Network - Simple Example
Step 2: Initiation
Assume the initial weights for the four links are 1, 2, 3, and 4. Take an example model and a
training record with all the inputs as 1 and the known output as 15. So, X1, X2, X3 = 1 and output
Y = 15.
Neural Network - Simple Example
Step 3: Calculating Error
This is a simple feed-forward process when the input data passes through the nodes. The
predicted output Y according to the current model is
1 + 1 X 2 + 1 X 3 + 1 X 4 = 10
The difference between the actual output from the training record and the predicted output is the
model error:
15 - 10 = 5.
Neural Network - Simple Example
Step 4: Weight Adjustment
● The neural network first makes a prediction, and then the error (difference between the predicted
output and the actual output) is calculated.
● The error is sent back through the network from the output layer to the input layer, adjusting the
weights of the connections between neurons to improve future predictions. This process is called
backpropagation.
● Each connection in the neural network has a weight that determines how much influence the input
has on the output.
The learning rate controls how quickly or slowly the weights are adjusted. A high learning rate (close to
1) means large changes in the weights, which could cause the network to overshoot the optimal solution.
A low learning rate (close to 0) makes the learning process slower and more gradual, but more stable.
Some models start with a high learning rate and gradually reduce it to fine-tune the network.
Neural Network - Simple Example
Step 4: Weight Adjustment
Example Calculation:
This means that after one iteration, the weight of this link is adjusted from 2 to 2.83.
Neural Network - Simple Example
Step 4: Weight Adjustment
During training, the weights are adjusted repeatedly through many iterations (or epochs) of
backpropagation, using different records from the training data. The goal is to minimize the overall error
across all training examples.
The training process continues until the error is reduced to a certain threshold, or the model has learned
enough to make accurate predictions.
If the input contains nominal (categorical) data, it must be converted into numeric form, often using
techniques like one-hot encoding or introducing dummy variables.
This increases the complexity of the model because more input links are required, thus increasing
computing resource
Support Vector Machines
● SVM was first formally introduced for optical character recognition at AT&T Bell Labs
● Widely used in various fields like pattern recognition and text mining,
● raw upon knowledge from three areas: computer science, statistics, and
mathematical optimization theory.
● For simple datasets, SVM can classify data points by finding the optimal separating
hyperplane, which maximizes the margin between two classes.
● SVM can handle more complex, nonlinear datasets by using kernel functions that
transform the data into higher dimensions where it becomes linearly separable.
● SVM relies on solving optimization problems to find the hyperplane with the largest
margin of separation between classes.
● SVM is highly effective in high-dimensional spaces, is memory efficient, and performs
well even when the number of features is greater than the number of samples.
● SVMs can be computationally intensive for large datasets, require tuning of the kernel
and regularization parameters, and may struggle with overlapping classes.
Support Vector Machines
● SVM is primarily used for classification, separating data into different classes.
● SVM works by fitting a boundary (also called a hyperplane) between regions of
data points that belong to different classes.
● Once the boundary is established using the training sample, it can classify
new data points by checking whether they lie inside or outside the boundary.
● After the boundary is established, most of the training data becomes
redundant, and only a core set of points is needed to define the boundary.
● These core points are called support vectors because they help define and
support the boundary between classes.
● Each data point is called a vector because it represents a row of data
containing values for different attributes (features).
● SVM aims to find a boundary that maximizes the margin (distance) between
the closest points of the two classes.
● SVM is efficient because, after training, only the support vectors are needed to
classify new data points, making most of the other training data unnecessary.
● SVM can handle both linear and nonlinear classification tasks, making it
flexible for different types of data.
● In the case of nonlinear data, SVM uses the kernel trick to transform the data
into a higher dimension where it becomes separable by a linear boundary.
Support Vector Machines
Support Vector Machines
● A hyperplane is the boundary that separates data points into
different classes in an SVM.
● In two dimensions, the boundary is a line or curve; in three
dimensions, it can be a plane or surface. For higher dimensions,
it's called a hyperplane.
● For the same dataset, many possible hyperplanes can be
drawn, but the best one is chosen based on minimal
misclassification.
● The correct hyperplane is the one that maximizes the
geometric distance between classes, called the margin.
● The support vectors are the data points that lie closest to the
boundary, which directly influence the position of the hyperplane.
● When data is not linearly separable, points may fall within the
margin, and the best hyperplane minimizes these
misclassifications.
● A penalty (ξ) is applied for each data point within the margin,
and the best hyperplane is the one that minimizes the total
Support Vector Machines
● In cases like concentric rings, data cannot be separated by a straight line (linearly separable), but may be separated
using more complex boundaries.
● Complex data can be made linearly separable by transforming it into a higher-dimensional feature space.
● Transforming two variables xxx and yyy into a new space using z= sqrt{x^2 + y^2} (the equation for a circle) can turn
non-linearly separable data into linearly separable data.
● After transforming the feature space, a linear SVM can classify the data perfectly (e.g., 100% accuracy).
Support Vector Machines
● Kernel functions automate the process of transforming nonlinear
data into linear form by mapping it into higher-dimensional spaces.
● SVM packages include several nonlinear kernels like polynomial,
radial basis function (RBF), and sigmoid functions.
● The most commonly used kernels are polynomial and radial basis
function (RBF) kernels.
● Starting with a simple quadratic polynomial kernel is a good
strategy, and then trying more complex ones if needed for better
accuracy.
● SVMs, especially with complex kernels, can be computationally
expensive, requiring significant resources for large datasets.
● SVMs are flexible and can handle both linear and nonlinear
classification problems using different kernel functions.
● SVMs run an optimization algorithm to find the hyperplane that
maximizes the margin while minimizing the penalty for
misclassifications.
● While an intuitive understanding of SVMs is useful, the algorithm's
workings can also be understood formally through mathematical
optimization techniques.
Support Vector Machines
● SVMs can be computationally expensive, especially in
higher dimensions or when dealing with large numbers
of attributes
● Once the SVM model is built, small changes in the
training data do not require significant re-training or
adjustment, as long as the support vectors remain
unchanged.
● SVMs are resistant to overfitting because the model's
boundary is typically defined by only a few support
vectors, rather than all the training data points.
● SVMs are flexible and have been applied in a wide
range of tasks, including image processing, fraud
detection, and text mining.
● Once the SVM model is trained, the prediction process
is typically fast and efficient, with most of the
computational cost incurred during the training phase.
Support Vector Machines
● SVMs can be computationally expensive, especially in
higher dimensions or when dealing with large numbers
of attributes
● Once the SVM model is built, small changes in the
training data do not require significant re-training or
adjustment, as long as the support vectors remain
unchanged.
● SVMs are resistant to overfitting because the model's
boundary is typically defined by only a few support
vectors, rather than all the training data points.
● SVMs are flexible and have been applied in a wide
range of tasks, including image processing, fraud
detection, and text mining.
● Once the SVM model is trained, the prediction process
is typically fast and efficient, with most of the
computational cost incurred during the training phase.
Esemble Learners
Esemble Learners - Board Room Example
You have a corporate board with three members who need to make a decision
(say, to approve or reject a project). Each member is making their decision based
on their understanding of the project.
● Individually, each board member makes the wrong decision 20% of the
time.
● The board wants to make the decision by majority rule. This means the
board makes a mistake (error) only if at least two of the three members
make a wrong decision at the same time.
● If all three board members make their decision unanimously every time
(either all say "yes" or all say "no"), the board’s error rate is simply 20%,
because the entire board is as accurate (or inaccurate) as one individual.
● Now, let's say each board member makes their decision independently—
meaning they don't influence each other.
● In this case, the board as a whole makes an error only when two or more
members make a wrong decision at the same time. This reduces the
board's overall error rate because the chance that all three make a mistake
at the same time is much smaller than just one person making a mistake.
● The math behind this is calculated using the binomial distribution, which
considers all the possible combinations of successes (right decisions) and
failures (wrong decisions) among the three members.
Esemble Learners - Board Room Example
Esemble Learners - Bagging
Bagging is an ensemble learning technique that aims to improve the accuracy and
robustness of machine learning models by training multiple models on different subsets of
the training data and then combining their predictions.
How it Works?
Step 1: Randomly create multiple subsets of the original training data, allowing
replacement (this is called bootstrapping). This means some data points can appear
multiple times in a subset, while others might not appear at all.
Step 2: Train a separate model (often a decision tree) on each of these subsets.
Step 3: For classification problems, the final prediction is made by majority voting across
all models. For regression problems, the prediction is the average of all models'
predictions.
Esemble Learners - Bagging
Example:
If you're predicting whether a person likes a product (yes/no), and 100 models were
trained using different data subsets, you take the majority vote of the models' predictions
to decide the final outcome.
Esemble Learners - Boosting
Boosting is another ensemble technique that sequentially trains models, with each new
model focusing on the errors made by the previous models
How it Works?.
Step 2: The model's errors (misclassifications or inaccurate predictions) are identified, and
the data points that were misclassified are given more weight so that the next model
focuses on these hard-to-classify examples.
Step 3: A new model is trained to correct the mistakes of the previous one, and this
process continues, each time adding a new model that focuses on the difficult cases.
Step 4: The final prediction is made by combining all the models’ predictions (in weighted
form).
Esemble Learners - Boosting
● Boosting improves the performance by combining weak models, each correcting the
mistakes of the previous one.
● It builds models sequentially, not independently, so errors from the earlier models are
addressed in later ones.
Example:
● If the first model misclassified people who are older as not liking the product, the next
model will focus more on those older people, trying to get that prediction right. This
continues until the overall model is accurate.
● AdaBoost and Gradient Boosting are popular boosting algorithms.
Esemble Learners - Random Forest
Random Forest:
Random Forest is a specific type of bagging technique that uses decision trees as the base model and adds an
additional layer of randomness to make the model even more robust.
How it works?
● Step 1: Just like in bagging, Random Forest creates multiple subsets of the data (bootstrapped samples).
● Step 2: For each subset, a decision tree is trained. However, unlike traditional bagging, when training each tree,
only a random subset of features is considered for splitting at each node of the tree. This adds randomness to
the model.
● Step 3: The final prediction for classification is made by majority vote across all decision trees, and for
regression, it's the average of the predictions.
● Reduces variance (due to the averaging of many decision trees) and adds more randomness to reduce the
likelihood of overfitting.
● By using different subsets of features at each tree split, Random Forest ensures that the trees are not too
correlated, making the ensemble more powerful.
Example:
If you have a dataset with 10 features (e.g., age, income, education, etc.), Random Forest will randomly select a subset
of those features (say 5) at each split while growing the decision trees, making each tree more diverse.