0% found this document useful (0 votes)
13 views11 pages

ML Unit-5

Uploaded by

pavankumarvoore3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

ML Unit-5

Uploaded by

pavankumarvoore3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT-V

Multilayer Perceptron Networks:

A multilayer perceptron (MLP) is a type of artificial neural network that


consists of multiple layers of interconnected perceptron units. It is one of the
most basic and widely used neural network architectures.

In an MLP, the perceptron units are organized into layers, typically including an
input layer, one or more hidden layers, and an output layer. Each layer is
composed of multiple perceptron units, also called neurons. Neurons in one
layer are connected to neurons in the next layer, forming a directed graph-like
structure.

The input layer receives the input data, which can be in the form of feature
vectors or raw data. Each input neuron represents a feature, and the values of
these neurons are passed to the next layer. The hidden layers perform
computations on the input data by applying an activation function to the
weighted sum of the inputs. The output layer produces the final result or
prediction based on the computations performed in the hidden layers.

MLPs are known as feedforward neural networks because the information flows
only in one direction, from the input layer through the hidden layers to the
output layer. The weights and biases associated with the connections between
neurons are adjusted during the training process using algorithms such as
backpropagation, which involves calculating the gradients of the error with
respect to the network's parameters and updating them accordingly to minimize
the error.

One key advantage of MLPs is their ability to approximate complex nonlinear


functions, making them suitable for a wide range of tasks, including
classification, regression, and pattern recognition. However, they can be prone
to overfitting, especially when the network has a large number of parameters
relative to the available training data. Regularization techniques, such as weight
decay or dropout, are often used to mitigate overfitting in MLPs.

MLPs have been widely used in various domains, including image and speech
recognition, natural language processing, and financial modeling. While they
have been successful in many applications, more advanced architectures, such
as convolutional neural networks (CNNs) for image processing and recurrent

Downloaded by Pavankumar Voore ([email protected])


neural networks (RNNs) for sequence modeling, have been developed to
address specific challenges in those domains.

Error back propagation algorithm

The error backpropagation algorithm, often referred to as backpropagation, is a


widely used algorithm for training neural networks, including multilayer
perceptron (MLP) networks. It is an iterative optimization method that adjusts
the weights and biases of the network based on the gradient of an error function
with respect to these parameters.

Here is a step-by-step overview of the error backpropagation algorithm:

1. Initialization: Initialize the weights and biases of the network randomly


or using some predetermined values.
2. Forward Propagation: Pass an input sample through the network,
calculating the activations of each neuron in each layer. Start with the input
layer and propagate forward through the hidden layers to the output layer. The
activations are computed by applying an activation function to the weighted
sum of the inputs.
3. Error Calculation: Compare the output of the network with the desired
output (target) for the given input sample. Calculate the error between the
network's output and the target using an appropriate error function, such as
mean squared error (MSE) or cross-entropy loss.
4. Backward Propagation: Starting from the output layer, propagate the
error backward through the network. Calculate the gradient of the error with
respect to the weights and biases of each neuron by applying the chain rule of
calculus. The gradient represents the direction and magnitude of the steepest
ascent or descent in the error landscape.
5. Weight Update: Adjust the weights and biases of each neuron using the
calculated gradients. The most common update rule is the gradient descent
algorithm, which updates the weights and biases in the opposite direction of the
gradient to minimize the error. The learning rate determines the step size of the
updates.
6. Repeat: Repeat steps 2-5 for each input sample in the training dataset,
iteratively updating the weights and biases based on the gradients of the errors.
This process is known as an epoch. Multiple epochs may be performed until the
network converges or a predefined stopping criterion is met.
7. Evaluation: After training, evaluate the performance of the network on
unseen data by passing it through the trained network and measuring the error
or accuracy.

Downloaded by Pavankumar Voore ([email protected])


It's important to note that backpropagation assumes differentiable activation
functions and requires the use of optimization techniques to overcome issues
such as local minima and overfitting. Regularization techniques like weight
decay or dropout can be employed to mitigate overfitting during the training
process.

Backpropagation has been a key algorithm in training neural networks and has
played a significant role in the success of deep learning.

Radial Basis Functions Networks

Radial Basis Function (RBF) networks are a type of neural network that use
radial basis functions as activation functions. They are known for their ability to
approximate complex functions and are particularly useful in applications such
as function approximation, classification, and pattern recognition.

Here's an overview of how RBF networks work:

1. Architecture: An RBF network typically consists of three layers: an


input layer, a hidden layer, and an output layer. Unlike MLP networks, RBF
networks have a single hidden layer.
2. Centers: The hidden layer of an RBF network contains a set of radial
basis functions, also known as RBF neurons. Each RBF neuron is associated
with a center, which represents a point in the input space. The centers can be
determined using clustering algorithms or other techniques.
3. Activation: The activation of an RBF neuron is computed based on the
distance between the input sample and the center of the neuron. The most
commonly used radial basis function is the Gaussian function, which calculates
the activation as the exponential of the negative squared distance between the
input and the center, divided by a width parameter called the spread. Other types
of radial basis functions, such as the Multiquadric or Inverse Multiquadric
functions, can also be used.
4. Weights: Each RBF neuron in the hidden layer is associated with a
weight that determines its contribution to the output of the network. These
weights are typically learned through a process called "linear regression" or
"least squares estimation," where the outputs of the hidden layer neurons are
used to approximate the desired output.
5. Output: The output layer of the RBF network performs a linear
combination of the activations of the hidden layer neurons, weighted by the
learned weights. The output can be a continuous value for regression tasks or a
binary/multi-class probability distribution for classification tasks.

Downloaded by Pavankumar Voore ([email protected])


6. Training: The training of an RBF network involves two main steps. First,
the centers of the RBF neurons are determined, often using clustering
algorithms like k-means. Then, the weights associated with the hidden layer
neurons are learned using techniques like least squares estimation or gradient
descent. The spread parameter of the radial basis functions can also be
optimized during training to improve the network's performance.

RBF networks have several advantages. They can approximate complex


nonlinear functions with fewer neurons compared to MLP networks, which can
lead to faster training and better generalization. RBF networks also have a solid
mathematical foundation and provide a clear interpretation of the hidden layer
as feature detectors.

However, RBF networks may suffer from issues such as overfitting and the
choice of the number and positions of the centers. Regularization techniques
and careful selection of the centers can help mitigate these challenges.

Overall, RBF networks offer an alternative approach to neural network


modeling, particularly suited for function approximation tasks and applications
where interpretability and simplicity are desired.

Decision Tree Learning

Decision tree learning is a popular machine learning technique used for both
classification and regression tasks. It builds a predictive model in the form of a
tree structure, where internal nodes represent features or attributes, branches
represent decisions or rules, and leaf nodes represent the output or predicted
values.

Here's a step-by-step overview of the decision tree learning process:

1. Data Preparation: Prepare a labeled dataset consisting of input features


and corresponding output labels. Each data point should have a set of features
and the corresponding class or value to be predicted.
2. Tree Construction: The decision tree learning algorithm starts by
selecting the best feature from the available features to split the dataset. Various
criteria can be used to measure the "best" feature, such as Gini impurity or
information gain. The selected feature becomes the root node of the tree.
3. Splitting: Once a feature is chosen, the dataset is partitioned into subsets
based on the possible values of that feature. Each subset represents a branch or
path from the root node. The process of splitting continues recursively for each
subset until a stopping criterion is met.

Downloaded by Pavankumar Voore ([email protected])


4. Stopping Criterion: The decision tree algorithm stops splitting when one
of the predefined stopping criteria is satisfied. Common stopping criteria
include reaching a maximum depth, reaching a minimum number of samples in
a leaf node, or when further splitting does not improve the predictive
performance significantly.
5. Leaf Node Assignment: At each leaf node, the majority class or the
average value of the samples in that subset is assigned as the predicted value.
For regression tasks, this can be the mean or median value, while for
classification tasks, it can be the most frequent class.
6. Pruning (Optional): After the initial construction of the decision tree,
pruning can be applied to reduce overfitting. Pruning involves removing or
collapsing nodes that do not contribute significantly to improving the predictive
performance on unseen data.
7. Prediction: Once the decision tree is constructed, it can be used to make
predictions on new, unseen data. Starting from the root node, the features of the
input data are compared with the decision rules at each node, and the prediction
is made by following the appropriate path down the tree until a leaf node is
reached.

Decision trees have several advantages, including their interpretability, as the


resulting tree structure can be easily visualized and understood. They can handle
both categorical and numerical features, handle missing values, and are
relatively fast to train and make predictions. Decision trees can also capture
non-linear relationships between features and the output.

However, decision trees are prone to overfitting, especially when the tree
becomes too complex or the dataset has noisy or irrelevant features. Techniques
like pruning, setting proper stopping criteria, or using ensemble methods like
random forests can help mitigate overfitting.

In summary, decision tree learning is a versatile and widely used machine


learning technique that provides an interpretable and efficient method for
classification and regression tasks.

Measures of impurity for evaluating splits in decision trees:

In decision tree algorithms, impurity measures are used to evaluate the quality
of a split at each node. The impurity measure helps determine which feature to
use for splitting and where to place the resulting branches. Here are some
commonly used impurity measures for evaluating splits in decision trees:

Downloaded by Pavankumar Voore ([email protected])


1. Gini impurity: The Gini impurity is a measure of how often a randomly
chosen element from the set would be incorrectly labeled if it were randomly
labeled according to the distribution of labels in the subset. It is computed as the
sum of the probabilities of each class being chosen times the probability of a
misclassification for that class. The Gini impurity is given by the formula:
Gini impurity = 1 - Σ (p(i)²)
where p(i) represents the probability of an item belonging to class i.
2. Entropy: Entropy is a measure of impurity based on information theory. It
calculates the average amount of information required to identify the class of a
randomly chosen element from the set. The entropy impurity is given by the
formula:
Entropy = - Σ (p(i) * log₂(p(i)))
where p(i) represents the probability of an item belonging to class i.
3. Misclassification error: This impurity measure calculates the error rate of
misclassifying an item to the most frequent class in a subset. It is given by the
formula:
Misclassification error = 1 - max(p(i))
where p(i) represents the probability of an item belonging to class i.

These impurity measures are used in decision tree algorithms to evaluate


potential splits and choose the split that minimizes impurity or maximizes
information gain. The impurity measure that results in the highest information
gain or the lowest impurity after the split is chosen as the best splitting criterion.

ID3:

ID3 (Iterative Dichotomiser 3) is a classic algorithm for constructing decision


trees. It was developed by Ross Quinlan in 1986 and is based on the concept of
information gain.

The ID3 algorithm follows a top-down, greedy approach to construct a decision


tree. It recursively selects the best attribute (feature) to split the data based on
the information gain measure. Information gain is a measure of the reduction in
entropy or impurity achieved by splitting the data on a particular attribute.

Here is a step-by-step overview of the ID3 algorithm:

1. Start with the entire training dataset and calculate the entropy (or
impurity) of the target variable.
2. For each attribute, calculate the information gain by splitting the data
based on that attribute. Information gain is calculated as the difference between
the entropy of the target variable before and after the split.

Downloaded by Pavankumar Voore ([email protected])


3. Select the attribute with the highest information gain as the splitting
criterion.
4. Create a decision tree node using the selected attribute.
5. Split the data into subsets based on the possible values of the selected
attribute.
6. Recursively apply the above steps to each subset by considering only the
remaining attributes (excluding the selected attribute).
7. If all instances in a subset belong to the same class, create a leaf node
with the corresponding class label.
8. Repeat steps 2-7 until all attributes are used or a stopping condition (e.g.,
reaching a maximum depth or minimum number of instances per leaf) is met.
9. The resulting tree represents the learned model, which can be used for
classification of new instances.

It's worth noting that the ID3 algorithm has some limitations, such as its
tendency to overfit on training data and its inability to handle missing values.
Various extensions and improvements, such as C4.5 and CART, have been
developed to address these limitations and build upon the concepts introduced
by ID3.

C4.5:

C4.5 is an extension of the ID3 algorithm for constructing decision trees,


developed by Ross Quinlan as an improvement over ID3. It was introduced in
1993 and addresses some limitations of ID3, including its inability to handle
continuous attributes and missing values.

C4.5 retains the top-down, greedy approach of ID3 but incorporates several
enhancements. Here are the key features and improvements of C4.5:

1. Handling Continuous Attributes: Unlike ID3, which can only handle


categorical attributes, C4.5 can handle continuous attributes. It does this by first
discretizing the continuous attributes into discrete intervals and then selecting
the best split point based on information gain or gain ratio.
2. Handling Missing Values: C4.5 can handle missing attribute values by
estimating the most probable value based on the available data. Instances with
missing values are appropriately weighted during the calculation of information
gain or gain ratio.
3. Gain Ratio: Instead of using information gain as the sole criterion for
attribute selection, C4.5 introduces the concept of gain ratio. Gain ratio takes
into account the intrinsic information of an attribute and aims to overcome the

Downloaded by Pavankumar Voore ([email protected])


bias towards attributes with a large number of distinct values. It helps prevent
the algorithm from favoring attributes with many outcomes.
4. Pruning: C4.5 includes a pruning step to address overfitting. After the
decision tree is constructed, it evaluates the effect of pruning subtrees by
considering the validation dataset. If pruning a subtree does not result in a
significant decrease in accuracy, it is replaced with a leaf node.
5. Handling Nominal and Numeric Class Labels: While ID3 is designed for
categorical class labels, C4.5 can handle both nominal and numeric class labels.

C4.5 has become widely adopted due to its improved handling of various data
types and ability to handle missing values. It has had a significant impact on
decision tree learning and has paved the way for further enhancements, such as
the C5.0 algorithm.

CART decision trees:

C4.5 is an extension of the ID3 algorithm for constructing decision trees,


developed by Ross Quinlan as an improvement over ID3. It was introduced in
1993 and addresses some limitations of ID3, including its inability to handle
continuous attributes and missing values.

C4.5 retains the top-down, greedy approach of ID3 but incorporates several
enhancements. Here are the key features and improvements of C4.5:

1. Handling Continuous Attributes: Unlike ID3, which can only handle


categorical attributes, C4.5 can handle continuous attributes. It does this by first
discretizing the continuous attributes into discrete intervals and then selecting
the best split point based on information gain or gain ratio.
2. Handling Missing Values: C4.5 can handle missing attribute values by
estimating the most probable value based on the available data. Instances with
missing values are appropriately weighted during the calculation of information
gain or gain ratio.
3. Gain Ratio: Instead of using information gain as the sole criterion for
attribute selection, C4.5 introduces the concept of gain ratio. Gain ratio takes
into account the intrinsic information of an attribute and aims to overcome the
bias towards attributes with a large number of distinct values. It helps prevent
the algorithm from favoring attributes with many outcomes.
4. Pruning: C4.5 includes a pruning step to address overfitting. After the
decision tree is constructed, it evaluates the effect of pruning subtrees by
considering the validation dataset. If pruning a subtree does not result in a
significant decrease in accuracy, it is replaced with a leaf node.

Downloaded by Pavankumar Voore ([email protected])


5. Handling Nominal and Numeric Class Labels: While ID3 is designed for
categorical class labels, C4.5 can handle both nominal and numeric class labels.

C4.5 has become widely adopted due to its improved handling of various data
types and ability to handle missing values. It has had a significant impact on
decision tree learning and has paved the way for further enhancements, such as
the C5.0 algorithm.
Pruning the tree:

Pruning is a technique used to prevent decision trees from overfitting, where the
model becomes too complex and overly specialized to the training data. Pruning
involves removing or collapsing nodes in the decision tree to simplify it, leading
to improved generalization and better performance on unseen data. Here are two
common approaches to pruning decision trees:

1. Pre-Pruning: Pre-pruning is performed during the construction of the


decision tree. It involves setting conditions to stop further splitting of nodes
based on certain criteria. Some common pre-pruning strategies include:
 Maximum Depth: Limiting the maximum depth of the tree by
specifying a threshold. Once the tree reaches the maximum depth, no
further splits are allowed.
 Minimum Number of Instances: Specifying a minimum number of
instances required at a node to allow further splitting. If the number of
instances falls below the threshold, the node becomes a leaf node without
further splits.
 Minimum Impurity Decrease: Requiring a minimum decrease in
impurity (e.g., information gain or Gini impurity) for a split to occur. If
the impurity decrease is below the threshold, the split is not performed.
By applying pre-pruning, the decision tree is restricted in its growth, preventing
it from capturing noise or irrelevant patterns in the training data.
2. Post-Pruning: Post-pruning, also known as backward pruning or error-
based pruning, is performed after the decision tree has been constructed. It
involves iteratively removing or collapsing nodes based on their estimated error
rate or other evaluation measures. The basic idea is to evaluate the impact of
removing a subtree and determine if it improves the overall accuracy or
performance of the tree on a validation dataset.

Both pre-pruning and post-pruning techniques aim to strike a balance between


model complexity and generalization performance, resulting in a more robust
decision tree that performs well on unseen data. The specific pruning strategy to
use depends on the dataset, algorithm, and available validation or test data for
evaluation.

Downloaded by Pavankumar Voore ([email protected])


Strengths and weakness of decision tree approach

The decision tree approach has several strengths and weaknesses that should be
considered when applying this algorithm to a given problem. Let's explore
them:

Strengths of the decision tree approach:

1. Interpretability: Decision trees are highly interpretable models, as they


can be visualized and easily understood by humans. The tree structure with
nodes and branches represents intuitive decision rules, making it easier to
explain the reasoning behind predictions or classifications.
2. Feature Importance: Decision trees provide a measure of feature
importance or attribute relevance. By examining the tree structure, you can
identify the most significant features that contribute to the decision-making
process. This can be valuable for feature selection and gaining insights into the
problem domain.
3. Nonlinear Relationships: Decision trees can handle nonlinear
relationships between features and the target variable. They are capable of
capturing complex interactions and patterns in the data without requiring
explicit transformations or assumptions about the data distribution.
4. Handling Missing Values and Outliers: Decision trees can handle missing
values and outliers in the dataset. They do not rely on imputation methods or
require data preprocessing techniques to handle missing values. Additionally,
the tree structure is robust to outliers, as the splitting process can accommodate
extreme values.
5. Easy Handling of Categorical and Numerical Data: Decision trees can
handle both categorical and numerical features without the need for extensive
data preprocessing. They automatically select appropriate splitting strategies for
different data types, making them versatile for various types of datasets.

Weaknesses of the decision tree approach:

1. Overfitting: Decision trees are prone to overfitting, especially when the


tree becomes too deep and complex. They may capture noise or specific
instances in the training data, leading to poor generalization and reduced
performance on unseen data. Proper pruning techniques and regularization
methods are necessary to mitigate overfitting.
2. Instability: Decision trees are sensitive to small changes in the training
data. A slight variation in the dataset may result in a different tree structure or
different decisions at the nodes. This instability can make decision trees less
reliable compared to other models that are more robust to data fluctuations.

Downloaded by Pavankumar Voore ([email protected])


3. Bias towards Features with High Cardinality: Decision trees tend to favor
features with high cardinality (a large number of distinct values) during the
splitting process. This can lead to an uneven representation of features in the
resulting tree and potentially overlook important features with lower cardinality.
4. Difficulty in Capturing Linear Relationships: Decision trees are not well-
suited for capturing linear relationships between features and the target variable.
They tend to model relationships using a series of threshold-based splits, which
may not effectively represent linear patterns.
5. Limited Expressiveness: Decision trees have a limited expressive power
compared to more complex models like neural networks or ensemble methods.
They may struggle with capturing intricate relationships and fine-grained
patterns in the data, particularly in high-dimensional datasets.

Understanding the strengths and weaknesses of the decision tree approach is


essential for selecting appropriate algorithms and employing strategies to
address its limitations, such as pruning, ensemble methods, or combining
decision trees with other techniques.

Downloaded by Pavankumar Voore ([email protected])

You might also like