0% found this document useful (0 votes)
7 views21 pages

DWDM Unit IV Note

The document discusses classification and prediction in data mining, highlighting key concepts, issues, and techniques such as decision tree induction and Bayesian classification. It emphasizes the importance of feature selection, class imbalance, overfitting, and model interpretability in building effective classification models. Additionally, it outlines the general approach to solving classification problems and provides details on the ID3 algorithm for decision tree construction.

Uploaded by

krishnashivayear
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views21 pages

DWDM Unit IV Note

The document discusses classification and prediction in data mining, highlighting key concepts, issues, and techniques such as decision tree induction and Bayesian classification. It emphasizes the importance of feature selection, class imbalance, overfitting, and model interpretability in building effective classification models. Additionally, it outlines the general approach to solving classification problems and provides details on the ID3 algorithm for decision tree construction.

Uploaded by

krishnashivayear
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT-IV: Classification

BASIC CONCEPTS- Classification, Prediction, Issues regarding Classification and Prediction,


Classification by Decision Tree Induction, Bayesian Classification-Baye’s Theorem, Naive
Bayesian Classification, Rule-Based Classification-Using IF-THEN Rules for Classification,
Rule Extraction from a Decision Tree, Classification by Back propagation.

Classification in Data Mining


Classification is a supervised learning technique in data mining where data is categorized into
predefined classes based on attributes. The goal is to build a model that can correctly assign new
data to one of the existing categories.
Key Issues:
• Feature Selection: Selecting the right features (attributes) is crucial for building an
effective classification model. Irrelevant or redundant features can degrade the model's
performance.
• Class Imbalance: Many real-world datasets have an imbalanced distribution of classes
(e.g., fraud detection, where fraudulent cases are much rarer than non-fraudulent ones).
This can lead to biased models that favor the majority class.
• Overfitting: Overfitting occurs when a model is too complex and fits the training data too
closely, capturing noise instead of the underlying pattern. This leads to poor
generalization on new data.
• Underfitting: On the other hand, underfitting occurs when a model is too simple and fails
to capture the underlying pattern of the data, resulting in poor performance on both
training and new data.
• Model Interpretability: In some cases, it is not enough for a model to be accurate; it also
needs to be interpretable. Highly accurate models like deep neural networks can be
challenging to interpret, leading to issues in understanding how decisions are made.

1
• Scalability: With large datasets, classification algorithms need to be scalable to handle
vast amounts of data efficiently. Some algorithms may struggle with performance or
memory issues as data grows.
Predication in Data Mining
Predication (or prediction) in data mining involves forecasting future outcomes based on
historical data. This can be seen in regression analysis, time-series forecasting, and other
predictive modeling techniques.
Key Issues:
• Data Quality: The quality of the input data (e.g., missing values, noise, and outliers) can
significantly affect the performance of predictive models. Poor-quality data leads to
inaccurate predictions.
• Generalization: A key challenge is building models that generalize well to unseen data, as
opposed to performing well only on the training dataset. Ensuring generalization requires
careful validation and testing procedures.
• Dynamic Data: In some applications (e.g., stock market prediction, weather forecasting),
the underlying data distribution may change over time. Models need to be continuously
updated or retrained to adapt to these changes.
• Evaluation Metrics: Choosing appropriate evaluation metrics is essential for predication
tasks. Different metrics (e.g., accuracy, precision, recall, F1-score) may provide different
perspectives on the performance of a model, and the choice depends on the specific
application and problem.
• Bias and Variance Trade-off: Balancing bias (error due to simplifying assumptions in the
model) and variance (error due to sensitivity to small fluctuations in the training set) is a
key challenge in predictive modeling. High bias can lead to underfitting, while high
variance can lead to overfitting.
• Interpretability vs. Accuracy: Similar to classification, there is often a trade-off between
the accuracy of a predictive model and its interpretability. More complex models may
yield better predictions but be harder to understand.

BASIC CONCEPTS:
Classification, which is the task of assigning objects to one of several predefined categories, is a
pervasive problem that encompasses many diverse applications. Examples include detecting
spam email messages based upon the message header and content.

Classification is the task of learning a target function / that maps each attribute set x to one of the
redefined class labels g. The target function is also known informally as a classification model.

2
A classification model is useful for the following purposes.

Aim: to predict categorical class labels for new tuples/samples


Input: a training set of tuples/samples, each with a class label
Output: a model (a classifier) based on the training set and the class labels

Example of Classification tasks:

• Descriptive Modeling A classification model can serve as an explanatory tool to


distinguish between objects of different classes. For example, it would be useful-for both
biologists and others-to have a descriptive model that summarizes the data shown in
Table 4.1 and explains what features define a vertebrate as a mammal, reptile, bird, fish,
or amphibian.
• Predictive Modeling A classification model can also be used to predict the class label of
unknown records
Classification techniques are most suited for predicting or describing data sets with binary or
nominal categories. They are less effective for ordinal categories

General Approach to Solving a Classification Problem:


classification technique (or classifier) is a systematic approach to building classification models
from an input data set. Examples include decision tree classifiers, rule-based classifiers, neural
networks, support vector machines, and naive Bayes classifiers.

Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and class label of the input data. The model generated by a learning
algorithm should both fit the input data well and correctly predict the class labels of records it
has never seen before.

Therefore, a key objective of the learning algorithm is to build models with good generalization
capability; i.e., models that accurately predict the class labels of previously unknown records.

3
Decision Tree Induction
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
• A decision tree is a flowchart-like tree structure, where each internal node (nonleaf
node) denotes a test on an attribute, each branch represents an outcome of the test, and
each leaf node (or terminal node) holds a class label.
• The topmost node in a tree is the root node.
• A decision tree performs the classification in the form of tree structure. It breaks down
the dataset into small subsets and a decision tree can be designed simultaneously.
• The final result is a tree with decision node.
• Decision tree induction is the learning of decision trees from class-labeled training
tuples.
• A tree has three types of nodes:
• root node that has no incoming edges and zero or more outgoing edges.
• Internal nodes, each of which has exactly one incoming edge and two or more outgoing
edges.
• Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing
edges.

4
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data partition,
D.
Input:
• Data partition, D, which is a set of training tuples and their associated class labels;
• attribute list, the set of candidate attributes;
• Attribute selection method, a procedure to determine the splitting criterion that “best” partitions
the data tuples into individual classes. This criterion consists of a splitting attribute and, possibly,
either a split-point or splitting subset.
Output: A decision tree.

• Method:
• (1) create a node N;
• (2) if tuples in D are all of the same class, C, then
• (3) return N as a leaf node labeled with the class C;
• (4) if attribute list is empty then
• (5) return N as a leaf node labeled with the majority class in D; // majority voting
• (6) apply Attribute selection method(D, attribute list) to find the “best” splitting
criterion;
• (7) label node N with splitting criterion;
• (8) if splitting attribute is discrete-valued and multi-way splits allowed then // not
restricted to binary trees
• (9) attribute list <-attribute list -splitting attribute; // remove splitting attribute
• (10) for each outcome j of splitting criterion
• // partition the tuples and grow sub-trees for each partition
• (11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
• (12) if Dj is empty then
• (13) attach a leaf labeled with the majority class in D to node N;

5
• (14) else attach the node returned by Generate decision tree(Dj , attribute list) to node
N;
• End for
• (15) return N;

Attribute Selection measures:


An attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes.

Attribute selection measures are also known as splitting rules because they determine how the
tuples at a given node are to be split.

ID3 uses information gain as its attribute selection measure

ID3(Iterative dichotomister3)
It is a classification algorithm that follows a greedy approach by selecting a best attribute that
yields maximum Information Gain(IG) or minimum Entropy(H).
What are the steps in ID3 algorithm?
The steps in ID3 algorithm are as follows:
Calculate entropy for dataset.
For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
Find the feature with maximum information gain.
Repeat it until we get the desired tree.

6
• Complete entropy of dataset is:
• H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no)) = - (9/14) * log2(9/14) - (5/14) *
log2(5/14) = - (-0.41) - (-0.53) = 0.94
First Attribute - Outlook
Categorical values - sunny, overcast and rain
H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971 H(Outlook=rain) = -
(3/5)*log(3/5)-(2/5)*log(2/5) =0.971 H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0
Average Entropy Information for Outlook - I(Outlook) = p(sunny) * H(Outlook=sunny) +
p(rain) * H(Outlook=rain) + p(overcast) * H(Outlook=overcast) = (5/14)*0.971 +
(5/14)*0.971 + (4/14)*0 = 0.693
Information Gain = H(S) - I(Outlook) = 0.94 - 0.693 = 0.247
Second Attribute - Temperature

Categorical values - hot, mild, cool H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1

H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811

H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179

Average Entropy Information for Temperature - I(Temperature) =


p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) +
p(cool)*H(Temperature=cool) = (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811 = 0.9108

Information Gain = H(S) - I(Temperature) = 0.94 - 0.9108 = 0.0292

Third Attribute - Humidity


Categorical values - high, normal H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) =
0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591
Average Entropy Information for Humidity - I(Humidity) = p(high)*H(Humidity=high) +
p(normal)*H(Humidity=normal) = (7/14)*0.983 + (7/14)*0.591 = 0.787 Information Gain =
H(S) - I(Humidity) = 0.94 - 0.787 = 0.153
Fourth Attribute - Wind

Categorical values - weak, strong H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811

H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1 Average Entropy Information for


Wind - I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong) = (8/14)*0.811 +
(6/14)*1 = 0.892

Information Gain = H(S) - I(Wind) = 0.94 - 0.892 = 0.048

7
Here, the attribute with maximum information gain is Outlook. So, the decision tree built so
far –

Here, when Outlook == overcast, it is of pure class(Yes).


Now, we have to repeat same procedure for the data with rows consist of Outlook value as
Sunny and then for Outlook value as Rain.

Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset
rows = [1, 2, 8, 9, 11]}.

Complete entropy of Sunny is - H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no)) = - (2/5)


* log2(2/5) - (3/5) * log2(3/5) = 0.971

First Attribute - Temperature

Categorical values - hot, mild, cool H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0


H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0 H(Sunny, Temperature=mild) = -
(1/2)*log(1/2)-(1/2)*log(1/2) = 1

Average Entropy Information for Temperature - I(Sunny, Temperature) = p(Sunny,


hot)*H(Sunny, Temperature=hot) + p(Sunny, mild)*H(Sunny, Temperature=mild) +
p(Sunny, cool)*H(Sunny, Temperature=cool) = (2/5)*0 + (1/5)*0 + (2/5)*1 = 0.4

Information Gain = H(Sunny) - I(Sunny, Temperature) = 0.971 - 0.4 = 0.571

Second Attribute - Humidity

• Categorical values - high, normal H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0


H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0 Average Entropy Information for
Humidity - I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny,
normal)*H(Sunny, Humidity=normal) = (3/5)*0 + (2/5)*0 = 0 Information Gain =
H(Sunny) - I(Sunny, Humidity) = 0.971 - 0 = 0.971

Third Attribute - Wind

8
• Categorical values - weak, strong H(Sunny, Wind=weak) = -(1/3)*log(1/3)-
(2/3)*log(2/3) = 0.918 H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Wind - I(Sunny, Wind) = p(Sunny, weak)*H(Sunny,
Wind=weak) + p(Sunny, strong)*H(Sunny, Wind=strong) = (3/5)*0.918 + (2/5)*1 =
0.9508 Information Gain = H(Sunny) - I(Sunny, Wind) = 0.971 - 0.9508 = 0.0202

Here, when Outlook = Sunny and Humidity = High, it is a pure class of category "no". And When
Outlook = Sunny and Humidity = Normal, it is again a pure class of category "yes". Therefore, we
don't need to do further calculations.

Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset rows = [4,
5, 6, 10, 14]}.

Complete entropy of Rain is - H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no)) = - (3/5) * log(3/5)
- (2/5) * log(2/5) = 0.971

• First Attribute - Temperature

• Categorical values - mild, cool H(Rain, Temperature=cool) = -(1/2)*log(1/2)- (1/2)*log(1/2) = 1


H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918 Average Entropy
Information for Temperature - I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild)
+ p(Rain, cool)*H(Rain, Temperature=cool) = (2/5)*1 + (3/5)*0.918 = 0.9508 Information Gain
= H(Rain) - I(Rain, Temperature) = 0.971 - 0.9508 = 0.0202

Second Attribute - Wind

9
• Categorical values - weak, strong H(Wind=weak) = -(3/3)*log(3/3)-0 = 0 H(Wind=strong) = 0-
(2/2)*log(2/2) = 0 Average Entropy Information for Wind - I(Wind) = p(Rain, weak)*H(Rain,
Wind=weak) + p(Rain, strong)*H(Rain, Wind=strong) = (3/5)*0 + (2/5)*0 = 0 Information Gain
= H(Rain) - I(Rain, Wind) = 0.971 - 0 = 0.971

• Here, the attribute with maximum information gain is Wind. So, the decision tree built so far -

Tree Pruning
Decision trees are made to classify the itemset, while classifying we met with 2 problems
➢ Underfitting
➢ Overfitting
Underfitting
Problem arises when both training errors and test errors are large and this happens when
developed model is made very simple.
Overfitting
Problem arises when training errors are small but test errors are large.
Training errors no longer provides a good estimate of how well the tree perform on
previously unseen records, need new ways for estimating errors. To address this go for tree
pruning.
The process of adjusting decision tree to minimize “misclassification error “ is called pruning.
• Tree pruning is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
• There are two approaches to prune a tree −

10
• Pre-pruning − The tree is pruned by halting its construction early.
Ex: ID3 algorithm
• Post-pruning - This approach removes a sub-tree from a fully grown tree.

Bayesian Classification:
Bayesian Classification is a probabilistic approach to classify data points based on Bayes'
Theorem. It's widely used in machine learning and artificial intelligence, especially for tasks
involving uncertainty and decision-making.

The key principle of Bayesian classification is to calculate the posterior probability of each class
given the data, then assign the class with the highest posterior probability.

Bayes' Theorem
Bayes' Theorem is the foundation of Bayesian classification. It can be stated as:

11
Naive Bayes Classifier:
The most common form of Bayesian classification is the Naive Bayes Classifier, which assumes that the
features are conditionally independent given the class.
This simplifies the computation of the likelihood as:

Steps for Bayesian Classification


1. Calculate Prior Probability: Estimate the probability of each class based on the training
data.
2. Calculate Likelihood: For each class, compute the likelihood for each feature (assuming
independence).
3. Calculate Posterior Probability: Use Bayes' Theorem to calculate the posterior probability
for each class.
4. Make a Prediction: Classify the instance into the class with the highest posterior
probability.
Advantages
• Works well with small datasets.
• Easy to implement and understand.
• Handles both continuous and discrete data.
Disadvantages
• The assumption of independence between features is often violated in real-world data.
• Can perform poorly if the data has strongly correlated features.
Bayesian classifiers, particularly Naive Bayes, are fundamental tools in probabilistic applications
like:
• Spam filtering
• Document classification
• Sentiment analysis
• Medical diagnosis

12
Predicting a class label using Naive Bayesian classification.

We wish to predict the class label of a tuple using Naive Bayesian classification, given the same
training data as in Example 8.3 for decision tree induction. The training data were shown earlier
in Table 8.1. The data tuples are described by the attributes age, income, student, and credit
rating. The class label attribute, buys computer, has two distinct values (namely, yes, no). Let C1
correspond to the class buys computer = yes and C2 correspond to buys computer = no. The tuple
we wish to classify is
X = (age= youth, income= medium, student =yes, credit rating = fair)

13
Rule-Based Classification

Using IF-THEN Rules for Classification

Rule-based classification is a technique in machine learning and artificial intelligence where rules are
used to classify data points into categories. The classification is based on "if-then" rules that are typically
created manually by domain experts or automatically through algorithms. Each rule defines a condition
that, if satisfied, leads to a specific classification.

Key Concepts of Rule-Based Classification

1. Rule Structure: A rule in a rule-based classifier generally has the form:

IF condition THEN class

o Condition: This part defines a set of criteria that the input data must meet. Conditions are
usually based on feature values.

o Class: The class or label to be assigned to the data point if the condition is satisfied.

Example:

IF "age > 18" AND "income > 50k" THEN "Approve Loan"

Rule Set: A rule-based classifier is composed of a set of rules. During classification, the system evaluates
the input data against each rule in the set. If the data satisfies the condition of a rule, it is assigned the
corresponding class.

14
2. Mutually Exclusive Rules: In some systems, rules are mutually exclusive, meaning a data point
can only satisfy one rule. However, in other systems, a data point may match multiple rules, and
mechanisms such as rule priority or voting are used to determine the final classification.

3. Rules Can Be Hierarchical: Some rule-based systems allow rules to be organized hierarchically,
where rules at higher levels are broader and rules at lower levels are more specific. This creates a
decision tree-like structure.

Advantages of Rule-Based Classification

• Interpretability: The rules are typically easy to understand, which makes rule-based classifiers
highly interpretable and transparent compared to many other machine learning techniques.

• Flexibility: Experts can manually define rules based on domain knowledge, making this method
adaptable to specific needs or situations.

• Speed: Once the rules are defined, the classification process is usually fast, as it only involves
evaluating the conditions in each rule.

Disadvantages of Rule-Based Classification

• Scalability: It can become challenging to define an extensive set of rules, especially for complex
problems with many variables.

• Incomplete Knowledge: If the rules don't cover all possible cases, certain input data may be left
unclassified, or the system might struggle with ambiguous cases.

• Overfitting: If rules are created based on specific examples from training data, they might not
generalize well to unseen data.

Applications of Rule-Based Classification

1. Expert Systems: Rule-based systems were the foundation of early expert systems, such as
medical diagnosis systems (e.g., MYCIN for diagnosing bacterial infections).

2. Business Decision Making: In banking and finance, rule-based systems are often used for
decisions like loan approvals (e.g., "If credit score > 700, approve loan").

3. Spam Filtering: Email filtering systems can use rules like "If the email contains the word
'money' AND is from an unknown sender, classify as spam."

4. Fraud Detection: Rule-based systems are used to detect fraud by applying rules like "If
transaction amount > $5000 AND location is outside the country, flag as suspicious."

15
Classification By Backpropagation In Data Minning:

It is an algorithm that propagates the error from the output nodes to the input nodes. We can say that
it is used to propagate back all the errors. We can propagate back to the error in the various
applications in the data mining, such as Character recognition, Signature verification, etc.

Neural Network

It is a type of paradigm that is an image-processing system inspired by the human nervous system.
Like the human nervous system, we have neutral artificial neurons in the neural network. The human
brain has 10 billion neurons, each connected to an average of 10,000 other neurons. Each neuron
receives a signal through a synapse, which controls the effect of the sign on the neuron.

Backpropagation

It is a type of algorithm widely used in training in neural networks. It is used to compute the loss
function of the weight of the networks. It is so efficient that it directly computes the gradient
concerning each weight. With the help of this, it is also possible to use gradient methods to train
multi-layer networks and update weights to minimize loss; variants such as gradient descent or
stochastic gradient descent are often used.

The main work of the backpropagation algorithm is to compute the gradient of the loss function to
each weight via the chain rule, computing the gradient layer by layer and iterating backwards from
the last layer to avoid redundant computation of intermediate terms in the chain rule.

Features of Backpropagation:

There are so many features of backpropagation. These features are as follows.


1. It is one of the gradient methods used to create the simple perceptron network with the
differentiable unit.

2. It is so much more difficult than another network. It is used to calculate the learning period
of the network.

3. There are three stages which are used in the training. Those three stages are as follows.

o The feed-forward of the input training pattern, the calculation, and the
backpropagation of the error.

o Updation of the weight.

Working of Backpropagation:

We can use the neural network to supervise the learning, which is used to generate output vectors
from input vectors that the network operates on. We can create the compare output, and also it can
generate the error report if the result of the error report doesn't match the output of the vector. Then,
it adjusts the weights according to the bug report to get your desired output.

Backpropagation Algorithm:

It is the most fundamental building block of the neural network. It was introduced in 1960 and
popular in 1989 by Rumelhart, Hinton and Williams in a paper called "Learning representations by
back-propagating errors".

The 4-layer neural network consists of 4 neurons for the input layer, 4 for the hidden layers and 1 for
the output layer.

In the above image, purple is represented as the input data. These input data are similar to the scalar
and difficult simir to the vector.

In the first set of activations, (a) also equals the input values. The term activation is known as the
neuron's value, which is generated after generating the activation function.

The green colour indicates the final value for the neurons. These are calculated using z^l —
weighted inputs in layer l and a^l— activations in layer l. For layer 2 and layer 3, the equation
becomes :
For l=2, the equation becomes:

For l=3, the equation becomes:

Here, the W2 and W3 are represented as the weight for layer 2 and layer 3, respectively. b2 and d3
are represented as the bias for layer 2 and layer 3.

We can compute the a2 and a3 by using an activation function f.here the function f is non-linear, and
with the help of this, the network can learn the complex pattern in the data
Looking very carefully, we see that x, z², a², z³, a³, W¹, W², b¹, and b² miss their subscripts in the 4-
layer network illustration above. The main reason is that we have combined all the parameter values
in matrices grouped by layers. With the help of this way, we can work with neural networks, and one
should be comfortable with the calculations. Let us understand this by taking an example.

Let's take two layers and their parameters as an example. We can apply the same operation in any
layer of any network.

o Here, W1 is known as the weight matrix of shape (n, m), where n is known as the number of
output neurons (neurons in the next layer), and m is the number of input neurons (neurons in
the previous layer). Let's take n = 2 and m = 4 in our example.

Here, the first number in any weight's subscript matches the index of the neuron in the next layer (in
our case, this is the Hidden_2 layer), and the second number matches the index of the neuron in the
previous layer (in our case this is the Input layer).

o Here, x is known as the input vector of shape (m, 1), where m is the number of input neurons.
Let's take m=4,

o Here, b1 is known as the bias vector of shape (n, 1), in which n is known as the number of
neurons in the current layer. Here, we take n = 2.
o Following the equation for z², we can use the above definitions of W¹, x and b¹ to derive
"Equation for z²":

Now, we have to observe the neural network picture in the figure below.

In the above figure we have to see that the z² can be expressed using (z_1)² and (z_2)² where (z_1)²
and (z_2)² are the sums of the multiplication between every input x_i with the corresponding weight
(W_ij)¹.this above equation leads to "Equation for z²" and proofs that the matrix representations for
z², a², z³ and a³ are correct. The final part of a neural network is the output layer, which produces the
predicted value. In our simple example, it is presented as a single neuron, colored in blue and
evaluated as follows:
Need for Backpropagation:

This type of backpropagation error is very useful for training the neural network. It is very easy to
implement and so simple in nature. It does not require any parameter to set everything except the
number of inputs. It is also a flexible method because it has no prior knowledge of the network.

Types of Backpropagation

There are two types of backpropagation in the network. These two are as follows.

1. Staticbackpropagation:
Static backpropagation is a network designed to map static inputs for static outputs. These
types of networks can solve static classification problems such as OCR (Optical Character
Recognition).

2. Recurrentbackpropagation:
Recursive backpropagation is another network used for fixed-point learning. Activation in
recurrent backpropagation is feed-forward until a fixed value is reached. Static
backpropagation provides an instant mapping, while recurrent backpropagation does not
provide an instant mapping.

You might also like