DWDM Unit IV Note
DWDM Unit IV Note
1
• Scalability: With large datasets, classification algorithms need to be scalable to handle
vast amounts of data efficiently. Some algorithms may struggle with performance or
memory issues as data grows.
Predication in Data Mining
Predication (or prediction) in data mining involves forecasting future outcomes based on
historical data. This can be seen in regression analysis, time-series forecasting, and other
predictive modeling techniques.
Key Issues:
• Data Quality: The quality of the input data (e.g., missing values, noise, and outliers) can
significantly affect the performance of predictive models. Poor-quality data leads to
inaccurate predictions.
• Generalization: A key challenge is building models that generalize well to unseen data, as
opposed to performing well only on the training dataset. Ensuring generalization requires
careful validation and testing procedures.
• Dynamic Data: In some applications (e.g., stock market prediction, weather forecasting),
the underlying data distribution may change over time. Models need to be continuously
updated or retrained to adapt to these changes.
• Evaluation Metrics: Choosing appropriate evaluation metrics is essential for predication
tasks. Different metrics (e.g., accuracy, precision, recall, F1-score) may provide different
perspectives on the performance of a model, and the choice depends on the specific
application and problem.
• Bias and Variance Trade-off: Balancing bias (error due to simplifying assumptions in the
model) and variance (error due to sensitivity to small fluctuations in the training set) is a
key challenge in predictive modeling. High bias can lead to underfitting, while high
variance can lead to overfitting.
• Interpretability vs. Accuracy: Similar to classification, there is often a trade-off between
the accuracy of a predictive model and its interpretability. More complex models may
yield better predictions but be harder to understand.
BASIC CONCEPTS:
Classification, which is the task of assigning objects to one of several predefined categories, is a
pervasive problem that encompasses many diverse applications. Examples include detecting
spam email messages based upon the message header and content.
Classification is the task of learning a target function / that maps each attribute set x to one of the
redefined class labels g. The target function is also known informally as a classification model.
2
A classification model is useful for the following purposes.
Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and class label of the input data. The model generated by a learning
algorithm should both fit the input data well and correctly predict the class labels of records it
has never seen before.
Therefore, a key objective of the learning algorithm is to build models with good generalization
capability; i.e., models that accurately predict the class labels of previously unknown records.
3
Decision Tree Induction
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
• A decision tree is a flowchart-like tree structure, where each internal node (nonleaf
node) denotes a test on an attribute, each branch represents an outcome of the test, and
each leaf node (or terminal node) holds a class label.
• The topmost node in a tree is the root node.
• A decision tree performs the classification in the form of tree structure. It breaks down
the dataset into small subsets and a decision tree can be designed simultaneously.
• The final result is a tree with decision node.
• Decision tree induction is the learning of decision trees from class-labeled training
tuples.
• A tree has three types of nodes:
• root node that has no incoming edges and zero or more outgoing edges.
• Internal nodes, each of which has exactly one incoming edge and two or more outgoing
edges.
• Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing
edges.
4
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data partition,
D.
Input:
• Data partition, D, which is a set of training tuples and their associated class labels;
• attribute list, the set of candidate attributes;
• Attribute selection method, a procedure to determine the splitting criterion that “best” partitions
the data tuples into individual classes. This criterion consists of a splitting attribute and, possibly,
either a split-point or splitting subset.
Output: A decision tree.
• Method:
• (1) create a node N;
• (2) if tuples in D are all of the same class, C, then
• (3) return N as a leaf node labeled with the class C;
• (4) if attribute list is empty then
• (5) return N as a leaf node labeled with the majority class in D; // majority voting
• (6) apply Attribute selection method(D, attribute list) to find the “best” splitting
criterion;
• (7) label node N with splitting criterion;
• (8) if splitting attribute is discrete-valued and multi-way splits allowed then // not
restricted to binary trees
• (9) attribute list <-attribute list -splitting attribute; // remove splitting attribute
• (10) for each outcome j of splitting criterion
• // partition the tuples and grow sub-trees for each partition
• (11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
• (12) if Dj is empty then
• (13) attach a leaf labeled with the majority class in D to node N;
5
• (14) else attach the node returned by Generate decision tree(Dj , attribute list) to node
N;
• End for
• (15) return N;
Attribute selection measures are also known as splitting rules because they determine how the
tuples at a given node are to be split.
ID3(Iterative dichotomister3)
It is a classification algorithm that follows a greedy approach by selecting a best attribute that
yields maximum Information Gain(IG) or minimum Entropy(H).
What are the steps in ID3 algorithm?
The steps in ID3 algorithm are as follows:
Calculate entropy for dataset.
For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
Find the feature with maximum information gain.
Repeat it until we get the desired tree.
6
• Complete entropy of dataset is:
• H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no)) = - (9/14) * log2(9/14) - (5/14) *
log2(5/14) = - (-0.41) - (-0.53) = 0.94
First Attribute - Outlook
Categorical values - sunny, overcast and rain
H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971 H(Outlook=rain) = -
(3/5)*log(3/5)-(2/5)*log(2/5) =0.971 H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0
Average Entropy Information for Outlook - I(Outlook) = p(sunny) * H(Outlook=sunny) +
p(rain) * H(Outlook=rain) + p(overcast) * H(Outlook=overcast) = (5/14)*0.971 +
(5/14)*0.971 + (4/14)*0 = 0.693
Information Gain = H(S) - I(Outlook) = 0.94 - 0.693 = 0.247
Second Attribute - Temperature
7
Here, the attribute with maximum information gain is Outlook. So, the decision tree built so
far –
Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset
rows = [1, 2, 8, 9, 11]}.
8
• Categorical values - weak, strong H(Sunny, Wind=weak) = -(1/3)*log(1/3)-
(2/3)*log(2/3) = 0.918 H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Wind - I(Sunny, Wind) = p(Sunny, weak)*H(Sunny,
Wind=weak) + p(Sunny, strong)*H(Sunny, Wind=strong) = (3/5)*0.918 + (2/5)*1 =
0.9508 Information Gain = H(Sunny) - I(Sunny, Wind) = 0.971 - 0.9508 = 0.0202
Here, when Outlook = Sunny and Humidity = High, it is a pure class of category "no". And When
Outlook = Sunny and Humidity = Normal, it is again a pure class of category "yes". Therefore, we
don't need to do further calculations.
Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset rows = [4,
5, 6, 10, 14]}.
Complete entropy of Rain is - H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no)) = - (3/5) * log(3/5)
- (2/5) * log(2/5) = 0.971
9
• Categorical values - weak, strong H(Wind=weak) = -(3/3)*log(3/3)-0 = 0 H(Wind=strong) = 0-
(2/2)*log(2/2) = 0 Average Entropy Information for Wind - I(Wind) = p(Rain, weak)*H(Rain,
Wind=weak) + p(Rain, strong)*H(Rain, Wind=strong) = (3/5)*0 + (2/5)*0 = 0 Information Gain
= H(Rain) - I(Rain, Wind) = 0.971 - 0 = 0.971
• Here, the attribute with maximum information gain is Wind. So, the decision tree built so far -
Tree Pruning
Decision trees are made to classify the itemset, while classifying we met with 2 problems
➢ Underfitting
➢ Overfitting
Underfitting
Problem arises when both training errors and test errors are large and this happens when
developed model is made very simple.
Overfitting
Problem arises when training errors are small but test errors are large.
Training errors no longer provides a good estimate of how well the tree perform on
previously unseen records, need new ways for estimating errors. To address this go for tree
pruning.
The process of adjusting decision tree to minimize “misclassification error “ is called pruning.
• Tree pruning is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
• There are two approaches to prune a tree −
10
• Pre-pruning − The tree is pruned by halting its construction early.
Ex: ID3 algorithm
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
Bayesian Classification:
Bayesian Classification is a probabilistic approach to classify data points based on Bayes'
Theorem. It's widely used in machine learning and artificial intelligence, especially for tasks
involving uncertainty and decision-making.
The key principle of Bayesian classification is to calculate the posterior probability of each class
given the data, then assign the class with the highest posterior probability.
Bayes' Theorem
Bayes' Theorem is the foundation of Bayesian classification. It can be stated as:
11
Naive Bayes Classifier:
The most common form of Bayesian classification is the Naive Bayes Classifier, which assumes that the
features are conditionally independent given the class.
This simplifies the computation of the likelihood as:
12
Predicting a class label using Naive Bayesian classification.
We wish to predict the class label of a tuple using Naive Bayesian classification, given the same
training data as in Example 8.3 for decision tree induction. The training data were shown earlier
in Table 8.1. The data tuples are described by the attributes age, income, student, and credit
rating. The class label attribute, buys computer, has two distinct values (namely, yes, no). Let C1
correspond to the class buys computer = yes and C2 correspond to buys computer = no. The tuple
we wish to classify is
X = (age= youth, income= medium, student =yes, credit rating = fair)
13
Rule-Based Classification
Rule-based classification is a technique in machine learning and artificial intelligence where rules are
used to classify data points into categories. The classification is based on "if-then" rules that are typically
created manually by domain experts or automatically through algorithms. Each rule defines a condition
that, if satisfied, leads to a specific classification.
o Condition: This part defines a set of criteria that the input data must meet. Conditions are
usually based on feature values.
o Class: The class or label to be assigned to the data point if the condition is satisfied.
Example:
IF "age > 18" AND "income > 50k" THEN "Approve Loan"
Rule Set: A rule-based classifier is composed of a set of rules. During classification, the system evaluates
the input data against each rule in the set. If the data satisfies the condition of a rule, it is assigned the
corresponding class.
14
2. Mutually Exclusive Rules: In some systems, rules are mutually exclusive, meaning a data point
can only satisfy one rule. However, in other systems, a data point may match multiple rules, and
mechanisms such as rule priority or voting are used to determine the final classification.
3. Rules Can Be Hierarchical: Some rule-based systems allow rules to be organized hierarchically,
where rules at higher levels are broader and rules at lower levels are more specific. This creates a
decision tree-like structure.
• Interpretability: The rules are typically easy to understand, which makes rule-based classifiers
highly interpretable and transparent compared to many other machine learning techniques.
• Flexibility: Experts can manually define rules based on domain knowledge, making this method
adaptable to specific needs or situations.
• Speed: Once the rules are defined, the classification process is usually fast, as it only involves
evaluating the conditions in each rule.
• Scalability: It can become challenging to define an extensive set of rules, especially for complex
problems with many variables.
• Incomplete Knowledge: If the rules don't cover all possible cases, certain input data may be left
unclassified, or the system might struggle with ambiguous cases.
• Overfitting: If rules are created based on specific examples from training data, they might not
generalize well to unseen data.
1. Expert Systems: Rule-based systems were the foundation of early expert systems, such as
medical diagnosis systems (e.g., MYCIN for diagnosing bacterial infections).
2. Business Decision Making: In banking and finance, rule-based systems are often used for
decisions like loan approvals (e.g., "If credit score > 700, approve loan").
3. Spam Filtering: Email filtering systems can use rules like "If the email contains the word
'money' AND is from an unknown sender, classify as spam."
4. Fraud Detection: Rule-based systems are used to detect fraud by applying rules like "If
transaction amount > $5000 AND location is outside the country, flag as suspicious."
15
Classification By Backpropagation In Data Minning:
It is an algorithm that propagates the error from the output nodes to the input nodes. We can say that
it is used to propagate back all the errors. We can propagate back to the error in the various
applications in the data mining, such as Character recognition, Signature verification, etc.
Neural Network
It is a type of paradigm that is an image-processing system inspired by the human nervous system.
Like the human nervous system, we have neutral artificial neurons in the neural network. The human
brain has 10 billion neurons, each connected to an average of 10,000 other neurons. Each neuron
receives a signal through a synapse, which controls the effect of the sign on the neuron.
Backpropagation
It is a type of algorithm widely used in training in neural networks. It is used to compute the loss
function of the weight of the networks. It is so efficient that it directly computes the gradient
concerning each weight. With the help of this, it is also possible to use gradient methods to train
multi-layer networks and update weights to minimize loss; variants such as gradient descent or
stochastic gradient descent are often used.
The main work of the backpropagation algorithm is to compute the gradient of the loss function to
each weight via the chain rule, computing the gradient layer by layer and iterating backwards from
the last layer to avoid redundant computation of intermediate terms in the chain rule.
Features of Backpropagation:
2. It is so much more difficult than another network. It is used to calculate the learning period
of the network.
3. There are three stages which are used in the training. Those three stages are as follows.
o The feed-forward of the input training pattern, the calculation, and the
backpropagation of the error.
Working of Backpropagation:
We can use the neural network to supervise the learning, which is used to generate output vectors
from input vectors that the network operates on. We can create the compare output, and also it can
generate the error report if the result of the error report doesn't match the output of the vector. Then,
it adjusts the weights according to the bug report to get your desired output.
Backpropagation Algorithm:
It is the most fundamental building block of the neural network. It was introduced in 1960 and
popular in 1989 by Rumelhart, Hinton and Williams in a paper called "Learning representations by
back-propagating errors".
The 4-layer neural network consists of 4 neurons for the input layer, 4 for the hidden layers and 1 for
the output layer.
In the above image, purple is represented as the input data. These input data are similar to the scalar
and difficult simir to the vector.
In the first set of activations, (a) also equals the input values. The term activation is known as the
neuron's value, which is generated after generating the activation function.
The green colour indicates the final value for the neurons. These are calculated using z^l —
weighted inputs in layer l and a^l— activations in layer l. For layer 2 and layer 3, the equation
becomes :
For l=2, the equation becomes:
Here, the W2 and W3 are represented as the weight for layer 2 and layer 3, respectively. b2 and d3
are represented as the bias for layer 2 and layer 3.
We can compute the a2 and a3 by using an activation function f.here the function f is non-linear, and
with the help of this, the network can learn the complex pattern in the data
Looking very carefully, we see that x, z², a², z³, a³, W¹, W², b¹, and b² miss their subscripts in the 4-
layer network illustration above. The main reason is that we have combined all the parameter values
in matrices grouped by layers. With the help of this way, we can work with neural networks, and one
should be comfortable with the calculations. Let us understand this by taking an example.
Let's take two layers and their parameters as an example. We can apply the same operation in any
layer of any network.
o Here, W1 is known as the weight matrix of shape (n, m), where n is known as the number of
output neurons (neurons in the next layer), and m is the number of input neurons (neurons in
the previous layer). Let's take n = 2 and m = 4 in our example.
Here, the first number in any weight's subscript matches the index of the neuron in the next layer (in
our case, this is the Hidden_2 layer), and the second number matches the index of the neuron in the
previous layer (in our case this is the Input layer).
o Here, x is known as the input vector of shape (m, 1), where m is the number of input neurons.
Let's take m=4,
o Here, b1 is known as the bias vector of shape (n, 1), in which n is known as the number of
neurons in the current layer. Here, we take n = 2.
o Following the equation for z², we can use the above definitions of W¹, x and b¹ to derive
"Equation for z²":
Now, we have to observe the neural network picture in the figure below.
In the above figure we have to see that the z² can be expressed using (z_1)² and (z_2)² where (z_1)²
and (z_2)² are the sums of the multiplication between every input x_i with the corresponding weight
(W_ij)¹.this above equation leads to "Equation for z²" and proofs that the matrix representations for
z², a², z³ and a³ are correct. The final part of a neural network is the output layer, which produces the
predicted value. In our simple example, it is presented as a single neuron, colored in blue and
evaluated as follows:
Need for Backpropagation:
This type of backpropagation error is very useful for training the neural network. It is very easy to
implement and so simple in nature. It does not require any parameter to set everything except the
number of inputs. It is also a flexible method because it has no prior knowledge of the network.
Types of Backpropagation
There are two types of backpropagation in the network. These two are as follows.
1. Staticbackpropagation:
Static backpropagation is a network designed to map static inputs for static outputs. These
types of networks can solve static classification problems such as OCR (Optical Character
Recognition).
2. Recurrentbackpropagation:
Recursive backpropagation is another network used for fixed-point learning. Activation in
recurrent backpropagation is feed-forward until a fixed value is reached. Static
backpropagation provides an instant mapping, while recurrent backpropagation does not
provide an instant mapping.