Classification Unit3
Classification Unit3
Classification:
Classification is the process of finding a good model that describes the data classes or
concepts, and the purpose of classification is to predict the class of objects whose class
label is unknown. In simple terms, we can think of Classification as categorizing the
incoming new data based on our current or past assumptions that we have made and
the data that we already have with us.
Prediction:
We can think of prediction is like something that may go to happen in the future. And
just like that in prediction, we identify or predict the missing or unavailable data for a
new observation based on the previous data that we have and based on the future
assumptions. In prediction, the output is a continuous value.
Difference between Prediction and Classification:
Eg. We can think of prediction as predicting the Eg. Whereas the grouping of patients
2. correct treatment for a particular disease for an based on their medical records can be
individual person. considered classification.
The model used to predict the unknown value is The model used to classify the unknown
3.
called a predictor. value is called a classifier.
Accuracy: Accuracy of the classifier can be referred to as the ability of the classifier to predicts
the class label correctly, and the accuracy of the predictor can be referred to as how well a
given predictor can estimate the unknown value.
Speed: The speed of the method depends on the computational cost of generating and using
the classifier/predictor.
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
An attribute selection
An attribute selection measure is a heuristic for choosing the splitting test that “best”
separates a given data partition, D, of class-labeled training tuples into single classes.
If it can split D into smaller partitions as per the results of the splitting criterion, ideally every
partition can be pure (i.e., some tuples that fall into a given partition can belong to the same
class).
Conceptually, the “best” splitting criterion is the most approximately results in such a method.
Attribute selection measures are called a splitting rules because they decides how the tuples at
a given node are to be divided.
The attribute selection measure supports a ranking for every attribute defining the given
training tuples. The attribute having the best method for the measure is selected as the
splitting attribute for the given tuples.
Tree Pruning
Pruning is the procedure that decreases the size of decision trees. It can decrease the risk of
overfitting by defining the size of the tree or eliminating areas of the tree that support little
power. Pruning supports by trimming the branches that follow anomalies in the training
information because of noise or outliers and supports the original tree in a method that
enhances the generalization efficiency of the tree.
Various methods generally use statistical measures to delete the least reliable departments,
frequently resulting in quicker classification and an improvement in the capability of the tree
to properly classify independent test data.
Pruning means to change the model by deleting the child nodes of a branch node. The pruned
node is regarded as a leaf node. Leaf nodes cannot be pruned.
A decision tree consists of a root node, several branch nodes, and several leaf nodes.The root
node represents the top of the tree. It does not have a parent node, however, it has different
child nodes.
Branch nodes are in the middle of the tree. A branch node has a parent node and several child
nodes.
Leaf nodes represent the bottom of the tree. A leaf node has a parent node. It does not have
Scalability in data mining
Scalability in data mining refers to the ability of a data mining algorithm to handle large
amounts of data efficiently and effectively. This means that the algorithm should be able to
process the data in a timely manner, without sacrificing the quality of the results. In other
words, a scalable data mining algorithm should be able to handle an increasing amount of data
without requiring a significant increase in computational resources. This is important because
the amount of data available for analysis is growing rapidly, and the ability to process that data
quickly and accurately is essential for making informed decisions.
Vertical Scalability
Vertical scalability is also known as scale-up refers to the ability of a system or algorithm to handle an increase in
workload by adding more computational resources, such as faster processors or more memory. This is in contrast
to horizontal scalability, which involves adding more machines to a distributed computing system to handle an
increase in workload.
Vertical scalability can be an effective way to improve the performance of a system or algorithm, particularly for
applications that are limited by the computational resources available to them. By adding more resources, a
system can often handle more data or perform more complex calculations, which can improve the speed and
accuracy of the results. However, there are limitations to vertical scalability, and at some point, adding more
resources may not result in a significant improvement in performance.
Horizontal Scalability
Horizontal scalability, also known as scale-out, refers to the ability of a system or algorithm to handle an increase
in workload by adding more machines to a distributed computing system. This is in contrast to vertical scalability,
which involves adding more computational resources, such as faster processors or more memory, to a single
machine.
Horizontal scalability can be an effective way to improve the performance of a system or algorithm, particularly
for applications that require a lot of computational power. By adding more machines to the system, the workload
can be distributed across multiple machines, which can improve the speed and accuracy of the results. However,
there are limitations to horizontal scalability, and at some point, adding more machines may not result in a
significant improvement in performance. Additionally, horizontal scalability can be more complex to implement
and manage than vertical scalability.
Decision Tree Induction in Data Mining
Decision tree induction is a common technique in data mining that is used to generate a predictive
model from a dataset. This technique involves constructing a tree-like structure, where each internal
node represents a test on an attribute, each branch represents the outcome of the test, and each leaf
node represents a prediction. The goal of decision tree induction is to build a model that can accurately
predict the outcome of a given event, based on the values of the attributes in the dataset.
To build a decision tree, the algorithm first selects the attribute that best splits the data into distinct
classes. This is typically done using a measure of impurity, such as entropy or the Gini index, which
measures the degree of disorder in the data. The algorithm then repeats this process for each branch of
the tree, splitting the data into smaller and smaller subsets until all of the data is classified.
Decision tree induction is a popular technique in data mining because it is easy to understand and
interpret, and it can handle both numerical and categorical data. Additionally, decision trees can handle
large amounts of data, and they can be updated with new data as it becomes available. However,
decision trees can be prone to overfitting, where the model becomes too complex and does not
generalize well to new data. As a result, data scientists often use techniques such as pruning to simplify
the tree and improve its performance.
Decision Tree Induction in Data Mining example