0% found this document useful (0 votes)
9 views

Classification Unit3

Uploaded by

shafiyasayyed23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Classification Unit3

Uploaded by

shafiyasayyed23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

There are two forms of data analysis that can be used for extracting

models describing important classes or to predict future data trends.


These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction
models predict continuous valued functions. For example, we can build a
classification model to categorize bank loan applications as either safe or
risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and
occupation.
Classification and prediction are two main methods used to mine the data. We use
these two techniques to analyze the data, to explore more about unknown data.

Classification:
Classification is the process of finding a good model that describes the data classes or
concepts, and the purpose of classification is to predict the class of objects whose class
label is unknown. In simple terms, we can think of Classification as categorizing the
incoming new data based on our current or past assumptions that we have made and
the data that we already have with us.

Prediction:
We can think of prediction is like something that may go to happen in the future. And
just like that in prediction, we identify or predict the missing or unavailable data for a
new observation based on the previous data that we have and based on the future
assumptions. In prediction, the output is a continuous value.
Difference between Prediction and Classification:

Sr.No. Prediction Classification


Classification is about determining a
Prediction is about predicting a missing/unknown
1. (categorial) class (or label) for an
element(continuous value) of a dataset
element in a dataset

Eg. We can think of prediction as predicting the Eg. Whereas the grouping of patients
2. correct treatment for a particular disease for an based on their medical records can be
individual person. considered classification.

The model used to predict the unknown value is The model used to classify the unknown
3.
called a predictor. value is called a classifier.

A classifier is also constructed from a


The predictor is constructed from a training set
training set composed of the records of
4. and its accuracy refers to how well it can estimate
databases and their corresponding class
the value of new data.
names
Example of Classification and prediction
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will
spend during a sale at his company. In this example we are bothered to predict a
numeric value. Therefore the data analysis task is an example of numeric prediction. In
this case, a model or a predictor will be constructed that predicts a continuous-valued-
function or ordered value.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two
steps −
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model
This step is the learning step or the learning phase.

In this step the classification algorithms build the classifier.


The classifier is built from the training set made up of database tuples and their
associated class labels.
Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be applied to
the new data tuples if the accuracy is considered acceptable.
Comparison of Classification and Prediction Methods:

Accuracy: Accuracy of the classifier can be referred to as the ability of the classifier to predicts
the class label correctly, and the accuracy of the predictor can be referred to as how well a
given predictor can estimate the unknown value.

Speed: The speed of the method depends on the computational cost of generating and using
the classifier/predictor.

Robustness: Robustness is the ability to make correct predictions or classifications, in the


context of data mining robustness is the ability of the classifier or predictor to make correct
predictions from incoming unknown data.

Scalability: Scalability is referring to an increase or decrease in performance of the classifier or


predictor based on the given data.
Interpretability: Interpretability can be referred to as how readily we can understand the
reasoning behind predictions or classification made by the predictor or classifier.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
Data Cleaning − Data cleaning involves removing the noise and treatment of missing values.
The noise is removed by applying smoothing techniques and the problem of missing values is
solved by replacing a missing value with most commonly occurring value for that attribute.
data cleaning is referred to as the preprocessing of the data, removing the noise from the data,
cleaning the data, and fixing the missing or unknown values from the data.
Relevance Analysis: After cleaning the data, we have to do an analysis on data to find the
relevant data according to the problem. For example, we use correlation analysis to compare
the various classes in the classification method. After cleaning the data and analyzing the data,
we might need to normalize the resultant data, because normalized data gives more accuracy
while predicting an unknown value. Normalization can be achieved by scaling all the values in
the dataset from 0 to 1 in the range.
Data Transformation and reduction − The data can be transformed by any of the following
methods.
Decision trees are a popular machine learning algorithm that can be used for both regression
and classification tasks. They are easy to understand, interpret, and implement, making them
an ideal choice for beginners in the field of machine learning. In this comprehensive guide, we
will cover all aspects of the decision tree algorithm, including the working principles, different
types of decision trees, the process of building decision trees, and how to evaluate and
optimize decision trees.

What is a Decision Tree?


A decision tree is a non-parametric supervised learning algorithm for classification and
regression tasks. It has a hierarchical tree structure consisting of a root node, branches,
internal nodes, and leaf nodes. Decision trees are used for classification and regression tasks,
providing easy-to-understand models.

Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
An attribute selection
An attribute selection measure is a heuristic for choosing the splitting test that “best”
separates a given data partition, D, of class-labeled training tuples into single classes.
If it can split D into smaller partitions as per the results of the splitting criterion, ideally every
partition can be pure (i.e., some tuples that fall into a given partition can belong to the same
class).
Conceptually, the “best” splitting criterion is the most approximately results in such a method.
Attribute selection measures are called a splitting rules because they decides how the tuples at
a given node are to be divided.
The attribute selection measure supports a ranking for every attribute defining the given
training tuples. The attribute having the best method for the measure is selected as the
splitting attribute for the given tuples.
Tree Pruning
Pruning is the procedure that decreases the size of decision trees. It can decrease the risk of
overfitting by defining the size of the tree or eliminating areas of the tree that support little
power. Pruning supports by trimming the branches that follow anomalies in the training
information because of noise or outliers and supports the original tree in a method that
enhances the generalization efficiency of the tree.
Various methods generally use statistical measures to delete the least reliable departments,
frequently resulting in quicker classification and an improvement in the capability of the tree
to properly classify independent test data.
Pruning means to change the model by deleting the child nodes of a branch node. The pruned
node is regarded as a leaf node. Leaf nodes cannot be pruned.
A decision tree consists of a root node, several branch nodes, and several leaf nodes.The root
node represents the top of the tree. It does not have a parent node, however, it has different
child nodes.
Branch nodes are in the middle of the tree. A branch node has a parent node and several child
nodes.
Leaf nodes represent the bottom of the tree. A leaf node has a parent node. It does not have
Scalability in data mining

Scalability in data mining refers to the ability of a data mining algorithm to handle large
amounts of data efficiently and effectively. This means that the algorithm should be able to
process the data in a timely manner, without sacrificing the quality of the results. In other
words, a scalable data mining algorithm should be able to handle an increasing amount of data
without requiring a significant increase in computational resources. This is important because
the amount of data available for analysis is growing rapidly, and the ability to process that data
quickly and accurately is essential for making informed decisions.
Vertical Scalability
Vertical scalability is also known as scale-up refers to the ability of a system or algorithm to handle an increase in
workload by adding more computational resources, such as faster processors or more memory. This is in contrast
to horizontal scalability, which involves adding more machines to a distributed computing system to handle an
increase in workload.
Vertical scalability can be an effective way to improve the performance of a system or algorithm, particularly for
applications that are limited by the computational resources available to them. By adding more resources, a
system can often handle more data or perform more complex calculations, which can improve the speed and
accuracy of the results. However, there are limitations to vertical scalability, and at some point, adding more
resources may not result in a significant improvement in performance.
Horizontal Scalability
Horizontal scalability, also known as scale-out, refers to the ability of a system or algorithm to handle an increase
in workload by adding more machines to a distributed computing system. This is in contrast to vertical scalability,
which involves adding more computational resources, such as faster processors or more memory, to a single
machine.
Horizontal scalability can be an effective way to improve the performance of a system or algorithm, particularly
for applications that require a lot of computational power. By adding more machines to the system, the workload
can be distributed across multiple machines, which can improve the speed and accuracy of the results. However,
there are limitations to horizontal scalability, and at some point, adding more machines may not result in a
significant improvement in performance. Additionally, horizontal scalability can be more complex to implement
and manage than vertical scalability.
Decision Tree Induction in Data Mining
Decision tree induction is a common technique in data mining that is used to generate a predictive
model from a dataset. This technique involves constructing a tree-like structure, where each internal
node represents a test on an attribute, each branch represents the outcome of the test, and each leaf
node represents a prediction. The goal of decision tree induction is to build a model that can accurately
predict the outcome of a given event, based on the values of the attributes in the dataset.
To build a decision tree, the algorithm first selects the attribute that best splits the data into distinct
classes. This is typically done using a measure of impurity, such as entropy or the Gini index, which
measures the degree of disorder in the data. The algorithm then repeats this process for each branch of
the tree, splitting the data into smaller and smaller subsets until all of the data is classified.

Decision tree induction is a popular technique in data mining because it is easy to understand and
interpret, and it can handle both numerical and categorical data. Additionally, decision trees can handle
large amounts of data, and they can be updated with new data as it becomes available. However,
decision trees can be prone to overfitting, where the model becomes too complex and does not
generalize well to new data. As a result, data scientists often use techniques such as pruning to simplify
the tree and improve its performance.
Decision Tree Induction in Data Mining example

You might also like