Classification
Classification
CLASSIFICATION
• What is Classification in Data Mining?
• Classification is a technique in data mining that involves categorizing or
classifying data objects into predefined classes, categories, or groups based
on their features or attributes.
• It is a supervised learning technique that uses labelled data to build a model
that can predict the class of new, unseen data.
• It is an important task in data mining because it enables organizations to
make informed decisions based on their data.
• There are two main types of classification:
• binary classification and multi-class classification.
• Binary classification involves classifying instances into two classes, such as
“spam” or “not spam”,
• Multi-class classification involves classifying instances into more than two
classes.
• Steps to Build a Classification Model
• There are several steps involved in building a classification model, as shown
below -
• Data preparation - The first step in building a classification model is to
prepare the data. This involves collecting, cleaning, and transforming the
data into a suitable format for further analysis.
• Feature selection - The next step is to select the most important and
relevant features that will be used to build the classification model. This can
be done using various techniques, such as correlation, feature importance
analysis, or domain knowledge.
• Prepare train and test data - Once the data is prepared and relevant features
are selected, the dataset is divided into two parts - training and test datasets.
The training set is used to build the model, while the testing set is used to
evaluate the model's performance.
• Model selection - Many algorithms can be used to build a classification
model, such as decision trees, logistic regression, k-nearest neighbors, and
neural networks. The choice of algorithm depends on the type of data, the
number of features, and the desired accuracy.
• Model training - Once the algorithm is selected, the model is trained on the
training dataset. This involves adjusting the model parameters to minimize
the error between the predicted and actual class labels.
• Model evaluation - The model's performance is evaluated using the test
dataset. The accuracy, precision, recall, and F1 score are commonly used
metrics to evaluate the model performance.
• Model tuning - If the model's performance is not satisfactory, the model can
be tuned by adjusting the parameters or selecting a different algorithm. This
process is repeated until the desired performance is achieved.
• Model deployment - Once the model is built and evaluated, it can be
deployed in production to classify new data. The model should be monitored
regularly to ensure its accuracy and effectiveness over time.
• Classification Vs. Regression in Data Mining
• Simple Approach
• Euclidean Distance Formula
• The formula for Euclidean distance in two dimensions
• where D is the Euclidean distance, and (x1,y1) and (x2,y2) are the Cartesian
coordinates of the two points.
• Assign this to the class whose centroid is closest to it.
• K-Nearest Neighbor(KNN) Algorithm for Machine Learning
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories.
• Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it
in either cat or dog category.
•Decision Tree Classification Algorithm