0% found this document useful (0 votes)
36 views25 pages

Unit-4 AML (1. Basics and K-NN)

Unit-4 discusses classification and regression in data mining. Classification involves using labeled training data to build a model that can predict the class or category of new data. Examples include predicting if a loan is safe or risky, or categorizing emails as spam. Regression predicts continuous numeric values like housing prices. Supervised learning uses labeled input and output data to train models for tasks like classification and regression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views25 pages

Unit-4 AML (1. Basics and K-NN)

Unit-4 discusses classification and regression in data mining. Classification involves using labeled training data to build a model that can predict the class or category of new data. Examples include predicting if a loan is safe or risky, or categorizing emails as spam. Regression predicts continuous numeric values like housing prices. Supervised learning uses labeled input and output data to train models for tasks like classification and regression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Unit-4

Classification and Regression

Compiled By: Dr. Darshana H. Patel


Information Technology Department
V.V.P. Engineering College, Rajkot.
Different Attributes
• There are certain data types associated with data mining that actually tells us the format of the file
(whether it is in text format or in numerical format).
Attributes – Represents different features of an object. Different types of attributes are:
• Binary: Possesses only two values i.e. True or False
Example: Suppose there is a survey of evaluating some product. We need to check whether it’s
useful or not. So, the Customer has to answer it in Yes or No.
Product usefulness: Yes / No
• Symmetric: Both values are equally important in all aspects
• Asymmetric: When both the values may not be important.

Compiled By: Darshana Patel 2


• Nominal: When more than two outcomes are possible. It is in Alphabet form rather than being in
Integer form.
Example: One needs to choose some material but of different colors. So, the color might be Yellow,
Green, Black, Red.
Different Colors: Red, Green, Black, Yellow
• Ordinal: Values that must have some meaningful order.
Example: Suppose there are grade sheets of few students which might contain different grades
as per their performance such as A, B, C, D
Grades: A, B, C, D
• Continuous: May have number of values.
• Example: Measuring weight of few Students in a sequence or orderly manner i.e. 50, 51, 52, 53
Weight: 50, 51, 52, 53
• Discrete: Finite number or certain range of values.
Example: Result of a Student in a few subjects: Good, Average, Poor

Compiled By: Darshana Patel 3


Supervised Learning vs Unsupervised Learning

Parameters Supervised machine learning technique Unsupervised machine learning technique


Process In a supervised learning model, input and In unsupervised learning model, only input
output variables will be given. data will be given
Input Data Algorithms are trained using labeled data. Algorithms are used against data which is not
labeled
Algorithms Used Support vector machine, Neural network, Unsupervised algorithms can be divided into
Linear and logistics regression, random different categories: like Cluster algorithms,
forest, and Classification trees. K-means, Hierarchical clustering, etc.
Computational Complexity Supervised learning is a simpler method. Unsupervised learning is computationally
complex
Use of Data Supervised learning model uses training data Unsupervised learning does not use output
to learn a link between the input and the data.
outputs.
Accuracy of Results Highly accurate and trustworthy method. Less accurate and trustworthy method.
Real Time Learning Learning method takes place offline. Learning method takes place in real time.
Number of Classes Number of classes is known. Number of classes is not known.
Main Drawback Classifying big data can be a real challenge You cannot get precise information regarding
in Supervised Learning. data sorting, and the output as data used in
unsupervised learning is labeled and not
known.

Compiled By: Darshana H. Patel


Compiled By: Darshana H. Patel
Supervised learning

• Labelled training data containing past information comes as an input. Based on the
training data, the machine builds a predictive model that can be used on test data to
assign a label for each record in the test data.
• Some examples of supervised learning are
• Predicting the results of a game
• Predicting whether a tumour is malignant or benign
• Predicting the price of domains like real estate, stocks, etc.
• Classifying texts such as classifying a set of emails as
spam or non-spam

Figure: Supervised learning 6

Compiled By: Darshana H. Patel


Supervised learning

• When we are trying to predict a categorical or nominal variable, the problem is known as
a classification problem. Whereas when we are trying to predict a real-valued variable,
the problem falls under the category of regression.
• Some typical classification problems include: Image classification, Prediction of disease,
Recognition of handwriting etc.
• Typical applications of regression can be seen in Demand forecasting in retails, weather
forecast etc.
• Note: Supervised machine learning is as good as the data used to train it. If the training
data is of poor quality, the prediction will also be far from being precise.
7

Compiled By: Darshana H. Patel


Introduction
• There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
Classification
Regression
• Classification models predict categorical (discrete-valued and unoredered) class labels;
and prediction models predict continuous valued (ordered) functions.
• For example, we can build a classification model to categorize bank loan applications as
either safe or risky, or a regression model to predict the amount in dollars that would be
safe for the bank to loan an applicant.
Compiled By: Darshana Patel 8
Classification : Definition
• Classification is a data mining task of assigning a data instance to one of the
predefined classes or groups based upon the knowledge gained from previously seen or
classified data.
• Classification is the task of learning a target function f that maps attribute set x to
one of the predefined class labels y.
• Example:
Classifying bank loan application as safe or risky
Classifying or categorizing news as sports, weather etc.
Classifying e-mail into different categories like primary, forums, spam etc.
Compiled By: Darshana Patel 9
Classification Model
• Usually, the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate it.
• Thus, Classification is a two step process:
1) Learning: Training data are analyzed by a classification algorithm.
2) Classification: Test data are used to estimate the accuracy of the
classification rules. If the accuracy is considered acceptable, the rules
can be applied to the classification of new data tuples.

Compiled By: Darshana Patel 10


Classification Model

Figure shown here depicts the typical


process of classification where a
classification model is obtained from
the labelled training data by a
classifier algorithm. On the basis of
the model, a class label (e.g. ‘Intel’ as
in the case of the test data referred
in Figure) is assigned to the test data.

Compiled By: Darshana H. Patel


Classification Model
• A critical classification problem in the context of the banking domain is identifying
potentially fraudulent transactions. Because there are millions of transactions
which have to be scrutinized to identify whether a particular transaction might be a
fraud transaction, it is not possible for any human being to carry out this task.
• Machine learning is leveraged efficiently to do this task, and this is a classic case of
classification. On the basis of the past transaction data, especially the ones labelled
as fraudulent, all new incoming transactions are marked or labelled as usual or
suspicious. The suspicious transactions are subsequently segregated for a closer
review.
Compiled By: Darshana H. Patel
Classification Model
• Some typical classification problems include the following:
• Image classification
• Disease prediction
• Win–loss prediction of games
• Prediction of natural calamity such as earthquake, flood, etc.
• Handwriting recognition

Compiled By: Darshana H. Patel


CLASSIFICATION LEARNING STEPS

First, there is a problem which


is to be solved, and then, the
required data (related to the
problem, which is already
stored in the system) is
evaluated and pre-processed
based on the algorithm.
Algorithm selection is a critical
point in supervised learning.
The result after iterative
training rounds is a classifier for
the problem in hand

Compiled By: Darshana H. Patel


COMMON CLASSIFICATION ALGORITHMS

• Following are the most common classification algorithms:


1) k-Nearest Neighbour (kNN)
2) Decision tree
3) Random forest
4) Support Vector Machine (SVM)
5) Naïve Bayes classifier

Compiled By: Darshana H. Patel


k-Nearest Neighbour (kNN)

• The kNN algorithm is a simple but extremely powerful classification algorithm.


• The name of the algorithm originates from the underlying philosophy of kNN –
i.e. people having similar background or mindset tend to stay close to each other.
In other words, neighbours in a locality have a similar background.
• In the same way, as a part of the kNN algorithm, the unknown and unlabelled data
which comes for a prediction problem is judged on the basis of the training data
set elements which are similar to the unknown element. So, the class label of the
unknown element is assigned on the basis of the class labels of the similar training
data set elements
Compiled By: Darshana H. Patel
k-Nearest Neighbour (kNN)
k-Nearest Neighbour (kNN)

• In the kNN algorithm, the class label of the test data elements is decided
by the class label of the training data elements which are neighbouring,
i.e. similar in nature. But there are two challenges:
1. What is the basis of this similarity or when can we say that two data
elements are similar?
2. How many similar elements should be considered for deciding the class
label of each test data element?

Compiled By: Darshana H. Patel


k-Nearest Neighbour (kNN)

• To answer the first question, though there are many measures of similarity, the
most common approach adopted by kNN to measure similarity between two data
elements is Euclidean distance. Considering a very simple data set having two
features (say f1 and f2), Euclidean distance between two data
elements d1 and d2 can be measured by

where f11 = value of feature f1 for data element d1

f12 = value of feature f1 for data element d2

f21 = value of feature f2 for data element d1

f22 = value of feature f2 for data element d2

Compiled By: Darshana H. Patel


k-Nearest Neighbour (kNN)

• The answer to the second question, i.e. how many similar elements should be considered.
The answer lies in the value of ‘k’ which is a user-defined parameter given as an input to
the algorithm.
• In the kNN algorithm, the value of ‘k’ indicates the number of neighbours that need to be
considered.
• For example, if the value of k is 3, only three nearest neighbours or three training data
elements closest to the test data element are considered. Out of the three data elements, the
class which is predominant is considered as the class label to be assigned to the test data.
• In case the value of k is 1, only the closest training data element is considered. The class
label of that data element is directly assigned to the test data element

Compiled By: Darshana H. Patel


k-Nearest Neighbour (kNN)

• But it is often a tricky decision to decide the value of k. The reasons are as follows:
• If the value of k is very large (in the extreme case equal to the total number of records in the training
data), the class label of the majority class of the training data set will be assigned to the test data
regardless of the class labels of the neighbours nearest to the test data.
• If the value of k is very small (in the extreme case equal to 1), the class value of a noisy data or
outlier in the training data set which is the nearest neighbour to the test data will be assigned to the
test data.
• The best k value is somewhere between these two extremes.
• Few strategies are adopted by machine learning practitioners to arrive at a value for k.
• One common practice is to set k equal to the square root of the number of training records.

Compiled By: Darshana H. Patel


k-Nearest Neighbour (kNN)

• Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be considered)
• Steps:
• Do for all test data points
• Calculate the distance (usually Euclidean distance) of the test data point from the different training data points.
• Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point.
• If k = 1
• Then assign class label of the training data point to the test data point
• Else
• Whichever class label is predominantly present in the training data points, assign that class label to the test data point
• End do

Compiled By: Darshana H. Patel


k-Nearest Neighbour (kNN)

• Why the kNN algorithm is called a lazy learner?


• The eager learners follow the general steps of machine learning, i.e. perform an
abstraction of the information obtained from the input data and then follow it through
by a generalization step.
• However, as we have seen in the case of the kNN algorithm, these steps are completely
skipped. It stores the training data and directly applies the philosophy of nearest
neighbourhood finding to arrive at the classification. So, for kNN, there is no learning
happening in the real sense. Therefore, kNN falls under the category of lazy learner.

Compiled By: Darshana H. Patel


k-Nearest Neighbour (kNN)

• Strengths of the kNN algorithm


1. Extremely simple algorithm – easy to understand
2. Very effective in certain situations, e.g. for recommender system design
3. Very fast or almost no time required for the training phase
• Weaknesses of the kNN algorithm
1. Does not learn anything in the real sense. Classification is done completely on the basis of the training data. So, it has a
heavy reliance on the training data. If the training data does not represent the problem domain comprehensively, the
algorithm fails to make an effective classification.
2. Because there is no model trained in real sense and the classification is done completely on the basis of the training data,
the classification process is very slow.
3. Also, a large amount of computational space is required to load the training data for classification.

Compiled By: Darshana H. Patel


k-Nearest Neighbour (kNN)

• Application of the kNN algorithm


• One of the most popular areas in machine learning where the kNN algorithm is widely
adopted is recommender systems. As we know, recommender systems recommend users
different items which are similar to a particular item that the user seems to like. The
liking pattern may be revealed from past purchases or browsing history and the similar
items are identified using the kNN algorithm.
• Another area where there is widespread adoption of kNN is searching documents/
contents similar to a given document/content. This is a core area under information
retrieval and is known as concept search.

Compiled By: Darshana H. Patel

You might also like