0% found this document useful (0 votes)
52 views27 pages

Data Mining Module 3

The document discusses classification and prediction in data mining, detailing their processes, applications, and differences. Classification involves categorizing data into predefined classes, while prediction focuses on estimating continuous values based on input data. It also covers the data classification lifecycle, decision tree induction, and the importance of data preparation for effective analysis.

Uploaded by

blessonsunil26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views27 pages

Data Mining Module 3

The document discusses classification and prediction in data mining, detailing their processes, applications, and differences. Classification involves categorizing data into predefined classes, while prediction focuses on estimating continuous values based on input data. It also covers the data classification lifecycle, decision tree induction, and the importance of data preparation for effective analysis.

Uploaded by

blessonsunil26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Classification and Predication in Data Mining

There are two forms of data analysis that can be used to extract models describing important classes or
predict future data trends. These two forms are as follows:
• Classification
• Prediction
We use classification and prediction to extract a model, representing the data classes to predict future
data trends. Classification predicts the categorical labels of data with the prediction models. This
analysis provides us with the best understanding of the data at a large scale.
Classification models predict categorical class labels, and prediction models predict continuous-valued
functions. For example, we can build a classification model to categorize bank loan applications as
either safe or risky or a prediction model to predict the expenditures in dollars of potential customers
on computer equipment given their income and occupation.

Classification

The data classification process can be divided into the following steps:

Data collection: The first step is to collect the data that will be used for classification. This data can be
in a variety of formats, such as text, images, or audio.
Data pre-processing: The next step is to pre-process the data. This involves cleaning the data,
transforming it into a suitable format, and selecting the relevant features.
Model training: Once the data has been pre-processed, the next step is to train a model. This is done
by feeding the data to a machine learning algorithm, which will learn to identify the patterns that
distinguish different classes of data.
Model evaluation: Once the model has been trained, it is important to evaluate its performance. This
is done by feeding the model new data and seeing how well it can classify it.
Model deployment: If the model performs well, it can be deployed in a production environment where
it can be used to classify new data.

Classification is to identify the category or the class label of a new observation. First, a set of data is
used as training data. The set of input data and the corresponding outputs are given to the algorithm.
So, the training data set includes the input data and their associated class labels. Using the training
dataset, the algorithm derives a model or the classifier. The derived model can be a decision tree,
mathematical formula, or a neural network. In classification, when unlabelled data is given to the model,
it should find the class to which it belongs. The new data provided to the model is the test data set.
Classification is the process of classifying a record. One simple example of classification is to check
whether it is raining or not. The answer can either be yes or no. So, there is a particular number of
choices. Sometimes there can be more than two classes to classify. That is called multiclass
classification.
The bank needs to analyze whether giving a loan to a particular customer is risky or not. For example,
based on observable data for multiple loan borrowers, a classification model may be established that
forecasts credit risk. The data could track job records, homeownership or leasing, years of residency,
number, type of deposits, historical credit ranking, etc. The goal would be credit ranking, the predictors
would be the other characteristics, and the data would represent a case for each consumer. In this
example, a model is constructed to find the categorical label. The labels are risky or safe.

How does Classification Works?

1-Developing the Classifier or model creation: This level is the learning stage or the learning process.
The classification algorithms construct the classifier in this stage. A classifier is constructed from a
training set composed of the records of databases and their corresponding class names. Each category
that makes up the training set is referred to as a category or class. We may also refer to these records as
samples, objects, or data points.
2-Applying classifier for classification: The classifier is used for classification at this level. The test
data are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded to cover new data records. It includes:
• Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We can
use it to extract social media insights. We can build sentiment analysis models to read and
analyze misspelled words with advanced machine learning algorithms. The accurate trained
models provide consistently accurate outcomes and result in a fraction of the time.
• Document Classification: We can use document classification to organize the documents into
sections according to the content. Document classification refers to text classification; we can
classify the words in the entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.
• Image Classification: Image classification is used for the trained categories of an image. These
could be the caption of the image, a statistical value, a theme. You can tag images to train your
model for relevant categories by applying supervised learning algorithms.
• Machine Learning Classification: It uses the statistically demonstrable algorithm rules to
execute analytical tasks that would take humans hundreds of more hours to perform.
3-Data Classification Process: The data classification process can be categorized into five steps:
Create the goals of data classification, strategy, workflows, and architecture of data classification.
• Classify confidential details that we store.
• Using marks by data labelling.
• To improve protection and obedience, use effects.
• Data is complex, and a continuous method is a classification.

Data Classification Lifecycle


The data classification life cycle produces an excellent structure for controlling the flow of data to an
enterprise. Businesses need to account for data security and compliance at each level. With the help of
data classification, we can perform it at every stage, from origin to deletion. The data life-cycle has the
following stages, such as:
Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.
Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.
Storage: Here, we have the obtained data, including access controls and encryption.
Sharing: Data is continually distributed among agents, consumers, and co-workers from various
devices and platforms.
Archive: Here, data is eventually archived within an industry's storage systems.
Publication: Through the publication of data, it can reach customers. They can then view and download
in the form of dashboards.

Prediction
Another process of data analysis is prediction. It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and corresponding numerical output values. The
algorithm derives the model or a predictor according to the training dataset. The model should find a
numerical output when the new data is given. Unlike in classification, this method does not have a class
label. The model predicts a continuous-valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts such
as the number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular customer will
spend at his company during a sale. We are bothered to forecast a numerical value in this case.
Therefore, an example of numeric prediction is the data processing activity. In this case, a model or a
predictor will be developed that forecasts a continuous or ordered value function.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves the
following activities, such as:
Data Cleaning: Data cleaning involves removing the noise and treatment of missing values. The noise
is removed by applying smoothing techniques, and the problem of missing values is solved by replacing
a missing value with the most commonly occurring value for that attribute.
Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
Data Transformation and reduction: The data can be transformed by any of the following methods.
• Normalization: The data is transformed using normalization. Normalization involves scaling
all values for a given attribute to make them fall within a small specified range. Normalization
is used when the neural networks or the methods involving measurements are used in the
learning step.
• Generalization: The data can also be transformed by generalizing it to the higher concept. For
this purpose, we can use the concept hierarchies.

Difference between Classification and Prediction

Classification Prediction

Classification is the process of identifying Predication is the process of identifying the


which category a new observation based on a missing or unavailable numerical data for a
training data set containing observations whose new observation.
category membership is known.
In classification, the accuracy depends on In prediction, the accuracy depends on how
finding the class label correctly. well a given predictor can guess the value of a
predicated attribute for new data.

In classification, the model can be known as In prediction, the model can be known as the
the classifier. predictor.

A model or the classifier is constructed to find A model or a predictor will be constructed that
the categorical labels. predicts a continuous-valued function or
ordered value.

For example, the grouping of patients based For example, We can think of prediction as
on their medical records can be considered a predicting the correct treatment for a particular
classification. disease for a person.

Decision Tree Induction:


A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a
class label. The topmost node in the tree is the root node.
The benefits of having a decision tree are as follows −
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction


Decision tree induction is the method of learning the decision trees from the training set. The training
set consists of attributes and class labels. Applications of decision tree induction include astronomy,
financial analysis, medical diagnosis, manufacturing, and production.

A decision tree is a flowchart tree-like structure that is made from training set tuples. The dataset is
broken down into smaller subsets and is present in the form of nodes of a tree. The tree structure has a
root node, internal nodes or decision nodes, leaf node, and branches.

The root node is the topmost node. It represents the best attribute selected for classification. Internal
nodes of the decision nodes represent a test of an attribute of the dataset leaf node or terminal node
which represents the classification or decision label. The branches show the outcome of the test
performed.

Some decision trees only have binary nodes, that means exactly two branches of a node, while some
decision trees are non-binary.

How To Select Attributes For Creating A Tree?

The most popular methods of selecting the attribute are information gain, Gini index.

Attribute selection measures are also called splitting rules to decide how the tuples are going to split.
The splitting criteria are used to best partition the dataset. These measures provide a ranking to the
attributes for partitioning the training tuples.

1) Information Gain

This method is the main method that is used to build decision trees. It reduces the information that is
required to classify the tuples. It reduces the number of tests that are needed to classify the given
tuple. The attribute with the highest information gain is selected.

The original information needed for classification of a tuple in dataset D is given by:

Where p is the probability that the tuple belongs to class C. The information is encoded in bits, therefore,
log to the base 2 is used. E(s) represents the average amount of information required to find out the
class label of dataset D. This information gain is also called Entropy.
The information required for exact classification after portioning is given by the formula:
Where P (c) is the weight of partition. This information represents the information needed to classify
the dataset D on portioning by X.

Information gain is the difference between the original and expected information that is required to
classify the tuples of dataset D.

Where n is the nth partition of the dataset D. The reduction in impurity is given by the difference of
the Gini index of the original dataset D and Gini index after partition by attribute A. The maximum
reduction in impurity or max Gini index is selected as the best attribute for splitting.

Day Outlook Temperature Humidity Wind Play cricket


1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Step1: The first step will be to create a root node.

Step2: If all results are yes, then the leaf node “yes” will be returned else the leaf node “no” will be
returned.

Step3: Find out the Entropy of all observations and entropy with attribute “x” that is E(S) and E(S, x).

Step4: Find out the information gain and select the attribute with high information gain.

Step5: Repeat the above steps until all attributes are covered.
Calculation of Entropy:

Yes No

9 5

If entropy is zero, it means that all members belong to the same class and if entropy is one then it
means that half of the tuples belong to one class and one of them belong to other class. 0.94 means
fair distribution.

Find the information gain attribute which gives maximum information gain.

For Example “Wind”, it takes two values: Strong and Weak, therefore, x = {Strong, Weak}.

Find out H(x), P(x) for x =weak and x= strong. H(S) is already calculated above.
Weak= 8
Strong= 8

For “weak” wind, 6 of them say “Yes” to play cricket and 2 of them say “No”. So entropy will be:

For “strong” wind, 3 said “No” to play cricket and 3 said “Yes”.
This shows perfect randomness as half items belong to one class and the remaining half belong to others.

Calculate the information gain,

Similarly, the information gain for other attributes is:

The attribute outlook has the highest information gain of 0.246, thus it is chosen as root.

Overcast has 3 values: Sunny, Overcast and Rain. Overcast with play cricket is always “Yes”. So it
ends up with a leaf node, “yes”. For the other values “Sunny” and “Rain”.
Table for Outlook as “Sunny” will be:
Temperature Humidity Wind Golf
Hot High Weak No
Hot High Strong No
Mild High Weak No
Cool Normal Weak Yes
Mild Normal Strong Yes

Entropy for “Outlook” “Sunny” is:

Information gain for attributes with respect to Sunny is:

The information gain for humidity is highest, therefore it is chosen as the next node. Similarly, Entropy
is calculated for Rain. Wind gives the highest information gain.
The decision tree would look like below:
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)


• Tree is constructed in a top-down (from general to specific) recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, discretization in advance)
• Examples are partitioned recursively based on selected attributes
• Attributes are selected based on heuristic or statistical measure (e.g., information gain)
When to stop
• All example for a given node belong to the same class (pure), or
• No remaining attributes to select from, or
• majority voting to determine class label for the node
• No examples left

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers.
The pruned trees are smaller and less complex.
Pruning is a technique that removes the parts of the Decision Tree which prevent it from growing to its
full depth. The parts that it removes from the tree are the parts that do not provide the power to classify
instances. A Decision tree that is trained to its full depth will highly likely lead to overfitting the training
data - therefore Pruning is important.

In simpler terms, the aim of Decision Tree Pruning is to construct an algorithm that will perform worse
on training data but will generalize better on test data. Tuning the hyperparameters of your Decision
Tree model can do your model a lot of justice and save you a lot of time and money.
Tree Pruning Approaches
There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
Pre-pruning
The pre-pruning technique of Decision Trees is tuning the hyperparameters prior to the training
pipeline. It involves the heuristic known as ‘early stopping’ which stops the growth of the decision tree
- preventing it from reaching its full depth.

It stops the tree-building process to avoid producing leaves with small samples. During each stage of
the splitting of the tree, the cross-validation error will be monitored. If the value of the error does not
decrease anymore - then we stop the growth of the decision tree.
These same parameters can also be used to tune to get a robust model. However, you should be cautious
as early stopping can also lead to underfitting.
Post-pruning
Post-pruning does the opposite of pre-pruning and allows the Decision Tree model to grow to its full
depth. Once the model grows to its full depth, tree branches are removed to prevent the model from
overfitting.

The algorithm will continue to partition data into smaller subsets until the final subsets produced are
similar in terms of the outcome variable. The final subset of the tree will consist of only a few data
points allowing the tree to have learned the data to the T. However, when a new data point is introduced
that differs from the learned data - it may not get predicted well.
Naïve Bayes Classifier Algorithm

• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of colour, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:


Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:

• Convert the given dataset into frequency tables.


• Generate Likelihood table by finding the probabilities of given features.
• Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Sl No Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes
Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Applying Bayes'theorem:

P(Yes|Sunny) = P(Sunny Yes)*P(Yes)/P(Sunny)


P(Sunny Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71

So,
P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny) = P(Sunny|No) *P(No)/P(Sunny)
P(Sunny|NO) = 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So
P(No|Sunny) = 0.5*0.29/0.35 = 0.41
So
as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.

Rule-based Classification in Data Mining

Rule-based classification in data mining is a technique in which class decisions are taken based on
various “if...then… else” rules. Thus, we define it as a classification type governed by a set of IF-THEN
rules. We write an IF-THEN rule as:
“IF condition THEN conclusion.”
IF-THEN Rule
To define the IF-THEN rule, we can split it into two parts:
Rule Antecedent: This is the “if condition” part of the rule. This part is present in the LHS(Left Hand
Side). The antecedent can have one or more attributes as conditions, with logic AND operator.
Rule Consequent: This is present in the rule's RHS (Right Hand Side). The rule consequent consists of
the class prediction.
We want to classify records by using a collection of simpler “if…then…” rules

Rule notation: (Condition) → y

where

Condition is a conjunctions of attributes


y is the class label
LHS is the rule antecedent or condition
RHS is the rule consequent

Examples of classification rules:


(Blood Type=Warm) & (Lay Eggs=Yes) → Birds
(Taxable Income < 50K) & (Refund=Yes) → Evade=No

R1: (Give Birth = no) & (Can Fly = yes) → Birds

R2: (Give Birth = no) & (Live in Water = yes) → Fishes

R3: (Give Birth = yes) & (Blood Type = warm) → Mammals

R4: (Give Birth = no) & (Can Fly = no) → Reptiles

R5: (Live in Water = sometimes) → Amphibians

Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

The rule R1 above covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal
A lemur triggers rule R3, so it is classified => Mammal
A turtle trigger both R4 and R5
A dogfish shark trigger matches none of the rules

Assessment of Rule
Measures of Coverage and Accuracy
Coverage of a rule:
• Fraction of all records that satisfy the antecedent of a rule
• Count(instances with antecedent) / Count(training set)
• Example on left: (Status = 'Single') -> no, Coverage = 4/10 = 40%
Accuracy of a rule:

• Fraction of records that satisfy the antecedent that also satisfy the consequent of a rule
• Count (instances with antecedent AND consequent) / Count (instances with antecedent)
• Example on left: (Status = 'Single') -> no, accuracy = 2/4 = 50%

Properties of Rule-Based Classifiers


There are two significant properties of rule-based classification in data mining. They are:

• Rules may not be mutually exclusive


• Rules may not be exhaustive

Rules may not be mutually exclusive in nature


Many different rules are generated for the dataset, so it is possible and likely that many of them satisfy
the same data record. This condition makes the rules not mutually exclusive. Since the rules are not
mutually exclusive, we cannot decide on classes that cover different parts of data on different rules. But
this was our main objective. So, to solve this problem, we have two ways:

The first way is using an ordered set of rules.


• By ordering the rules, we set priority orders. Thus, this ordered rule set is called a decision list.
• So the class with the highest priority rule is taken as the final class.
The second solution can be assigning votes for each class depending on their weights. So, in this, the
set of rules remains unordered.

Rules may not be exhaustive in nature

It is not a guarantee that the rule will cover all the data entries. Any of the rules may leave some data
entries. This case, on its occurrence, will make the rules non-exhaustive. So, we have to solve this
problem too. So, to solve this problem, we can make use of a default class. Using a default class, we
can assign all the data entries not covered by any rules to the default class. Thus, using the default class
will solve the problem of non-exhaustivity.
Ordered Rule Set
Rules are rank ordered according to their priority

An ordered rule set is known as a decision list .When a test record is presented to the classifier

It is assigned to the class label of the highest ranked rule it has triggered (first encountered)
If none of the rules fired, it is assigned to the default class

Example from above

R1: (Give Birth = no) & (Can Fly = yes) → Birds


R2: (Give Birth = no) & (Live in Water = yes) → Fishes
R3: (Give Birth = yes) & (Blood Type = warm) → Mammals
R4: (Give Birth = no) & (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

A turtle triggers both R4 and R5, but by the order, the conclusion is =>Reptile

We did not define a default rule.


Backpropagation in Data Mining

Backpropagation is an algorithm that backpropagates the errors from the output nodes to the input
nodes. Therefore, it is simply referred to as the backward propagation of errors. It uses in the vast
applications of neural networks in data mining like Character recognition, Signature verification, etc.
Neural Network:
Neural networks are an information processing paradigm inspired by the human nervous system. Just
like in the human nervous system, we have biological neurons in the same way in neural networks we
have artificial neurons, artificial neurons are mathematical functions derived from biological neurons.
The human brain is estimated to have about 10 billion neurons, each connected to an average of 10,000
other neurons. Each neuron receives a signal through a synapse, which controls the effect of the sign
concerning on the neuron.
Backpropagation:

Backpropagation is a widely used algorithm for training feedforward neural networks. It computes the
gradient of the loss function with respect to the network weights. It is very efficient, rather than naively
directly computing the gradient concerning each weight. This efficiency makes it possible to use
gradient methods to train multi-layer networks and update weights to minimize loss; variants such as
gradient descent or stochastic gradient descent are often used.

The backpropagation algorithm works by computing the gradient of the loss function with respect to
each weight via the chain rule, computing the gradient layer by layer, and iterating backward from the
last layer to avoid redundant computation of intermediate terms in the chain rule.

Features of Backpropagation:

1. it is the gradient descent method as used in the case of simple perceptron network with the
differentiable unit.
2. it is different from other networks in respect to the process by which the weights are calculated
during the learning period of the network.
3. training is done in the three stages:
• the feed-forward of input training pattern
• the calculation and backpropagation of the error
• updating of the weight

Working of Backpropagation:

Neural networks use supervised learning to generate output vectors from input vectors that the network
operates on. It Compares generated output to the desired output and generates an error report if the
result does not match the generated output vector. Then it adjusts the weights according to the bug
report to get your desired output.

Backpropagation Algorithm:

Step 1: Inputs X, arrive through the preconnected path.


Step 2: The input is modelled using true weights W. Weights are usually chosen randomly.
Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the output layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the error.
Step 6: Repeat the process until the desired output is achieved.

Support vector machines

Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithms
which are used both for classification and regression. But generally, they are used in classification
problems. In 1960s, SVMs were first introduced but later they got refined in 1990. SVMs have their
unique way of implementation as compared to other machine learning algorithms. Lately, they are
extremely popular because of their ability to handle multiple continuous and categorical variables.

Working of SVM

An SVM model is basically a representation of different classes in a hyperplane in multidimensional


space. The hyperplane will be generated in an iterative manner by SVM so that the error can be
minimized. The goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH).

The followings are important concepts in SVM −

• Support Vectors − Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
• Hyperplane − As we can see in the above diagram, it is a decision plane or space which is
divided between a set of objects having different classes.
• Margin − It may be defined as the gap between two lines on the closet data points of different
classes. It can be calculated as the perpendicular distance from the line to the support vectors.
Large margin is considered as a good margin and small margin is considered as a bad margin.

The main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane
(MMH) and it can be done in the following two steps −

• First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
• Then, it will choose the hyperplane that separates the classes correctly.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:

When we can easily separate data with hyperplane by drawing a straight line is Linear SVM. When we
cannot separate data with a straight line we use Non – Linear SVM.
Association Rule learning in Data Mining:
Data mining is the process of discovering and extracting hidden patterns from different types of data to
help decision-makers make decisions. Associative classification is a common classification learning
method in data mining, which applies association rule detection methods and classification to create
classification models.

Association Rule learning in Data Mining:

Association rule learning is a machine learning method for discovering interesting relationships
between variables in large databases. It is designed to detect strong rules in the database based on some
interesting metrics. For any given multi-item transaction, association rules aim to obtain rules that
determine how or why certain items are linked.
Association rules are created by searching for information on common if-then patterns and using
specific criteria with support and trust to define what the key relationships are. They help to show the
frequency of an item in a given data since confidence is defined by the number of times an if-then
statement is found to be true. However, a third criterion called lift is often used to compare expected
and actual confidence. Lift shows how many times the if-then statement was predicted to be true. Create
association rules to compute itemsets based on data created by two or more items. Association rules
usually consist of rules that are well represented by the data.

There are different types of data mining techniques that can be used to find out the specific analysis and
result like Classification analysis, Clustering analysis, and multivariate analysis. Association rules are
mainly used to analyse and predict customer behaviour.

• In Classification analysis, it is mostly used to question, make decisions, and predict behaviour.
• In Clustering analysis, it is mainly used when no assumptions are made about possible
relationships in the data.
• In Regression analysis, it is used when we want to predict an infinitely dependent value of a set
of independent variables.

Associative Classification in Data Mining:

Bing Liu Et Al was the first to propose associative classification, in which he defined a model whose
rule is “the right-hand side is constrained to be the attribute of the classification class”. An associative
classifier is a supervised learning model that uses association rules to assign a target value.

The model generated by the association classifier and used to label new records consists of association
rules that produce class labels. Therefore, they can also be thought of as a list of “if-then” clauses: if a
record meets certain criteria (specified on the left side of the rule, also known as antecedents), it is
marked (or scored) according to the rule’s category on the right. Most associative classifiers read the
list of rules sequentially and apply the first matching rule to mark new records. Association classifier
rules inherit some metrics from association rules, such as Support or Confidence, which can be used to
rank or filter the rules in the model and evaluate their quality.

Types of Associative Classification:

There are different types of Associative Classification Methods, Some of them are given below.

1. CBA (Classification Based on Associations): It uses association rule techniques to classify data,
which proves to be more accurate than traditional classification techniques. It has to face the sensitivity
of the minimum support threshold. When a lower minimum support threshold is specified, a large
number of rules are generated.
2. CMAR (Classification based on Multiple Association Rules): It uses an efficient FP-tree, which
consumes less memory and space compared to Classification Based on Associations. The FP-tree will
not always fit in the main memory, especially when the number of attributes is large.

3. CPAR (Classification based on Predictive Association Rules): Classification based on predictive


association rules combines the advantages of association classification and traditional rule-based
classification. Classification based on predictive association rules uses a greedy algorithm to generate
rules directly from training data. Furthermore, classification based on predictive association rules
generates and tests more rules than traditional rule-based classifiers to avoid missing important rules.

LAZY LEARNERS

Lazy learners store the training data and wait until testing data appears. When it does, classification is
conducted based on the most related stored training data. Compared to eager learners, lazy learners
spend less training time but more time in predicting.

Examples: K-nearest neighbour and case-based reasoning.

K-Nearest Neighbour (KNN) Algorithm

• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the available
categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.

Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dogs images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

• Step-1: Select the number K of the neighbours


• Step-2: Calculate the Euclidean distance of K number of neighbours
• Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.
• Step-4: Among these k neighbours, count the number of the data points in each category.
• Step-5: Assign the new data points to that category for which the number of the neighbour is
maximum.
• Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:

• Firstly, we will choose the number of neighbours, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean distance
is the distance between two points, which we have already studied in geometry. It can be
calculated as:
• By calculating the Euclidean distance, we got the nearest neighbours, as three nearest
neighbours in category A and two nearest neighbours in category B. Consider the below image:

• As we can see the 3 nearest neighbours are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

• There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
• Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data points for all
the training samples.

Case Based Reasoning (CBR) Classifier

Case-Based Reasoning classifiers (CBR) use a database of problem solutions to solve new problems. It
stores the tuples or cases for problem-solving as complex symbolic descriptions. How CBR works?
When a new case arises to classify, a Case-based Reasoner (CBR) will first check if an identical training
case exists. If one is found, then the accompanying solution to that case is returned. If no identical case
is found, then the CBR will search for training cases having components that are similar to those of the
new case. Conceptually, these training cases may be considered as neighbours of the new case. If cases
are represented as graphs, this involves searching for subgraphs that are similar to subgraphs within the
new case. The CBR tries to combine the solutions of the neighbouring training cases to propose a
solution for the new case. If compatibilities arise with the individual solutions, then backtracking to
search for other solutions may be necessary. The CBR may employ background knowledge and
problem-solving strategies to propose a feasible solution. Applications of CBR includes:
• Problem resolution for customer service help desks, where cases describe product-related
diagnostic problems.
• It is also applied to areas such as engineering and law, where cases are either technical designs
or legal rulings, respectively.
• Medical educations, where patient case histories and treatments are used to help diagnose and
treat new patients.

Challenges with CBR

• Finding a good similarity metric (eg for matching subgraphs) and suitable methods for
combining solutions.
• Selecting salient features for indexing training cases and the development of efficient indexing
techniques.
• CBR becomes more intelligent as the number of the trade-off between accuracy and efficiency
evolves as the number of stored cases becomes very large. But after a certain point, the system’s
efficiency will suffer as the time required to search for and process relevant cases increases.

Prediction methods
LINEAR AND NONLINEAR REGRESSION

• It is simplest form of regression. Linear regression attempts to model the relationship between
two variables byfitting a linear equation to observe the data.
• Linear regression attempts to find the mathematical relationship between variables.
• If outcome is straight line then it is considered as linear model and if it is curved line, then it is
a non linearmodel.
• The relationship between dependent variable is given by straight line and it has only one
independent variable.Y = α + Β X
• Model 'Y', is a linear function of 'X'.
• The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also
changes

MULTIPLE LINEAR REGRESSION

• Multiple linear regression is an extension of linear regression analysis.


• It uses two or more independent variables to predict an outcome and a single continuous
dependentvariable.

Y = a0 + a1 X1+ a2X2+.........+akXk+e

where,'Y' is the response variable.X1 + X2 + Xk are the independent predictors’' is random error.a0,
a1, a2, ak are the regression coefficients.
LOGISTIC REGRESSION

Logistic Regression was used in the biological sciences in early twentieth century. It was thenused in
many social science applications. Logistic Regression is used when the dependent variable(target) is
categorical.

For example,

• To predict whether an email is spam (1) or (0)


• Whether the tumor is malignant (1) or not (0)

Techniques To Evaluate Accuracy of Classifier or predictor in Data Mining

HoldOut

• In the holdout method, the largest dataset is randomly divided into three subsets:
• A training set is a subset of the dataset which are been used to build predictive models.
• The validation set is a subset of the dataset which is been used to assess the performance of the
model built in the training phase. It provides a test platform for fine-tuning of the model’s
parameters and selecting the best-performing model. It is not necessary for all modeling
algorithms to need a validation set.
• Test sets or unseen examples are the subset of the dataset to assess the likely future performance
of the model. If a model is fitting into the training set much better than it fits into the test set,
then overfitting is probably the cause that occurred here.
• Basically, two-thirds of the data are been allocated to the training set and the remaining one-
third is been allocated to the test set.
Random Subsampling

• Random subsampling is a variation of the holdout method. The holdout method is been repeated
K times.
• The holdout subsampling involves randomly splitting the data into a training set and a test set.
• On the training set the data is been trained and the mean square error (MSE) is been obtained
from the predictions on the test set.
• As MSE is dependent on the split, this method is not recommended. So a new split can give
you a new MSE.
• The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1} E_{i}

Cross-Validation

• K-fold cross-validation is been used when there is only a limited amount of data available, to
achieve an unbiased estimation of the performance of the model.
• Here, we divide the data into K subsets of equal sizes.
• We build models K times, each time leaving out one of the subsets from the training, and use
it as the test set.
• If K equals the sample size, then this is called a “Leave-One-Out”

Bootstrapping
• Bootstrapping is one of the techniques which is used to make the estimations from the data by
taking an average of the estimates from smaller data samples.
• The bootstrapping method involves the iterative resampling of a dataset with replacement.
• On resampling instead of only estimating the statistics once on complete data, we can do it
many times.
• Repeating this multiple times helps to obtain a vector of estimates.
• Bootstrapping can compute variance, expected value, and other relevant statistics of these
estimates.

Ensemble classifier

Ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. Basic idea is to
learn a set of classifiers (experts) and to allow them to vote

• Advantage: Improvement in predictive accuracy.


• Disadvantage : It is difficult to understand an ensemble of classifiers..

Dietterich(2002) showed that ensembles overcome three problems –

• Statistical Problem –The Statistical Problem arises when the hypothesis space is too large for
the amount of available data. Hence, there are many hypotheses with the same accuracy on the
data and the learning algorithm chooses only one of them! There is a risk that the accuracy of
the chosen hypothesis is low on unseen data!
• Computational Problem –The Computational Problem arises when the learning algorithm
cannot guarantees finding the best hypothesis.
• Representational Problem –The Representational Problem arises when the hypothesis space
does not contain any good approximation of the target class(es).

Main Challenge for Developing Ensemble Models?

The main challenge is not to obtain highly accurate base models, but rather to obtain base models which
make different kinds of errors. For example, if ensembles are used for classification, high accuracies
can be accomplished if different base models misclassify different training examples, even if the base
classifier accuracy is low.

Methods for Independently Constructing Ensembles –

• Majority Vote
• Bagging and Random Forest
• Randomness Injection
• Feature-Selection Ensembles
• Error-Correcting Output Coding

Methods for Coordinated Construction of Ensembles –


• Boosting
• Stacking

Types of Ensemble Classifier –


Bagging:

Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree. Suppose a set D of
d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e.,
bootstrap). Then a classifier model Mi is learned for each training set D < i. Each classifier Mi returns
its class prediction. The bagged classifier M* counts the votes and assigns the class with the most votes
to X (unknown sample).
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples, selecting
observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models.

Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision tree
classifier and is generated using a random selection of attributes at each node to determine the split.
During classification, each tree votes and the most popular class is returned.
Implementation steps of Random Forest –
• Multiple subsets are created from the original data set, selecting observations with
replacement.
• A subset of features is selected randomly and whichever feature gives the best split is used to
split the node iteratively.
• The tree is grown to the largest.
• Repeat the above steps and prediction is given based on the aggregation of predictions from n
number of trees.

You might also like