0% found this document useful (0 votes)

21 views15 pages

U4 Clasification and Prediction

data mining unit 4

Uploaded by

Akansha S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views15 pages

U4 Clasification and Prediction

data mining unit 4

Uploaded by

Akansha S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

U4 CLASIFICATION & PREDICTION DATA MINING

There are two forms of data analysis that can be used to extract models describing important classes or predict
future data trends. These two forms are as follows:

1. Classification

2. Prediction

We use classification and prediction to extract a model, representing the data classes to predict future data
trends. Classification predicts the categorical labels of data with the prediction models. This analysis provides us
with the best understanding of the data at a large scale.

Classification models predict categorical class labels, and prediction models predict continuous-valued
functions. For example, we can build a classification model to categorize bank loan applications as either safe or
risky or a prediction model to predict the expenditures in dollars of potential customers on computer equipment
given their income and occupation.

What is Classification?

Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data. The set of input data and the corresponding outputs are given to the algorithm. So, the training
data set includes the input data and their associated class labels. Using the training dataset, the algorithm derives
a model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network.
In classification, when unlabeled data is given to the model, it should find the class to which it belongs. The new
data provided to the model is the test data set.

Classification is the process of classifying a record. One simple example of classification is to check whether it
is raining or not. The answer can either be yes or no. So, there is a particular number of choices. Sometimes
there can be more than two classes to classify. That is called multiclass classification.
U4 CLASIFICATION & PREDICTION DATA MINING

The bank needs to analyze whether giving a loan to a particular customer is risky or not. For example, based on
observable data for multiple loan borrowers, a classification model may be established that forecasts credit risk.
The data could track job records, homeownership or leasing, years of residency, number, type of deposits,
historical credit ranking, etc. The goal would be credit ranking, the predictors would be the other characteristics,
and the data would represent a case for each consumer. In this example, a model is constructed to find the
categorical label. The labels are risky or safe.

How does Classification Works?

The functioning of classification with the assistance of the bank loan application has been mentioned above.
There are two stages in the data classification system: classifier or model creation and classification classifier.

1. Developing the Classifier or model creation: This level is the learning stage or the learning process.
The classification algorithms construct the classifier in this stage. A classifier is constructed from a
training set composed of the records of databases and their corresponding class names. Each category
that makes up the training set is referred to as a category or class. We may also refer to these records as
samples, objects, or data points.

2. Applying classifier for classification: The classifier is used for classification at this level. The test
data are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded to cover new data records. It includes:

o Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We can
use it to extract social media insights. We can build sentiment analysis models to read and
analyze misspelled words with advanced machine learning algorithms. The accurate trained
models provide consistently accurate outcomes and result in a fraction of the time.

o Document Classification: We can use document classification to organize the documents into
sections according to the content. Document classification refers to text classification; we can
U4 CLASIFICATION & PREDICTION DATA MINING

classify the words in the entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.

o Image Classification: Image classification is used for the trained categories of an image.
These could be the caption of the image, a statistical value, a theme. You can tag images to
train your model for relevant categories by applying supervised learning algorithms.

o Machine Learning Classification: It uses the statistically demonstrable algorithm rules to

execute analytical tasks that would take humans hundreds of more hours to perform.

3. Data Classification Process: The data classification process can be categorized into five steps:

o Create the goals of data classification, strategy, workflows, and architecture of data
classification.

o Classify confidential details that we store.

o Using marks by data labelling.

o To improve protection and obedience, use effects.

o Data is complex, and a continuous method is a classification.

What is Data Classification Lifecycle?

The data classification life cycle produces an excellent structure for controlling the flow of data to an enterprise.
Businesses need to account for data security and compliance at each level. With the help of data classification,
we can perform it at every stage, from origin to deletion. The data life-cycle has the following stages, such as:
U4 CLASIFICATION & PREDICTION DATA MINING

1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.

2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.

3. Storage: Here, we have the obtained data, including access controls and encryption.

4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various
devices and platforms.

5. Archive: Here, data is eventually archived within an industry's storage systems.

6. Publication: Through the publication of data, it can reach customers. They can then view and
download in the form of dashboards.

What is Prediction?

Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding numerical output values. The algorithm derives the model
or a predictor according to the training dataset. The model should find a numerical output when the new data is
given. Unlike in classification, this method does not have a class label. The model predicts a continuous-valued
function or ordered value.

Regression is generally used for prediction. Predicting the value of a house depending on the facts such as the
number of rooms, the total area, etc., is an example for prediction.

For example, suppose the marketing manager needs to predict how much a particular customer will spend at his
company during a sale. We are bothered to forecast a numerical value in this case. Therefore, an example of
numeric prediction is the data processing activity. In this case, a model or a predictor will be developed that
forecasts a continuous or ordered value function.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following
activities, such as:
U4 CLASIFICATION & PREDICTION DATA MINING

1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing values. The noise
is removed by applying smoothing techniques, and the problem of missing values is solved by
replacing a missing value with the most commonly occurring value for that attribute.

2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.

3. Data Transformation and reduction: The data can be transformed by any of the following methods.

o Normalization: The data is transformed using normalization. Normalization involves scaling

all values for a given attribute to make them fall within a small specified range. Normalization
is used when the neural networks or the methods involving measurements are used in the
learning step.

o Generalization: The data can also be transformed by generalizing it to the higher concept.
For this purpose, we can use the concept hierarchies.

NOTE: Data can also be reduced by some other methods such as wavelet transformation, binning, histogram
analysis, and clustering.

Comparison of Classification and Prediction Methods

Here are the criteria for comparing the methods of Classification and Prediction, such as:

o Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to predict the
class label correctly, and the accuracy of the predictor can be referred to as how well a given predictor
can estimate the unknown value.

o Speed: The speed of the method depends on the computational cost of generating and using the
classifier or predictor.

o Robustness: Robustness is the ability to make correct predictions or classifications. In the context of
data mining, robustness is the ability of the classifier or predictor to make correct predictions from
incoming unknown data.

o Scalability: Scalability refers to an increase or decrease in the performance of the classifier or

predictor based on the given data.
U4 CLASIFICATION & PREDICTION DATA MINING

o Interpretability: Interpretability is how readily we can understand the reasoning behind predictions or
classification made by the predictor or classifier.

Difference between Classification and Prediction

Classification Prediction
Classification is the process of identifying which Predication is the process of identifying the missing or
category a new observation belongs to based on a training unavailable numerical data for a new observation.
data set containing observations whose category
membership is known.

In classification, the accuracy depends on finding the In prediction, the accuracy depends on how well a given
class label correctly. predictor can guess the value of a predicated attribute for new
data.

In classification, the model can be known as the In prediction, the model can be known as the predictor.
classifier.

A model or the classifier is constructed to find the A model or a predictor will be constructed that predicts a
categorical labels. continuous-valued function or ordered value.
For example, the grouping of patients based on their For example, We can think of prediction as predicting the
medical records can be considered a classification. correct treatment for a particular disease for a person.

Decision Tree Induction

Decision Tree is a supervised learning method used in data mining for classification and regression methods. It
is a tree that helps us in decision-making purposes. The decision tree creates classification or regression models
as a tree structure. It separates a data set into smaller subsets, and at the same time, the decision tree is steadily
developed. The final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two
branches. The leaf nodes show a classification or decision. We can't accomplish more split on leaf nodes-The
uppermost decision node in a tree that relates to the best predictor called the root node. Decision trees can deal
with both categorical and numerical data.

Key factors:

Entropy:

Entropy refers to a common way to measure impurity. In the decision tree, it measures the randomness or
impurity in data sets.
U4 CLASIFICATION & PREDICTION DATA MINING

Information Gain:

Information Gain refers to the decline in entropy after the dataset is split. It is also called Entropy Reduction.
Building a decision tree is all about discovering attributes that return the highest data gain.

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing decisions. Starting
with the dataset, we can measure the entropy to find a way to segment the set until the data belongs to the same
class.

Why are decision trees useful?

It enables us to analyze the possible consequences of a decision thoroughly.

It provides us a framework to measure the values of outcomes and the probability of accomplishing them.

It helps us to make the best decisions based on existing data and best speculations.

In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split an
extensive collection of records into smaller sets of the class by implementing a sequence of simple decision
rules. A decision tree model comprises a set of rules for portioning a huge heterogeneous population into
smaller, more homogeneous, or mutually exclusive classes. The attributes of the classes can be any variables
from nominal, ordinal, binary, and quantitative values, in contrast, the classes must be a qualitative type, such as
categorical or ordinal or binary. In brief, the given data of attributes together with its class, a decision tree
creates a set of rules that can be used to identify the class. One rule is implemented after another, resulting in a
hierarchy of segments within a segment. The hierarchy is known as the tree, and each segment is called a node.
With each progressive division, the members from the subsequent sets become more and more similar to each
other. Hence, the algorithm used to build a decision tree is referred to as recursive partitioning. The algorithm is
known as CART (Classification and Regression Trees)

Consider the given example of a factory where

U4 CLASIFICATION & PREDICTION DATA MINING

Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which leads to $8 million
profit, and the probability of a bad economy is 0.4 (40%), which leads to $6 million profit.

Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which leads to $4 million
profit, and the probability of a bad economy is 0.4, which leads to $2 million profit.

The management teams need to take a data-driven decision to expand or not based on the given data.

Net Expand = ( 0.6 8 + 0.46 ) - 3 = $4.2M

Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.

Decision tree Algorithm:

The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques is as follows:

The algorithm is based on three parameters: D, attribute_list, and Attribute _selection_method.

Generally, we refer to D as a data partition.

Initially, D is the entire set of training tuples and their related class levels (input training data).

The parameter attribute_list is a set of attributes defining the tuples.

Attribute_selection_method specifies a heuristic process for choosing the attribute that "best" discriminates
the given tuples according to class.

Attribute_selection_method process applies an attribute selection measure.

U4 CLASIFICATION & PREDICTION DATA MINING

Advantages of using decision trees:

A decision tree does not need scaling of information.

Missing values in data also do not influence the process of building a choice tree to any considerable extent.

A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.

Compared to other algorithms, decision trees need less exertion for data preparation during pre-processing.

A decision tree does not require a standardization of data.

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems.

o It is mainly used in text classification that includes a high-dimensional training dataset.

o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.

o It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.

o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.

o The formula for Bayes' theorem is given as:

U4 CLASIFICATION & PREDICTION DATA MINING

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.rward Skip 10s

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.

3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes
U4 CLASIFICATION & PREDICTION DATA MINING

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35
U4 CLASIFICATION & PREDICTION DATA MINING

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.

o It can be used for Binary as well as Multi-class Classifications.

o It performs well in Multi-class predictions as compared to the other Algorithms.

o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.

o It is used in medical data classification.

o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
U4 CLASIFICATION & PREDICTION DATA MINING

o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.

o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed.
It is primarily used for document classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.

o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.

Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one
cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember

 A cluster of data objects can be treated as one group.

 While doing cluster analysis, we first partition the set of data into groups based on data similarity and
then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to changes and helps single
out useful features that distinguish different groups.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research, pattern recognition,
data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionalities and gain insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according to house type, value, and
geographic location.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit card fraud.
U4 CLASIFICATION & PREDICTION DATA MINING

 As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to
observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −

 Scalability − We need highly scalable clustering algorithms to deal with large databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on
any kind of data such as interval-based (numerical) data, categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle low-dimensional
data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods

Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each
partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the
following requirements −

 Each group contains at least one object.

 Each object must belong to exactly one group.

Points to remember −

 For a given number of partitions (say k), the partitioning method will create an initial partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to other.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical
methods on the basis of how the hierarchical decomposition is formed. There are two approaches here −

 Agglomerative Approach
 Divisive Approach

Agglomerative Approach
U4 CLASIFICATION & PREDICTION DATA MINING

This approach is also known as the bottom-up approach. In this, we start with each object forming a separate
group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the
groups are merged into one or until the termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the objects in the same
cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can
never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −

 Perform careful analysis of object linkages at each hierarchical partitioning.

 Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group
objects into micro-clusters, and then performing macro-clustering on the micro-clusters.

Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as
the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the
radius of a given cluster has to contain at least a minimum number of points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid
structure.

Advantages

 The major advantage of this method is fast processing time.

 It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods

In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This
method locates the clusters by clustering the density function. It reflects spatial distribution of the data points.

This method also provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. It therefore yields robust clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A
constraint refers to the user expectation or the properties of desired clustering results. Constraints provide us
with an interactive way of communication with the clustering process. Constraints can be specified by the user
or the application requirement.

Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
Data Mining Jntuh Cse R18
No ratings yet
Data Mining Jntuh Cse R18
20 pages
Data Mining Module 3
No ratings yet
Data Mining Module 3
27 pages
Classification and Predication in Data Mining
No ratings yet
Classification and Predication in Data Mining
6 pages
Data Mining Classification Prediction
No ratings yet
Data Mining Classification Prediction
3 pages
DM Unit 4
No ratings yet
DM Unit 4
22 pages
Classification
No ratings yet
Classification
15 pages
Classification Unit3
No ratings yet
Classification Unit3
15 pages
Data Mining and Warehousing Mod3
No ratings yet
Data Mining and Warehousing Mod3
69 pages
18mca52c U3
No ratings yet
18mca52c U3
8 pages
9 Data Mining - Classification & Prediction
No ratings yet
9 Data Mining - Classification & Prediction
4 pages
Unit 3 DM
No ratings yet
Unit 3 DM
34 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
3 DM Classification
No ratings yet
3 DM Classification
62 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
Unit-5 3161610
No ratings yet
Unit-5 3161610
92 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Data Mining - Classification & Prediction
No ratings yet
Data Mining - Classification & Prediction
5 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
10 Classification2022
No ratings yet
10 Classification2022
20 pages
DWM Unit 3 Final Notes
No ratings yet
DWM Unit 3 Final Notes
47 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
For More Visit WWW - Ktunotes.in
No ratings yet
For More Visit WWW - Ktunotes.in
21 pages
Classification (Part II)
No ratings yet
Classification (Part II)
162 pages
DSand ML
No ratings yet
DSand ML
76 pages
Classification
No ratings yet
Classification
50 pages
Chp8 (Topic Not in Book) - ClassificationPrediction+Issues
No ratings yet
Chp8 (Topic Not in Book) - ClassificationPrediction+Issues
7 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
ITP4-Lesson 4-Week 7-8
No ratings yet
ITP4-Lesson 4-Week 7-8
18 pages
Unit 8 Classification and Prediction: Structure
No ratings yet
Unit 8 Classification and Prediction: Structure
16 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Overview Basics
No ratings yet
Overview Basics
16 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
Down 4
No ratings yet
Down 4
83 pages
FDS Unit-4
No ratings yet
FDS Unit-4
15 pages
Unit 3
No ratings yet
Unit 3
53 pages
Data Mining 5 Semester Bca
No ratings yet
Data Mining 5 Semester Bca
44 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
Module 04
No ratings yet
Module 04
75 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
14 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
V1-CH-6-Classification and Prediction
No ratings yet
V1-CH-6-Classification and Prediction
38 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
26076classification - Data Mining
No ratings yet
26076classification - Data Mining
4 pages
Classification Basic Concept - Data Mining
No ratings yet
Classification Basic Concept - Data Mining
20 pages
Classification and Prediction-Module4
No ratings yet
Classification and Prediction-Module4
26 pages
DS Practical
No ratings yet
DS Practical
9 pages
? Simple UI Design Elements (Community)
No ratings yet
? Simple UI Design Elements (Community)
1 page
Document
No ratings yet
Document
37 pages
Document
No ratings yet
Document
32 pages
Document
No ratings yet
Document
29 pages
Document
No ratings yet
Document
21 pages
2nd Semester Question
No ratings yet
2nd Semester Question
8 pages
SE Unit 1
No ratings yet
SE Unit 1
29 pages
JS Part-2
No ratings yet
JS Part-2
24 pages
Statistics 2
No ratings yet
Statistics 2
36 pages
SE Unit - 3
No ratings yet
SE Unit - 3
25 pages
L13 Relational Model DDL
No ratings yet
L13 Relational Model DDL
79 pages
MAD Unit 4
No ratings yet
MAD Unit 4
141 pages
DM Unit 1
No ratings yet
DM Unit 1
9 pages
Linear Search and Binary Search
No ratings yet
Linear Search and Binary Search
11 pages
Ba Notes
No ratings yet
Ba Notes
7 pages
Module 5 - Dimensional Modeling
No ratings yet
Module 5 - Dimensional Modeling
4 pages
Modern Simple ATS Friendly LateX CV 1
No ratings yet
Modern Simple ATS Friendly LateX CV 1
2 pages
E Library Seminar
No ratings yet
E Library Seminar
18 pages
Syllabus of IDC Under FYUG Programme in LIS Final
No ratings yet
Syllabus of IDC Under FYUG Programme in LIS Final
12 pages
J. B. Institute of Engineering and Technology
No ratings yet
J. B. Institute of Engineering and Technology
1 page
ISQ197
No ratings yet
ISQ197
5 pages
Final Year Project Topics (300+)
100% (1)
Final Year Project Topics (300+)
28 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
2 pages
Sudhanshu Kumar: Resume
No ratings yet
Sudhanshu Kumar: Resume
3 pages
Digital - Fluency - Fullnotesp
No ratings yet
Digital - Fluency - Fullnotesp
32 pages
Fundamentals of Database Systems: Lesson 1: Introduction
No ratings yet
Fundamentals of Database Systems: Lesson 1: Introduction
35 pages
Data Science Ai Important Questions Answers - 250322 - 101649
No ratings yet
Data Science Ai Important Questions Answers - 250322 - 101649
31 pages
Dental Clinic Management System Project Report
No ratings yet
Dental Clinic Management System Project Report
95 pages
Semini Perera
No ratings yet
Semini Perera
2 pages
Data Science Lecture No 01
No ratings yet
Data Science Lecture No 01
28 pages
Nail Disease Detection and Classification Using Deep Learning
No ratings yet
Nail Disease Detection and Classification Using Deep Learning
21 pages
Varsha Resume
No ratings yet
Varsha Resume
2 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
17 pages
Ajp MCQ Set 1
No ratings yet
Ajp MCQ Set 1
2 pages
Summer Training Report - Ishan Patwal
No ratings yet
Summer Training Report - Ishan Patwal
52 pages
Ali Gohar - (Research Fellow) : Master of Engineering in Computer Science
No ratings yet
Ali Gohar - (Research Fellow) : Master of Engineering in Computer Science
2 pages
836 - Library - MS 20-21
No ratings yet
836 - Library - MS 20-21
10 pages
Finxter OpenAI Glossary
No ratings yet
Finxter OpenAI Glossary
1 page
Bigdata Units
No ratings yet
Bigdata Units
80 pages
Questioned Document Examination
No ratings yet
Questioned Document Examination
56 pages
PostgreSQL Architecture 2
No ratings yet
PostgreSQL Architecture 2
5 pages
A Data Mining Architecture For Distributed Environments: Lecture Notes in Computer Science June 2002
No ratings yet
A Data Mining Architecture For Distributed Environments: Lecture Notes in Computer Science June 2002
13 pages
Seven Steps To Systematic Literature Reviews
100% (1)
Seven Steps To Systematic Literature Reviews
5 pages

U4 Clasification and Prediction

Uploaded by

U4 Clasification and Prediction

Uploaded by

U4 CLASIFICATION & PREDICTION DATA MINING

How does Classification Works?

o Machine Learning Classification: It uses the statistically demonstrable algorithm rules to

o Classify confidential details that we store.

o Using marks by data labelling.

o To improve protection and obedience, use effects.

o Data is complex, and a continuous method is a classification.

What is Data Classification Lifecycle?

5. Archive: Here, data is eventually archived within an industry's storage systems.

Classification and Prediction Issues

o Normalization: The data is transformed using normalization. Normalization involves scaling

Comparison of Classification and Prediction Methods

o Scalability: Scalability refers to an increase or decrease in the performance of the classifier or

Difference between Classification and Prediction

Decision Tree Induction

Why are decision trees useful?

It enables us to analyze the possible consequences of a decision thoroughly.

Consider the given example of a factory where

Net Expand = ( 0.6 *8 + 0.4*6 ) - 3 = $4.2M

Decision tree Algorithm:

The algorithm is based on three parameters: D, attribute_list, and Attribute _selection_method.

Generally, we refer to D as a data partition.

The parameter attribute_list is a set of attributes defining the tuples.

Attribute_selection_method process applies an attribute selection measure.

Advantages of using decision trees:

A decision tree does not need scaling of information.

A decision tree does not require a standardization of data.

Naïve Bayes Classifier Algorithm

o It is mainly used in text classification that includes a high-dimensional training dataset.

Why is it called Naïve Bayes?

o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

o The formula for Bayes' theorem is given as:

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.

3. Now, use Bayes theorem to calculate the posterior probability.

Solution: To solve this, first consider the below dataset:

Frequency table for the Weather Conditions:

Likelihood table weather condition:

Overcast 0 5 5/14= 0.35

All 4/14=0.29 10/14=0.71

P(Sunny|Yes)= 3/10= 0.3

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o It can be used for Binary as well as Multi-class Classifications.

o It performs well in Multi-class predictions as compared to the other Algorithms.

o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.

o It is used in medical data classification.

o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

 A cluster of data objects can be treated as one group.

Applications of Cluster Analysis

Requirements of Clustering in Data Mining

Clustering methods can be classified into the following categories −

 Each group contains at least one object.

Approaches to Improve Quality of Hierarchical Clustering

 Perform careful analysis of object linkages at each hierarchical partitioning.

 The major advantage of this method is fast processing time.

You might also like

Net Expand = ( 0.6 8 + 0.46 ) - 3 = $4.2M