0% found this document useful (0 votes)

30 views30 pages

On Unit-3

Uploaded by

Nihar Ranjan Prusty 92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views30 pages

On Unit-3

Uploaded by

Nihar Ranjan Prusty 92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Unit:3

Data Mining Techniques &

Classification and Prediction

Name: Nihar Ranjan Prusty

Roll No: 22PG101201
MBA(2022-24)
Data Mining Technique

Data mining uses algorithms and various other techniques to convert

large collections of data into useful output.
The most popular types of data mining techniques include:
• Association rules
• Classification
• Clustering
• Neural networks
• Predictive analysis.
• Association rules
Association rule mining finds interesting associations and relationships among large
sets of data items. This rule shows how frequently a itemset occurs in a transaction.
A typical example is a Market Based Analysis.
Example:- Let's say we have a dataset representing transactions at a grocery store.
Each transaction consists of items that a customer purchased together. Here's a small
dataset:
Transaction 1: {Bread, Milk}
Transaction 2: {Bread, Diapers, Beer}
Transaction 3: {Milk, Diapers, Beer, Eggs}
Transaction 4: {Bread, Milk, Diapers, Beer}
Transaction 5: {Bread, Milk, Diapers, Eggs}
Now, let's use association rule mining to find patterns in this data.
1. Find frequent item sets:
Calculate support for each item:
• Support(Bread) = 4/5 = 0.8
• Support(Milk) = 4/5 = 0.8
• Support(Diapers) = 4/5 = 0.8
• Support(Beer) = 3/5 = 0.6
• Support(Eggs) = 2/5 = 0.4
Since all items have support greater than 0.5, they are all frequent item sets.
2. Generate association rules:
• Generate rules with confidence greater than or equal to 0.5:
• {Bread} -> {Milk} (Confidence = Support(Bread ∪ Milk) / Support(Bread) = 3/4 = 0.75)
• {Milk} -> {Bread} (Confidence = Support(Bread ∪ Milk) / Support(Milk) = 3/4 = 0.75)
• {Bread} -> {Diapers} (Confidence = Support(Bread ∪ Diapers) /
Support(Bread) = 3/4 = 0.75)
• {Diapers} -> {Bread} (Confidence = Support(Bread ∪ Diapers) /
Support(Diapers) = 3/4 = 0.75)
• {Bread} -> {Beer} (Confidence = Support(Bread ∪ Beer) /
Support(Bread) = 3/4 = 0.75)
• {Beer} -> {Bread} (Confidence = Support(Bread ∪ Beer) / Support(Beer)
= 3/3 = 1)
• {Milk} -> {Diapers} (Confidence = Support(Milk ∪ Diapers) /
Support(Milk) = 3/4 = 0.75)
These are the association rules generated from the dataset. They indicate the
likelihood of one item being bought if another item is bought. For example,
if a customer buys Bread, there's a 75% chance they will also buy Milk.
Similarly, if a customer buys Milk, there's a 50% chance they will buy Beer,
and so on.
• Application of Association Rule

• Predicting shopping transaction data (Market Basket Analysis)

• Predicting web page click stream
• Mining software bugs
• Finding DNA or protein sequence in medical stream
• Finding quality phases entity
• Identifying object or sub structure in image, video, social media
Advantage and disadvantage of Association Rule Mining
Advantage Disadvantage

1. Easy to Understand 1. High Dimensionality

2. Identifying Hidden Patterns 2. Quality of Rules

3. Scalability 3. Handling Continuous Variables

4. Decision Support 4. Sensitive to Data Skewness

5. Applicability to Diverse Domains 5. Influence of Support and Confidence Thresholds

• Classification
Classification in data mining is a common technique that separates data
points into different classes. It allows you to organize data sets of all
sorts, including complex and large datasets as well as small and simple
ones. It primarily involves using algorithms that you can easily modify
to improve the data quality.
Types of Classification Techniques in Data Mining
There are two major type of classification in data mining:
1. Generative Classification: This type of classification models the
distribution of individual classes. It tries to learn the model that
generates the data behind the scenes by estimating assumptions and
distributions of the model. This approach is useful for predicting unseen
data.
2. Discriminative Classification: This is a more basic classifier and
determines just one class for each row of data. It tries to model just by
depending on the observed data, depends heavily on the quality of data rather
than on distributions.
Other examples of classification algorithms include:
• Support Vector Machines (SVMs)
• Decision Trees
• K-Nearest Neighbours (KNN)
• Random Forests
• Artificial Neural Networks (ANNs)
• Support Vector Machines (SVMs): These are kernel-based methods that
are used to classify data points into two categories. They work by finding a
hyperplane that separates the data points of one category from the data
points of another category.
• Decision Trees: These are tree-like structures that are used to classify data
points. They work by asking a series of questions about the data point, and
then following the branch of the tree that corresponds to the answer to the
question
• K-Nearest Neighbors (KNN): This is a type of instance-based learning,
where the classification of a data point is based on the classification of its
nearest neighbors.
• Random Forests: These are ensembles of decision trees that are trained
on different random subsets of the data. They can be used for both
classification and regression tasks.
Example of Decision Tree
Advantage and disadvantage of different classification model:
Classification Model Advantage Disadvantage

Probabilistic Approach, gives information about

Logistic Regression The assumptions of logistic regression.
statistical significance of features.
Need to manually choose the number of
K – Nearest Neighbours Simple to understand, fast and efficient.
neighbours ‘k’.
Support Vector Machine Performant, not biased by outliers, not sensitive Not appropriate for non-linear problems, not
(SVM) to overfitting. the best choice for large number of features.
High performance on non – linear problems, not Not the best choice for large number of
Kernel SVM
biased by outliers, not sensitive to overfitting. features, more complex.
Efficient, not biased by outliers, works on non – Based in the assumption that the features have
Naive Bayes
linear problems, probabilistic approach. same statistical relevance.
Decision Tree Interpretability, no need for feature scaling, works Poor results on very small datasets, overfitting
Classification on both linear / non – linear problems. can easily occur.
Random Forest Powerful and accurate, good performance on No interpretability, overfitting can easily occur,
Classification many problems, including non – linear. need to choose the number of trees manually.
The process of building a classification model:
• Data Collection: The first step in building a classification model is data
collection. In this step, the data relevant to the problem at hand is collected.
The data should be representative of the problem and should contain all the
necessary attributes and labels needed for classification.
• Data Preprocessing: The second step in building a classification model is
data preprocessing. The collected data needs to be preprocessed to ensure
its quality. This involves handling missing values, dealing with outliers,
and transforming the data into a format suitable for analysis.
• Feature Selection: The third step in building a classification model is
feature selection. Feature selection involves identifying the most relevant
attributes in the dataset for classification. This can be done using various
techniques, such as correlation analysis, information gain, and principal
component analysis.
• Principal Component Analysis: Principal Component Analysis (PCA) is a
technique used to reduce the dimensionality of the dataset. PCA identifies
the most important features in the dataset and removes the redundant ones.
• Model Selection: The fourth step in building a classification model is
model selection. Model selection involves selecting the appropriate
classification algorithm for the problem at hand. There are several
algorithms available, such as decision trees, support vector machines, and
neural networks.
• Neural Networks: Neural Networks are a powerful classification
algorithm that can learn complex patterns in the data. They are inspired by
the structure of the human brain and consist of multiple layers of
interconnected nodes.
• Model Training: The fifth step in building a classification model is
model training. Model training involves using the selected classification
algorithm to learn the patterns in the data. The data is divided into a
training set and a validation set. The model is trained using the training
set, and its performance is evaluated on the validation set.
• Model Evaluation: The sixth step in building a classification model is
model evaluation. Model evaluation involves assessing the performance
of the trained model on a test set. This is done to ensure that the model
generalizes well
Classification and Prediction Issues
• Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the
following methods:
i. Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.
ii. Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how well
a given predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
Bayesian Classification
• Bayesian classification is based on Bayes' Theorem. Bayesian
classifiers are the statistical classifiers. Bayesian classifiers can predict
class membership probabilities such as the probability that a given
tuple belongs to a particular class.
• Bayes' Theorem: Bayes' Theorem is named after Thomas Bayes. There
are two types of probabilities −
i. Posterior Probability [P(H/X)]
ii. Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
• According to Bayes' Theorem, P(H/X)= P(X/H)P(H) / P(X)
i. Posterior Probability[P(H/X)]: When new data or information is
collected then the Prior Probability of an event will be revised to
produce a more accurate measure of a possible outcome. This revised
probability becomes the Posterior Probability and is calculated using
Bayes’ theorem. So, the Posterior Probability is the probability of an
event X occurring given that event H has occurred.
ii. Prior Probability[P(H)]: Prior Probability is the probability of
occurring an event before the collection of new data. It is the best
logical evaluation of the probability of an outcome which is based on
the present knowledge of the event before the inspection is performed.
Applications of Bayes’ Theorem
Some applications are given below :

• It can also be used as a building block and starting point for more
complex methodologies, For example, The popular Bayesian networks.
• Used in classification problems and other probability-related questions.
• Bayesian inference, a particular approach to statistical inference.
• In genetics, Bayes’ theorem can be used to calculate the probability of
an individual having a specific genotype.
Clustering
• The process of making a group of abstract objects into classes of similar
objects is known as clustering. One group is treated as a cluster of data
objects.
• In the process of cluster analysis, the first step is to partition the set of data
into groups with the help of data similarity, and then groups are assigned
to their respective labels.
• The biggest advantage of clustering over-classification is it can adapt to
the changes made and helps single out useful features that differentiate
different groups.
Applications of cluster analysis :

• It is widely used in many applications such as image processing, data

analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their customer base
and they can characterize their customer groups by using purchasing
patterns.
• It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on the
web.
Clustering Methods:
1. Model-Based Method: Model-based clustering methods assume that the
data is generated from a mixture of probability distributions and aim to
fit a statistical model to the data. These methods seek to identify
underlying patterns in the data by estimating parameters of the
probabilistic model, such as cluster centers, covariances, and mixing
proportions.
2. Hierarchical Method: Hierarchical clustering is a method of cluster
analysis in data mining that creates a hierarchical representation of the
clusters in a dataset. The method starts by treating each data point as a
separate cluster and then iteratively combines the closest clusters until a
stopping criterion is reached. A Hierarchical clustering method works via
grouping data into a tree of clusters.
3. Constraint-Based Method: Constraint-based clustering methods
incorporate prior knowledge or constraints into the clustering process to
guide the formation of clusters. These methods allow users to specify
constraints such as pairwise similarities, dissimilarities, or
must-link/cannot-link constraints between data points, which are then
used to guide the clustering process.

4. Grid-Based Method: It Divides the data space into a finite number of

cells, forming a grid structure. Clusters are formed by grouping cells that
contain a sufficient number of data points. Example methods include
STING (Statistical Information Grid) and CLIQUE (Clustering In
Quest).
5. Partitioning Method: Partitioning methods aim to partition the data points
into a specified number of clusters such that each data point belongs to
exactly one cluster. These methods iteratively refine the partitioning to
optimize a chosen criterion, such as minimizing intra-cluster distance or
maximizing inter-cluster distance. One of the most popular partitioning
methods is K-Means clustering.
6. Density-Based Method: Density-based methods aim to discover clusters
of arbitrary shape based on the density of data points in the feature space.
These methods typically identify regions of high density separated by
regions of low density. One of the well-known density-based clustering
algorithms is DBSCAN.
Requirements of clustering in data mining:
• Scalability – we require highly scalable clustering algorithms to work
with large databases.
• Ability to deal with different kinds of attributes – Algorithms should
be able to work with the type of data such as categorical, numerical, and
binary data.
• Discovery of clusters with attribute shape – The algorithm should be
able to detect clusters in arbitrary shapes and it should not be bounded to
distance measures.
• Interpretability – The results should be comprehensive, usable, and
interpretable.
• High dimensionality – The algorithm should be able to handle high
dimensional space instead of only handling low dimensional data.
Advantages of Cluster Analysis:
• It can help identify patterns and relationships within a dataset that may not be
immediately obvious.

• It can be used for exploratory data analysis and can help with feature selection.

• It can be used to reduce the dimensionality of the data.

• It can be used for anomaly detection and outlier identification.

• It can be used for market segmentation and customer profiling.

Disadvantages of Cluster Analysis
• It can be sensitive to the choice of initial conditions and the number
of clusters.
• It can be sensitive to the presence of noise or outliers in the data.
• It can be difficult to interpret the results of the analysis if the clusters
are not well-defined.
• It can be computationally expensive for large datasets.
• The results of the analysis can be affected by the choice of clustering
algorithm used.
• It is important to note that the success of cluster analysis depends on
the data, the goals of the analysis, and the ability of the analyst to
interpret the results.
Neural Networks
Neural networks extract identifying features from data, lacking pre-
programmed understanding. Network components include neurons,
connections, weights, biases, propagation functions, and a learning rule.
Neurons receive inputs, governed by thresholds and activation functions.
Importance of Neural Networks:
The ability of neural networks to identify patterns, solve intricate puzzles,
and adjust to changing surroundings is essential. Their capacity to learn
from data has far-reaching effects, ranging from revolutionizing
technology like natural language processing and self-driving automobiles
to automating decision-making processes and increasing efficiency in
numerous industries.
Thank You

Six Sigma Green Belt Examination
No ratings yet
Six Sigma Green Belt Examination
6 pages
Demantra Engine Tuning-RapidFlow PDF
No ratings yet
Demantra Engine Tuning-RapidFlow PDF
7 pages
8 Data Mining Algorithms
No ratings yet
8 Data Mining Algorithms
8 pages
Data Mining
No ratings yet
Data Mining
30 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Datamining Fifth Lecture
No ratings yet
Datamining Fifth Lecture
65 pages
DM - Unit-1 - Fundamentals of Data Mining
No ratings yet
DM - Unit-1 - Fundamentals of Data Mining
43 pages
26076classification - Data Mining
No ratings yet
26076classification - Data Mining
4 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Classification
No ratings yet
Classification
50 pages
Unit III Data Mining Techniques
No ratings yet
Unit III Data Mining Techniques
17 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
V8I4201941
No ratings yet
V8I4201941
5 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
11 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Data Mining Implementation
No ratings yet
Data Mining Implementation
9 pages
3 DM Classification
No ratings yet
3 DM Classification
62 pages
Journal On Decision Tree
No ratings yet
Journal On Decision Tree
5 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Ai Word Document Session 2 Detailed Exaple
No ratings yet
Ai Word Document Session 2 Detailed Exaple
15 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
Exercise of Chapter 4 - Data Mining Tools and Techniques Worksheet
No ratings yet
Exercise of Chapter 4 - Data Mining Tools and Techniques Worksheet
4 pages
DA5.6 Marketing Analytics Q&a
No ratings yet
DA5.6 Marketing Analytics Q&a
4 pages
1 s2.0 S2665917422000551 Main
No ratings yet
1 s2.0 S2665917422000551 Main
9 pages
BI Chapter 04 - Unlocked
No ratings yet
BI Chapter 04 - Unlocked
47 pages
Unit 5
No ratings yet
Unit 5
9 pages
Unit 2
No ratings yet
Unit 2
20 pages
Classifiction
No ratings yet
Classifiction
42 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
5 pages
Unit 3 BI & Data Science
No ratings yet
Unit 3 BI & Data Science
19 pages
Unit-IV Classification Part 1
No ratings yet
Unit-IV Classification Part 1
38 pages
Data Mining - Classification Using Frequent Pattern
No ratings yet
Data Mining - Classification Using Frequent Pattern
8 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
24 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Classification
No ratings yet
Classification
34 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
Classification Notes
No ratings yet
Classification Notes
14 pages
Unit - 2 Data Minig Notes
No ratings yet
Unit - 2 Data Minig Notes
15 pages
Data Mining
No ratings yet
Data Mining
63 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Slides Courtesy: Ling Chen [email protected]
No ratings yet
Slides Courtesy: Ling Chen [email protected]
42 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Discovering Knowledge in Data: Lecture Review of
No ratings yet
Discovering Knowledge in Data: Lecture Review of
20 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
60 pages
Lec 02
No ratings yet
Lec 02
33 pages
Data Warehousing Fundamentals - Unit 2
No ratings yet
Data Warehousing Fundamentals - Unit 2
38 pages
Module 04
No ratings yet
Module 04
75 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Classification Unit3
No ratings yet
Classification Unit3
15 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Data Collection
No ratings yet
Data Collection
3 pages
Session Commands
No ratings yet
Session Commands
1,033 pages
Week 1
No ratings yet
Week 1
14 pages
8604 Assignment 1
No ratings yet
8604 Assignment 1
26 pages
Iec 61400-1
No ratings yet
Iec 61400-1
31 pages
Assignment in Research and Statistics
No ratings yet
Assignment in Research and Statistics
17 pages
4 Discrete Time Random Processes
No ratings yet
4 Discrete Time Random Processes
16 pages
MOOC Econometrics: Dennis Fok
No ratings yet
MOOC Econometrics: Dennis Fok
3 pages
A Primer On Partial Least Squares Structural Equation Modeling - Hair-242-289
No ratings yet
A Primer On Partial Least Squares Structural Equation Modeling - Hair-242-289
48 pages
Robust Decision Trees
No ratings yet
Robust Decision Trees
6 pages
Sit Work/tutorial Exercise #2. Forecasting QUESTIONS: 30 Marks
No ratings yet
Sit Work/tutorial Exercise #2. Forecasting QUESTIONS: 30 Marks
2 pages
Convolution: Chris Piech CS109, Stanford University
No ratings yet
Convolution: Chris Piech CS109, Stanford University
24 pages
Assessing The Knowledge and Impact of Ad
No ratings yet
Assessing The Knowledge and Impact of Ad
8 pages
Practise Course Solution
No ratings yet
Practise Course Solution
7 pages
A Study To Assess The Perceived Stress Among Nursing Students During COVID-19 Lockdown
No ratings yet
A Study To Assess The Perceived Stress Among Nursing Students During COVID-19 Lockdown
7 pages
Exit Exam Model Hawassa University
No ratings yet
Exit Exam Model Hawassa University
32 pages
Appendix G
No ratings yet
Appendix G
7 pages
AS3021 Operational Research 2011/2012 Coursework
No ratings yet
AS3021 Operational Research 2011/2012 Coursework
4 pages
CREATES Research Paper 2008-42: Ole E. Barndorff-Nielsen, Silja Kinnebrock and Neil Shephard
No ratings yet
CREATES Research Paper 2008-42: Ole E. Barndorff-Nielsen, Silja Kinnebrock and Neil Shephard
24 pages
Questionnaire Construction and Administration
No ratings yet
Questionnaire Construction and Administration
6 pages
STA630 Final Term Solved Papers
No ratings yet
STA630 Final Term Solved Papers
77 pages
A Neural Network Approach To Ordinal Regression: Jianlin Cheng, Zheng Wang, and Gianluca Pollastri
No ratings yet
A Neural Network Approach To Ordinal Regression: Jianlin Cheng, Zheng Wang, and Gianluca Pollastri
6 pages
Instructional Module in Statistics and Probability For Research
No ratings yet
Instructional Module in Statistics and Probability For Research
4 pages
Chapter 4 ECON NOTES
No ratings yet
Chapter 4 ECON NOTES
8 pages
MT 330
No ratings yet
MT 330
1 page
Department of Management School of Management &business Studies Jamia Hamdard
No ratings yet
Department of Management School of Management &business Studies Jamia Hamdard
3 pages
06b Probability
No ratings yet
06b Probability
52 pages
BW LME Tutorial2 PDF
No ratings yet
BW LME Tutorial2 PDF
22 pages

On Unit-3

Uploaded by

On Unit-3

Uploaded by

Unit:3

Data Mining Techniques &

Name: Nihar Ranjan Prusty

Data mining uses algorithms and various other techniques to convert

• Predicting shopping transaction data (Market Basket Analysis)

1. Easy to Understand 1. High Dimensionality

2. Identifying Hidden Patterns 2. Quality of Rules

3. Scalability 3. Handling Continuous Variables

4. Decision Support 4. Sensitive to Data Skewness

5. Applicability to Diverse Domains 5. Influence of Support and Confidence Thresholds

Probabilistic Approach, gives information about

• It is widely used in many applications such as image processing, data

4. Grid-Based Method: It Divides the data space into a finite number of

• It can be used to reduce the dimensionality of the data.

• It can be used for anomaly detection and outlier identification.

• It can be used for market segmentation and customer profiling.

You might also like