0% found this document useful (0 votes)
30 views30 pages

On Unit-3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views30 pages

On Unit-3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit:3

Data Mining Techniques &


Classification and Prediction

Name: Nihar Ranjan Prusty


Roll No: 22PG101201
MBA(2022-24)
Data Mining Technique

Data mining uses algorithms and various other techniques to convert


large collections of data into useful output.
The most popular types of data mining techniques include:
• Association rules
• Classification
• Clustering
• Neural networks
• Predictive analysis.
• Association rules
Association rule mining finds interesting associations and relationships among large
sets of data items. This rule shows how frequently a itemset occurs in a transaction.
A typical example is a Market Based Analysis.
Example:- Let's say we have a dataset representing transactions at a grocery store.
Each transaction consists of items that a customer purchased together. Here's a small
dataset:
Transaction 1: {Bread, Milk}
Transaction 2: {Bread, Diapers, Beer}
Transaction 3: {Milk, Diapers, Beer, Eggs}
Transaction 4: {Bread, Milk, Diapers, Beer}
Transaction 5: {Bread, Milk, Diapers, Eggs}
Now, let's use association rule mining to find patterns in this data.
1. Find frequent item sets:
Calculate support for each item:
• Support(Bread) = 4/5 = 0.8
• Support(Milk) = 4/5 = 0.8
• Support(Diapers) = 4/5 = 0.8
• Support(Beer) = 3/5 = 0.6
• Support(Eggs) = 2/5 = 0.4
Since all items have support greater than 0.5, they are all frequent item sets.
2. Generate association rules:
• Generate rules with confidence greater than or equal to 0.5:
• {Bread} -> {Milk} (Confidence = Support(Bread ∪ Milk) / Support(Bread) = 3/4 = 0.75)
• {Milk} -> {Bread} (Confidence = Support(Bread ∪ Milk) / Support(Milk) = 3/4 = 0.75)
• {Bread} -> {Diapers} (Confidence = Support(Bread ∪ Diapers) /
Support(Bread) = 3/4 = 0.75)
• {Diapers} -> {Bread} (Confidence = Support(Bread ∪ Diapers) /
Support(Diapers) = 3/4 = 0.75)
• {Bread} -> {Beer} (Confidence = Support(Bread ∪ Beer) /
Support(Bread) = 3/4 = 0.75)
• {Beer} -> {Bread} (Confidence = Support(Bread ∪ Beer) / Support(Beer)
= 3/3 = 1)
• {Milk} -> {Diapers} (Confidence = Support(Milk ∪ Diapers) /
Support(Milk) = 3/4 = 0.75)
These are the association rules generated from the dataset. They indicate the
likelihood of one item being bought if another item is bought. For example,
if a customer buys Bread, there's a 75% chance they will also buy Milk.
Similarly, if a customer buys Milk, there's a 50% chance they will buy Beer,
and so on.
• Application of Association Rule

• Predicting shopping transaction data (Market Basket Analysis)


• Predicting web page click stream
• Mining software bugs
• Finding DNA or protein sequence in medical stream
• Finding quality phases entity
• Identifying object or sub structure in image, video, social media
Advantage and disadvantage of Association Rule Mining
Advantage Disadvantage

1. Easy to Understand 1. High Dimensionality

2. Identifying Hidden Patterns 2. Quality of Rules

3. Scalability 3. Handling Continuous Variables

4. Decision Support 4. Sensitive to Data Skewness

5. Applicability to Diverse Domains 5. Influence of Support and Confidence Thresholds


• Classification
Classification in data mining is a common technique that separates data
points into different classes. It allows you to organize data sets of all
sorts, including complex and large datasets as well as small and simple
ones. It primarily involves using algorithms that you can easily modify
to improve the data quality.
Types of Classification Techniques in Data Mining
There are two major type of classification in data mining:
1. Generative Classification: This type of classification models the
distribution of individual classes. It tries to learn the model that
generates the data behind the scenes by estimating assumptions and
distributions of the model. This approach is useful for predicting unseen
data.
2. Discriminative Classification: This is a more basic classifier and
determines just one class for each row of data. It tries to model just by
depending on the observed data, depends heavily on the quality of data rather
than on distributions.
Other examples of classification algorithms include:
• Support Vector Machines (SVMs)
• Decision Trees
• K-Nearest Neighbours (KNN)
• Random Forests
• Artificial Neural Networks (ANNs)
• Support Vector Machines (SVMs): These are kernel-based methods that
are used to classify data points into two categories. They work by finding a
hyperplane that separates the data points of one category from the data
points of another category.
• Decision Trees: These are tree-like structures that are used to classify data
points. They work by asking a series of questions about the data point, and
then following the branch of the tree that corresponds to the answer to the
question
• K-Nearest Neighbors (KNN): This is a type of instance-based learning,
where the classification of a data point is based on the classification of its
nearest neighbors.
• Random Forests: These are ensembles of decision trees that are trained
on different random subsets of the data. They can be used for both
classification and regression tasks.
Example of Decision Tree
Advantage and disadvantage of different classification model:
Classification Model Advantage Disadvantage

Probabilistic Approach, gives information about


Logistic Regression The assumptions of logistic regression.
statistical significance of features.
Need to manually choose the number of
K – Nearest Neighbours Simple to understand, fast and efficient.
neighbours ‘k’.
Support Vector Machine Performant, not biased by outliers, not sensitive Not appropriate for non-linear problems, not
(SVM) to overfitting. the best choice for large number of features.
High performance on non – linear problems, not Not the best choice for large number of
Kernel SVM
biased by outliers, not sensitive to overfitting. features, more complex.
Efficient, not biased by outliers, works on non – Based in the assumption that the features have
Naive Bayes
linear problems, probabilistic approach. same statistical relevance.
Decision Tree Interpretability, no need for feature scaling, works Poor results on very small datasets, overfitting
Classification on both linear / non – linear problems. can easily occur.
Random Forest Powerful and accurate, good performance on No interpretability, overfitting can easily occur,
Classification many problems, including non – linear. need to choose the number of trees manually.
The process of building a classification model:
• Data Collection: The first step in building a classification model is data
collection. In this step, the data relevant to the problem at hand is collected.
The data should be representative of the problem and should contain all the
necessary attributes and labels needed for classification.
• Data Preprocessing: The second step in building a classification model is
data preprocessing. The collected data needs to be preprocessed to ensure
its quality. This involves handling missing values, dealing with outliers,
and transforming the data into a format suitable for analysis.
• Feature Selection: The third step in building a classification model is
feature selection. Feature selection involves identifying the most relevant
attributes in the dataset for classification. This can be done using various
techniques, such as correlation analysis, information gain, and principal
component analysis.
• Principal Component Analysis: Principal Component Analysis (PCA) is a
technique used to reduce the dimensionality of the dataset. PCA identifies
the most important features in the dataset and removes the redundant ones.
• Model Selection: The fourth step in building a classification model is
model selection. Model selection involves selecting the appropriate
classification algorithm for the problem at hand. There are several
algorithms available, such as decision trees, support vector machines, and
neural networks.
• Neural Networks: Neural Networks are a powerful classification
algorithm that can learn complex patterns in the data. They are inspired by
the structure of the human brain and consist of multiple layers of
interconnected nodes.
• Model Training: The fifth step in building a classification model is
model training. Model training involves using the selected classification
algorithm to learn the patterns in the data. The data is divided into a
training set and a validation set. The model is trained using the training
set, and its performance is evaluated on the validation set.
• Model Evaluation: The sixth step in building a classification model is
model evaluation. Model evaluation involves assessing the performance
of the trained model on a test set. This is done to ensure that the model
generalizes well
Classification and Prediction Issues
• Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the
following methods:
i. Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.
ii. Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how well
a given predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
Bayesian Classification
• Bayesian classification is based on Bayes' Theorem. Bayesian
classifiers are the statistical classifiers. Bayesian classifiers can predict
class membership probabilities such as the probability that a given
tuple belongs to a particular class.
• Bayes' Theorem: Bayes' Theorem is named after Thomas Bayes. There
are two types of probabilities −
i. Posterior Probability [P(H/X)]
ii. Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
• According to Bayes' Theorem, P(H/X)= P(X/H)P(H) / P(X)
i. Posterior Probability[P(H/X)]: When new data or information is
collected then the Prior Probability of an event will be revised to
produce a more accurate measure of a possible outcome. This revised
probability becomes the Posterior Probability and is calculated using
Bayes’ theorem. So, the Posterior Probability is the probability of an
event X occurring given that event H has occurred.
ii. Prior Probability[P(H)]: Prior Probability is the probability of
occurring an event before the collection of new data. It is the best
logical evaluation of the probability of an outcome which is based on
the present knowledge of the event before the inspection is performed.
Applications of Bayes’ Theorem
Some applications are given below :

• It can also be used as a building block and starting point for more
complex methodologies, For example, The popular Bayesian networks.
• Used in classification problems and other probability-related questions.
• Bayesian inference, a particular approach to statistical inference.
• In genetics, Bayes’ theorem can be used to calculate the probability of
an individual having a specific genotype.
Clustering
• The process of making a group of abstract objects into classes of similar
objects is known as clustering. One group is treated as a cluster of data
objects.
• In the process of cluster analysis, the first step is to partition the set of data
into groups with the help of data similarity, and then groups are assigned
to their respective labels.
• The biggest advantage of clustering over-classification is it can adapt to
the changes made and helps single out useful features that differentiate
different groups.
Applications of cluster analysis :

• It is widely used in many applications such as image processing, data


analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their customer base
and they can characterize their customer groups by using purchasing
patterns.
• It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on the
web.
Clustering Methods:
1. Model-Based Method: Model-based clustering methods assume that the
data is generated from a mixture of probability distributions and aim to
fit a statistical model to the data. These methods seek to identify
underlying patterns in the data by estimating parameters of the
probabilistic model, such as cluster centers, covariances, and mixing
proportions.
2. Hierarchical Method: Hierarchical clustering is a method of cluster
analysis in data mining that creates a hierarchical representation of the
clusters in a dataset. The method starts by treating each data point as a
separate cluster and then iteratively combines the closest clusters until a
stopping criterion is reached. A Hierarchical clustering method works via
grouping data into a tree of clusters.
3. Constraint-Based Method: Constraint-based clustering methods
incorporate prior knowledge or constraints into the clustering process to
guide the formation of clusters. These methods allow users to specify
constraints such as pairwise similarities, dissimilarities, or
must-link/cannot-link constraints between data points, which are then
used to guide the clustering process.

4. Grid-Based Method: It Divides the data space into a finite number of


cells, forming a grid structure. Clusters are formed by grouping cells that
contain a sufficient number of data points. Example methods include
STING (Statistical Information Grid) and CLIQUE (Clustering In
Quest).
5. Partitioning Method: Partitioning methods aim to partition the data points
into a specified number of clusters such that each data point belongs to
exactly one cluster. These methods iteratively refine the partitioning to
optimize a chosen criterion, such as minimizing intra-cluster distance or
maximizing inter-cluster distance. One of the most popular partitioning
methods is K-Means clustering.
6. Density-Based Method: Density-based methods aim to discover clusters
of arbitrary shape based on the density of data points in the feature space.
These methods typically identify regions of high density separated by
regions of low density. One of the well-known density-based clustering
algorithms is DBSCAN.
Requirements of clustering in data mining:
• Scalability – we require highly scalable clustering algorithms to work
with large databases.
• Ability to deal with different kinds of attributes – Algorithms should
be able to work with the type of data such as categorical, numerical, and
binary data.
• Discovery of clusters with attribute shape – The algorithm should be
able to detect clusters in arbitrary shapes and it should not be bounded to
distance measures.
• Interpretability – The results should be comprehensive, usable, and
interpretable.
• High dimensionality – The algorithm should be able to handle high
dimensional space instead of only handling low dimensional data.
Advantages of Cluster Analysis:
• It can help identify patterns and relationships within a dataset that may not be
immediately obvious.

• It can be used for exploratory data analysis and can help with feature selection.

• It can be used to reduce the dimensionality of the data.

• It can be used for anomaly detection and outlier identification.

• It can be used for market segmentation and customer profiling.


Disadvantages of Cluster Analysis
• It can be sensitive to the choice of initial conditions and the number
of clusters.
• It can be sensitive to the presence of noise or outliers in the data.
• It can be difficult to interpret the results of the analysis if the clusters
are not well-defined.
• It can be computationally expensive for large datasets.
• The results of the analysis can be affected by the choice of clustering
algorithm used.
• It is important to note that the success of cluster analysis depends on
the data, the goals of the analysis, and the ability of the analyst to
interpret the results.
Neural Networks
Neural networks extract identifying features from data, lacking pre-
programmed understanding. Network components include neurons,
connections, weights, biases, propagation functions, and a learning rule.
Neurons receive inputs, governed by thresholds and activation functions.
Importance of Neural Networks:
The ability of neural networks to identify patterns, solve intricate puzzles,
and adjust to changing surroundings is essential. Their capacity to learn
from data has far-reaching effects, ranging from revolutionizing
technology like natural language processing and self-driving automobiles
to automating decision-making processes and increasing efficiency in
numerous industries.
Thank You

You might also like