Knowledge Discovery in Databases

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Knowledge Discovery in Databases (KDD)

What is KDD?

KDD is referred to as Knowledge Discovery in Database and is defined as a method of finding,


transforming, and refining meaningful data and patterns from a raw database in order to be utilised
in different domains or applications.

It's a process of extracting useful information or patterns from large datasets.

The process typically involves steps such as data cleaning, data preprocessing, data transformation,
pattern discovery, and interpretation of the discovered patterns.

KDD is commonly associated with data mining, machine learning, and artificial intelligence
techniques and is used across various fields including business, healthcare, finance, and science for
tasks such as predictive modeling, classification, clustering, and anomaly detection.

KDD in data mining is a programmed and analytical approach to model data from a database to
extract useful and applicable ‘knowledge’.

KDD in data mining is necessary for businesses and organizations since it enables them to get new
knowledge and insights from the data. Knowledge discovery in databases can assist in improving
customer experience, enhance the decision-making process, optimize operations, support strategic
planning, and drive business growth.

Steps in the Process of Data Mining :

1. Goal-Setting and Application Understanding :

This is the first step in the process and requires prior understanding and knowledge of the field to
be applied in. This is where we decide how the transformed data and the patterns arrived at by
data mining will be used to extract knowledge. This premise is extremely important which, if set
wrong, can lead to false interpretations and negative impacts on the end-user.

2. Data Selection and Integration :

After setting the goals and objectives, the data collected needs to be selected and segregated into
meaningful sets based on availability, accessibility importance and quality. These parameters are
critical for data mining because they make the base for it and will affect what kinds of data models
are formed.

3. Data Cleaning and Preprocessing :

This step involves searching for missing data and removing noisy, redundant and low-quality data
from the data set in order to improve the reliability of the data and its effectiveness. Certain
algorithms are used for searching and eliminating unwanted data based on attributes specific to the
application.

AIML Department Page 1


4. Data Transformation

This step prepares the data to be fed to the data mining algorithms. Hence, the data needs to be in
consolidated and aggregate forms. The data is consolidated on the basis of functions, attributes,
features etc.

5. Data Mining

This is the root or backbone process of the whole KDD. This is where algorithms are used to extract
meaningful patterns from the transformed data, which help in prediction models. It is an analytical
tool which helps in discovering trends from a data set using techniques such as artificial
intelligence, advanced numerical and statistical methods and specialised algorithms.

6. Pattern Evaluation/Interpretation

Once the trend and patterns have been obtained from various data mining methods and iterations,
these patterns need to be represented in discrete forms such as bar graphs, pie charts, histograms
etc. to study the impact of data collected and transformed during previous steps. This also helps in
evaluating the effectiveness of a particular data model in view of the domain.

7. Knowledge Discovery and Use

This is the final step in the KDD process and requires the ‘knowledge’ extracted from the previous
step to be applied to the specific application or domain in a visualised format such as tables, reports
etc. This step drives the decision mak]ing process for the said application.

AIML Department Page 2


What is Data Preprocessing?
Data preprocessing is a crucial step in data mining. It involves transforming raw data into a clean,
structured, and suitable format for mining. Proper data preprocessing helps improve the quality of
the data, enhances the performance of algorithms, and ensures more accurate and reliable results.

Data preprocessing is used to improve the quality of data and mining results. And The goal of data
preprocessing is to enhance the accuracy, efficiency, and reliability of data mining algorithms.

Data preprocessing is an essential step in the knowledge discovery process, because quality
decisions must be based on quality data. And Data Preprocessing involves Data Cleaning, Data
Integration, Data Reduction and Data Trans formation.

Why Preprocess the Data?

In the real world, many databases and data warehouses have noisy, missing, and inconsistent data
due to their huge size. Low quality data leads to low quality data mining.

Noisy: Containing errors or outliers. E.g., Salary = “-10” Noisy data may come from • Human or
computer error at data entry.

• Errors in data transmission.

Missing: lacking certain attribute values or containing only aggregate data. E.g., Occupation = “”
Missing (Incomplete) may data come from

• “Not applicable” data value when collected.

• Human/hardware/software problems.

Inconsistent: Data inconsistency meaning is that different versions of the same data appear in
different places.For example, the ZIP code is saved in one table as 1234-567 numeric data format;
while in another table it may be represented in 1234567. Inconsistent data may come from

• Errors in data entry.

• Merging data from different sources with varying formats.

AIML Department Page 3


Steps in Data Preprocessing :

1. Data Cleaning :

Data cleaning is a process that "cleans" the data by filling in the missing values, smoothing noisy
data, analyzing, and removing outliers, and removing inconsistencies in the data. If users believe
the data are dirty, they are unlikely to trust the results of any data mining that has been
applied. Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data.

2. Data Integration :

Data integration is the process of combining data from multiple sources into a single, unified
view. This process involves identifying and accessing the different data sources, mapping the
data to a common format. Different data sources may include multiple data cubes, databases,
or flat files.

The goal of data integration is to make it easier to access and analyze data that is spread across
multiple systems or platforms, in order to gain a more complete and accurate understanding of
the data.

Data integration strategy is typically described using a triple (G, S, M) approach, where G
denotes the global schema, S denotes the schema of the heterogeneous data sources, and M
represents the mapping between the queries of the source and global schema.

3. Data Reduction :

AIML Department Page 4


Imagine that you have selected data from the AllEl ectronics data warehouse for analysis.The
data set will likely be huge! Complex data analysis and mining on huge amounts of data can take
a long time, making such analysis impractical or infeasible.

Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data. That is,
mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.

In simple words,Data reduction is a technique used in data mining to reduce the size of a
dataset while still preserving the most important information. This can be beneficial in
situations where the dataset is too large to be processed efficiently, or where the dataset
contains a large amount of irrelevant or redundant information.

4. Data Transformation :

Data transformation in data mining refers to the process of converting raw data into a format
that is suitable for analysis and modelling. The goal of data transformation is to prepare the
data for data mining so that it can be used to extract useful insights and knowledge.

What is Apriori Algorithm ? How does thr Apriori Algorithm Works in Data Mining ?
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say
that the apriori algorithm is an association rule leaning that analyzes that people who bought
product A also bought product B.

Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant
association rules. Generally, the apriori algorithm operates on a database containing a huge
number of transactions. For example, the items customers but at a Big Bazar.

The given three components comprise the apriori algorithm.

1. Support

2. Confidence

3. Lift

Let's take an example to understand this concept.

We have already discussed above; you need a huge database containing a large no of transactions.
Suppose you have 4000 customers transactions in a Big Bazar. You have to calculate the Support,
Confidence, and Lift for two products, and you may say Biscuits and Chocolate. This is because
customers frequently buy these two items together.

AIML Department Page 5


Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600
transactions include a 200 that includes Biscuits and chocolates. Using this data, we will find out the
support, confidence, and lift.

Support :

Support refers to the default popularity of any product. You find the support as a quotient of the
division of the number of transactions comprising that product by the total number of transactions.
Hence, we get

Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent.

Confidence :

Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.

Hence,

Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving
Biscuits)

Lift :

Consider the above example; lift refers to the increase in the ratio of the sale of chocolates when
you sell biscuits. The mathematical equations of lift are given below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

How Does the Apriori Algorithm Work?

The Apriori algorithm operates on a straight forward premise. When the support value of an item
set exceeds a certain threshold, it is considered a frequent item set.

 Step 1: Create a list of all the elements that appear in every transaction and create a
frequency table.

 Step 2: Set the minimum level of support. Only those elements whose support exceeds or
equals the threshold support are significant.

 Step 3: All potential pairings of important elements must be made, bearing in mind that AB
and BA are interchangeable.

 Step 4: Tally the number of times each pair appears in a transaction.

AIML Department Page 6


 Step 5: Only those sets of data that meet the criterion of support are significant.

 Step 6: Now, suppose you want to find a set of three things that may be bought together. A
rule, known as self-join, is needed to build a three-item set. The item pairings OP, OB, PB,
and PM state that two combinations with the same initial letter are sought from these sets.

1. OPB is the result of OP and OB.

2. PBM is the result of PB and PM.

 Step 7: When the threshold criterion is applied again, you'll get the significant itemset.

What is Text Mining?


Text mining is a component of data mining that deals specifically with unstructured text data. It
involves the use of natural language processing (NLP) techniques to extract useful information and
insights from large amounts of unstructured text data. Text mining can be used as a preprocessing
step for data mining or as a standalone process for specific tasks.

Text Mining in Data Mining?

Text mining in data mining is mostly used for, the unstructured text data that can be transformed
into structured data that can be used for data mining tasks such as classification, clustering, and
association rule mining. This allows organizations to gain insights from a wide range of data
sources, such as customer feedback, social media posts, and news articles.

Text mining is widely used in various fields, such as natural language processing, information
retrieval, and social media analysis. It has become an essential tool for organizations to extract
insights from unstructured text data and make data-driven decisions.

Text mining is a process of extracting useful information and nontrivial patterns from a large
volume of text databases. There exist various strategies and devices to mine the text and find
important data for the prediction and decision-making process. The selection of the right and
accurate text mining procedure helps to enhance the speed and the time complexity.

Text Mining Applications :

 Digital Library

 Academic and Research Field Life Science

 Social-Media

 Business Intelligence

AIML Department Page 7


Advantages of Text Mining

 Large Amounts of Data: Text mining allows organizations to extract insights from large
amounts of unstructured text data.

 Variety of Applications: Text mining has a wide range of applications, including sentiment
analysis, named entity recognition, and topic modeling.

 Improved Decision Making

 Cost-effective: Text mining can be a cost-effective way, as it eliminates the need for manual
data entry.

Disadvantages of Text Mining

 Complexity: Text mining can be a complex process requiring advanced skills in natural
language processing and machine learning.

 Quality of Data: The quality of text data can vary, affecting the accuracy of the insights
extracted from text mining.

 High Computational Cost: Text mining requires high computational resources, and it may be
difficult for smaller organizations to afford the technology.

 Limited to Text Data: Text mining is limited to extracting insights from unstructured text
data and cannot be used with other data types.

Web Mining :
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is to discover
useful information from the World Wide Web and its usage patterns.

What is Data Mining?

Web mining is the best type of practice for sifting through the vast amount of data in the system
that is available on the World Wide Web to find and extract pertinent information as per
requirements. One unique feature of web mining is its ability to deliver a wide range of required
data types in the actual process.

Applications of Web Mining

Web mining is the process of discovering patterns, structures, and relationships in web data. It
involves using data mining techniques to analyze web data and extract valuable insights. The
applications of web mining are wide-ranging and include:

AIML Department Page 8


 Personalized marketing

 Fraud detection

 Customer service

 Healthcare

Advantages of Web Mining :

 Improve Business
 Increased revenue

Disadvantages of Web Mining :

 Cost
 Information Missuse

Major issues in Data Mining :

Mining Methodology and User Interaction Issues

 Mining different kinds of knowledge in databases − Different users may be interested in


different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
 Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high level languages, and visual representations. These
representations should be easily understandable.
 Handling noisy or incomplete data – If the data cleaning methods are not there then the
accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues :

 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.

AIML Department Page 9


Diverse Data Types Issues :

 Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.

Naïve Bayes Classification :

Naive Bayes classifier is a probabilistic machine learning model based on Bayes’ theorem. It
assumes independence between features and calculates the probability of a given input belonging
to a particular class. It’s widely used in text classification, spam filtering, and recommendation
systems.
It is a classification technique based on Bayes’ Theorem with an independence assumption among
predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.

Naive Bayes classification is a probabilistic machine learning algorithm based on Bayes' theorem
with a strong (naive) assumption of independence between the features. Despite its simplicity, it is
surprisingly effective in many real-world classification tasks, particularly in text classification and
spam filtering. Here's a detailed explanation of how naive Bayesian classification works:
1. Understanding Bayes' Theorem:
Bayes' theorem provides a way to calculate the probability of a hypothesis (class label) given the
evidence (features), denoted as:

P(C∣X)= P(X∣C)×P(C)/ P(X)


Where:
 P(C∣X) is the posterior probability of class C given the evidence X.
 P(X∣C) is the likelihood of observing evidence X given that class C is true.
 P(C) is the prior probability of class C.
 P(X) is the probability of observing evidence X (also called the marginal likelihood).

How does Naïve Bayes works :

1. Convert the data set into a frequency table


2. Create Likelihood table by finding the probabilities
3. Use Naive Bayesian equation to calculate the posterior probability

Advantages of Naive Bayes:


 Simple and easy to implement.
 Requires a small amount of training data to estimate parameters.
 Efficient in terms of computation.

AIML Department Page 10


Limitations of Naive Bayes:
 Relies on the strong assumption of feature independence, which may not hold true in all
datasets.
 Often outperformed by more complex models, especially in tasks where feature interactions
are important.

Decision Tree Induction in Data Mining

 Decision tree induction is a common technique in data mining that is used to generate a
predictive model from a dataset. This technique involves constructing a tree-like structure,
where each internal node represents a test on an attribute, each branch represents the
outcome of the test, and each leaf node represents a prediction. The goal of decision tree
induction is to build a model that can accurately predict the outcome of a given event,
based on the values of the attributes in the dataset.
 To build a decision tree, the algorithm first selects the attribute that best splits the data into
distinct classes. This is typically done using a measure of impurity, such as entropy or the
Gini index, which measures the degree of disorder in the data. The algorithm then repeats
this process for each branch of the tree, splitting the data into smaller and smaller subsets
until all of the data is classified.
 Decision tree induction is a popular technique in data mining because it is easy to
understand and interpret, and it can handle both numerical and categorical data.
Additionally, decision trees can handle large amounts of data, and they can be updated with
new data as it becomes available. However, decision trees can be prone to overfitting,
where the model becomes too complex and does not generalize well to new data. As a
result, data scientists often use techniques such as pruning to simplify the tree and improve
its performance.

AIML Department Page 11


Advantages of Decision Tree Induction
1. Easy to understand and interpret: Decision trees are a visual and intuitive model that can be
easily understood by both experts and non-experts.
2. Handle both numerical and categorical data: Decision trees can handle a mix of numerical
and categorical data, which makes them suitable for many different types of datasets.
3. Can handle large amounts of data: Decision trees can handle large amounts of data and can
be updated with new data as it becomes available.
4. Can be used for both classification and regression tasks: Decision trees can be used for both
classification, where the goal is to predict a discrete outcome, and regression, where the
goal is to predict a continuous outcome.
Disadvantages of Decision Tree Induction
1. Prone to overfitting: Decision trees can become too complex and may not generalize well to
new data. This can lead to poor performance on unseen data.
2. Sensitive to small changes in the data: Decision trees can be sensitive to small changes in
the data, and a small change in the data can result in a significantly different tree.
3. Biased towards attributes with many levels: Decision trees can be biased towards
attributes with many levels, and may not perform well on attributes with a small number of
levels.

Attribute selection measures :

Attribute selection measures, also known as feature selection criteria, are methods used in
machine learning and data mining to identify the most relevant attributes (features) for predictive
modeling or classification tasks. These measures help in reducing the dimensionality of the dataset
by selecting a subset of features that contribute the most to the predictive accuracy of the model
while minimizing overfitting and computational complexity. Here's a note on attribute selection
measures:

Importance of Attribute Selection:


1. Dimensionality Reduction: High-dimensional datasets with many features can lead to
increased computational complexity and the risk of overfitting. Selecting a subset of
relevant features can help mitigate these issues.
2. Improved Model Performance: Including only the most informative features can improve
the predictive accuracy and generalization performance of the model by reducing noise and
irrelevant information.
3. Interpretability: Models built using fewer features are often easier to interpret and
understand, which is important for decision-making and gaining insights from the model.

Common Attribute Selection Measures:


1. Information Gain:
 Information gain measures the reduction in entropy or uncertainty achieved by
splitting the dataset based on a particular attribute.
 It is commonly used in decision tree induction algorithms to select the best attribute
for splitting.
2. Gain Ratio:

AIML Department Page 12


 Gain ratio is a variation of information gain that takes into account the intrinsic
information of an attribute.
 It penalizes attributes with many distinct values, favoring attributes that result in
homogeneous subsets.
3. Gini Index:
 Gini index measures the impurity of a dataset by calculating the probability of
misclassifying a randomly chosen element.
 It is often used in decision tree algorithms, particularly CART (Classification and
Regression Trees).
4. Chi-Square Test:
 Chi-square test measures the independence between the attributes and the class
labels by comparing observed and expected frequencies.
 It is commonly used in feature selection for categorical data.
5. Correlation-based Feature Selection:
 Correlation-based measures evaluate the linear relationship between features and
the target variable.
 Features with high correlation to the target variable are selected while minimizing
redundancy among selected features.
6. Wrapper Methods:
 Wrapper methods evaluate feature subsets by training and evaluating a model on
different subsets of features.
 Examples include forward selection, backward elimination, and recursive feature
elimination.
7. Embedded Methods:
 Embedded methods incorporate feature selection as part of the model training
process.
 Examples include Lasso (L1 regularization), which encourages sparse solutions by
penalizing the absolute magnitude of feature coefficients.

Attribute selection measures play a crucial role in feature engineering and model building by
identifying the most informative features for predictive modeling tasks. Choosing the appropriate
attribute selection measure depends on the nature of the data, the modeling algorithm used, and
the specific objectives of the analysis. Effective attribute selection can lead to simpler, more
interpretable models with improved performance and generalization capabilities.

Discuss on classification by back propagation :

Classification by backpropagation is a type of supervised learning algorithm that is used to train a


neural network to classify data into different classes.
The backpropagation algorithm is based on the idea of adjusting the weights and biases of a
network in order to minimize the error between the predicted output and the actual output.
The backpropagation algorithm works by taking a set of training examples and feeding them
through the neural network.
The backpropagation algorithm is an iterative process that continues until the error is minimized or
until a predetermined number of iterations is reached.
● Backpropagation can be used to train neural networks with multiple layers, known as deep
learning networks.

AIML Department Page 13


● Backpropagation requires a large amount of training data and may not perform well on small
datasets. It also requires careful tuning of hyperparameters such as learning rate and batch size to
achieve good performance.
● Backpropagation is a powerful and flexible algorithm that has been used for a wide range of
classification tasks, including image recognition, natural language processing, and speech
recognition.
● Backpropagation is a type of feedforward neural network, which means that data flows through
the network in one direction from the input layer to the output layer.
● The input layer of a neural network typically consists of one neuron for each input feature, while
the output layer consists of one neuron for each output class.
● The number and size of hidden layers in the network can be adjusted depending on the
complexity of the data and the task at hand. However, adding more layers can increase the risk of
overfitting and may require more training data.
● Backpropagation has been applied to a wide range of classification tasks, including image
classification, speech recognition, fraud detection, and medical diagnosis. It is a powerful tool for
solving complex classification problems, but requires careful tuning and a deep understanding of
neural networks to achieve good performance.

AIML Department Page 14


Association rule mining is used to discover relationships between items in a dataset. An association
rule is a statement of the form "If A, then B," where A and B are sets of items. The strength of an
association rule is measured using two measures: support and confidence. Support measures the
frequency of the occurrence of the items in the rule, and confidence measures the reliability of the
rule.

Apriori algorithm is a popular algorithm for mining association rules. It is an iterative algorithm that
works by generating candidate itemsets and pruning those that do not meet the support and
confidence thresholds.

Multilevel Association Rule in data mining


Multilevel Association Rule mining is a technique that extends Association Rule mining to discover
relationships between items at different levels of granularity. Multilevel Association Rule mining
can be classified into two types: multi-dimensional Association Rule and multi-level Association
Rule.

Multi-dimensional Association Rule mining : This is used to find relationships between items in
different dimensions of a dataset. For example, in a sales dataset, multi-dimensional Association
Rule mining can be used to find relationships between products, regions, and time.

Multi-level Association Rule mining : This is used to find relationships between items at different
levels of granularity. For example, in a retail dataset, multi-level Association Rule mining can be
used to find relationships between individual items and categories of items.

AIML Department Page 15


Support Vector Machine(SVM) :

Support Vector Machines (SVM) are powerful supervised learning algorithms used in data mining
and machine learning for classification, regression, and outlier detection tasks. SVMs are
particularly effective in high-dimensional spaces and cases where the number of dimensions
exceeds the number of samples. Here's an overview of SVM in data mining:

Basic Concept:

SVM aims to find the hyperplane that best separates data points into different classes. In a binary
classification scenario, this hyperplane is chosen to maximize the margin, which is the distance
between the hyperplane and the nearest data points from each class, known as support vectors.

Linear SVM:

For linearly separable data, SVM finds the optimal hyperplane that separates the classes with the
maximum margin. Mathematically, this optimization problem can be formulated as a quadratic
programming problem.

Non-linear SVM:

For non-linearly separable data, SVM can employ the kernel trick, where the input data is mapped
into a higher-dimensional feature space where it becomes linearly separable. Common kernel
functions include polynomial, radial basis function (RBF), and sigmoid kernels.

Training SVM:

1. Selecting Kernel Function: Choose an appropriate kernel function based on the


characteristics of the data.

2. Optimizing Parameters: Tune parameters such as the regularization parameter (C), kernel
parameters (gamma for RBF kernel, degree for polynomial kernel), and kernel coefficients.

3. Training: Solve the optimization problem to find the optimal hyperplane or decision
boundary that maximizes the margin while minimizing classification error.

Advantages of SVM:

1. Effective in High-dimensional Spaces: SVM performs well even in cases where the number
of dimensions is greater than the number of samples.

2. Robust to Overfitting: SVM uses regularization parameters to prevent overfitting, making it


robust to noisy data.

3. Works well with Non-linear Data: With the kernel trick, SVM can handle non-linearly
separable data effectively.

AIML Department Page 16


4. Global Optimal Solution: SVM finds the global optimal solution, ensuring better
generalization performance.

Applications of SVM in Data Mining:

1. Classification: SVM is widely used for classification tasks in various domains such as text
classification, image recognition, and bioinformatics.

2. Regression: SVM can be adapted for regression tasks to predict continuous outcomes.

3. Anomaly Detection: SVM can be used for anomaly detection by identifying data points that
deviate significantly from the majority of the data.

4. Data Preprocessing: SVM can be used for data preprocessing tasks such as feature selection
and dimensionality reduction.

AIML Department Page 17

You might also like