Knowledge Discovery in Databases
Knowledge Discovery in Databases
Knowledge Discovery in Databases
What is KDD?
The process typically involves steps such as data cleaning, data preprocessing, data transformation,
pattern discovery, and interpretation of the discovered patterns.
KDD is commonly associated with data mining, machine learning, and artificial intelligence
techniques and is used across various fields including business, healthcare, finance, and science for
tasks such as predictive modeling, classification, clustering, and anomaly detection.
KDD in data mining is a programmed and analytical approach to model data from a database to
extract useful and applicable ‘knowledge’.
KDD in data mining is necessary for businesses and organizations since it enables them to get new
knowledge and insights from the data. Knowledge discovery in databases can assist in improving
customer experience, enhance the decision-making process, optimize operations, support strategic
planning, and drive business growth.
This is the first step in the process and requires prior understanding and knowledge of the field to
be applied in. This is where we decide how the transformed data and the patterns arrived at by
data mining will be used to extract knowledge. This premise is extremely important which, if set
wrong, can lead to false interpretations and negative impacts on the end-user.
After setting the goals and objectives, the data collected needs to be selected and segregated into
meaningful sets based on availability, accessibility importance and quality. These parameters are
critical for data mining because they make the base for it and will affect what kinds of data models
are formed.
This step involves searching for missing data and removing noisy, redundant and low-quality data
from the data set in order to improve the reliability of the data and its effectiveness. Certain
algorithms are used for searching and eliminating unwanted data based on attributes specific to the
application.
This step prepares the data to be fed to the data mining algorithms. Hence, the data needs to be in
consolidated and aggregate forms. The data is consolidated on the basis of functions, attributes,
features etc.
5. Data Mining
This is the root or backbone process of the whole KDD. This is where algorithms are used to extract
meaningful patterns from the transformed data, which help in prediction models. It is an analytical
tool which helps in discovering trends from a data set using techniques such as artificial
intelligence, advanced numerical and statistical methods and specialised algorithms.
6. Pattern Evaluation/Interpretation
Once the trend and patterns have been obtained from various data mining methods and iterations,
these patterns need to be represented in discrete forms such as bar graphs, pie charts, histograms
etc. to study the impact of data collected and transformed during previous steps. This also helps in
evaluating the effectiveness of a particular data model in view of the domain.
This is the final step in the KDD process and requires the ‘knowledge’ extracted from the previous
step to be applied to the specific application or domain in a visualised format such as tables, reports
etc. This step drives the decision mak]ing process for the said application.
Data preprocessing is used to improve the quality of data and mining results. And The goal of data
preprocessing is to enhance the accuracy, efficiency, and reliability of data mining algorithms.
Data preprocessing is an essential step in the knowledge discovery process, because quality
decisions must be based on quality data. And Data Preprocessing involves Data Cleaning, Data
Integration, Data Reduction and Data Trans formation.
In the real world, many databases and data warehouses have noisy, missing, and inconsistent data
due to their huge size. Low quality data leads to low quality data mining.
Noisy: Containing errors or outliers. E.g., Salary = “-10” Noisy data may come from • Human or
computer error at data entry.
Missing: lacking certain attribute values or containing only aggregate data. E.g., Occupation = “”
Missing (Incomplete) may data come from
• Human/hardware/software problems.
Inconsistent: Data inconsistency meaning is that different versions of the same data appear in
different places.For example, the ZIP code is saved in one table as 1234-567 numeric data format;
while in another table it may be represented in 1234567. Inconsistent data may come from
1. Data Cleaning :
Data cleaning is a process that "cleans" the data by filling in the missing values, smoothing noisy
data, analyzing, and removing outliers, and removing inconsistencies in the data. If users believe
the data are dirty, they are unlikely to trust the results of any data mining that has been
applied. Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data.
2. Data Integration :
Data integration is the process of combining data from multiple sources into a single, unified
view. This process involves identifying and accessing the different data sources, mapping the
data to a common format. Different data sources may include multiple data cubes, databases,
or flat files.
The goal of data integration is to make it easier to access and analyze data that is spread across
multiple systems or platforms, in order to gain a more complete and accurate understanding of
the data.
Data integration strategy is typically described using a triple (G, S, M) approach, where G
denotes the global schema, S denotes the schema of the heterogeneous data sources, and M
represents the mapping between the queries of the source and global schema.
3. Data Reduction :
Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data. That is,
mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.
In simple words,Data reduction is a technique used in data mining to reduce the size of a
dataset while still preserving the most important information. This can be beneficial in
situations where the dataset is too large to be processed efficiently, or where the dataset
contains a large amount of irrelevant or redundant information.
4. Data Transformation :
Data transformation in data mining refers to the process of converting raw data into a format
that is suitable for analysis and modelling. The goal of data transformation is to prepare the
data for data mining so that it can be used to extract useful insights and knowledge.
What is Apriori Algorithm ? How does thr Apriori Algorithm Works in Data Mining ?
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say
that the apriori algorithm is an association rule leaning that analyzes that people who bought
product A also bought product B.
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant
association rules. Generally, the apriori algorithm operates on a database containing a huge
number of transactions. For example, the items customers but at a Big Bazar.
1. Support
2. Confidence
3. Lift
We have already discussed above; you need a huge database containing a large no of transactions.
Suppose you have 4000 customers transactions in a Big Bazar. You have to calculate the Support,
Confidence, and Lift for two products, and you may say Biscuits and Chocolate. This is because
customers frequently buy these two items together.
Support :
Support refers to the default popularity of any product. You find the support as a quotient of the
division of the number of transactions comprising that product by the total number of transactions.
Hence, we get
= 400/4000 = 10 percent.
Confidence :
Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving
Biscuits)
Lift :
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates when
you sell biscuits. The mathematical equations of lift are given below.
The Apriori algorithm operates on a straight forward premise. When the support value of an item
set exceeds a certain threshold, it is considered a frequent item set.
Step 1: Create a list of all the elements that appear in every transaction and create a
frequency table.
Step 2: Set the minimum level of support. Only those elements whose support exceeds or
equals the threshold support are significant.
Step 3: All potential pairings of important elements must be made, bearing in mind that AB
and BA are interchangeable.
Step 6: Now, suppose you want to find a set of three things that may be bought together. A
rule, known as self-join, is needed to build a three-item set. The item pairings OP, OB, PB,
and PM state that two combinations with the same initial letter are sought from these sets.
Step 7: When the threshold criterion is applied again, you'll get the significant itemset.
Text mining in data mining is mostly used for, the unstructured text data that can be transformed
into structured data that can be used for data mining tasks such as classification, clustering, and
association rule mining. This allows organizations to gain insights from a wide range of data
sources, such as customer feedback, social media posts, and news articles.
Text mining is widely used in various fields, such as natural language processing, information
retrieval, and social media analysis. It has become an essential tool for organizations to extract
insights from unstructured text data and make data-driven decisions.
Text mining is a process of extracting useful information and nontrivial patterns from a large
volume of text databases. There exist various strategies and devices to mine the text and find
important data for the prediction and decision-making process. The selection of the right and
accurate text mining procedure helps to enhance the speed and the time complexity.
Digital Library
Social-Media
Business Intelligence
Large Amounts of Data: Text mining allows organizations to extract insights from large
amounts of unstructured text data.
Variety of Applications: Text mining has a wide range of applications, including sentiment
analysis, named entity recognition, and topic modeling.
Cost-effective: Text mining can be a cost-effective way, as it eliminates the need for manual
data entry.
Complexity: Text mining can be a complex process requiring advanced skills in natural
language processing and machine learning.
Quality of Data: The quality of text data can vary, affecting the accuracy of the insights
extracted from text mining.
High Computational Cost: Text mining requires high computational resources, and it may be
difficult for smaller organizations to afford the technology.
Limited to Text Data: Text mining is limited to extracting insights from unstructured text
data and cannot be used with other data types.
Web Mining :
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is to discover
useful information from the World Wide Web and its usage patterns.
Web mining is the best type of practice for sifting through the vast amount of data in the system
that is available on the World Wide Web to find and extract pertinent information as per
requirements. One unique feature of web mining is its ability to deliver a wide range of required
data types in the actual process.
Web mining is the process of discovering patterns, structures, and relationships in web data. It
involves using data mining techniques to analyze web data and extract valuable insights. The
applications of web mining are wide-ranging and include:
Fraud detection
Customer service
Healthcare
Improve Business
Increased revenue
Cost
Information Missuse
Performance Issues :
Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.
Naive Bayes classifier is a probabilistic machine learning model based on Bayes’ theorem. It
assumes independence between features and calculates the probability of a given input belonging
to a particular class. It’s widely used in text classification, spam filtering, and recommendation
systems.
It is a classification technique based on Bayes’ Theorem with an independence assumption among
predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
Naive Bayes classification is a probabilistic machine learning algorithm based on Bayes' theorem
with a strong (naive) assumption of independence between the features. Despite its simplicity, it is
surprisingly effective in many real-world classification tasks, particularly in text classification and
spam filtering. Here's a detailed explanation of how naive Bayesian classification works:
1. Understanding Bayes' Theorem:
Bayes' theorem provides a way to calculate the probability of a hypothesis (class label) given the
evidence (features), denoted as:
Decision tree induction is a common technique in data mining that is used to generate a
predictive model from a dataset. This technique involves constructing a tree-like structure,
where each internal node represents a test on an attribute, each branch represents the
outcome of the test, and each leaf node represents a prediction. The goal of decision tree
induction is to build a model that can accurately predict the outcome of a given event,
based on the values of the attributes in the dataset.
To build a decision tree, the algorithm first selects the attribute that best splits the data into
distinct classes. This is typically done using a measure of impurity, such as entropy or the
Gini index, which measures the degree of disorder in the data. The algorithm then repeats
this process for each branch of the tree, splitting the data into smaller and smaller subsets
until all of the data is classified.
Decision tree induction is a popular technique in data mining because it is easy to
understand and interpret, and it can handle both numerical and categorical data.
Additionally, decision trees can handle large amounts of data, and they can be updated with
new data as it becomes available. However, decision trees can be prone to overfitting,
where the model becomes too complex and does not generalize well to new data. As a
result, data scientists often use techniques such as pruning to simplify the tree and improve
its performance.
Attribute selection measures, also known as feature selection criteria, are methods used in
machine learning and data mining to identify the most relevant attributes (features) for predictive
modeling or classification tasks. These measures help in reducing the dimensionality of the dataset
by selecting a subset of features that contribute the most to the predictive accuracy of the model
while minimizing overfitting and computational complexity. Here's a note on attribute selection
measures:
Apriori algorithm is a popular algorithm for mining association rules. It is an iterative algorithm that
works by generating candidate itemsets and pruning those that do not meet the support and
confidence thresholds.
Multi-dimensional Association Rule mining : This is used to find relationships between items in
different dimensions of a dataset. For example, in a sales dataset, multi-dimensional Association
Rule mining can be used to find relationships between products, regions, and time.
Multi-level Association Rule mining : This is used to find relationships between items at different
levels of granularity. For example, in a retail dataset, multi-level Association Rule mining can be
used to find relationships between individual items and categories of items.
Support Vector Machines (SVM) are powerful supervised learning algorithms used in data mining
and machine learning for classification, regression, and outlier detection tasks. SVMs are
particularly effective in high-dimensional spaces and cases where the number of dimensions
exceeds the number of samples. Here's an overview of SVM in data mining:
Basic Concept:
SVM aims to find the hyperplane that best separates data points into different classes. In a binary
classification scenario, this hyperplane is chosen to maximize the margin, which is the distance
between the hyperplane and the nearest data points from each class, known as support vectors.
Linear SVM:
For linearly separable data, SVM finds the optimal hyperplane that separates the classes with the
maximum margin. Mathematically, this optimization problem can be formulated as a quadratic
programming problem.
Non-linear SVM:
For non-linearly separable data, SVM can employ the kernel trick, where the input data is mapped
into a higher-dimensional feature space where it becomes linearly separable. Common kernel
functions include polynomial, radial basis function (RBF), and sigmoid kernels.
Training SVM:
2. Optimizing Parameters: Tune parameters such as the regularization parameter (C), kernel
parameters (gamma for RBF kernel, degree for polynomial kernel), and kernel coefficients.
3. Training: Solve the optimization problem to find the optimal hyperplane or decision
boundary that maximizes the margin while minimizing classification error.
Advantages of SVM:
1. Effective in High-dimensional Spaces: SVM performs well even in cases where the number
of dimensions is greater than the number of samples.
3. Works well with Non-linear Data: With the kernel trick, SVM can handle non-linearly
separable data effectively.
1. Classification: SVM is widely used for classification tasks in various domains such as text
classification, image recognition, and bioinformatics.
2. Regression: SVM can be adapted for regression tasks to predict continuous outcomes.
3. Anomaly Detection: SVM can be used for anomaly detection by identifying data points that
deviate significantly from the majority of the data.
4. Data Preprocessing: SVM can be used for data preprocessing tasks such as feature selection
and dimensionality reduction.