Module 2 Introduction To Data Mining
Module 2 Introduction To Data Mining
1. The set of task-relevant data to be mined: This specifies the portions of the database
or the set of data in which the user is interested. This includes the database attributes
or data warehouse dimensions of interest (referred to as the relevant attributes or
dimensions).
2. The kind of knowledge to be mined: This specifies the data mining functions to be
per- formed, such as characterization, discrimination, association or correlation
analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.
3. The background knowledge to be used in the discovery process: This knowledge
about the domain to be mined is useful for guiding the knowledge discovery process
and for evaluating the patterns found. Concept hierarchies are a popular form of back-
ground knowledge, which allow data to be mined at multiple levels of abstraction. An
example of a concept hierarchy for the attribute (or dimension) age is shown in Figure
1.2. User beliefs regarding relationships in the data are another form of back- ground
knowledge.
4. The interestingness measures and thresholds for pattern evaluation: They may be used
to guide the mining process or, after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures. For
example, interestingness measures for association rules include support and
confidence. Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
5. The expected representation for visualizing the discovered patterns: This refers to the
form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes.
2.1.2 DATA mining Architecture
Data Mining Architecture
•
What is Data Mining Architecture
• Data Mining Architecture is the process of selecting, exploring, and modelling large
amounts of data to discover previously unknown regularities or relationships to
generate clear and valuable findings for the database owner. Data mining is exploring
and analysing large amounts of data using automated or semi-automated processes to
identify practical designs and procedures.
• The primary components of any data mining system are the Data source, data
warehouse server, data mining engine, pattern assessment module, graphical user
interface, and knowledge base.
Basic Working:
• When a user requests data mining queries, these requests are sent to data mining
engines for pattern analysis.
• These software applications use the existing database to try to discover a solution to
the query.
• The retrieved metadata is then transmitted to the data mining engine for suitable
processing, which may interact with pattern assessment modules to decide the
outcome.
• The result is finally delivered to the front end in a user-friendly format via an
appropriate interface.
Components Of Data Mining Architecture
• Data Sources
• Database Server
• Data Mining Engine
• Pattern Evaluation Modules
• Graphic User Interface
• Knowledge Base
Data Sources
• These sources provide the data in plain text, spreadsheets, or other media such as
images or videos Data sources include databases, the World Wide Web (WWW), and
data warehouses.
Database Server
• The real data is stored on the database server and is ready to be processed. Its job is to
handle data retrieval in response to the user's request.
Data Mining Engine:
• It is one of the most important parts of the data mining architecture since it conducts
many data mining techniques such as association, classification, characterisation,
clustering, prediction, and so on.
Pattern Evaluation Modules:
• They are responsible for identifying intriguing patterns in data and, on occasion,
interacting with database servers to provide the results of user queries.
Graphic User Interface:
• Because the user cannot completely comprehend the complexities of the data mining
process, a graphical user interface assists the user in efficiently communicating with
the data mining system.
Knowledge Base:
• The Knowledge Base is an essential component of the data mining engine that aids in
the search for outcome patterns. Occasionally, the knowledge base may also provide
input to the data mining engine. This knowledge base might include information
gleaned from user encounters. The knowledge base's goal is to improve the accuracy
and reliability of the outcome. The Knowledge Base is a crucial component of the
data mining engine that aids in the search for outcome patterns. Occasionally, the
knowledge base may also provide input to the data mining engine. This knowledge
base might include information gleaned from user encounters. The knowledge base's
goal is to improve the accuracy and reliability of the outcome.
Or
The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other repositories
of data. Sometimes, even plain text files or spreadsheets may contain information. Another
primary source of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different
formats, it can't be used directly for the data mining procedure because the data may not be
complete and accurate. So, the first data requires to be cleaned and unified. More information
than needed will be collected from various data sources, and only the data of interest will have
to be selected and passed to the server. These procedures are not as easy as we think. Several
methods may be performed on the data as part of selection, integration, and cleaning.
The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per
user request.
The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from
various data sources and stored within the data warehouse.
Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the
search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules
to focus the search towards fascinating patterns. It might utilize a stake threshold to filter out
discovered patterns. On the other hand, the pattern evaluation module might be coordinated
with the mining module, depending on the implementation of the data mining techniques used.
For efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as
much as possible into the mining procedure to confine the search to only fascinating patterns.
The graphical user interface (GUI) module communicates between the data mining system and
the user. This module helps the user to easily and efficiently use the system without knowing
the complexity of the process. This module cooperates with the data mining system when the
user specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide
the search or evaluate the stake of the result patterns. The knowledge base may even contain
user views and data from user experiences that might be helpful in the data mining process.
The data mining engine may receive inputs from the knowledge base to make the result more
accurate and reliable. The pattern assessment module regularly interacts with the knowledge
base to get inputs, and also update it.
2.1.3 KDD process
The KDD process in data mining is a multi-step process that involves various stages to
extract useful knowledge from large datasets. The following are the main steps involved in
the KDD process -
• Data Selection - The first step in the KDD process is identifying and selecting the
relevant data for analysis. This involves choosing the relevant data sources, such as
databases, data warehouses, and data streams, and determining which data is required
for the analysis.
• Data Preprocessing - After selecting the data, the next step is data preprocessing.
This step involves cleaning the data, removing outliers, and removing missing,
inconsistent, or irrelevant data. This step is critical, as the data quality can
significantly impact the accuracy and effectiveness of the analysis.
• Data Transformation - Once the data is preprocessed, the next step is to transform it
into a format that data mining techniques can analyze. This step involves reducing the
data dimensionality, aggregating the data, normalizing it, and discretizing it to prepare
it for further analysis.
• Data Mining - This is the heart of the KDD process and involves applying various
data mining techniques to the transformed data to discover hidden patterns, trends,
relationships, and insights. A few of the most common data mining techniques include
clustering, classification, association rule mining, and anomaly detection.
• Pattern Evaluation - After the data mining, the next step is to evaluate the
discovered patterns to determine their usefulness and relevance. This involves
assessing the quality of the patterns, evaluating their significance, and selecting the
most promising patterns for further analysis.
• Knowledge Representation - This step involves representing the knowledge
extracted from the data in a way humans can easily understand and use. This can be
done through visualizations, reports, or other forms of communication that provide
meaningful insights into the data.
• Deployment - The final step in the KDD process is to deploy the knowledge and
insights gained from the data mining process to practical applications. This involves
integrating the knowledge into decision-making processes or other applications to
improve organizational efficiency and effectiveness.
In summary, the KDD process in data mining involves several steps to extract useful
knowledge from large datasets. It is a comprehensive and iterative process that requires
careful consideration of each step to ensure the accuracy and effectiveness of the analysis.
Various steps involved in the KDD process in data mining are shown below diagram –
2.1.4 Issues in Data Mining,
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −
Performance Issues
• Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
Short answer below
Methodology-related data mining issues encompass challenges related to the choice and
application of mining algorithms and techniques. Selecting the right method for a specific
dataset and problem can be daunting. Moreover, issues like overfitting, bias, and the need for
interpretability often arise, making it crucial to strike a balance between model complexity
and accuracy.
2. Performance Issues
Performance-related data mining issues revolve around scalability, efficiency, and handling
large datasets. As data volumes continue to grow exponentially, it becomes essential to
develop algorithms and infrastructure capable of processing and analyzing data promptly.
Performance bottlenecks can hinder the practical application of data mining techniques.
The diverse data types data mining issues highlight the complexity of dealing with
heterogeneous data sources. Data mining often involves integrating data from various
formats, such as text, images, and structured databases. Each data type presents unique
challenges in terms of preprocessing, feature extraction, and modelling, requiring specialized
approaches and tools to tackle these complexities effectively.
2.1.5
Types of attributes
Attribute:
It can be seen as a data field that represents the characteristics or features of a data object. For
a customer, object attributes can be customer Id, address, etc. We can say that a set of
attributes used to describe a given object are known as attribute vector or feature vector.
Type of attributes :
This is the First step of Data-preprocessing. We differentiate between different types of
attributes and then preprocess the data. So here is the description of attribute types.
Qualitative (Nominal (N), Ordinal (O), Binary(B)).
Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes – related to names: The values of a Nominal attribute are names of
things, some kind of symbols. Values of Nominal attributes represents some category or state
and that’s why nominal attribute also referred as categorical attributes and there is no order
(rank, position) among values of the nominal attribute.
Example :
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected
or unaffected, true or false.
Symmetric: Both values are equally important (Gender).
Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful
sequence or ranking(order) between them, but the magnitude between values is not actually
known, the order of values that shows what is important but don’t indicate how important it
is.
Quantitative Attributes:
2. Discrete : Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example:
Data visualizations are common in your everyday life, but they always appear in the form of
graphs and charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics.
Data visualizations are used to discover unknown facts and trends. You can see visualizations
in the form of line charts to display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is a great way to show parts-of-
a-whole. And maps are the best way to share geographical data visually.
Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel
spreadsheet, which displays the data in more sophisticated ways such as dials and gauges,
geographic maps, heat maps, pie chart, and fever chart.
2.1.9 Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and
put in a formatted way. So for this, we use data preprocessing task.
Data Cleaning
Data Cleaning uses methods to handle incorrect, incomplete, inconsistent, or missing values.
Some of the techniques for Data Cleaning include -
• Handling Missing Values
o Input data can contain missing or NULL values, which must be handled
before applying any Machine Learning or Data Mining techniques.
o Missing values can be handled by many techniques, such as removing
rows/columns containing NULL values and imputing NULL values using
mean, mode, regression, etc.
• De-noising
o De-noising is a process of removing noise from the data. Noisy data is
meaningless data that is not interpretable or understandable by machines or
humans. It can occur due to data entry errors, faulty data collection, etc.
o De-noising can be performed by applying many techniques, such as binning
the features, using regression to smoothen the features to reduce noise,
clustering to detect the outliers, etc.
Data Integration
Data Integration can be defined as combining data from multiple sources. A few of the issues
to be considered during Data Integration include the following -
• Entity Identification Problem - It can be defined as identifying objects/features from
multiple databases that correspond to the same entity. For example, in database
A _customer_id,_ and in database B _customer_number_ belong to the same entity.
• Schema Integration - It is used to merge two or more database schema/metadata into
a single schema. It essentially takes two or more schema as input and determines a
mapping between them. For example, entity type CUSTOMER in one schema may
have CLIENT in another schema.
• Detecting and Resolving Data Value Concepts - The data can be stored in various
ways in different databases, and it needs to be taken care of while integrating them
into a single dataset. For example, dates can be stored in various formats such
as DD/MM/YYYY, YYYY/MM/DD, or MM/DD/YYYY, etc.
Data Reduction
Data Reduction is used to reduce the volume or size of the input data. Its main objective is to
reduce storage and analysis costs and improve storage efficiency. A few of the popular
techniques to perform Data Reduction include -
• Dimensionality Reduction - It is the process of reducing the number of features in
the input dataset. It can be performed in various ways, such as selecting features with
the highest importance, Principal Component Analysis (PCA), etc.
• Numerosity Reduction - In this method, various techniques can be applied to reduce
the volume of data by choosing alternative smaller representations of the data. For
example, a variable can be approximated by a regression model, and instead of storing
the entire variable, we can store the regression model to approximate it.
• Data Compression - In this method, data is compressed. Data Compression can be
lossless or lossy depending on whether the information is lost or not during
compression.
Data Transformation
Data Transformation is a process of converting data into a format that helps in building
efficient ML models and deriving better insights. A few of the most common methods for
Data Transformation include -
• Smoothing - Data Smoothing is used to remove noise in the dataset, and it helps
identify important features and detect patterns. Therefore, it can help in predicting
trends or future events.
• Aggregation - Data Aggregation is the process of transforming large volumes of
data into an organized and summarized format that is more understandable and
comprehensive. For example, a company may look at monthly sales data of a product
instead of raw sales data to understand its performance better and forecast future
sales.
• Discretization - Data Discretization is a process of converting numerical or
continuous variables into a set of intervals/bins. This makes data easier to analyze.
For example, the age features can be converted into various intervals such as (0-
10, 11-20, ..) or (child, young, …).
• Normalization - Data Normalization is a process of converting a numeric variable
into a specified range such as [-1,1], [0,1], etc. A few of the most common approaches
to performing normalization are Min-Max Normalization, Data Standardization or
Data Scaling, etc.
Data reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount
of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining,
including:
1. Data Sampling: This technique involves selecting a subset of the data to work with,
rather than using the entire dataset. This can be useful for reducing the size of a
dataset while still preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features
in the dataset, either by removing features that are not relevant or by combining
multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or
lossless compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into
discrete data by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the
dataset that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy
and the size of the data. The more data is reduced, the less accurate the model will be
and the less generalizable it will be.
Data Discretization
Top-down Discretization -
• If the process starts by first finding one or a few points called split points or cut points
to split the entire attribute range and then repeat this recursively on the resulting
intervals.
Bottom-up Discretization -
Concept Hierarchies
1] Binning
2] Histogram Analysis
3] Cluster Analysis
4] Entropy-Based Discretization