0% found this document useful (0 votes)
5 views74 pages

DWM - Module 2

Module 2 covers the fundamentals of data mining, including its processes, tasks, and architecture. It emphasizes the importance of data preprocessing, ethical considerations, and the various types of attributes involved in data mining. The document also discusses the Knowledge Discovery in Databases (KDD) process and the challenges faced in data mining methodologies and performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views74 pages

DWM - Module 2

Module 2 covers the fundamentals of data mining, including its processes, tasks, and architecture. It emphasizes the importance of data preprocessing, ethical considerations, and the various types of attributes involved in data mining. The document also discusses the Knowledge Discovery in Databases (KDD) process and the challenges faced in data mining methodologies and performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Module 2

Introduction to Data Mining ,Data


Exploration and Data Preprocessing
Content
• Data Mining Task primitives, Architecture, KDD process, Issues in data
Mining, Types of Attributes; Statistical Description of Data; Data
Visualization; Measuring similarity and dissimilarity. Why Pre-
processing? Data Cleaning; Data Integration;
• Data Reduction: Attribute subset selection, Histograms, Clustering and
Sampling; Data Transformation & Data Discretization: Normalization,
Binning, Histogram Analysis and Concept hierarchy generation. (08 hours)
Data Mining
• Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques.
The data can be structured, semi-structured or unstructured, and can be
stored in various forms such as databases, data warehouses, and data lakes.
• The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques such
as clustering, classification, regression analysis, association rule mining,
and anomaly detection.
• However, data mining also raises ethical and privacy concerns, particularly
when it involves personal or sensitive data. It’s important to ensure that
data mining is conducted ethically and with appropriate safeguards in
place to protect the privacy of individuals and prevent misuse of their data.
Data Mining Task primitives
• A data mining task can be specified in the form of a data mining query, which is
input to the data mining system.
• A data mining query is defined in terms of data mining task primitives.
• These primitives allow the user to interactively communicate with the data mining
system during discovery to direct the mining process or examine the findings from
different angles or depths.
• The data mining primitives specify the following,
1.Set of task-relevant data to be mined.
2.Kind of knowledge to be mined.
3.Background knowledge to be used in the discovery process.
4.Interestingness measures and thresholds for pattern evaluation.
5.Representation for visualizing the discovered patterns.
1. Set of task-relevant data to be mined.
• This specifies the portions of the database or the set of data in which the user is
interested.
• This includes the database attributes or data warehouse dimensions of interest
(the relevant attributes or dimensions).
• example: Extracting the database name, database tables, and relevant required
attributes from the dataset from the provided input database.
2. Kind of knowledge to be mined.
• This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
• example, It determines the task to be performed on the relevant data in order to
mine useful information such as classification, clustering, prediction,
discrimination, outlier detection, and correlation analysis.
3. The background knowledge to be used in the discovery process
• It refers to any prior information or understanding that is used to guide the
data mining process.
• This can include domain-specific knowledge, such as industry-specific
terminology, trends, or best practices, as well as knowledge about the data
itself.
• The use of background knowledge can help to improve the accuracy and
relevance of the insights obtained from the data mining process.
• This knowledge about the domain to be mined is useful for guiding the
knowledge discovery process and evaluating the patterns found.
• Concept hierarchies are a popular form of background knowledge, which
allows data to be mined at multiple levels of abstraction.
4. The interestingness measures and thresholds for pattern evaluation
• It refers to the methods and criteria used to evaluate the quality and
relevance of the patterns or insights discovered through data mining.
• Interestingness measures are used to quantify the degree to which a pattern is
considered to be interesting or relevant based on certain criteria, such as its
frequency, confidence, or lift.
• These measures are used to identify patterns that are meaningful or relevant
to the task.
• Thresholds for pattern evaluation, on the other hand, are used to set a
minimum level of interestingness that a pattern must meet in order to be
considered for further analysis or action.
• For example: Evaluating the interestingness and interestingness measures
such as utility, certainty, and novelty for the data and setting an appropriate
threshold value for the pattern evaluation.
5. The expected representation for visualizing the discovered patterns
• It refers to the methods used to represent the patterns or insights discovered
through data mining in a way that is easy to understand and interpret.
• Visualization techniques such as charts, graphs, and maps are commonly
used to represent the data and can help to highlight important trends,
patterns, or relationships within the data.
• Visualizing the discovered pattern helps to make the insights obtained from
the data mining process more accessible and understandable to a wider
audience, including non-technical stakeholders.
• For example, Presentation and visualization of discovered pattern data using
various visualization techniques such as bar plot, charts, graphs, tables, etc.
Advantages of Data Mining Task Primitives
The use of data mining task primitives has several advantages, including:
1.Modularity: Data mining task primitives provide a modular approach to data mining, which
allows for flexibility and the ability to easily modify or replace specific steps in the process.
2.Reusability: Data mining task primitives can be reused across different data mining projects,
which can save time and effort.
3.Standardization: Data mining task primitives provide a standardized approach to data mining,
which can improve the consistency and quality of the data mining process.
4.Understandability: Data mining task primitives are easy to understand and communicate, which
can improve collaboration and communication among team members.
5.Improved Performance: Data mining task primitives can improve the performance of the data
mining process by reducing the amount of data that needs to be processed, and by optimizing the
data for specific data mining algorithms.
6.Flexibility: Data mining task primitives can be combined and repeated in various ways to achieve
the goals of the data mining process, making it more adaptable to the specific needs of the project.
7.Efficient use of resources: Data mining task primitives can help to make more efficient use of
resources, as they allow to perform specific tasks with t he right tools, avoiding unnecessary steps
and reducing the time and computational power needed.
Data Mining Architecture
• The significant components of data mining systems are a data source, data
mining engine, data warehouse server, the pattern evaluation module,
graphical user interface, and knowledge base.
Data Source:
• The actual source of data is the Database, data warehouse, World Wide Web
(WWW), text files, and other documents.
• There is a requirement of huge amount of historical data for data mining to be
successful.
• Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or
other repositories of data. Sometimes, even plain text files or spreadsheets
may contain information.
• Another primary source of data is the World Wide Web or the internet.
Different processes:
• Before passing the data to the database or data warehouse server, the data must
be cleaned, integrated, and selected.
• As the information comes from various sources and in different formats, it can't
be used directly for the data mining procedure because the data may not be
complete and accurate. So, the first data requires to be cleaned and unified.
• More information than needed will be collected from various data sources,
and only the data of interest will have to be selected and passed to the server.
Several methods may be performed on the data as part of selection,
integration, and cleaning.
Database or Data Warehouse Server:
• The database or data warehouse server consists of the original data that is
ready to be processed.
• Hence, the server is cause for retrieving the relevant data that is based on data
mining as per user request.
Data Mining Engine:
• The data mining engine is a major component of any data mining system. It
contains several modules for operating data mining tasks, including
association, characterization, classification, clustering, prediction, time-series
analysis, etc.
• In other words, we can say data mining is the root of our data mining
architecture. It comprises instruments and software used to obtain insights
and knowledge from data collected from various data sources and stored
within the data warehouse.
Pattern Evaluation Module:
• The Pattern evaluation module is primarily responsible for the measure of
investigation of the pattern by using a threshold value.
• It collaborates with the data mining engine to focus the search on exciting
patterns.
• This segment commonly employs stake measures that cooperate with the data
mining modules to focus the search towards fascinating patterns.
• It might utilize a stake threshold to filter out discovered patterns.
• On the other hand, the pattern evaluation module might be coordinated with the
mining module, depending on the implementation of the data mining techniques
used.
• For efficient data mining, it is abnormally suggested to push the evaluation of pattern
stake as much as possible into the mining procedure to confine the search to only
fascinating patterns.
Graphical User Interface:
• The graphical user interface (GUI) module communicates between the data mining
system and the user.
• This module helps the user to easily and efficiently use the system without knowing
the complexity of the process.
• This module cooperates with the data mining system when the user specifies a query
or a task and displays the results.
Knowledge Base:
• The knowledge base is helpful in the entire process of data mining. It might
be helpful to guide the search or evaluate the stake of the result patterns.
• The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process.
• The data mining engine may receive inputs from the knowledge base to make
the result more accurate and reliable.
• The pattern assessment module regularly interacts with the knowledge base to
get inputs, and also update it.
KDD process in data mining
• KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets.
• The KDD process is an iterative process and it requires multiple iterations of the
above steps to extract accurate knowledge from the data.
• The following steps are included in KDD process:
Data Cleaning:
Data cleaning is defined as removal of noisy and irrelevant data from collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
• KDD is an iterative process where evaluation measures can be enhanced, mining
can be refined, new data can be integrated and transformed in order to get different
and more appropriate results.
• Preprocessing of databases consists of Data cleaning and Data Integration.
Data Integration:
• Data integration is defined as heterogeneous data from multiple sources combined in
a common source(DataWarehouse).
• Data integration using Data Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
Data Selection:
• Data selection is defined as the process where data relevant to the analysis is decided
and retrieved from the data collection.
• For this we can use Neural network, Decision Trees, Naive bayes, Clustering, and
Regression methods.
Data Transformation
• Data Transformation is defined as the process of transforming data into appropriate
form required by mining procedure.
Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture
transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining:
• Data mining is defined as techniques that are applied to extract patterns
potentially useful.
• It transforms task relevant data into patterns, and decides purpose of model
using classification or characterization.
Pattern Evaluation:
• Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures.
• It find interestingness score of each pattern, and uses summarization and
Visualization to make data understandable by user.
Knowledge Representation:
• This involves presenting the results in a way that is meaningful and can be used
to make decisions.
Issues in data Mining
• Data mining is not an easy task, as the algorithms used can get very complex and data
is not always available at one place.
• It needs to be integrated from various heterogeneous data sources. These factors also
create some issues.
The major issues are regarding −
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases −
• Different users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction −
• The data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on the
returned results.
• Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only
in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining −
• Data Mining Query language that allows the user to describe ad hoc mining tasks,
should be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
Presentation and visualization of data mining results −
• Once the patterns are discovered it needs to be expressed in high level languages, and
visual representations.
• These representations should be easily understandable.
Handling noisy or incomplete data −
• The data cleaning methods are required to handle the noise and incomplete objects
while mining the data regularities.
• If the data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
Pattern evaluation −
• The patterns discovered should be interesting because either they represent common
knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms −
• In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms −
• The factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of parallel and
distributed data mining algorithms.
• These algorithms divide the data into partitions which is further processed in a
parallel fashion. Then the results from the partitions is merged.
• The incremental algorithms, update databases without mining the data again
from scratch.
Diverse Data Types Issues
Handling of relational and complex types of data −
• The database may contain complex data objects, multimedia data objects, spatial
data, temporal data etc.
• It is not possible for one system to mine all these kind of data.
Mining information from heterogeneous databases and global information
systems −
• The data is available at different data sources on LAN or WAN.
• These data source may be structured, semi structured or unstructured.
• Therefore mining the knowledge from them adds challenges to data mining.
Types of Attributes
• For data mining, we usually discuss knowledge discovery from data. Mining data
includes knowing about data, finding relations between data. And for this, we need to
discuss data objects and attributes.
• Data objects are the essential part of a database. A data object represents the entity. Data
Objects are like a group of attributes of an entity. For example, a sales data object may
represent customers, sales, or purchases. When a data object is listed in a database they
are called data tuples.
Attributes:
• It can be seen as a data field that represents the characteristics or features of a data
object is known as attributes. For a customer, object attributes can be customer Id,
address, etc.
Types of Attributes
• This is the First step of Data-preprocessing We differentiate between
different types of attributes and then preprocess the data. So here is the
description of attribute types.
Qualitative (Nominal (N), Ordinal (O), Binary(B)).
Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes – related to names:
• The values of a Nominal attribute are names of things, some kind of symbols.
• It is in alphabetical form and not in an integer.
• Values of Nominal attributes represents some category or state and that’s why
nominal attribute also referred as categorical attributes and
• There is no order (rank, position) among values of the nominal attribute.
Example :
Attribute Value
Categorical data Lecturer, Assistant Professor, Professor
States New, Pending, Working, Complete, Finish
Colors Black, Brown, White, Red
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or unaffected, true or false.
Attribute Value
HIV detected Yes, No
Result Pass, Fail
•Symmetric: Both values are equally important (Gender). Both values are equally
important. For example, if we have open admission to our university, then it does not
matter, whether you are a male or a female.
Attribte Value
Gender Male, Female
•Asymmetric: Both values are not equally important (Result). Both values are not equally
important. For example, HIV detected is more important than HIV not detected. If a
patient is with HIV and we ignore him, then it can lead to death but if a person is not HIV
detected and we ignore it, then there is no special issue or risk.
Attribute Value
HIV detected Yes, No
Result Pass, Fail
3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful
sequence or ranking(order) between them, but the magnitude between values is not
actually known, the order of values that shows what is important but don’t indicate how
important it is.
Attribute Value
Grade A, B, C, D, F
BPS- Basic pay scale 16, 17, 18
1. Numeric:
• A numeric attribute is quantitative because, it is a measurable quantity, represented
in integer or real values. Numerical attributes are of 2 types, interval, and ratio.
Interval-scaled attribute
• It has values, whose differences are interpretable, but the numerical attributes do not
have the correct reference point, or we can call zero points.
• Data can be added and subtracted at an interval scale but can not be multiplied or
divided.
• Consider an example of temperature in degrees Centigrade. If a day’s temperature of
one day is twice of the other day we cannot say that one day is twice as hot as
another day.
Ratio-scaled attribute
• It is a numeric attribute with a fix zero-point. If a measurement is ratio-scaled, we can
say of a value as being a multiple (or ratio) of another value.
• The values are ordered, and we can also compute the difference between values, and
the mean, median, mode, Quantile-range, and Five number summary can be given.
2. Discrete : Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.

3. Continuous: Continuous data have an infinite no of states. Continuous data is of float


type. There can be many values between 2 and 3.
Data Cleaning( Han, Kamber)
• Real-world data tend to be incomplete, noisy, and inconsistent.
• Data cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in the
data.
• In this section, you will study basic methods for data cleaning.
Missing Values
• Imagine that you need to analyze All Electronics sales and customer data. You note
that many tuples have no recorded value for several attributes such as customer
income. How can you go about filling in the missing values for this attribute? Let’s
look at the following methods.
1. Ignore the tuple:
• This is usually done when the class label is missing (assuming the mining task
involves classification).
• This method is not very effective, unless the tuple contains several attributes with
missing values.
• It is especially poor when the percentage of missing values per attribute varies
considerably.
• By ignoring the tuple, we do not make use of the remaining attributes’ values in the
tuple. Such data could have been useful to the task at hand.
2.Fill in the missing value manually:
• In general, this approach is time consuming and may not be feasible given a large data
set with many missing values.
3. Use a global constant to fill in the missing value:
• Replace all missing attribute values by the same constant such as a label like
“Unknown” or −∞.
• If missing values are replaced by, say, “Unknown,” then the mining program may
mistakenly think that they form an interesting concept, since they all have a value in
common—that of “Unknown.”
• Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value:
We have discussed measures of central tendency, which indicate the “middle” value of
a data distribution. For normal (symmetric) data distributions, the mean can be used,
while skewed data distribution should employ the median. E.g if customer average
income is Rs.25000,then we can use this value to replace missing value for income.
• For example, suppose that the data distribution regarding the income of All
Electronics customers is symmetric and that the mean income is $56,000. Use this
value to replace the missing value for income.
5. Use the attribute mean or median for all samples belonging to the same class as
the given tuple:
• For example, if classifying customers according to credit risk, we may replace the
missing value with the mean income value for customers in the same credit risk
category as that of the given tuple.
• If the data distribution for a given class is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value:
• This may be determined with regression, inference-based tools using a Bayesian
classification or decision tree induction.
• For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
• Decision trees and Bayesian classification are described afterwards
Noisy Data
• “What is noise?” Noise is a random error or variance in a measured variable.
• In previous section, we saw how some basic statistical description techniques (e.g., box
plots and scatter plots), and methods of data visualization can be used to identify outliers,
which may represent noise.
• Given a numeric attribute such as, say, price, how can we “smooth” out the data to
remove the noise?
• Let’s look at the following data smoothing techniques to handle noisy data.
1.Binning:
• Binning methods smooth a sorted data value by consulting its “neighborhood,” that is,
the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
• Because binning methods consult the neighborhood of values, they perform local
smoothing.
• Following Figure illustrates some binning techniques. In this example, the data for price
are first sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin
contains three values).
•In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
• Therefore, each original value in this bin is replaced by the value 9.
• Similarly, smoothing by bin medians can be employed, in which each bin value is
replaced by the bin median.
• In smoothing by bin boundaries, the minimum and maximum values in a given bin
are identified as the bin boundaries.
• Each bin value is then replaced by the closest boundary value. In general, the larger
the width, the greater the effect of the smoothing.
• Alternatively, bins may be equal width, where the interval range of values in each bin
is constant. Binning is also used as a discretization technique.
2.Outlier analysis by clustering :
• Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers (as in Figure).
• Many data smoothing methods are also used for data discretization (a form of data
transformation) and data reduction.
• For example, the binning techniques described before reduce the number of distinct
values per attribute.
• This acts as a form of data reduction for logic-based data mining methods, such as
decision tree induction, which repeatedly makes value comparisons on sorted data.
• Concept hierarchies are a form of data discretization that can also be used for data
smoothing.
• A concept hierarchy for price, for example, may map real price values into inexpensive,
moderately priced, and expensive, thereby reducing the number of data values to be
handled by the mining process.
3.Regression:
• Data smoothing can also be done by regression, a technique that conforms data
values to a function.
• Linear regression involves finding the “best” line to fit two attributes (or variables)
so that one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
Data Integration

• Data integration is one of the steps of data pre-processing that involves combining
data residing in different sources and providing users with a unified view of these data.
• It merges the data from multiple data stores (data sources)
• It includes multiple databases, data cubes or flat files.
• The statistical strategy in Data Integration in Data Mining is formally stated as a triplet
of the (G, S, M) approach, where G is a global schema, S is a heterogeneous source
and M represents a mapping between source and global schema queries.
What are the Different Approaches to Data Integration in
Data Mining?
Data Integration in Data Mining is mainly categorized into two types of approaches.
Tight Coupling
• This approach involves the creation of a centralized database that integrates data from
different sources. The data is loaded into the centralized database using extract,
transform, and load (ETL) processes.
• In this approach, the integration is tightly coupled, meaning that the data is physically
stored in the central database, and any updates or changes made to the data sources
are immediately reflected in the central database.
• Tight coupling is suitable for situations where real-time access to the data is required,
and data consistency is critical. However, this approach can be costly and complex,
especially when dealing with large volumes of data.
Loose Coupling
• This approach involves the integration of data from different sources without
physically storing it in a centralized database.
• In this approach, data is accessed from the source systems as needed and combined in
real-time to provide a unified view. This approach uses middleware, such as application
programming interfaces (APIs) and web services, to connect the source systems and
access the data.
• In the Loose Coupling, the data is kept actual data source Databases. Using this
approach users get an interface to send a query, which transforms into a format suitable
for the data source, and the query is received by the source and it sends the data back to
the user as per the query.
• Loose coupling is suitable for situations where real-time access to the data is not
critical, and the data sources are highly distributed. This approach is more cost-
effective and flexible than tight coupling but can be more complex to set up and
maintain.
Data Integration Techniques

1.Manual Integration
• This technique avoids the use of automation during data integration.
• The data analyst himself collects the data, cleans it and integrate it to provide useful
information.
• This technique can be implemented for a small organization with a small data set.
• But it would be tedious for the large, complex and recurring integration. Because it is a time
taking process as the entire process has to be done manually.
2.Middleware Integration
• The middleware software is employed to collect the information from different sources,
normalize the data and stored into the resultant data set.
• This technique is adopted when the enterprise wants to integrate data from the legacy
systems to modern systems.
• Middleware software act as an interpreter between the legacy systems and advanced
systems.
• You can take an example of the adapter which helps in connecting two systems with
different interfaces. It can be applied to some system only.
3. Application-Based Integration
• This technique makes use of software application to extract, transform and load the
data from the heterogeneous sources.
• This technique also makes the data from disparate source compatible in order to ease
the transfer of the data from one system to another.
• This technique saves time and effort but is little complicated as designing such an
application requires technical knowledge.
4. Uniform Access Integration
• This technique integrates data from a more discrepant source. But, here the location of
the data is not changed, the data stays in its original location.
• This technique only creates a unified view which represents the integrated data. No
separate storage is required to store the integrated data as only the integrated view is
created for the end-user.
5. Data Warehousing
• This technique loosely relates to the uniform access integration technique. But the
difference is that the unified view is stored in certain storage.
• This allows the data analyst to handle more complex queries.
• Though this is a promising technique it has increased storage cost as the view or copy
of the unified data needs separate storage and even it has an increase in maintenance
cost.
Issues in Data Integration
While integrating the data we have to deal with several issues which are discussed below.
1. Entity Identification Problem
• As we know the data is unified from the heterogeneous sources then how can we
‘match the real-world entities from the data’.
• For example, we have customer data from two different data source. An entity from one
data source has customer_id and the entity from the other data source has
customer_number.
• Now how does the data analyst or the system would understand that these two entities
refer to the same attribute?
• Well, here the schema integration can be achieved using metadata of each attribute.
• Metadata of an attribute incorporates its name, what does it mean in the particular
scenario, what is its data type, up to what range it can accept the value.
• What rules does the attribute follow for the null value, blank, or zero? Analyzing this
metadata information will prevent error in schema integration.
• Structural integration can be achieved by ensuring that the functional dependency
of an attribute in the source system and its referential constraints matches the functional
dependency and referential constraint of the same attribute in the target system.
• This can be understood with the help of an example suppose in the one system, the
discount would be applied to an entire order but in another system, the discount would
be applied to every single item in the order. This difference must be caught before the
data from these two sources are integrated into the target system.
2. Redundancy and Correlation Analysis
• Redundancy is one of the big issues during data integration.
• Redundant data is an unimportant data or the data that is no longer needed.
• It can also arise due to attributes that could be derived using another attribute in the
data set.
• For example, one data set has the customer age and other data set has the customers
date of birth then age would be a redundant attribute as it could be derived using the
date of birth.
• Inconsistencies in the attribute also raise the level of redundancy. The redundancy can
be discovered using correlation analysis.
The attributes are analyzed to detect their interdependency on each other thereby
detecting the correlation between them.
3. Tuple Duplication
• Along with redundancies data integration has also deal with the duplicate tuples.
• Duplicate tuples may come in the resultant data if the renormalized table has been
used as a source for data integration.
4. Data Conflict Detection and Resolution
• Data conflict means the data merged from the different sources do not match. Like the
attribute values may differ in different data sets.
• The difference maybe because they are represented differently in the different data
sets. For suppose the price of a hotel room may be represented in different currencies in
different cities.
•This kind of issues is detected and resolved during data integration.
So far, we have disused the issues that a data analyst or system has to deal with during
data integration.
Data Reduction
• Data reduction is a process that reduces the volume of original data and
represents it in a much smaller volume.
• Data reduction techniques are used to obtain a reduced representation of the
dataset that is much smaller in volume by maintaining the integrity of the
original data.
• By reducing the data, the efficiency of the data mining process is improved,
which produces the same analytical results.
• Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information.
• This can be beneficial in situations where the dataset is too large to be processed
efficiently, or where the dataset contains a large amount of irrelevant or
redundant information.
Data Reduction Techniques
1. Data Cube Aggregation
• This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
• For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to
the year 2022.
• If you want to get the annual sale per year, you just have to aggregate the sales per quarter for each
year.
• In this way, aggregation provides you with the required data, which is much smaller in size, and
thereby we achieve data reduction even without losing any data.
• The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis.
The data cube present precomputed and summarized data which eases the data mining into fast
access.
2.Dimensionality Reduction
• Dimensionality reduction eliminates the attributes from the data set under consideration
thereby reducing the volume of original data.
• In the section below, we will discuss three methods of dimensionality reduction.
a. Wavelet Transform
• In the wavelet transform, a data vector X is transformed into a numerically different data
vector X’ such that both X and X’ vectors are of the same length. Then how it is useful in
reducing data?
• The data obtained from the wavelet transform can be truncated. The compressed data is
obtained by retaining the smallest fragment of the strongest of wavelet coefficients.
• Wavelet transform can be applied to data cube, sparse data or skewed data.
b. Principal Component Analysis
• Let us consider we have a data set to be analyzed that has tuples with n attributes, then the
principal component analysis identifies k independent tuples with n attributes that can
represent the data set.
• In this way, the original data can be cast on a much smaller space. In this way, the
dimensionality reduction can be achieved.
• Principal component analysis can be applied to sparse, and skewed data.
c. Attribute Subset Selection
• The large data set has many attributes some of which are irrelevant to data mining or
some are redundant.
• The attribute subset selection reduces the volume of data by eliminating the redundant
and irrelevant attribute.
• The attribute subset selection makes it sure that even after eliminating the unwanted
attributes we get a good subset of original attributes such that the resulting probability
of data distribution is as close as possible to the original data distribution using all the
attributes.
3. Numerosity Reduction
The numerosity reduction reduces the volume of the original data and represents it in a much
smaller form. This technique includes two types parametric and non-parametric numerosity
reduction.
Parametric
• Parametric numerosity reduction incorporates ‘storing only data parameters instead of the
original data’.
• One method of parametric numerosity reduction is ‘regression and log-linear’ method.
Regression and Log-Linear
Linear regression models a relationship between the two attributes by modeling a linear
equation to the data set. Suppose we need to model a linear function between two attributes.
y = wx +b
Here, y is the response attribute and x is the predictor attribute. If we discuss in terms of data
mining, the attribute x and the attribute y are the numeric database attributes whereas w and b
are regression coefficients.
Multiple linear regression lets the response variable y to model linear function between two or
more predictor variable.
• Log-linear model discovers the relation between two or more discrete attributes in the
database.
• Suppose, we have a set of tuples presented in n-dimensional space. Then the log-linear
model is used to study the probability of each tuple in a multidimensional space.
• Regression and log-linear method can be used for sparse data and skewed data.
Non-Parametric
1.Histogram
A histogram is a ‘graph’ that represents frequency distribution which describes how
often a value appears in the data. Histogram uses the binning method and to represent
data distribution of an attribute. It uses disjoint subset which we call as bin or buckets.
We have data for All Electronics data set, which contains prices for regularly sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18,
18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28,
30,30,30.
The diagram below shows a histogram of equal width that shows the frequency of price
distribution.
A histogram is capable of representing dense, sparse, uniform or skewed data. Instead
of only one attribute, the histogram can be implemented for multiple attributes. It can
effectively represent the up to five attributes.
2.Clustering
• Clustering techniques groups the similar objects from the data in such a way that the
objects in a cluster are similar to each other but they are dissimilar to objects in
another cluster.
• How much similar are the objects inside a cluster can be calculated by using a
distance function. More is the similarity between the objects in a cluster closer they
appear in the cluster.
• The quality of cluster depends on the diameter of the cluster i.e. the at max distance
between any two objects in the cluster.
• The original data is replaced by the cluster representation. This technique is more
effective if the present data can be classified into a distinct clustered.
3. Sampling
• One of the methods used for data reduction is sampling as it is capable to reduce the
large data set into a much smaller data sample.
• Below we will discuss the different method in which we can sample a large data set
D containing N tuples:
Simple random sample without replacement (SRSWOR) of size s:
• In this ‘s number’ of tuples are drawn from N tuples such that in the data set D (s<N).
• The probability of drawing any tuple from the data set D is 1/N this means all tuples
have an equal probability of getting sampled.
Simple random sample with replacement (SRSWR) of size s:
• It is similar to the SRSWOR but the tuple is drawn from data set D, is recorded and
then replaced back into the data set D so that it can be drawn again.
Cluster sample:
• The tuples in data set D are clustered into M mutually disjoint subsets.
• From these clusters, a simple random sample of size s could be generated where
s<M.
• The data reduction can be applied by implementing SRSWOR on these clusters.
Stratified sample:
• The large data set D is partitioned into mutually disjoint sets called ‘strata’.
• Now a simple random sample is taken from each stratum to get stratified data.
• This method is effective for skewed data.
4.Data Compression
• The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding).
• We can divide it into two types based on their compression techniques.
Lossless Compression –
• Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction.
• Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
Lossy Compression –
• Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image.
• In lossy-data compression, the decompressed data may differ from the original data but
are useful enough to retrieve information from them.
5.Discretization and concept hierarchy generation −
• Techniques of data discretization are used to divide the attributes of the continuous
nature into data with intervals.
• We replace many constant values of the attributes by labels of small intervals. This
means that mining results are shown in a concise, and easily understandable way.
Top-down discretization –
• If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat this method up to the end, then the process
is known as top-down discretization also known as splitting.
Bottom-up discretization –
• If you first consider all the constant values as split points, some are discarded through
a combination of the neighborhood values in the interval, that process is called bottom-
up discretization.
Concept Hierarchies:
• It reduces the data size by collecting and then replacing the low-level concepts (such as
43 for age) with high-level concepts (categorical variables such as middle age or Senior).
• For numeric data following techniques can be followed:
Binning
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
Equal Width Partitioning: Partitioning the values in a fixed gap based on the number
of bins i.e. a set of values ranging from 0-20.
Clustering: Grouping similar data together.
Data Transformation & Data Discretization
• Data transformation in data mining is done by combining unstructured data with
structured data to analyze it later. It is also important when the data is transferred
to a new cloud data warehouse.
• When the data is homogeneous and well-structured, it is easier to analyze and
look for patterns.
• For example, a company has acquired another firm and now has to consolidate
all the business data. The Smaller company may be using a different database
than the parent firm. Also, the data in these databases may have unique IDs,
keys, and values. All this needs to be formatted so that all the records are similar
and can be evaluated.
• This is why data transformation methods are applied. And, they are described
below:
Data Smoothing
This method is used for removing the noise from a dataset. Noise is referred to as the
distorted and meaningless data within a dataset.
Smoothing uses algorithms to highlight the special features in the data.
After removing noise, the process can detect any small changes to the data to detect
special patterns.
Any data modification or trend can be identified by this method.
Data Aggregation
• Aggregation is the process of collecting data from a variety of sources and storing it in
a single format. Here, data is collected, stored, analyzed, and presented in a report or
summary format.
• It helps in gathering more information about a particular data cluster. The method helps
in collecting vast amounts of data.
• This is a crucial step as accuracy and quantity of data is important for proper analysis.
• Companies collect data about their website visitors. This gives them an idea about
customer demographics and behavior metrics. This aggregated data assists them in
designing personalized messages, offers, and discounts.
Discretization
• This is a process of converting continuous data into a set of data intervals.
Continuous attribute values are substituted by small interval labels. This makes the
data easier to study and analyze.
• If a continuous attribute is handled by a data mining task, then its discrete values can
be replaced by constant quality attributes.
• This improves the efficiency of the task.
• This method is also called a data reduction mechanism as it transforms a large
dataset into a set of categorical data.
• Discretization can be done by Binning, Histogram Analysis, and Correlation
Analyses.
• Discretization also uses decision tree-based aigorithms to produce short, compact,
and accurate results when xSing discrete values.
Generalization
• In this process, low-level data attributes are transformed into high-level data
attributes using concept hierarchies.
• This conversion from a lower level to a higher conceptual level is useful to get a
clearer picture of the data.
• For example, age data can be in the form of (20, 30) in a dataset. It is transformed
into a higher conceptual level into a categorical value (young, old).
• Data generalization can be divided into two approaches
data cube process (OLAP) and attribute-oriented induction approach (AO).
Attribute construction
In the attribute construction method, new attributes are
created from an existing set of attributes.
For example, in a dataset of employee information, the
attributes can be employee name, employee ID, and address.
These attributes can be used to construct another dataset that
contains information about the employees who have joined in
the year 2019 only.
This method of reconstruction makes mining more piicient and
helps in creating new datasets quickly.
Normalization
Also called data pre-processing, this is one of the crucial techniques for data transformation in data mining.
Here, the data is transformed so that it falls under a given range. When attributes are on different ranges or scales, data
modeling and mining can be difficult.
Normalization helps in applying data mining algorithms and extracting data faster.
The popular normalization methods are:
Min-max normalization
In this technique of data normalization, a linear transformation is performed on the original data. Minimum and
maximum value from data is fetched and each value is replaced according to the following formula:

Where A is the attribute data, min-max are the minimum and maximum the absolute value of A respectively, vis the new
value of each entry in data, v is the old value of each entry in data, new_maxA, new_min is the max and min value of the
range (i.e. boundary value of the range required) respectively.
Example: Suppose the income range from 10,000 to Ex.95,000 is normalized to [0.0, 1.0]. By min-max normalization, a
value of $64,300 for income is in transformed to 64300-10000 95000-100001.0-0.0) + 0.0 =0.6388
Z-score normalization
In this technique, values are normalized based on a mean and standard deviation of the data A.
Decimal scaling
It normalizes by moving the decimal point of values of the data.
To normalize the data by this technique, we divide each value
of the data by the maximum absolute value of the data. The
data value, Vi, of data is normalized to v; by using the formula
below:
where, j is the smallest integer such that max (Iv )<1.

You might also like