0% found this document useful (0 votes)
35 views30 pages

Advanced Databases and Mining Unit 3

Data mining involves using advanced analytical tools to uncover patterns and relationships in large datasets through techniques like classification, clustering, regression, and association rules. It also includes knowledge representation methods such as logical representation, semantic networks, frames, and production rules. Additionally, OLAP technology facilitates multidimensional analysis for better decision-making in various organizational functions.

Uploaded by

yaminimygapule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views30 pages

Advanced Databases and Mining Unit 3

Data mining involves using advanced analytical tools to uncover patterns and relationships in large datasets through techniques like classification, clustering, regression, and association rules. It also includes knowledge representation methods such as logical representation, semantic networks, frames, and production rules. Additionally, OLAP technology facilitates multidimensional analysis for better decision-making in various organizational functions.

Uploaded by

yaminimygapule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

55

UNIT III:
Data Mining
Data Mining Techniques:

Data mining includes the utilization of refined data analysis tools to find previously unknown,
valid patterns and relationships in huge data sets. These tools can incorporate statistical models,
machine learning techniques, and mathematical algorithms, such as neural networks or decision
trees. Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers to
better understanding how to process and make conclusions from the huge amount of data, but
what are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.

1. Classification:
56

This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:


i. Classification of Data mining frameworks as per the type of data sources mined: This
classification is as per the type of data handled. For example, multimedia, spatial data,
text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved: This
classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered: This
classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented
or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the
data mining procedure, such as query-driven systems, autonomous systems, or interactive
exploratory systems.
v. 2. Clustering:
vi. Clustering is a division of information into groups of connected objects. Describing the
data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. Data modeling puts clustering from a
historical point of view rooted in statistics, mathematics, and numerical analysis. From a
machine learning point of view, clusters relate to hidden patterns, the search for clusters
is unsupervised learning, and the subsequent framework represents a data concept.
From a practical point of view, clustering plays an extraordinary job in data mining
applications. For example, scientific data exploration, text mining, information retrieval,
spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.
vii. In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between
57

the data. Clustering is very similar to the classification, but it involves grouping chunks of
data together based on their similarities.
viii. 3. Regression:
ix. Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of planning
and modeling. For example, we might use it to project certain costs, depending on other
factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set. x. 4. Association
Rules: xi. This data mining technique helps to discover a link between two or more items. It finds
a hidden pattern in the data set.
xii. Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
xiii. The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of items
being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.

(Confidence) / (item B)/ (Entire dataset)

o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.

(Item A + Item B) / (Entire dataset)

o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.

(Item A + Item B)/ (Item A)


58

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set, which
do not match an expected pattern or expected behavior. This technique may be used in various
domains like intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the rest of the dataset.
The majority of the real-world datasets have an outlier. Outlier detection plays a significant role
in the data mining field. Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor network
data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event. knowledge representation methods:

There are mainly four ways of knowledge representation which are given as follows:

1. Logical Representation
2. Semantic Network Representation
3. Frame Representation
4. Production Rules
59

Logical Representation:

Logical representation is a language with some concrete rules which deals with propositions and
has no ambiguity in representation. Logical representation means drawing a conclusion based on
various conditions. This representation lays down some important communication rules. It
consists of precisely defined syntax and semantics which supports the sound inference. Each
sentence can be translated into logics using syntax and semantics.

Syntax: o Syntaxes are the rules which decide how we can construct legal sentences in
the logic. o It determines which symbol we can use in knowledge representation. o
How to write those symbols.

Semantics: o Semantics are the rules by which we can interpret the sentence in the
logic. o Semantic also involves assigning a meaning to each sentence.

Logical representation can be categorised into mainly two logics:

a. Propositional Logics
b. Predicate logics

Advantages of logical representation:


1. Logical representation enables us to do logical reasoning.
2. Logical representation is the basis for the programming languages.

Disadvantages of logical Representation:


60

1. Logical representations have some restrictions and are challenging to work with.
2. Logical representation technique may not be very natural, and inference may not be so
efficient.
3. Semantic Network Representation:

Semantic networks are alternative of predicate logic for knowledge representation. In


Semantic networks, we can represent our knowledge in the form of graphical networks. This
network consists of nodes representing objects and arcs which describe the relationship
between those objects. Semantic networks can categorize the object in different forms and
can also link those objects. Semantic networks are easy to understand and can be easily
extended.

This representation consist of mainly two types of relations:

a. IS-A relation (Inheritance)


b. Kind-of-relation

Example: Following are some statements which we need to represent in the form of nodes and
arcs.

Statements:
a. Jerry is a cat.
b. Jerry is a mammal
c. Jerry is owned by Priya.
d. Jerry is brown colored.
e. All Mammals are animal.
61

In the above diagram, we have represented the different type of knowledge in the form of nodes
and arcs. Each object is connected with another object by some relation.

Drawbacks in Semantic representation:


1. Semantic networks take more computational time at runtime as we need to traverse the
complete network tree to answer some questions. It might be possible in the worst case
scenario that after traversing the entire tree, we find that the solution does not exist in this
network.
2. Semantic networks try to model human-like memory (Which has 1015 neurons and links)
to store the information, but in practice, it is not possible to build such a vast semantic
network.
3. These types of representations are inadequate as they do not have any equivalent
quantifier, e.g., for all, for some, none, etc.
4. Semantic networks do not have any standard definition for the link names.
5. These networks are not intelligent and depend on the creator of the system.

Advantages of Semantic network:


1. Semantic networks are a natural representation of knowledge.
2. Semantic networks convey meaning in a transparent manner.
3. These networks are simple and easily understandable

3. Frame Representation:
62

4. A frame is a record like structure which consists of a collection of attributes and its values
to describe an entity in the world. Frames are the AI data structure which divides
knowledge into substructures by representing stereotypes situations. It consists of a
collection of slots and slot values. These slots may be of any type and sizes. Slots have
names and values which are called facets.
5. Facets: The various aspects of a slot is known as Facets. Facets are features of frames which
enable us to put constraints on the frames. Example: IF-NEEDED facts are called when data
of any particular slot is needed. A frame may consist of any number of slots, and a slot may
include any number of facets and facets may have any number of values. A frame is also
known as slot-filter knowledge representation in artificial intelligence.
6. Frames are derived from semantic networks and later evolved into our modern-day classes
and objects. A single frame is not much useful. Frames system consist of a collection of
frames which are connected. In the frame, knowledge about an object or event can be
stored together in the knowledge base. The frame is a type of technology which is widely
used in various applications including Natural language processing and machine visions.

Example: 1

Let's take an example of a frame for a book

Slots Filters

Title Artificial Intelligence

Genre Computer Science

Author Peter Norvig

Edition Third Edition

Year 1996

Page 1152
Advantages of frame representation:
63

1. The frame knowledge representation makes the programming easier by grouping the
related data.
2. The frame representation is comparably flexible and used by many applications in AI.
3. It is very easy to add slots for new attribute and relations.
4. It is easy to include default data and to search for missing values.
5. Frame representation is easy to understand and visualize.

Disadvantages of frame representation:


1. In frame system inference mechanism is not be easily processed.
2. Inference mechanism cannot be smoothly proceeded by frame representation.
3. Frame representation has a much generalized approach.

4. Production Rules

Production rules system consist of (condition, action) pairs which mean, "If condition then
action". It has mainly three parts:

o The set of production rules o Working Memory o The


recognize-act-cycle

In production rules agent checks for the condition and if the condition exists then production rule
fires and corresponding action is carried out. The condition part of the rule determines which
rule may be applied to a problem. And the action part carries out the associated problemsolving
steps. This complete process is called a recognize-act cycle.
The working memory contains the description of the current state of problems-solving and rule
can write knowledge to the working memory. This knowledge match and may fire other rules.

If there is a new situation (state) generates, then multiple production rules will be fired together,
this is called conflict set. In this situation, the agent needs to select a rule from these sets, and it
is called a conflict resolution.

Example:
o IF (at bus stop AND bus arrives) THEN action (get
into the bus) o IF (on the bus AND paid AND empty
seat) THEN action (sit down). o IF (on bus AND
64

unpaid) THEN action (pay charges). o IF (bus arrives at


destination) THEN action (get down from the bus).

Advantages of Production rule:


1. The production rules are expressed in natural language.
2. The production rules are highly modular, so we can easily remove, add or modify an
individual rule.

Disadvantages of Production rule:


1. Production rule system does not exhibit any learning capabilities, as it does not store the
result of the problem for the future uses.
2. During the execution of the program, many rules may be active hence rule-based
production systems are inefficient.

Data mining Approaches (OLAP):

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible
views of data that has been transformed from raw information to reflect the real
dimensionality of the enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and support
the capability for complex estimations, trend analysis, and sophisticated data
modeling. It is rapidly enhancing the essential foundation for Intelligent Solutions
containing Business Performance Management, Planning, Budgeting, Forecasting,
Financial Documenting, Analysis, Simulation-Models, Knowledge Discovery, and Data
Warehouses Reporting. OLAP enables end-clients to perform ad hoc analysis of
record in multiple dimensions, providing the insight and understanding they require
for better decision making.
Who uses OLAP and Why?:
OLAP applications are used by a variety of the functions of an organization.

Finance and accounting: o Budgeting o


Activity-based costing o
65

Financial performance analysis o And


financial modeling

Sales and Marketing

o Sales analysis and forecasting o Market


research analysis o
Promotion analysis o

Customer analysis
o Market and customer
segmentation

Production

o Production planning o Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.

The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.
How OLAP Works?

Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that are
typically very hard to execute over tabular databases, namely aggregation, joining, and grouping.
These queries are calculated during a process that is usually called 'building' or 'processing' of
the OLAP cube. This process happens overnight, and by the time end users get to work - data will
have been updated.

OLAP Guidelines (Dr.E.F.Codd Rule):

Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:
66

Statistics and ML:


Data mining refers to extracting or mining knowledge from large amounts of data. In other words,
data mining is the science, art, and technology of discovering large and complex bodies of data
in order to discover useful patterns. Theoreticians and practitioners are continually seeking
improved techniques to make the process more efficient, cost-effective, and accurate. Any
situation can be analyzed in two ways in data mining:
• Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to
identify patterns and trends. Alternatively, it is referred to as quantitative analysis.
• Non-statistical Analysis: This analysis provides generalized information and includes sound,
still images, and moving images.
In statistics, there are two main categories:
• Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the
main characteristics of that data. Graphs or numbers summarize the data. Average, Mode,
SD(Standard Deviation), and Correlation are some of the commonly used descriptive statistical
methods.
67

• Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about
populations and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics. Some
of these are:
• Population
• Sample
• Variable
• Quantitative Variable
• Qualitative Variable
• Discrete Variable
• Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using mathematical
formulas, models, and techniques. Through the use of statistical methods, information is
extracted from research data, and different ways are available to judge the robustness of research
outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically are derived
from the vast statistical toolkit developed to answer problems arising in other fields. These
techniques are taught in science curriculums. It is necessary to check and test several hypotheses.
The hypotheses described above help us assess the validity of our data mining endeavor when
attempting to infer any inferences from the data under study. When using more complex and
sophisticated statistical estimators and tests, these issues become more pronounced.
For extracting knowledge from databases containing different types of observations, a variety of
statistical methods are available in Data Mining and some of these are:
• Logistic regression analysis
• Correlation analysis
• Regression analysis
• Discriminate analysis
• Linear discriminant analysis (LDA)
• Classification
• Clustering
• Outlier detection
• Classification and regression trees,
• Correspondence analysis
• Nonparametric regression,
• Statistical pattern recognition,
68

• Categorical data analysis,


• Time-series methods for trends and periodicity
• Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used in data
mining:
• Linear Regression: The linear regression method uses the best linear relationship between
the independent and dependent variables to predict the target variable. In order to achieve
the best fit, make sure that all the distances between the shape and the actual observations
at each point are as small as possible. A good fit can be determined by determining that no
other position would produce fewer errors given the shape chosen. Simple linear regression
and multiple linear regression are the two major types of linear regression. By fitting a linear
relationship to the independent variable, the simple linear regression predicts the dependent
variable. Using multiple independent variables, multiple linear regression fits the best linear
relationship with the dependent variable. For more details, you can refer linear regression.
• Classification: This is a method of data mining in which a collection of data is categorized so
that a greater degree of accuracy can be predicted and analyzed. An effective way to analyze
very large datasets is to classify them. Classification is one of several methods aimed at
improving the efficiency of the analysis process. A Logistic Regression and a Discriminant
Analysis stand out as two major classification techniques.
• Logistic Regression: It can also be applied to machine learning applications and predictive
analytics. In this approach, the dependent variable is either binary (binary regression) or
multinomial (multinomial regression): either one of the two or a set of one, two, three, or four
options. With a logistic regression equation, one can estimate probabilities regarding the
relationship between the independent variable and the dependent variable. For
understanding logistic regression analysis in detail, you can refer to logistic regression.
• Discriminant Analysis: A Discriminant Analysis is a statistical method of analyzing data based
on the measurements of categories or clusters and categorizing new observations into one or
more populations that were identified a priori. The discriminant analysis models each
response class independently then uses Bayes’s theorem to flip these projections around to
estimate the likelihood of each response category given the value of X. These models can be
either linear or quadratic.
• Linear Discriminant Analysis: According to Linear Discriminant
Analysis, each observation is assigned a discriminant score to classify it
into a response variable class. By combining the independent variables in
a linear fashion, these scores can be obtained. Based on this model,
observations are drawn from a Gaussian distribution, and the predictor
69

variables are correlated across all k levels of the response variable, Y. and
for further details linear discriminant analysis
• Quadratic Discriminant Analysis: An alternative approach is provided by Quadratic
Discriminant Analysis. LDA and QDA both assume Gaussian distributions for the observations
of the Y classes. Unlike LDA, QDA considers each class to have its own covariance matrix. As a
result, the predictor variables have different variances across the k levels in Y.
• Correlation Analysis: In statistical terms, correlation analysis captures the relationship
between variables in a pair. The value of such variables is usually stored in a column or rows
of a database table and represents a property of an object.
• Regression Analysis: Based on a set of numeric data, regression is a data mining method that
predicts a range of numerical values (also known as continuous values). You could, for
instance, use regression to predict the cost of goods and services based on other variables. A
regression model is used across numerous industries for forecasting financial data, modeling
environmental conditions, and analyzing trends. Data warehouse and DBMS: Background
A Database Management System (DBMS) stores data in the form of tables, uses ER model and
the goal is ACID properties. For example, a DBMS of college has tables for students, faculty, etc.
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically
collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce
statistical results that may help in decision makings. For example, a college might want to see
quick different results, like how the placement of CS students has improved over the last 10 years,
in terms of salaries, counts, etc.
Need for Data Warehouse
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing
data of TB size, the storage shifted to Data Warehouse. Besides this, a transactional database
doesn’t offer itself to analytics. To effectively perform analytics, an organization keeps a central
Data Warehouse to closely study its business by organizing, understanding, and using its historic
data for taking strategic decisions and analyzing trends.

Data Warehouse vs
Database System Data Warehouse DBMS

It supports operational It supports analysis and


processes. performance reporting.
70

Capture and maintain the data. Explore the data.

Current data. Multiple years of history.

Data must be integrated and balanced


from multiple
Data is balanced within the scope of
system. this one system.
Data is updated when Data is updated on scheduled transaction
occurs. processes.

Data verification occurs when entry Data verification occurs after is


done. the fact.

100 MB to GB. 100 GB to TB.

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly detailed. Summarized and consol


Example Applications of Data Warehousing
Data Warehousing can be applied anywhere where we have a huge amount of data and we want
to see statistical results that help in decision making.

• Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc.
are based on analyzing large data sets. These sites gather data related to members, groups,
locations, etc., and store it in a single central repository. Being a large amount of data, Data
Warehouse is needed for implementing the same.
• Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
• Government: Government uses a data warehouse to store and analyze tax payments which
are used to detect tax thefts. Multidimensional data model:

A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
71

The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for
the dimension time, item, and location. These dimensions allow the save to keep track of things,
for example, monthly sales of items and the locations at which the items were sold. Each
dimension has a table related to it, called a dimensional table, which describes the dimension
further. For example, a dimensional table for an item may contain the attributes item_name,
brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands)
72

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
73

OLAP operations:

OLAP stands for Online Analytical Processing Server. It is a software technology that allows users
to analyze information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg. Delhi
-> 2018 -> Sales data). OLAP databases are divided into one or more cubes and these cubes are
known as Hyper-cubes.

OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:
74

1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed
data. It can be done by:
• Moving down in the concept hierarchy
• Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving down
in the concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:
• Climbing up in the concept hierarchy
• Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing up in
the concept hierarchy of Location dimension (City -> Country).
75

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:
• Location = “Delhi” or “Kolkata”
• Time = “Q1” or “Q2”
• Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension Time
= “Q1”.
76

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing pivot
operation gives a new view of it.

Data Cleaning in Data Mining:

Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it.
Data quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning.

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or


incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable,
even though they may look correct. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.

Generally, data cleaning reduces errors and improves data quality. Correcting errors in data and
eliminating bad records can be a time-consuming and tedious process, but it cannot be ignored.
Data mining is a key technique for data cleaning. Data mining is a technique for discovering
77

interesting information in data. Data quality mining is a recent approach applying data mining
techniques to identify and recover data quality problems in large databases. Data mining
automatically extracts hidden and intrinsic information from the collections of data. Data mining
has various techniques that are suitable for data cleaning.

Steps of Data Cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:

1. Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or


irrelevant observations. Duplicate observations will happen most often during data collection.
When you combine data sets from multiple places, scrape data, or receive data from clients or
multiple departments, there are opportunities to create duplicate data. De-duplication is one of
the largest areas to be considered in this process. Irrelevant observations are when you notice
observations that do not fit into the specific problem you are trying to analyze.

For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.

2. Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but they should be
analyzed in the same category.

3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data
entry, doing so will help the performance of the data you are working with.

However, sometimes, the appearance of an outlier will prove a theory you are working on. And
just because an outlier exists doesn't mean it is incorrect. This step is needed to determine the
78

validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.
4. Handle missing data

You can't ignore missing data because many algorithms will not accept missing values. There are
a couple of ways to deal with missing data. Neither is optimal, but both can be considered, such
as:

o You can drop observations with missing values, but this will drop or lose information, so be
careful before removing it. o You can input missing values based on other observations;
again, there is an opportunity to lose the integrity of the data because you may be
operating from assumptions and not actual observations.
o You might alter how the data is used to navigate null values effectively.

Methods of Data Cleaning:

There are many data cleaning methods through which the data should be run. The methods are
described below:

1. Ignore the tuples: This method is not very feasible, as it only comes to use when the tuple
has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it can
be a time-consuming method. In the approach, one has to fill in the missing value. This is
usually done manually, but it can also be done by attribute mean or using the most
probable value.
79

3. Binning method: This approach is very simple to understand. The smoothing of sorted data
is done using the values around it. The data is then divided into several segments of equal
size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent variable,
and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar values
are then arranged into a "group" or a "cluster".

Data Transformation in Data Mining:

Raw data is difficult to trace or understand. That's why it needs to be preprocessed before
retrieving any information from it. Data transformation is a technique used to convert the raw
data into a suitable format that efficiently eases data mining and retrieves strategic information.
Data transformation includes data cleaning techniques and a data reduction technique to
convert the data into the appropriate form.

Data transformation is an essential data preprocessing technique that must be performed on the
data before data mining to provide patterns that are easier to understand.

Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic processes,
and it enables businesses to make better data-driven decisions. During the data transformation
process, an analyst will determine the structure of the data. This could mean that data
transformation may be:

o Constructive: The data transformation process adds, copies, or replicates data.


o Destructive: The system deletes fields or records.
o Aesthetic: The transformation standardizes the data to meet requirements or parameters.
o Structural: The database is reorganized by renaming, moving, or combining columns.

Data Transformation Techniques:

There are several data transformation techniques that can help structure and clean up the data
before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.
80

Data Smoothing

Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they wouldn't
see otherwise.

Attribute Construction

In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied to assist
the mining process from the given attributes. This simplifies the original data and makes the
mining more efficient.

For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area'
from attributes 'height' and 'weight'. This also helps understand the relations among the
attributes in a data set.
81

Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.

Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization:

Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0,
1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.

Data Generalization

It converts low-level data attributes to high-level data attributes using concept hierarchy. This
conversion from a lower level to a higher conceptual level is useful to get a clearer picture of the
data. Data generalization can be divided into two approaches:
o Data cube process (OLAP) approach. o

Attribute-oriented induction (AOI) approach.


82

For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).

Data Reduction in Data Mining:


The method of data reduction may achieve a condensed description of the original data which is
much smaller in quantity but keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information
you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your
company every three months. They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.
Step-wise Forward Selection – The selection begins with an empty set of
attributes later on we decide best of the original attributes on the set based on their relevance
to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
• Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
83

Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }


Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
• Combination of forwarding and Backward Selection – It allows us to remove the
worst and select best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based
on their compression techniques.
• Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
• Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis)
are examples of this compression. For e.g., JPEG image format is a lossy compression, but we
can find the meaning equivalent to the original the image. In lossydata compression, the
decompressed data may differ to the original data but are useful enough to retrieve
information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller
representation of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric method such as clustering, histogram, sampling. For More
Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes by labels of small intervals.
This means that mining results are shown in a concise, and easily understandable way.
• Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide
the whole set of attributes and repeat of this method up to the end, then the process is known
as top-down discretization also known as splitting.
• Bottom-up discretization –
84

If you first consider all the constant values as split-points, some are discarded through a
combination of the neighbourhood values in the interval, that process is called bottom-up
discretization. Concept Hierarchies: It reduces the data size by collecting and then replacing
the low-level concepts (such as 43 for age) to high-level concepts (categorical variables such as
middle age or Senior). For numeric data following techniques can be followed:
• Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
• Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into
disjoint ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of
bins i.e. a set of values ranging from 0-20.
3. Clustering: Grouping the similar data together.

You might also like