0% found this document useful (0 votes)

18 views

3.data Pre-Processing Concepts

The document discusses concepts related to data pre-processing including data cleaning, integration, transformation, and reduction. It describes handling missing or noisy data, schema integration, data normalization, attribute selection, and other techniques. Data discretization and concept hierarchy generation are also explained as important preprocessing steps.

Uploaded by

Bibek Neupane

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

3.data Pre-Processing Concepts

Uploaded by

Bibek Neupane

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Unit-3 Data Pre-processing Concepts

3.1 Data Pre-processing concepts

Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data.
The quality of the data should be checked before applying machine learning or data
mining algorithms.
When we talk about data, we usually think of some large datasets with a huge
number of rows and columns. While that is a likely scenario, it is not always the case
data could be in so many different forms: Structured Tables, Images, Audio files,
Videos, etc.
Machines don’t understand free text, image, or video data as it is, they understand
1s and 0s. So it probably won’t be good enough if we put on a slideshow of all our
images and expect our machine learning model to get trained just by that!
3.2 Major Tasks in Data Preprocessing

 Data Cleaning

The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
i. Missing Data
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
 Ignore the tuples: This approach is suitable only when the
dataset we have is quite large and multiple values are missing
within a tuple.
 Fill the missing value: There are various ways to do this task.
You can choose to fill the missing values manually, by
attribute mean or the most probable value.

ii. Noisy data

Noisy data is a meaningless data that can’t be interpreted by machines.
It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways:
 Binning method
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
 Regression
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
 Clustering
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

 Data integration

The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management. There
are some problems to be considered during data integration.
 Schema integration: Integrates metadata (a set of data that
describes other data) from different sources.
 Entity identification problem: Identifying entities from
multiple databases. For example, the system or the use should
know student _id of one database and student_name of
another database belongs to the same entity.
 Detecting and resolving data value concepts: The data
taken from different databases while merging may differ.
Like the attribute values from one database may differ from
another database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
 Data transformation

This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
 Normalization: It is done in order to scale the data values in a specified
range (-1.0 to 1.0 or 0.0 to 1.0)
 Attribute Selection: In this strategy, new attributes are constructed
from the given set of attributes to help the mining process.
 Discretization: This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.
 Concept Hierarchy Generation: Here attributes are converted from
lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.
 Data Reduction

Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we use data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
 Data Cube aggregation: Aggregation operation is applied to data for
the construction of the data cube.
 Attribute Subset: The highly relevant attributes should be used; rest
all can be discarded. For performing attribute selection, one can use
level of significance and p- value of the attribute. The attribute having
p-value greater than significance level can be discarded.
 Numerosity Reduction: This enable to store the model of data instead
of whole data, for example: Regression Models.
 Dimensionality reduction: This reduce the size of data by encoding
mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are: Wavelet transforms
and PCA (Principal Component Analysis).
3.3 Data discretization and Concept Hierarchy Generation
 Data Discretization

Data discretization refers to a method of converting a huge number of data

values into smaller ones so that the evaluation and management of data
become easy. In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of intervals with minimum
data loss. There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation
proceeds. It means it works on the top-down splitting strategy and bottom-up
merging strategy.

Example:
Suppose we have an attribute of Age with the given values

Age 1,2,6,9,11,15,17,18,19,31,35,45,58,59,61,65,71,75

After Data Discretization table becomes:

Age Age Age Age
1,2,6,9 11,15,17,18,19 31,35,45,58,59 61,65,71,75
Child Young Mature old

Techniques of Data Discretization

 Histogram Analysis: Histogram refers to a plot used to represent the

underlying frequency distribution of a continuous data set. Histogram
assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.

 Binning: Binning refers to a data smoothing technique that helps to

group a huge number of continuous values into smaller values. For data
discretization and the development of idea hierarchy, this technique can
also be used.

 Cluster Analysis: It is a form of data discretization. A clustering

algorithm is executed by dividing the values of x numbers into clusters
to isolate a computational feature of x.
 Decision tree Analysis: Data discretization refers to a decision tree
analysis in which a top-down slicing technique is used. It is done
through a supervised procedure. In a numeric attribute discretization,
first, you need to select the attribute that has the least entropy, and then
you need to run it with the help of a recursive process. The recursive
process divides it into various discretized disjoint intervals, from top to
bottom, using the same splitting criterion.

 Correlation Analysis: Discretizing data by linear regression

technique, you can get the best neighboring interval, and then the large
intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.

 Concept Hierarchy Generation:

The term hierarchy represents an organizational structure or mapping in which

items are ranked according to their levels of importance. In other words, we
can say that a hierarchy concept refers to a sequence of mappings with a set
of more general concepts to complex concepts. It means mapping is done from
low-level concepts to high-level concepts. For example, in computer science,
there are different types of hierarchical systems. A document is placed in a
folder in windows at a specific place in the tree structure is the best example
of a computer hierarchical tree model. There are two types of hierarchy: top-
down mapping and the second one is bottom-up mapping.

Example:
A particular city can map with the belonging country. For example, New Delhi
can be mapped to India, and India can be mapped to Asia.

 Top Down Mapping: Top-down mapping generally starts with the top
with some general information and ends with the bottom to the
specialized information.
 Bottom up Mapping: Bottom-up mapping generally starts with the
bottom with some specialized information and ends with the top to the
generalized information.

3.4 DMQL
The Data Mining Query Language is actually based on the Structured Query
Language (SQL). Data Mining Query Languages can be designed to support ad hoc
and interactive data mining. This DMQL provides commands for specifying
primitives. The DMQL can work with databases and data warehouses as well.
DMQL can be used to define data mining tasks. Particularly we examine how to
define data warehouses and data marts in DMQL.
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al.
for the DBMiner data mining system.
 Syntax of DMQL:

Syntax of DMQL for specifying task-relevant data.

use database database_name

OR
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list

 Syntax – Specifying Kind of Knowledge

a) Characterization

mine characteristics [as pattern_name]

analyze {measure(s) }

b) Discrimination
mine comparison [as {pattern_name]}
For {target_class } where {target_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
c) Association

mine associations [ as {pattern_name} ]

{matching {metapattern} }

d) Classification

mine classification [as pattern_name]

analyze classifying_attribute_or_dimension
e) Prediction

mine prediction [as pattern_name]

analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}

 Full Specification of DMQL:

As a market manager of a company, you would like to characterize the buying

habits of customers who can purchase items priced at no less than $100; with
respect to the customer's age, type of item purchased, and the place where the
item was purchased. You would like to know the percentage of customers
having that characteristic. In particular, you are only interested in purchases
made in Canada, and paid with an American Express credit card. You would
like to view the resulting descriptions in the form of a table.

use database AllElectronics_db

use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table

02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Down 2
No ratings yet
Down 2
61 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Normalization
No ratings yet
Normalization
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Data Integration and Data Reduction
No ratings yet
Data Integration and Data Reduction
27 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
Unit 3
No ratings yet
Unit 3
18 pages
3 Prep
No ratings yet
3 Prep
50 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Unit-2
No ratings yet
Unit-2
144 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Lect 4
No ratings yet
Lect 4
30 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Module_III_data_mining
No ratings yet
Module_III_data_mining
7 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Quick Question42
No ratings yet
Quick Question42
51 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Data Preprocessing: Why Preprocess The Data?
No ratings yet
Data Preprocessing: Why Preprocess The Data?
51 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages

3.data Pre-Processing Concepts

Uploaded by

3.data Pre-Processing Concepts

Uploaded by

Unit-3 Data Pre-processing Concepts

3.1 Data Pre-processing concepts

ii. Noisy data

Data discretization refers to a method of converting a huge number of data

After Data Discretization table becomes:

Techniques of Data Discretization

 Histogram Analysis: Histogram refers to a plot used to represent the

 Binning: Binning refers to a data smoothing technique that helps to

 Cluster Analysis: It is a form of data discretization. A clustering

 Correlation Analysis: Discretizing data by linear regression

 Concept Hierarchy Generation:

The term hierarchy represents an organizational structure or mapping in which

Syntax of DMQL for specifying task-relevant data.

use database database_name

 Syntax – Specifying Kind of Knowledge

mine characteristics [as pattern_name]

mine associations [ as {pattern_name} ]

mine classification [as pattern_name]

mine prediction [as pattern_name]

 Full Specification of DMQL:

As a market manager of a company, you would like to characterize the buying

use database AllElectronics_db

You might also like