0% found this document useful (0 votes)
18 views

3.data Pre-Processing Concepts

The document discusses concepts related to data pre-processing including data cleaning, integration, transformation, and reduction. It describes handling missing or noisy data, schema integration, data normalization, attribute selection, and other techniques. Data discretization and concept hierarchy generation are also explained as important preprocessing steps.

Uploaded by

Bibek Neupane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

3.data Pre-Processing Concepts

The document discusses concepts related to data pre-processing including data cleaning, integration, transformation, and reduction. It describes handling missing or noisy data, schema integration, data normalization, attribute selection, and other techniques. Data discretization and concept hierarchy generation are also explained as important preprocessing steps.

Uploaded by

Bibek Neupane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit-3 Data Pre-processing Concepts

3.1 Data Pre-processing concepts


Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data.
The quality of the data should be checked before applying machine learning or data
mining algorithms.
When we talk about data, we usually think of some large datasets with a huge
number of rows and columns. While that is a likely scenario, it is not always the case
data could be in so many different forms: Structured Tables, Images, Audio files,
Videos, etc.
Machines don’t understand free text, image, or video data as it is, they understand
1s and 0s. So it probably won’t be good enough if we put on a slideshow of all our
images and expect our machine learning model to get trained just by that!
3.2 Major Tasks in Data Preprocessing

 Data Cleaning

The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
i. Missing Data
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
 Ignore the tuples: This approach is suitable only when the
dataset we have is quite large and multiple values are missing
within a tuple.
 Fill the missing value: There are various ways to do this task.
You can choose to fill the missing values manually, by
attribute mean or the most probable value.

ii. Noisy data


Noisy data is a meaningless data that can’t be interpreted by machines.
It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways:
 Binning method
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
 Regression
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
 Clustering
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

 Data integration

The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management. There
are some problems to be considered during data integration.
 Schema integration: Integrates metadata (a set of data that
describes other data) from different sources.
 Entity identification problem: Identifying entities from
multiple databases. For example, the system or the use should
know student _id of one database and student_name of
another database belongs to the same entity.
 Detecting and resolving data value concepts: The data
taken from different databases while merging may differ.
Like the attribute values from one database may differ from
another database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
 Data transformation

This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
 Normalization: It is done in order to scale the data values in a specified
range (-1.0 to 1.0 or 0.0 to 1.0)
 Attribute Selection: In this strategy, new attributes are constructed
from the given set of attributes to help the mining process.
 Discretization: This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.
 Concept Hierarchy Generation: Here attributes are converted from
lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.
 Data Reduction

Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we use data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
 Data Cube aggregation: Aggregation operation is applied to data for
the construction of the data cube.
 Attribute Subset: The highly relevant attributes should be used; rest
all can be discarded. For performing attribute selection, one can use
level of significance and p- value of the attribute. The attribute having
p-value greater than significance level can be discarded.
 Numerosity Reduction: This enable to store the model of data instead
of whole data, for example: Regression Models.
 Dimensionality reduction: This reduce the size of data by encoding
mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are: Wavelet transforms
and PCA (Principal Component Analysis).
3.3 Data discretization and Concept Hierarchy Generation
 Data Discretization

Data discretization refers to a method of converting a huge number of data


values into smaller ones so that the evaluation and management of data
become easy. In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of intervals with minimum
data loss. There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation
proceeds. It means it works on the top-down splitting strategy and bottom-up
merging strategy.

Example:
Suppose we have an attribute of Age with the given values

Age 1,2,6,9,11,15,17,18,19,31,35,45,58,59,61,65,71,75

After Data Discretization table becomes:


Age Age Age Age
1,2,6,9 11,15,17,18,19 31,35,45,58,59 61,65,71,75
Child Young Mature old

Techniques of Data Discretization

 Histogram Analysis: Histogram refers to a plot used to represent the


underlying frequency distribution of a continuous data set. Histogram
assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.

 Binning: Binning refers to a data smoothing technique that helps to


group a huge number of continuous values into smaller values. For data
discretization and the development of idea hierarchy, this technique can
also be used.

 Cluster Analysis: It is a form of data discretization. A clustering


algorithm is executed by dividing the values of x numbers into clusters
to isolate a computational feature of x.
 Decision tree Analysis: Data discretization refers to a decision tree
analysis in which a top-down slicing technique is used. It is done
through a supervised procedure. In a numeric attribute discretization,
first, you need to select the attribute that has the least entropy, and then
you need to run it with the help of a recursive process. The recursive
process divides it into various discretized disjoint intervals, from top to
bottom, using the same splitting criterion.

 Correlation Analysis: Discretizing data by linear regression


technique, you can get the best neighboring interval, and then the large
intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.

 Concept Hierarchy Generation:

The term hierarchy represents an organizational structure or mapping in which


items are ranked according to their levels of importance. In other words, we
can say that a hierarchy concept refers to a sequence of mappings with a set
of more general concepts to complex concepts. It means mapping is done from
low-level concepts to high-level concepts. For example, in computer science,
there are different types of hierarchical systems. A document is placed in a
folder in windows at a specific place in the tree structure is the best example
of a computer hierarchical tree model. There are two types of hierarchy: top-
down mapping and the second one is bottom-up mapping.

Example:
A particular city can map with the belonging country. For example, New Delhi
can be mapped to India, and India can be mapped to Asia.

 Top Down Mapping: Top-down mapping generally starts with the top
with some general information and ends with the bottom to the
specialized information.
 Bottom up Mapping: Bottom-up mapping generally starts with the
bottom with some specialized information and ends with the top to the
generalized information.

3.4 DMQL
The Data Mining Query Language is actually based on the Structured Query
Language (SQL). Data Mining Query Languages can be designed to support ad hoc
and interactive data mining. This DMQL provides commands for specifying
primitives. The DMQL can work with databases and data warehouses as well.
DMQL can be used to define data mining tasks. Particularly we examine how to
define data warehouses and data marts in DMQL.
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al.
for the DBMiner data mining system.
 Syntax of DMQL:

Syntax of DMQL for specifying task-relevant data.

use database database_name

OR
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list

 Syntax – Specifying Kind of Knowledge

a) Characterization

mine characteristics [as pattern_name]


analyze {measure(s) }

b) Discrimination
mine comparison [as {pattern_name]}
For {target_class } where {target_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
c) Association

mine associations [ as {pattern_name} ]


{matching {metapattern} }

d) Classification

mine classification [as pattern_name]


analyze classifying_attribute_or_dimension
e) Prediction

mine prediction [as pattern_name]


analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}

 Full Specification of DMQL:

As a market manager of a company, you would like to characterize the buying


habits of customers who can purchase items priced at no less than $100; with
respect to the customer's age, type of item purchased, and the place where the
item was purchased. You would like to know the percentage of customers
having that characteristic. In particular, you are only interested in purchases
made in Canada, and paid with an American Express credit card. You would
like to view the resulting descriptions in the form of a table.

use database AllElectronics_db


use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table

You might also like