0% found this document useful (0 votes)
22 views17 pages

5 Data Preprocessing III Editted Notes

The document outlines data preprocessing techniques essential for improving data quality in data mining, including data cleaning, integration, reduction, transformation, and discretization. It emphasizes the importance of normalization and various methods for transforming data types, such as min-max normalization and z-score normalization. Additionally, it discusses the significance of encoding categorical data to numeric formats for effective data analysis.

Uploaded by

Nashwa Fouad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views17 pages

5 Data Preprocessing III Editted Notes

The document outlines data preprocessing techniques essential for improving data quality in data mining, including data cleaning, integration, reduction, transformation, and discretization. It emphasizes the importance of normalization and various methods for transforming data types, such as min-max normalization and z-score normalization. Additionally, it discusses the significance of encoding categorical data to numeric formats for effective data analysis.

Uploaded by

Nashwa Fouad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DATA MINING

Lectures 5: Data Preprocessing III

Dr. Doaa Elzanfaly


Lecture Outline

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

There are several data pre-processing techniques.


Data cleaning can be applied to remove noise and correct inconsistencies in
data.
Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
Data transformations (e.g., normalization) may be applied, where data are
scaled to fall within a smaller range like 0.0 to 1.0. This can improve the
accuracy and efficiency of mining algorithms involving distance measurements.
These techniques are not mutually exclusive; they may work together.

Data processing techniques, when applied before mining, can substantially


improve the overall quality of the patterns mined and/or the time required for
the actual mining.
Why Transformation??
◼ Data is Often Heterogeneous
◼ A demographic data set may contain both numeric and mixed attributes.
because different data mining algorithms may only work with specific
data types.

◼ Possible Solutions
◼ Designing an algorithm with an arbitrary/‫ عشوائيه‬combination of data
types. that can handle a variety of data types simultaneously, processing both numerical and categorical
data. >>Time-consuming and sometimes impractical

◼ Converting between various data types >>Utilize off-the-shelf tools for


processing

Ex: Converting categorical data (like gender) to numerical format using encoding (e.g.,
"Male" to 1, "Female" to 0).
Ex: Discretizing continuous numerical data into categories if needed.

Demographic data refers to data about groups of people according to


certain attributes. Examples include age, gender and interests.
Data mining algorithms (Read)
Data Transformation

◼ A function that maps the entire set of values of a given attribute to a


new set of replacement values.

◼ Methods

1. Aggregation: Summarization, data cube construction

2. Normalization: Scaled to fall within a smaller, specified range


a. min-max normalization

b. z-score normalization

c. normalization by decimal scaling

3. Discretization: Concept hierarchy climbing

5
3. Aggregation
◼ Data aggregation is the process where raw data is gathered and
summarized to perform statistical analysis

◼ Aggregated data is usually presented in data warehouses

For example, finding the average age of customer buying a particular product
can help in finding out the targeted age group for that particular product.
Instead of dealing with an individual customer, the average age of the customer
is calculated.

◼ Time aggregation - It provides the data point for single resources for a
defined time period. Example: The website receives 60 visits in one hour. After
aggregating, the data will show total visits per day like 1,500 visits on Monday,
1,800 on Tuesday, etc.
◼ Spatial aggregation - It provided the data point for a group of resources for
a defined time period. Example: Suppose there are multiple weather stations in
different cities within a region. Spatial aggregation can combine the readings to
provide an average temperature for the entire region for a given day.
4. Normalization
◼ To give all attributes an
equal weight, the data
should be normalized or
standardized.
◼ This helps to prevent
attributes with initially large
ranges from outweighing
attributes with initially
smaller ranges
◼ The measurement unit used can
◼ This is particularly useful for affect the data analysis.
classification algorithms ◼ Changing measurement units from
involving neural networks or meters to inches for height, or
distance measurements such from kilograms to pounds for
weight.
as nearest-neighbour
Expressing an attribute in smaller
classification and clustering.

units will lead to a larger range.

expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight.”
k-nearest neighbors (k-NN),

◼ Calculate the distance House Size (sq


ft)
Rooms
using Euclidean 1000 2
distance of 2500 and 2000 3
3bedrooms 3000 4
4000 5

House Size (sq


Rooms
ft)
-1.16 -1.34
-0.39 -0.45
0.39 0.45
1.16 1.34
Data Normalization Methods
let A be a numeric attribute with n observed values, v1, v2, … , vn.
◼ Min-max Normalization to [new_minA, new_maxA]

◼ Performs a linear transformation on the original data by mapping


a value, vi , of A to vi in the range [new-minA, new-maxA]

◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to: 73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000

◼ Preserves the relationships among the original data values.


◼ Encounters an “out-of-bounds” error if a future input case falls
outside of the original data range for A.

Min-max normalization preserves the relationships


among the original data values. It will encounter an
“out-of-bounds” error if a future input case for
normalization falls outside of the original data range
for A.
Data Normalization Methods

◼ Z-score normalization (μ: mean, σ: standard deviation):

where Ᾱ and σA are the mean and standard deviation, respectively, of attribute
A.

73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000

◼ This method of normalization is useful when:


◼ The actual minimum and maximum of attribute A are unknown,
◼ Or when there are outliers that dominate the min-max normalization

If you need normalized data with a controlled or fixed range after Z-score
normalization, you could apply an additional step:
1.Apply Z-score Normalization First: Normalize the data to mean 0 and
standard deviation 1.
2.Apply Min-Max Scaling on the Z-scores: Use min-max normalization to
rescale the Z-score values to a specific range, like [0,1]or [−1,1]
Data Normalization Methods
◼ Normalization by decimal scaling
◼ Normalizes by moving the decimal point of values of attribute A.
◼ The number of decimal points moved depends on the maximum
absolute value of A.
v
v' =
10 j
Where j is the smallest integer such that Max(|ν’|) < 1

◼ Ex. Suppose that the recorded values of A range from -986 to 917. The maximum
absolute value of A is 986. To normalize by decimal scaling, we therefore divide each
value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to
0.917.

Note that normalization can change the original data


quite a bit, especially when using z-score
normalization or decimal scaling. It is also necessary
to save the normalization parameters (e.g., the mean
and standard deviation if using z-score
normalization) so that future data can be normalized
in a uniform manner.
5. Discretization
◼ Discretization refers to the process converting or partitioning continuous
attributes to discretized or nominal attributes.

◼ Typical methods: All the methods can be applied recursively

• Binning
Top-down split, unsupervised
• Histogram analysis
Top-down split, unsupervised
• Clustering analysis
Unsupervised, top-down split or
bottom-up merge
Binning as a discretization technique
◼ Binning can also be used as a discretization technique.

◼ Attribute values can be discretized by applying equal-width or equal-


frequency binning,

◼ The continuous values in each bin can be converted to a nominal or


discretized value by replacing them by the bin mean or median.

◼ Variations within a range are not distinguishable after discretization.

◼ For uniformly distributed data, equal width bins may be useful.

◼ For data that is not uniformly distributed, equal depth bins work
reasonably well.

For example, consider the age attribute. One could create ranges [0, 10], [11,
20], [21, 30], and so on. The symbolic value for any record in the range [11, 20]
is “2” and the symbolic value for a record in the range [21, 30] is “3”. Because
these are symbolic values, no ordering is assumed between the values “2” and
“3”. Furthermore, variations within a range are not distinguishable after
discretization. Thus, the discretization process does lose some information for
the mining process. However, for some applications, this loss of information is
not too debilitating. One challenge with discretization is that the data may be
nonuniformly distributed across the different intervals. For example, for the
case of the salary attribute, a large subset of the population may be grouped in
the [40, 000, 80, 000] range, but very few will be grouped in the [1, 040, 000, 1,
080, 000] range. Note that both ranges have the same size. Thus, the use of
ranges of equal size may not be very helpful in discriminating between different
data segments. On the other hand, many attributes, such as age, are not as
nonuniformly distributed, and therefore ranges of equal size may work
reasonably well.
Histograms

◼ Histogram-based discretization often


provides the most interpretable bins because it
respects natural groupings in the data, which can
align with real-world categories (e.g., income
levels).

• Low Income: 10 - 30
• Lower-Middle Income: 30 - 50
• Upper-Middle Income: 50 - 70
• High Income: 70 - 90
• Very High Income: 90 - 130
Categorical to Numeric Data
◼ Direct Encoding
◼ By giving each distinct value a number.
◼ Would cause the model to misinterpret these values
◼ Ex. encoding male by 1 and female by 2 may be interpreted by the model as if
female is more important than mail.

◼ One Hot Encoding – Binarization


◼ This method creates a binary vector for each value

Desirable to use numeric data mining algorithms on categorical data.

Because binary data is a special form of both numeric and categorical data, it is
possible to convert the categorical attributes to binary form and then use
numeric algorithms on the binarized data.

Direct encoding, by giving each distinct value a


number, would cause the model to misinterpret
these values as it will consider that there is an
order relationship between these values, which
is not the case.
For example, if there is a categorical feature of
type nominal (i.e. male and female), encoding
mail by 1 and female by 2 may be interpreted by
the model as if female is more important than mail.
References
◼ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
◼ A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
◼ H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
◼ M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
◼ H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
◼ J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
◼ T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
◼ R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995

You might also like