5 Data Preprocessing III Editted Notes
5 Data Preprocessing III Editted Notes
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Possible Solutions
◼ Designing an algorithm with an arbitrary/ عشوائيهcombination of data
types. that can handle a variety of data types simultaneously, processing both numerical and categorical
data. >>Time-consuming and sometimes impractical
Ex: Converting categorical data (like gender) to numerical format using encoding (e.g.,
"Male" to 1, "Female" to 0).
Ex: Discretizing continuous numerical data into categories if needed.
◼ Methods
b. z-score normalization
5
3. Aggregation
◼ Data aggregation is the process where raw data is gathered and
summarized to perform statistical analysis
For example, finding the average age of customer buying a particular product
can help in finding out the targeted age group for that particular product.
Instead of dealing with an individual customer, the average age of the customer
is calculated.
◼ Time aggregation - It provides the data point for single resources for a
defined time period. Example: The website receives 60 visits in one hour. After
aggregating, the data will show total visits per day like 1,500 visits on Monday,
1,800 on Tuesday, etc.
◼ Spatial aggregation - It provided the data point for a group of resources for
a defined time period. Example: Suppose there are multiple weather stations in
different cities within a region. Spatial aggregation can combine the readings to
provide an average temperature for the entire region for a given day.
4. Normalization
◼ To give all attributes an
equal weight, the data
should be normalized or
standardized.
◼ This helps to prevent
attributes with initially large
ranges from outweighing
attributes with initially
smaller ranges
◼ The measurement unit used can
◼ This is particularly useful for affect the data analysis.
classification algorithms ◼ Changing measurement units from
involving neural networks or meters to inches for height, or
distance measurements such from kilograms to pounds for
weight.
as nearest-neighbour
Expressing an attribute in smaller
classification and clustering.
◼
units will lead to a larger range.
expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight.”
k-nearest neighbors (k-NN),
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to: 73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
where Ᾱ and σA are the mean and standard deviation, respectively, of attribute
A.
73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
If you need normalized data with a controlled or fixed range after Z-score
normalization, you could apply an additional step:
1.Apply Z-score Normalization First: Normalize the data to mean 0 and
standard deviation 1.
2.Apply Min-Max Scaling on the Z-scores: Use min-max normalization to
rescale the Z-score values to a specific range, like [0,1]or [−1,1]
Data Normalization Methods
◼ Normalization by decimal scaling
◼ Normalizes by moving the decimal point of values of attribute A.
◼ The number of decimal points moved depends on the maximum
absolute value of A.
v
v' =
10 j
Where j is the smallest integer such that Max(|ν’|) < 1
◼ Ex. Suppose that the recorded values of A range from -986 to 917. The maximum
absolute value of A is 986. To normalize by decimal scaling, we therefore divide each
value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to
0.917.
• Binning
Top-down split, unsupervised
• Histogram analysis
Top-down split, unsupervised
• Clustering analysis
Unsupervised, top-down split or
bottom-up merge
Binning as a discretization technique
◼ Binning can also be used as a discretization technique.
◼ For data that is not uniformly distributed, equal depth bins work
reasonably well.
For example, consider the age attribute. One could create ranges [0, 10], [11,
20], [21, 30], and so on. The symbolic value for any record in the range [11, 20]
is “2” and the symbolic value for a record in the range [21, 30] is “3”. Because
these are symbolic values, no ordering is assumed between the values “2” and
“3”. Furthermore, variations within a range are not distinguishable after
discretization. Thus, the discretization process does lose some information for
the mining process. However, for some applications, this loss of information is
not too debilitating. One challenge with discretization is that the data may be
nonuniformly distributed across the different intervals. For example, for the
case of the salary attribute, a large subset of the population may be grouped in
the [40, 000, 80, 000] range, but very few will be grouped in the [1, 040, 000, 1,
080, 000] range. Note that both ranges have the same size. Thus, the use of
ranges of equal size may not be very helpful in discriminating between different
data segments. On the other hand, many attributes, such as age, are not as
nonuniformly distributed, and therefore ranges of equal size may work
reasonably well.
Histograms
• Low Income: 10 - 30
• Lower-Middle Income: 30 - 50
• Upper-Middle Income: 50 - 70
• High Income: 70 - 90
• Very High Income: 90 - 130
Categorical to Numeric Data
◼ Direct Encoding
◼ By giving each distinct value a number.
◼ Would cause the model to misinterpret these values
◼ Ex. encoding male by 1 and female by 2 may be interpreted by the model as if
female is more important than mail.
Because binary data is a special form of both numeric and categorical data, it is
possible to convert the categorical attributes to binary form and then use
numeric algorithms on the binarized data.