Module2 - Preprocessing Updated - V3-2
Module2 - Preprocessing Updated - V3-2
DATA MINING
1
Module: 2: Data Preprocessing
2
What Is an Attribute?
3
Types of data (Attributes) in data mining
● Nominal Attribute
● Binary Attributes
● Ordinal Attributes
● Numeric Attributes
– Interval Attributes
– Discrete Attributes
– Ratio-Scaled Attributes
– Continuous Data Attributes
4
Types of data in data mining
● Nominal Attribute:
● Nominal means “relating to names.” The values of a
nominal attribute are symbols or names of things. Each
value represents some kind of category, code, or state,
and so nominal attributes are also referred to as
categorical. The values do not have any meaningful order
● Examples include gender (male, female), marital status
(single, married, divorced), or types of fruits (apple,
banana, orange).
● Nominal data are often represented using categorical
variables and are suitable for techniques like frequency
analysis, mode calculation, and association rule mining.
5
Example – Nominal Data
6
● Binary Attributes
● A binary attribute is a nominal attribute with only two
categories or states: 0 or 1, where 0 typically means that
the attribute is absent, and 1 means that it is present.
Binary attributes are referred to as Boolean if the two
states correspond to true and false.
● Example
● Binary attributes. Given the attribute smoker describing a
patient object, 1 indicates that the patient smokes, while 0
indicates that the patient does not. Similarly, suppose the
patient undergoes a medical test that has two possible
outcomes. The attribute medical test is binary, where a
value of 1 means the result of the test for the patient is
positive, while 0 means the result is negative 7
● Ordinal Data:
● Ordinal data also represent categories, but they have a
natural order or ranking.
● While the categories have a relative order, the differences
between them may not be consistent or measurable.
● Examples include survey responses (e.g., "strongly
disagree" to "strongly agree"), education levels (e.g.,
"high school diploma" to "Ph.D."), or socioeconomic
status (e.g., "low-income" to "high-income").
● Techniques such as rank ordering, median calculation,
and non-parametric tests are commonly used for
analyzing ordinal data.
8
example
9
Numeric Attributes
10
Numeric Attributes
● Interval Data:
● Interval data represent numeric values where the
difference between any two values is meaningful
and consistent, but there is no true zero point.
● Interval data allow for meaningful calculations of
differences between values but not ratios.
● Examples include temperature measured in
Celsius or Fahrenheit, calendar dates, or IQ
scores.
● Techniques such as mean calculation, standard
deviation, and correlation analysis can be applied
to interval data. 11
● Temperature:
– 0°C, 10°C, 20°C, 30°C, ...
● Calendar dates:
– January 1, 2024
– February 13, 2024
– March 5, 2024
– ...
● Time:
– 12:00 PM, 1:00 PM, 2:00 PM, ...
● IQ scores:
– 80, 90, 100, 110, ...
12
● Discrete Data:
● Discrete data consist of countable values with
clear boundaries between them.
● Examples include the number of customers, the
number of products sold, or the number of
defects in a manufacturing process.
● Discrete data are typically analyzed using
techniques such as frequency analysis, mode
calculation, and Poisson regression.
13
● Number of children in a family:
– 0, 1, 2, 3, ...
● Number of students in a classroom:
– 20, 25, 30, 35, ...
● Number of items sold:
– 10, 20, 30, 40, ...
● Number of cars in a parking lot:
– 50, 60, 70, 80, ...
● Number of defects in a manufacturing process:
– 0, 1, 2, 3, ...
● Number of goals scored in a soccer match:
– 0, 1, 2, 3, ...
14
● Ratio-Scaled Attributes
● Ratio-scaled attributes, also known as ratio
variables, are a type of quantitative variable in
statistics that possess all the properties of interval
variables but with an added feature: a true zero
point.
● Examples of ratio-scaled attributes include:
● Length: Measurements such as height, width, length, or distance,
where zero indicates the absence of length.
● Weight: Mass measurements in kilograms or pounds, where zero
represents the absence of weight.
● Time: Time measurements in seconds, minutes, hours, etc., where
zero denotes the absence of time.
15
● Continuous Data Attributes:
● Continuous data represent values that can take
on any value within a given range, often
measured with a high degree of precision.
● Examples include height, weight, temperature,
and time.
● Continuous data are analyzed using techniques
such as mean calculation, standard deviation,
regression analysis, and density estimation.
16
● Height:
– 165.2 cm, 172.5 cm, 179.1 cm, ...
● Weight:
– 68.3 kg, 75.6 kg, 82.9 kg, ...
● Temperature:
– 23.7°C, 25.3°C, 28.6°C, ...
● Time:
– 10:15:20, 10:16:35, 10:17:48, ...
● Distance:
– 3.5 meters, 4.2 meters, 5.9 meters, ...
● Speed:
– 60.5 km/h, 70.2 km/h, 80.9 km/h, ...
17
Data Quality
18
Factors affecting the quality of the data
● The term measurement error refers to any problem resulting from the
measurement process.
● A common problem is that the value recorded differs from the true
value to some extent.
● For continuous attributes, the numerical difference of the measured
and true value is called the error.
● The term data collection error refers to errors such as omitting data
objects or attribute values, or inappropriately including a data object.
● Within particular domains, certain types of data errors are
commonplace, and well-developed techniques often exist for
detecting and/or correcting these errors.
● For example, keyboard errors are common when data is entered
manually, and as a result, many data entry programs have
techniques for detecting and, with human intervention, correcting
such errors
20
Noise and Artifacts
● Noise:
● Noise refers to random or irrelevant fluctuations
or disturbances present in data that do not carry
any meaningful information.
● Noise can arise from various sources, including
measurement errors, data collection
inaccuracies, environmental factors, or inherent
variability in the data generation process.
21
Artifacts:
22
Example: Medical Imaging
● Noise:
● Description: In medical imaging, noise can manifest as
random fluctuations in pixel intensity values that do not
represent anatomical structures or pathological findings. It
can be caused by factors such as electronic noise in the
imaging equipment, photon noise due to low radiation
dose, or patient motion during image acquisition.
● Characteristics: Noise appears as graininess or speckle-
like patterns in the image, especially in regions with low
signal intensity. It may obscure fine details and make it
challenging to differentiate between structures of interest
and background noise.
23
● Artifacts:
● Description: Artifacts in medical imaging refer to
unwanted distortions or anomalies introduced into the
image due to technical factors, patient-related factors, or
errors in image acquisition or processing. They can arise
from sources such as equipment malfunction, patient
movement, metal implants, or incorrect imaging
parameters.
● Characteristics: Artifacts appear as structured or
systematic deviations from the true anatomical features in
the image. They may manifest as streaks, shadows,
blurring, geometric distortions, or intensity variations that
are not representative of the underlying anatomy.
24
Noise
Artifacts
25
Outliers
27
Missing values
28
example
29
Inconsistent Values
32
33
Issues Related to Applications
● Timeliness
● Some data starts to age as soon as it has been
collected.
● In particular, if the data provides a snapshot of
some ongoing phenomenon or process, such as
the purchasing behavior of customers or web
browsing patterns, then this snapshot represents
reality for only a limited time. If the data is out of
date, then so are the models and patterns that
are based on it.
34
● Relevance
● The available data must contain the information
necessary for the application.
● Consider the task of building a model that
predicts the accident rate for drivers. If
information about the age and gender of the
driver is omitted, then it is likely that the model
will have limited accuracy unless this information
is indirectly available through other attributes.
35
● A common problem is sampling bias, which
occurs when a sample does not contain different
types of objects in proportion to their actual
occurrence in the population.
● For example, survey data describes only those
who respond to the survey.
● Because the results of a data analysis can reflect
only the data that is present, sampling bias will
typically lead to erroneous results when applied
to the broader population.
36
Data Preprocessing
37
Why Is Data Dirty?
● Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was
collected and when it is analyzed.
– Human/hardware/software problems
● Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
● Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked
data)
● Duplicate records also need data cleaning 38
Why Is Data Preprocessing Important?
● Aggregation
● Sampling
● Dimensionality reduction
● Feature subset selection
● Feature creation
● Discretization and binarization
● Variable transformation
40
1. Aggregation
41
Example
42
Perform aggregation on this sales data:
44
45
2. Sampling
46
2.1 Sampling without replacement
●
Sampling without replacement is a method of
selecting a subset of data from a larger dataset
where each selected data point is removed from
consideration for subsequent selections.
● This means that once a data point is included in
the sample, it cannot be selected again.
Sampling without replacement ensures that each
selected sample is unique and does not contain
duplicate entries.
47
Example:
49
2.2 Sampling with replacement
50
Example
53
Explanation of dimensionality reduction in preprocessing
58
Approaches to Feature Subset Selection:
Filter Methods:
● Statistical Measures: These methods use statistical measures (e.g., correlation,
mutual information, chi-squared) to rank or score features based on their individual
relevance to the target variable. Features are then selected or ranked accordingly.
● Variance Thresholding: Features with low variance are considered less informative
and may be removed.
Wrapper Methods:
● Forward Selection: Start with an empty set of features and iteratively add the most
relevant feature until a stopping criterion is met.
● Backward Elimination: Start with all features and iteratively remove the least relevant
feature until a stopping criterion is met.
● Recursive Feature Elimination (RFE): Similar to backward elimination but uses a
model to identify less relevant features at each step.
Embedded Methods:
● Regularization Techniques: Techniques like LASSO (L1 regularization) encourage
sparsity in the model coefficients, effectively performing feature selection during
model training.
● Tree-based Methods: Decision trees and ensemble methods (e.g., Random Forests)
inherently perform feature selection by giving importance scores to features based on
59
their contribution to the model's performance.
Flowchart of a feature subset selection process
60
Feature subset selection is a search over all possible subsets of
features.
Many different types of search strategies can be used, but the search
strategy should be computationally inexpensive and should find optimal
or near optimal sets of features.
The number of subsets can be enormous and it is impractical to
examine them all, some sort of stopping criterion is necessary.
This strategy is usually based on one or more conditions involving the
following: the number of iterations, whether the value of the subset
evaluation measure is optimal or exceeds a certain threshold, whether a
subset of a certain size has been obtained, and whether any
improvement can be achieved by the options available to the search
strategy.
61
once a subset of features has been selected, the
results of the target data mining algorithm on the
selected subset should be validated. A
straightforward validation approach is to run the
algorithm with the full set of features and compare
the full results to results obtained using the subset
of features. Hopefully, the subset of features will
produce results that are better than or almost as
good as those produced when using all features.
62
5. Feature Creation
63
Popular techniques for feature creation
64
● Deriving Features: Use domain knowledge to
create features based on specific rules or
conditions. For example, derive a "customer
loyalty" feature based on purchase history.
● Interaction Features: Identify interactions
between features, like "product category" and
"discount offered," to capture their combined
influence on behavior.
● Embedding Features: Transform categorical
features (like colors or text) into numerical
representations suitable for models to
understand.
65
6. Discretization
66
Types of Discretization Techniques:
68
Steps for Equal Width Binning:
71
72
● In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in
Bin 1 is 9. Therefore, each original value in this
bin is replaced by the value 9.
● Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced
by the bin median.
● In smoothing by bin boundaries, the minimum
and maximum values in a given bin are
identified as the bin boundaries. Each bin value
is then replaced by the closest boundary value.
73
● What is smoothing in data analysis, and why is it used?
Smoothing in data analysis refers to the process of removing noise
or irregularities from a dataset to reveal underlying patterns or
trends. It is used to simplify data while preserving important
features, making it easier to interpret and analyze.
● Explain the concept of bin smoothing. How does it work? Bin
smoothing involves dividing a dataset into consecutive bins and
replacing each bin's values with a representative statistic, such as
the mean, median, or boundaries of the bin. This helps to reduce
noise and variability in the data.
● Describe the difference between smoothing by bin means, bin
medians, and bin boundaries.
– Smoothing by bin means replaces each bin's values with the
mean of those values.
– Smoothing by bin medians replaces each bin's values with the
median of those values.
– Smoothing by bin boundaries replaces each bin's values with 74
the minimum and maximum values of that bin.
● Discuss real-world applications of
smoothing techniques across different
domains.
● Smoothing techniques are used in finance for
analyzing stock prices, in healthcare for
processing medical signals, in climate science
for analyzing temperature trends, and in image
processing for noise reduction, among other
applications.
75
Problem 1
76
● Problem 2
● Give data point: 4, 7, 13, 16, 20, 24, 27, 29, 31,
33, 38, 42.
● Problem 3
● data for price in dollars: 8 16, 9, 15, 21, 21,
24, 30, 26, 27, 30, 34 - BIN SIZE=4
77
7. Binarization
78
How Binarization Works:
● Threshold Selection:
– The first step in binarization is to choose an appropriate
threshold value. This threshold divides the numerical values into
two categories: values below the threshold are assigned 0, and
values equal to or above the threshold are assigned 1.
● Applying the Threshold:
– For each data point in the numerical feature, compare its value
with the chosen threshold. If the value is less than the
threshold, assign it 0; otherwise, assign it 1.
● Resulting Binary Feature:
– After applying the threshold to all data points, the numerical
feature is transformed into a binary feature, where each data
point is represented as either 0 or 1.
79
Example Scenario:
80
● Steps for Binarization:
● Threshold Selection:
– Let's choose a passing
threshold of 60. Scores
equal to or above 60 will
be considered passing,
while scores below 60 will
be considered failing.
● Applying the
Threshold:
– For each exam score in
the dataset, compare it
with the threshold (60). If
the score is greater than
or equal to 60, assign it 1
(pass); otherwise, assign
81
it 0 (fail).
● Interpretation:
● In this example, we binarized the exam scores
into pass (1) or fail (0) based on a passing
threshold of 60. Exam scores equal to or above
60 were assigned a value of 1 (pass), while
scores below 60 were assigned a value of 0
(fail). This binarization allows us to simplify the
representation of exam scores and focus on the
binary outcome of pass or fail.
82
Data Transformation
● A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
● Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
◆ New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
◆ min-max normalization
◆ z-score normalization
◆ normalization by decimal scaling
83
– Discretization: Concept hierarchy climbing
Normalization
84
● Normalize the following group of data:
● 200, 300,400,600,1000
85
86
Standard Deviation
Mean
87
To find the standard deviation of a dataset, follow these steps:
89
● Normalize the following group of data:
● 200, 300,400,600,1000
90
91
● Normalize the following group of data:
● 200, 300,400,600,1000
92
Similarity and Dissimilarity Measures
● Similarity measures:
● Similarity measures quantify the likeness or resemblance
between two objects, datasets, or entities. These measures are
fundamental in various domains, including machine learning,
data mining, information retrieval, and pattern recognition.
● Dissimilarity measures
● Dissimilarity measures, also known as distance metrics, quantify
the difference or dissimilarity between two objects or data points.
Unlike similarity measures, which indicate how similar two
objects are, dissimilarity measures provide a quantitative
assessment of how different or distant they are from each other.
Dissimilarity measures are widely used in various fields such as
clustering, classification, and anomaly detection
Similarity/ Dissimilarity between Data
● Euclidean Distance
Example:
97
Example 2:
98
99
● In this example, the Euclidean distance between Data
Point 1 and Data Point 2 is approximately 2.45. Since
this distance is relatively small, we can interpret it as
indicating a moderate similarity between the two data
points. This interpretation suggests that Data Point 1 and
Data Point 2 are somewhat similar in terms of their
features.
● In summary, when using Euclidean distance to measure
similarity, smaller distances imply greater similarity, while
larger distances imply greater dissimilarity between data
points.
100
Example for Dissimilarity
101
102
● In this example, the Euclidean distance between
Product A and Product B is approximately 7.071.
Since this distance is relatively large, we can
interpret it as indicating a significant dissimilarity
between the two products in terms of their price
and size.
● This illustrates how dissimilarity measures, such
as the Euclidean distance, can be used to
quantify the differences between products based
on their attributes.
103
● The threshold value for distance to determine similarity or
dissimilarity between data points depends on the specific context
and the nature of the data. There is no universal threshold value
that applies to all scenarios. Instead, the threshold is typically
determined based on the objectives of the analysis, the
characteristics of the data, and domain knowledge.
● Example:
● Domain Knowledge: Understanding the domain and the specific
problem at hand can help in determining an appropriate threshold
value. For example, in some applications, a small distance might be
considered similar enough, while in others, a larger distance might
be acceptable.
● Application Requirements: The choice of threshold often depends
on the requirements of the application. For instance, in clustering
algorithms, a threshold is used to decide when to stop merging
clusters, while in anomaly detection, a threshold is used to identify
outliers.
104
Common Properties of a Distance
105
106