Module 2
Module 2
DATA PRE-PROCESSING
Contents
Types of data
Data Quality
Data Pre-processing Techniques
Similarity and Dissimilarity measures.
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing 1values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.
2. Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.
3. Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization
is used to transform the data to have zero mean and unit variance. Discretization is used to
4. Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.
5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal width
binning, equal frequency binning, and clustering.
6. Data Normalization: This involves scaling the data to a common range, such as between 0 and
1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization, and
decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy
of the analysis results.
The specific steps involved in data preprocessing may vary depending on the nature of
the data and the analysis goals.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
a. Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
This approach is suitable only when the dataset we have is quite large and multiple values are missing
within a tuple.
There are various ways to do this task. You can choose to fill the missing values manually, by attribute
mean or the most probable value.
b. Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
*Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.
*Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).
*Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways: .
*Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).
*Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
*Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection:
This involves selecting a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It can be done using various
techniques such as correlation analysis, mutual information, and principal component analysis (PCA).
Feature Extraction:
This involves transforming the data into a lower-dimensional space while preserving the important
information. Feature extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant analysis (LDA), and non-
negative matrix factorization (NMF).
This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the
size of the dataset while preserving the important information. It can be done using techniques such as
random sampling, stratified sampling, and systematic sampling.
Clustering:
This involves grouping similar data points together into clusters. Clustering is often used to reduce the
size of the dataset by replacing similar data points with a representative centroid. It can be done using
techniques such as k-means, hierarchical clustering, and density-based clustering.
Compression:
This involves compressing the dataset while preserving the important information. Compression is
often used to reduce the size of the dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip compression.
1. Nominal data:
This type of data is also referred to as categorical data.
Nominal data represents data that is qualitative and cannot be measured or compared with
numbers.
In nominal data, the values represent a category, and there is no inherent order or hierarchy.
Examples of nominal data include gender, race, religion, and occupation. Nominal data is
used in data mining for classification and clustering tasks.
Nominal (symbolic, categorical)
Values from an unordered set
Ex: {red, yellow, blue, ….}
Examples: ID numbers, eye color, zip codes
2. Ordinal Data:
This type of data is also categorical, but with an inherent order or hierarchy.
Ordinal data represents qualitative data that can be ranked in a particular order.
For instance, education level can be ranked from primary to tertiary, and social status can be
ranked from low to high.
In ordinal data, the distance between values is not uniform.
3. Interval Data:
This type of data represents quantitative data with equal intervals between consecutive values.
Interval data has no absolute zero point, and therefore, ratios cannot be computed.
Examples of interval data include temperature, IQ scores, and time. Interval data is used in
data mining for clustering and prediction tasks.
Ex:
Calendar dates
Temperature in celsius/fah
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
4. Ratio Data:
This type of data is similar to interval data, but with an absolute zero point.
In ratio data, it is possible to compute ratios of two values, and this makes it possible to make
meaningful comparisons.
Examples of ratio data include height, weight, and income. Ratio data is used in data mining
for prediction and association rule mining tasks.
Difference between a person of age 35 and a person of age 38 is same as difference between
people who are 12 and 15. ( 35 to 38 = 3 , 12 to 15 = 3) 3:3.
Examples: temperature in Kelvin, length, time, counts
5. Discrete Attributes:
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Data Matrix
Document Data
Transaction Data
Graph
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
2. Data matrix:
If data objects have the same fixed set of numeric attributes, then the data objects can be
thought of as points in a multi-dimensional space, where each dimension represents a distinct
attribute
Such data set can be represented by an m by n matrix, where there are m rows, one for each
object, and n columns, one for each attribute
3. Document data:
Each document becomes a `term' vector,
a. each term is a component (attribute) of the vector,
b. the value of each component is the number of times the corresponding term occurs in
the document.
5. Graph data:
Examples: Generic graph and HTML Links
Nodes – atoms
Mining Substructures
in a chemical compound?
Data preprocessing is an essential step in data mining and machine learning as it helps to ensure the
quality of data used for analysis. There are several factors that are used for data quality assessment,
including:
1. Data cleaning:
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data
from the datasets, and it also replaces the missing values. Here are some techniques for data
cleaning:
Standard values like “Not Available” or “NA” can be used to replace the missing values.
Missing values can also be filled manually, but it is not recommended when that dataset is
big.
The attribute’s mean value can be used to replace the missing value when the data is normally
distributed
wherein in the case of non-normal distribution median value of the attribute can be used.
While using regression or decision tree algorithms, the missing value can be replaced by the
most probable value.
Noisy generally means random error or containing unnecessary data points. Handling noisy
data is one of the most important steps as it leads to the optimization of the model we are
using Here are some of the methods to handle noisy data.
2. Data integration:
The process of combining multiple sources into a single dataset. The Data integration process is one of
the main components of data management. There are some problems to be considered during data
integration.
Schema integration: Integrates metadata(a set of data that describes other data) from different
sources.
Entity identification problem: Identifying entities from multiple databases. For example, the
system or the user should know the student id of one database and studentname of another
database belonging to the same entity.
Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. The attribute values from one database may differ from another database.
For example, the date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.
This process helps in the reduction of the volume of the data, which makes the analysis easier
yet produces the same or almost the same result. This reduction also helps to reduce storage
space. Some of the data reduction techniques are dimensionality reduction, numerosity
reduction, and data compression.
The change made in the format or the structure of the data is called data transformation. This
step can be simple or complex based on the requirements. There are some methods for data
transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps
in knowing the important features of the dataset. By smoothing, we can find even a simple
change that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set, which is from multiple sources, is integrated into with data analysis description. This
is an important step since the accuracy of the data depends on the quantity and quality of the
data. When the quality and the quantity of the data are good, the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization reduces the data
size. For example, rather than specifying the class time, we can set an interval like (3 pm-5 pm,
or 6 pm-8 pm).
DATA CLEANING
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
Ignore the tuple: usually done when class label is missing (when doing classification)—not
effective when the % of missing values per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
Noisy data:
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with possible outliers)
2.Data integration:
1. Χ2 (chi-square) test:
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
i 1 (ai A)(bi B)
n n
( ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
where n is the number of tuples, and are the respective means of A and B, σA and σB are
the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated.
Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be
smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Co-Variance: An Example
3. Data reduction:
Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
Data reduction strategies
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward selection
and backward elimination methods can be combined so that, at each step, the procedure selects the
best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction: Decision tree induction constructs a flowchart like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each external (leaf) node denotes a class prediction. At each node, the algorithm chooses the
“best” attribute to partition the data into individual classes.
Numerosity Reduction:
Linear regression
Histogram
Clustering
Sampling
Linear regression
Data modeled to fit a straight line
Often uses the least-square method to fit the line
Multiple regression
Allows a response variable Y to be modeled as a linear function of multidimensional
feature vector
Log-linear model
Approximates discrete multidimensional probability distributions
Regression Analysis:
Regression analysis: A collective name for techniques for the modeling and analysis of
numerical data consisting of values of a dependent variable (also called response variable or
measurement) and of one or more independent variables (aka. explanatory variables or
predictors)
The parameters are estimated so as to give a "best fit" of the data
Most commonly the best fit is evaluated by using the least squares method, but other criteria
have also been used
Y1
Y1’ y=x+1
X1 x
Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof
Regress Analysis and Log-Linear Models:
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be estimated by using
the data at hand
Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional space for a set of
discretized attributes, based on a smaller subset of dimensional combinations
Useful for dimensionality reduction and data smoothing
Histogram Analysis:
Divide data into buckets and store average (sum) for each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
Discretization
Supervised
Entropy – based
Unsupervised
Equal width and equal frequency
Normalization
Min-max
Discretization/Quantization:
Transformation by Discretization:
Discretization methods:
Unsupervised
Independent of the class label
Unsupervised Discretization:
Procedure:
i. Select the best T which gives the highest info gain as the optimum split.
ii. 3. Repeat step 2 with another interval (highest entropy) until a user specified no. of intervals
is reached or some stopping criterion is met.
Normalization:
Cosine Similarity:
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
Cos(x,y) = 1 indicates both are similar, if output is nearer to 1 we can say that it is
similar
• Distances, such as the Euclidean distance, have some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness). Distance will never have negative.
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.