0% found this document useful (0 votes)

9 views

Module 2

The document discusses data preprocessing which involves cleaning, transforming and integrating raw data to prepare it for analysis. Some common steps in data preprocessing include data cleaning, integration, transformation, reduction, discretization and normalization. The goal is to improve data quality and suitability for specific mining tasks.

Uploaded by

hiremathsumith99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Module 2

Uploaded by

hiremathsumith99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

MODULE-2

DATA PRE-PROCESSING
Contents

Types of data
Data Quality
Data Pre-processing Techniques
Similarity and Dissimilarity measures.

Introduction to data pre-processing:

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:

1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing 1values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.

2. Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.

3. Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization
is used to transform the data to have zero mean and unit variance. Discretization is used to

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

convert continuous data into discrete categories.

4. Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.

5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal width
binning, equal frequency binning, and clustering.

6. Data Normalization: This involves scaling the data to a common range, such as between 0 and
1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization, and
decimal scaling.

 Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy
of the analysis results.
 The specific steps involved in data preprocessing may vary depending on the nature of
the data and the analysis goals.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Steps Involved in Data Preprocessing:

1. Data Cleaning:

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

a. Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.

Some of them are:

*Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and multiple values are missing
within a tuple.

*Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by attribute
mean or the most probable value.

b. Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :

*Binning Method:

This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.

*Regression:

Here data can be made smooth by fitting it to a regression function. The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).

*Clustering:

This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

2. Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways: .

*Normalization:

It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).

*Attribute Selection:

In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

*Discretization:

This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.

3. Data Reduction:

Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:

Feature Selection:

This involves selecting a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It can be done using various
techniques such as correlation analysis, mutual information, and principal component analysis (PCA).

Feature Extraction:

This involves transforming the data into a lower-dimensional space while preserving the important
information. Feature extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant analysis (LDA), and non-
negative matrix factorization (NMF).

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Sampling:

This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the
size of the dataset while preserving the important information. It can be done using techniques such as
random sampling, stratified sampling, and systematic sampling.

Clustering:

This involves grouping similar data points together into clusters. Clustering is often used to reduce the
size of the dataset by replacing similar data points with a representative centroid. It can be done using
techniques such as k-means, hierarchical clustering, and density-based clustering.

Compression:

This involves compressing the dataset while preserving the important information. Compression is
often used to reduce the size of the dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip compression.

Types of data in data mining

What is data?

*Collection of data objects and their attributes

*An attribute is a property or characteristic of an object

 Examples: eye color of a person, temperature, etc.

*Attribute is also known as variable, field, characteristic, or feature

*A collection of attributes describes an object

*Object is also known as record, point, case, sample, entity, or instance

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Different types of attributes or data types:
In data mining, understanding the different types of attributes or data types is essential as it helps to
determine the appropriate data analysis techniques to use.

The following are the different types of data:

1. Nominal data:
 This type of data is also referred to as categorical data.
 Nominal data represents data that is qualitative and cannot be measured or compared with
numbers.
 In nominal data, the values represent a category, and there is no inherent order or hierarchy.
 Examples of nominal data include gender, race, religion, and occupation. Nominal data is
used in data mining for classification and clustering tasks.
 Nominal (symbolic, categorical)
 Values from an unordered set
 Ex: {red, yellow, blue, ….}
 Examples: ID numbers, eye color, zip codes

2. Ordinal Data:

 This type of data is also categorical, but with an inherent order or hierarchy.
 Ordinal data represents qualitative data that can be ranked in a particular order.
 For instance, education level can be ranked from primary to tertiary, and social status can be
ranked from low to high.
 In ordinal data, the distance between values is not uniform.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

 This means that it is not possible to say that the difference between high and medium social
status is the same as the difference between medium and low social status.
 Ordinal data is used in data mining for ranking and classification tasks.
 Values from an ordered set
 Ex: {good, better, best}
 Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall,
medium, short}

3. Interval Data:

 This type of data represents quantitative data with equal intervals between consecutive values.
 Interval data has no absolute zero point, and therefore, ratios cannot be computed.
 Examples of interval data include temperature, IQ scores, and time. Interval data is used in
data mining for clustering and prediction tasks.
 Ex:
 Calendar dates
 Temperature in celsius/fah
 Examples: calendar dates, temperatures in Celsius or Fahrenheit.

4. Ratio Data:
 This type of data is similar to interval data, but with an absolute zero point.
 In ratio data, it is possible to compute ratios of two values, and this makes it possible to make
meaningful comparisons.
 Examples of ratio data include height, weight, and income. Ratio data is used in data mining
for prediction and association rule mining tasks.
 Difference between a person of age 35 and a person of age 38 is same as difference between
people who are 12 and 15. ( 35 to 38 = 3 , 12 to 15 = 3) 3:3.
 Examples: temperature in Kelvin, length, time, counts

5. Discrete Attributes:
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a collection of documents
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete attributes

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

6. Continuous attribute:
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented using a finite number of digits.
 Continuous attributes are typically represented as floating-point variables.

Properties of Attribute Values:

 The type of an attribute depends on which of the following properties it possesses:
 Distinctness: = 
 Order: < >
 Addition: + -
 Multiplication: */
 Nominal attribute: distinctness
 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties

Types of data sets:

 Record

 Data Matrix

 Document Data

 Transaction Data

 Graph

 World Wide Web

 Molecular Structures

 Ordered

 Spatial Data

 Temporal Data

 Sequential Data

 Genetic Sequence Data

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

1. Record dataset:
Record data is data that consists of a collection of records, each of which consists of a fixed
set of attributes. This particular data set, which I use in several places, is record data.
 Data stored in flat files
 Like Excel, word etc
 Or RDBMS

2. Data matrix:
 If data objects have the same fixed set of numeric attributes, then the data objects can be
thought of as points in a multi-dimensional space, where each dimension represents a distinct
attribute
 Such data set can be represented by an m by n matrix, where there are m rows, one for each
object, and n columns, one for each attribute

3. Document data:
Each document becomes a `term' vector,
a. each term is a component (attribute) of the vector,
b. the value of each component is the number of times the corresponding term occurs in
the document.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

4. Transaction data:
A special type of record data, where
a. each record (transaction) involves a set of items.
b. For example, consider a grocery store. The set of products purchased by a customer
during one shopping trip constitute a transaction, while the individual products that
were purchased are the items.

5. Graph data:
Examples: Generic graph and HTML Links

• a graph is sometimes a more

convenient and powerful representation of data
• can be used to capture relationship between data objects.
• Data objects themselves can be graphs.
• Ex: set of linked web pages can be represented as graphs

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Chemical Data as a Graph:

Data with objects that are graphs:-

 Objects have sub-objects that have relationships

 Ex : structure of chemical compounds

Nodes – atoms

Links – chemical compounds

 Benzene Molecule: C6H6

 Mining Substructures

 Which substructures occur frequently

in a chemical compound?

 Is the presence of any substructure associated with any other?

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

6. Ordered data:
Attributes have relationships that involve order in time/space
Extension of a record data
Each record has a time associated with it
Each attribute can also be given a time stamp.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

7. Time series data:
 A special type of sequential data
 Each record is a time-series i.e. a series of measurements taken over time
 Ex: financial data set has objects which are the time series of the daily prices of various
stocks.
 Have temporal autocorrelation
 If two measurements are close in time, then their values are often similar.

Data Quality: Why do we preprocess the data?

Data preprocessing is an essential step in data mining and machine learning as it helps to ensure the
quality of data used for analysis. There are several factors that are used for data quality assessment,
including:

 Accuracy: correct or wrong, accurate or not

 Completeness: not recorded, unavailable, …

 Consistency: some modified but some not, dangling, …

 Timeliness: timely update?

 Believability: how trustable the data are correct?

 Interpretability: how easily the data can be understood?

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Major Tasks in Data Preprocessing:

1. Data cleaning:
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data
from the datasets, and it also replaces the missing values. Here are some techniques for data
cleaning:

*Handling Missing Values:

 Standard values like “Not Available” or “NA” can be used to replace the missing values.
 Missing values can also be filled manually, but it is not recommended when that dataset is
big.
 The attribute’s mean value can be used to replace the missing value when the data is normally
distributed
 wherein in the case of non-normal distribution median value of the attribute can be used.
 While using regression or decision tree algorithms, the missing value can be replaced by the
most probable value.

*Handling Noisy Data

 Noisy generally means random error or containing unnecessary data points. Handling noisy
data is one of the most important steps as it leads to the optimization of the model we are
using Here are some of the methods to handle noisy data.
2. Data integration:

The process of combining multiple sources into a single dataset. The Data integration process is one of
the main components of data management. There are some problems to be considered during data
integration.

 Schema integration: Integrates metadata(a set of data that describes other data) from different
sources.
 Entity identification problem: Identifying entities from multiple databases. For example, the
system or the user should know the student id of one database and studentname of another
database belonging to the same entity.
 Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. The attribute values from one database may differ from another database.
For example, the date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

3.Data reduction:

This process helps in the reduction of the volume of the data, which makes the analysis easier
yet produces the same or almost the same result. This reduction also helps to reduce storage
space. Some of the data reduction techniques are dimensionality reduction, numerosity
reduction, and data compression.

 Dimensionality reduction: This process is necessary for real-world applications as the

data size is big. In this process, the reduction of random variables or attributes is done so
that the dimensionality of the data set can be reduced. Combining and merging the attributes
of the data without losing its original characteristics. This also helps in the reduction of
storage space, and computation time is reduced. When the data is highly dimensional, a
problem called the “Curse of Dimensionality” occurs.
 Numerosity Reduction: In this method, the representation of the data is made smaller by
reducing the volume. There will not be any loss of data in this reduction.
 Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.
4. Data Transformation:

The change made in the format or the structure of the data is called data transformation. This
step can be simple or complex based on the requirements. There are some methods for data
transformation.

Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps
in knowing the important features of the dataset. By smoothing, we can find even a simple
change that helps in prediction.

Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set, which is from multiple sources, is integrated into with data analysis description. This
is an important step since the accuracy of the data depends on the quantity and quality of the
data. When the quality and the quantity of the data are good, the results are more relevant.

Discretization: The continuous data here is split into intervals. Discretization reduces the data
size. For example, rather than specifying the class time, we can set an interval like (3 pm-5 pm,
or 6 pm-8 pm).

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Normalization: It is the method of scaling the data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.

DATA CLEANING

 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?

Missing data /incomplete data:

 Data is not always available

 E.g., many tuples have no recorded value for several attributes, such as customer income in
sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred

How to handle missing values:

 Ignore the tuple: usually done when class label is missing (when doing classification)—not
effective when the % of missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class: smarter
 the most probable value: inference-based such as Bayesian formula or decision tree

Noisy data:

 Noise: random error or variance in a measured variable

 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data

How to Handle Noisy Data?

 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with possible outliers)

Data Cleaning as a Process:

 Data discrepancy detection

 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check)
to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

2.Data integration:

 Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are different
 Possible reasons: different representations, different scales, e.g., metric vs. British
units.

Handling Redundancy in Data Integration:

 Redundant data occur often when integration of multiple databases

 Object identification: The same attribute or object may have different names in
different databases
 Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
 Redundant attributes may be able to be detected by correlation analysis and covariance
analysis
 Careful integration of the data from multiple sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Correlation Analysis (Nominal Data):

1. Χ2 (chi-square) test:

 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on

the data distribution in the two categories)
 It shows that like_science_fiction and play_chess are correlated in the group.
 For this 2 × 2 table, the degrees of freedom are (2 - 1)(2 - 1) = 1.
 For 1 degree of freedom, the χ2 value needed to reject the hypothesis at the 0.001 significance
level is 10.828.
 Since our computed value is above this, we can reject the hypothesis and conclude that the
two attributes are (strongly) correlated for the given group of people.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Correlation Analysis (Numeric Data):

 Correlation coefficient (also called Pearson’s product moment coefficient).

i 1 (ai  A)(bi  B) 
n n
( ai bi )  n AB
rA, B   i 1

(n  1) A B (n  1) A B
where n is the number of tuples, and are the respective means of A and B, σA and σB are
the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.

 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Visually Evaluating Correlation:

Scatter plots showing the similarity from –1 to 1.

Correlation (viewed as linear relationship):

 Correlation measures the linear relationship between objects

 To compute correlation, we standardize data objects, A and B, and then take their dot product.

Covariance (Numeric Data):

 Covariance is similar to correlation.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

where n is the number of tuples, and are the respective mean or expected values of A and B,
σA and σB are the respective standard deviation of A and B.

 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be
smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence

Co-Variance: An Example

3. Data reduction:
Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
Data reduction strategies

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression

Data Reduction 1: Dimensionality Reduction:

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

Attribute Subset Selection:

 Another way to reduce dimensionality of data

 Redundant attributes
 Duplicate much or all of the information contained in one or more other attributes
 E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data mining task at hand
 E.g., students' ID is often irrelevant to the task of predicting students' GPA

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

1. Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced
set. The best of the original attributes is and added to the reduced set. At each subsequent iteration or
step, the best of the remaining original attributes is added to the set.

2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.

3. Combination of forward selection and backward elimination: The stepwise forward selection
and backward elimination methods can be combined so that, at each step, the procedure selects the
best attribute and removes the worst from among the remaining attributes.

4. Decision tree induction: Decision tree induction constructs a flowchart like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each external (leaf) node denotes a class prediction. At each node, the algorithm chooses the
“best” attribute to partition the data into individual classes.

Data Reduction 2: Numerosity Reduction:

 Reduce data volume by choosing alternative, smaller forms of data representation

 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-D space as the product on
appropriate marginal subspaces

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …

Numerosity Reduction:

 Linear regression
 Histogram
 Clustering
 Sampling

Parametric Data Reduction: Regression and Log-Linear Models

 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a linear function of multidimensional
feature vector
 Log-linear model
 Approximates discrete multidimensional probability distributions

Regression Analysis:

 Regression analysis: A collective name for techniques for the modeling and analysis of
numerical data consisting of values of a dependent variable (also called response variable or
measurement) and of one or more independent variables (aka. explanatory variables or
predictors)
 The parameters are estimated so as to give a "best fit" of the data
 Most commonly the best fit is evaluated by using the least squares method, but other criteria
have also been used

Y1
Y1’ y=x+1
X1 x
Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof
Regress Analysis and Log-Linear Models:

 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be estimated by using
the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional space for a set of
discretized attributes, based on a smaller subset of dimensional combinations
 Useful for dimensionality reduction and data smoothing

Histogram Analysis:

 Divide data into buckets and store average (sum) for each bucket
 Partitioning rules:
 Equal-width: equal bucket range
 Equal-frequency (or equal-depth)

40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof
Data Cube Aggregation:

 The lowest level of a data cube (base cuboid)

 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to solve the task
 Queries regarding aggregated information should be answered using data cube, when possible

4.Data transformation and discretization:

 Discretization
 Supervised
 Entropy – based
 Unsupervised
 Equal width and equal frequency
 Normalization
 Min-max

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

 Z-score
 Decimal scaling
 Binarization

Discretization/Quantization:

 Three types of attributes:

 Nominal – values from an unordered set
 Ordinal – values from an ordered set
 Continuous – real numbers
 Discretization :
 Divide the range of a continuous attribute into intervals
 Some classification algos only accept categorical attributes
 Reduce data size by discretization
 Prepare for further analysis

Transformation by Discretization:

Some Algorithms require nominal/ discrete attributes

Discretization methods:

 Unsupervised
 Independent of the class label

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

 Ex: Equal width binning, equal frequency binning
 Supervised
 Dependent on the class label
 Ex: entropy based binning

Unsupervised Discretization:

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Entropy Based Discretization- Supervised:
 Uses the class info present in the data
 Entropy(info content) is calculated based on the class label
 Tries to find the best split so that bins are as pure as possible.
 Pure bin : majority of the values in a bin should correspond to the same class.
 Purity of a bin is measured using its entropy
 Entropy
 Zero – perfectly pure bin
 Max (1) – impure – equal class distribution

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Entropy:

Procedure:

1. Sort the attb values to be discretized, S

2. Bisect the initial values so that the resulting two intervals have minimum entropy.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

i. Consider each value T as a possible split point Where T = midpoint of each
consecutive attb values.
ii. Compute the information gain before and after choosing T as a split point.

Gain = E(S) – E(T,S)

i. Select the best T which gives the highest info gain as the optimum split.
ii. 3. Repeat step 2 with another interval (highest entropy) until a user specified no. of intervals
is reached or some stopping criterion is met.

Normalization:

l Scale attribute values to fall within a small-specified range.

 Min-max
 Z-score
 Decimal scaling

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof
Binarization:

 Transforming a continuous or discrete attribute into one or more binary attribute

 Why?
 ARM can be done only on binarized data.
 But i/p data set may have numeric/discrete attributes

Binarizing a categorical data:-

If the categorical attb has m distinct values,

1. assign a unique integer from 0 to m-1 to each value.

2. Represent each integer using unique bit combinations

Similarity and Dissimilarity:

• More important – used in clustering, some classification, anomaly detection.

• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Objects with Single Attribute:

p and q are the attribute values for two data objects.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Dissimilarities between Data Objects with multiple Numeric attributes:

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

2.Minkowski Distance:

Minkowski Distance: Examples:

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

SMC versus Jaccard: Example:

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

• SMC - Counts both presences and absences equally. Used for objects with symmetric
binary attributes.
• Can be used to find students who answered similarly in a test – true/false questions
• JC is used to handle objects with assymetric binary attributes.
• Ex: In a TDB:
• No. of products not purchased is far more than purchased
• SMC would say all transactions are very similar.
• Use JC

Cosine Similarity:

• If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,

where  indicates vector dot product and || d || is the length of vector d.

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (11+00+00+00+00+00+00+11+00+22) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Cos(x,y) = 0 indicates both are dissimilar

Cos(x,y) = 1 indicates both are similar, if output is nearer to 1 we can say that it is
similar

But in this case it is different/dissimilar.

Extended Jaccard Coefficient (Tanimoto):

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Common Properties of a Distance:

• Distances, such as the Euclidean distance, have some well known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness). Distance will never have negative.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

p & q are the data points on the graph.

• A distance that satisfies all these properties is a metric

Common Properties of a Similarity:

• Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Data transformation in data mining
No ratings yet
Data transformation in data mining
6 pages
ml4
No ratings yet
ml4
17 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Down 2
No ratings yet
Down 2
61 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Transformation and standardization
No ratings yet
Data Transformation and standardization
5 pages
Data Warehousing - CH3
No ratings yet
Data Warehousing - CH3
15 pages
Data Mining
No ratings yet
Data Mining
22 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Dmml Notes
No ratings yet
Dmml Notes
89 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
4.1 - Data Preprocessing
No ratings yet
4.1 - Data Preprocessing
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
CCS364-Soft Computing-Unit 5 - Applications - Lecture Notes
No ratings yet
CCS364-Soft Computing-Unit 5 - Applications - Lecture Notes
25 pages
Mini Project_music Genre Classification
No ratings yet
Mini Project_music Genre Classification
20 pages
Final Report
No ratings yet
Final Report
79 pages
Rahil Merged
No ratings yet
Rahil Merged
27 pages
Unit-3
No ratings yet
Unit-3
131 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Forest Fire Prediction Using Machine Learning
No ratings yet
Forest Fire Prediction Using Machine Learning
28 pages
Shivam Intership
100% (1)
Shivam Intership
18 pages
Automated Data Preprocessing For Machine Learning Based Analyses
No ratings yet
Automated Data Preprocessing For Machine Learning Based Analyses
8 pages
Medical Insurance Cost Prediction Report Full
100% (1)
Medical Insurance Cost Prediction Report Full
50 pages
UNIT 3 DWM NOTES
No ratings yet
UNIT 3 DWM NOTES
17 pages
coastal guard - synopsis - samuel
No ratings yet
coastal guard - synopsis - samuel
12 pages
Cloud Architecture Recommendation Using LLM Model (2)
No ratings yet
Cloud Architecture Recommendation Using LLM Model (2)
1 page
Final Report MP
No ratings yet
Final Report MP
35 pages
Dokumen - Pub Big Data Concepts Technology and Architecture 9781119701828 1 52
No ratings yet
Dokumen - Pub Big Data Concepts Technology and Architecture 9781119701828 1 52
52 pages
Intelligent Datadriven Modelling And Optimization In Power And Energy Applications B Rajanarayan Prusty Neeraj Gupta Kishore Bingi Rakesh Sehgal download
100% (1)
Intelligent Datadriven Modelling And Optimization In Power And Energy Applications B Rajanarayan Prusty Neeraj Gupta Kishore Bingi Rakesh Sehgal download
85 pages
Fracture Identification on Facial Bone X-Ray using Transfer Learning ( YOLO V8 Algorithm )
No ratings yet
Fracture Identification on Facial Bone X-Ray using Transfer Learning ( YOLO V8 Algorithm )
11 pages
Online Fraud Report
No ratings yet
Online Fraud Report
15 pages
Internship_report (1)
No ratings yet
Internship_report (1)
29 pages
DWDM File
No ratings yet
DWDM File
26 pages
Python - Data Analysis
No ratings yet
Python - Data Analysis
11 pages
Mastering Exploratory Data Analysis With Python - A Comprehensive Guide To Unveiling Hidden Insights
No ratings yet
Mastering Exploratory Data Analysis With Python - A Comprehensive Guide To Unveiling Hidden Insights
73 pages
Model-for-the-Prediction-of-Default-Risk-of-Funding-Requests-Using-Data-Mining-Sameh-Ali-2
No ratings yet
Model-for-the-Prediction-of-Default-Risk-of-Funding-Requests-Using-Data-Mining-Sameh-Ali-2
8 pages
Project Viva
No ratings yet
Project Viva
4 pages
Data Analytics - Project Videos & Ideas
No ratings yet
Data Analytics - Project Videos & Ideas
6 pages
MACHINE-LEARNING-LAB
No ratings yet
MACHINE-LEARNING-LAB
3 pages
beevi (1)
No ratings yet
beevi (1)
39 pages
Project Report - 092046
No ratings yet
Project Report - 092046
5 pages
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
No ratings yet
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
64 pages
Report
No ratings yet
Report
31 pages