0% found this document useful (0 votes)
7 views

Module 2

The document discusses data preprocessing which involves cleaning, transforming and integrating raw data to prepare it for analysis. Some common steps in data preprocessing include data cleaning, integration, transformation, reduction, discretization and normalization. The goal is to improve data quality and suitability for specific mining tasks.

Uploaded by

hiremathsumith99
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module 2

The document discusses data preprocessing which involves cleaning, transforming and integrating raw data to prepare it for analysis. Some common steps in data preprocessing include data cleaning, integration, transformation, reduction, discretization and normalization. The goal is to improve data quality and suitability for specific mining tasks.

Uploaded by

hiremathsumith99
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

MODULE-2

DATA PRE-PROCESSING
Contents

Types of data
Data Quality
Data Pre-processing Techniques
Similarity and Dissimilarity measures.

Introduction to data pre-processing:

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Some common steps in data preprocessing include:


Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:

1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing 1values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.

2. Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.

3. Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization
is used to transform the data to have zero mean and unit variance. Discretization is used to

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


convert continuous data into discrete categories.

4. Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.

5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal width
binning, equal frequency binning, and clustering.

6. Data Normalization: This involves scaling the data to a common range, such as between 0 and
1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization, and
decimal scaling.

 Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy
of the analysis results.
 The specific steps involved in data preprocessing may vary depending on the nature of
the data and the analysis goals.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Steps Involved in Data Preprocessing:

1. Data Cleaning:

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

a. Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.

Some of them are:

*Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and multiple values are missing
within a tuple.

*Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by attribute
mean or the most probable value.

b. Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :

*Binning Method:

This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.

*Regression:

Here data can be made smooth by fitting it to a regression function. The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).

*Clustering:

This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


2. Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways: .

*Normalization:

It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).

*Attribute Selection:

In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

*Discretization:

This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.

3. Data Reduction:

Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:

Feature Selection:

This involves selecting a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It can be done using various
techniques such as correlation analysis, mutual information, and principal component analysis (PCA).

Feature Extraction:

This involves transforming the data into a lower-dimensional space while preserving the important
information. Feature extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant analysis (LDA), and non-
negative matrix factorization (NMF).

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Sampling:

This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the
size of the dataset while preserving the important information. It can be done using techniques such as
random sampling, stratified sampling, and systematic sampling.

Clustering:

This involves grouping similar data points together into clusters. Clustering is often used to reduce the
size of the dataset by replacing similar data points with a representative centroid. It can be done using
techniques such as k-means, hierarchical clustering, and density-based clustering.

Compression:

This involves compressing the dataset while preserving the important information. Compression is
often used to reduce the size of the dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip compression.

Types of data in data mining


What is data?

*Collection of data objects and their attributes

*An attribute is a property or characteristic of an object

 Examples: eye color of a person, temperature, etc.

*Attribute is also known as variable, field, characteristic, or feature

*A collection of attributes describes an object

*Object is also known as record, point, case, sample, entity, or instance

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Different types of attributes or data types:
In data mining, understanding the different types of attributes or data types is essential as it helps to
determine the appropriate data analysis techniques to use.

The following are the different types of data:

1. Nominal data:
 This type of data is also referred to as categorical data.
 Nominal data represents data that is qualitative and cannot be measured or compared with
numbers.
 In nominal data, the values represent a category, and there is no inherent order or hierarchy.
 Examples of nominal data include gender, race, religion, and occupation. Nominal data is
used in data mining for classification and clustering tasks.
 Nominal (symbolic, categorical)
 Values from an unordered set
 Ex: {red, yellow, blue, ….}
 Examples: ID numbers, eye color, zip codes

2. Ordinal Data:

 This type of data is also categorical, but with an inherent order or hierarchy.
 Ordinal data represents qualitative data that can be ranked in a particular order.
 For instance, education level can be ranked from primary to tertiary, and social status can be
ranked from low to high.
 In ordinal data, the distance between values is not uniform.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


 This means that it is not possible to say that the difference between high and medium social
status is the same as the difference between medium and low social status.
 Ordinal data is used in data mining for ranking and classification tasks.
 Values from an ordered set
 Ex: {good, better, best}
 Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall,
medium, short}

3. Interval Data:

 This type of data represents quantitative data with equal intervals between consecutive values.
 Interval data has no absolute zero point, and therefore, ratios cannot be computed.
 Examples of interval data include temperature, IQ scores, and time. Interval data is used in
data mining for clustering and prediction tasks.
 Ex:
 Calendar dates
 Temperature in celsius/fah
 Examples: calendar dates, temperatures in Celsius or Fahrenheit.

4. Ratio Data:
 This type of data is similar to interval data, but with an absolute zero point.
 In ratio data, it is possible to compute ratios of two values, and this makes it possible to make
meaningful comparisons.
 Examples of ratio data include height, weight, and income. Ratio data is used in data mining
for prediction and association rule mining tasks.
 Difference between a person of age 35 and a person of age 38 is same as difference between
people who are 12 and 15. ( 35 to 38 = 3 , 12 to 15 = 3) 3:3.
 Examples: temperature in Kelvin, length, time, counts

5. Discrete Attributes:
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a collection of documents
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete attributes

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


6. Continuous attribute:
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented using a finite number of digits.
 Continuous attributes are typically represented as floating-point variables.

Properties of Attribute Values:


 The type of an attribute depends on which of the following properties it possesses:
 Distinctness: = 
 Order: < >
 Addition: + -
 Multiplication: */
 Nominal attribute: distinctness
 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties

Types of data sets:


 Record

 Data Matrix

 Document Data

 Transaction Data

 Graph

 World Wide Web

 Molecular Structures

 Ordered

 Spatial Data

 Temporal Data

 Sequential Data

 Genetic Sequence Data

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


1. Record dataset:
Record data is data that consists of a collection of records, each of which consists of a fixed
set of attributes. This particular data set, which I use in several places, is record data.
 Data stored in flat files
 Like Excel, word etc
 Or RDBMS

2. Data matrix:
 If data objects have the same fixed set of numeric attributes, then the data objects can be
thought of as points in a multi-dimensional space, where each dimension represents a distinct
attribute
 Such data set can be represented by an m by n matrix, where there are m rows, one for each
object, and n columns, one for each attribute

3. Document data:
Each document becomes a `term' vector,
a. each term is a component (attribute) of the vector,
b. the value of each component is the number of times the corresponding term occurs in
the document.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


4. Transaction data:
A special type of record data, where
a. each record (transaction) involves a set of items.
b. For example, consider a grocery store. The set of products purchased by a customer
during one shopping trip constitute a transaction, while the individual products that
were purchased are the items.

5. Graph data:
Examples: Generic graph and HTML Links

• a graph is sometimes a more


convenient and powerful representation of data
• can be used to capture relationship between data objects.
• Data objects themselves can be graphs.
• Ex: set of linked web pages can be represented as graphs

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Chemical Data as a Graph:

Data with objects that are graphs:-

 Objects have sub-objects that have relationships

 Ex : structure of chemical compounds

Nodes – atoms

Links – chemical compounds

 Benzene Molecule: C6H6

 Mining Substructures

 Which substructures occur frequently

in a chemical compound?

 Is the presence of any substructure associated with any other?

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


6. Ordered data:
Attributes have relationships that involve order in time/space
Extension of a record data
Each record has a time associated with it
Each attribute can also be given a time stamp.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


7. Time series data:
 A special type of sequential data
 Each record is a time-series i.e. a series of measurements taken over time
 Ex: financial data set has objects which are the time series of the daily prices of various
stocks.
 Have temporal autocorrelation
 If two measurements are close in time, then their values are often similar.

Data Quality: Why do we preprocess the data?

Data preprocessing is an essential step in data mining and machine learning as it helps to ensure the
quality of data used for analysis. There are several factors that are used for data quality assessment,
including:

 Accuracy: correct or wrong, accurate or not

 Completeness: not recorded, unavailable, …

 Consistency: some modified but some not, dangling, …

 Timeliness: timely update?

 Believability: how trustable the data are correct?

 Interpretability: how easily the data can be understood?

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Major Tasks in Data Preprocessing:

1. Data cleaning:
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data
from the datasets, and it also replaces the missing values. Here are some techniques for data
cleaning:

*Handling Missing Values:

 Standard values like “Not Available” or “NA” can be used to replace the missing values.
 Missing values can also be filled manually, but it is not recommended when that dataset is
big.
 The attribute’s mean value can be used to replace the missing value when the data is normally
distributed
 wherein in the case of non-normal distribution median value of the attribute can be used.
 While using regression or decision tree algorithms, the missing value can be replaced by the
most probable value.

*Handling Noisy Data

 Noisy generally means random error or containing unnecessary data points. Handling noisy
data is one of the most important steps as it leads to the optimization of the model we are
using Here are some of the methods to handle noisy data.
2. Data integration:

The process of combining multiple sources into a single dataset. The Data integration process is one of
the main components of data management. There are some problems to be considered during data
integration.

 Schema integration: Integrates metadata(a set of data that describes other data) from different
sources.
 Entity identification problem: Identifying entities from multiple databases. For example, the
system or the user should know the student id of one database and studentname of another
database belonging to the same entity.
 Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. The attribute values from one database may differ from another database.
For example, the date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


3.Data reduction:

This process helps in the reduction of the volume of the data, which makes the analysis easier
yet produces the same or almost the same result. This reduction also helps to reduce storage
space. Some of the data reduction techniques are dimensionality reduction, numerosity
reduction, and data compression.

 Dimensionality reduction: This process is necessary for real-world applications as the


data size is big. In this process, the reduction of random variables or attributes is done so
that the dimensionality of the data set can be reduced. Combining and merging the attributes
of the data without losing its original characteristics. This also helps in the reduction of
storage space, and computation time is reduced. When the data is highly dimensional, a
problem called the “Curse of Dimensionality” occurs.
 Numerosity Reduction: In this method, the representation of the data is made smaller by
reducing the volume. There will not be any loss of data in this reduction.
 Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.
4. Data Transformation:

The change made in the format or the structure of the data is called data transformation. This
step can be simple or complex based on the requirements. There are some methods for data
transformation.

Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps
in knowing the important features of the dataset. By smoothing, we can find even a simple
change that helps in prediction.

Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set, which is from multiple sources, is integrated into with data analysis description. This
is an important step since the accuracy of the data depends on the quantity and quality of the
data. When the quality and the quantity of the data are good, the results are more relevant.

Discretization: The continuous data here is split into intervals. Discretization reduces the data
size. For example, rather than specifying the class time, we can set an interval like (3 pm-5 pm,
or 6 pm-8 pm).

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Normalization: It is the method of scaling the data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.

DATA CLEANING

 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?

Missing data /incomplete data:

 Data is not always available


 E.g., many tuples have no recorded value for several attributes, such as customer income in
sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred

How to handle missing values:

 Ignore the tuple: usually done when class label is missing (when doing classification)—not
effective when the % of missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class: smarter
 the most probable value: inference-based such as Bayesian formula or decision tree

Noisy data:

 Noise: random error or variance in a measured variable


 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data

How to Handle Noisy Data?

 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with possible outliers)

Data Cleaning as a Process:

 Data discrepancy detection


 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check)
to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

2.Data integration:

 Combines data from multiple sources into a coherent store


 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are different
 Possible reasons: different representations, different scales, e.g., metric vs. British
units.

Handling Redundancy in Data Integration:

 Redundant data occur often when integration of multiple databases


 Object identification: The same attribute or object may have different names in
different databases
 Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
 Redundant attributes may be able to be detected by correlation analysis and covariance
analysis
 Careful integration of the data from multiple sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Correlation Analysis (Nominal Data):

1. Χ2 (chi-square) test:

 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on


the data distribution in the two categories)
 It shows that like_science_fiction and play_chess are correlated in the group.
 For this 2 × 2 table, the degrees of freedom are (2 - 1)(2 - 1) = 1.
 For 1 degree of freedom, the χ2 value needed to reject the hypothesis at the 0.001 significance
level is 10.828.
 Since our computed value is above this, we can reject the hypothesis and conclude that the
two attributes are (strongly) correlated for the given group of people.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Correlation Analysis (Numeric Data):

 Correlation coefficient (also called Pearson’s product moment coefficient).

i 1 (ai  A)(bi  B) 
n n
( ai bi )  n AB
rA, B   i 1

(n  1) A B (n  1) A B
where n is the number of tuples, and are the respective means of A and B, σA and σB are
the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.

 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Visually Evaluating Correlation:

Scatter plots showing the similarity from –1 to 1.

Correlation (viewed as linear relationship):

 Correlation measures the linear relationship between objects


 To compute correlation, we standardize data objects, A and B, and then take their dot product.

Covariance (Numeric Data):

 Covariance is similar to correlation.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


where n is the number of tuples, and are the respective mean or expected values of A and B,
σA and σB are the respective standard deviation of A and B.

 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be
smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence

Co-Variance: An Example

3. Data reduction:
Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
Data reduction strategies

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression

Data Reduction 1: Dimensionality Reduction:

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

Attribute Subset Selection:

 Another way to reduce dimensionality of data


 Redundant attributes
 Duplicate much or all of the information contained in one or more other attributes
 E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data mining task at hand
 E.g., students' ID is often irrelevant to the task of predicting students' GPA

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


1. Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced
set. The best of the original attributes is and added to the reduced set. At each subsequent iteration or
step, the best of the remaining original attributes is added to the set.

2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.

3. Combination of forward selection and backward elimination: The stepwise forward selection
and backward elimination methods can be combined so that, at each step, the procedure selects the
best attribute and removes the worst from among the remaining attributes.

4. Decision tree induction: Decision tree induction constructs a flowchart like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each external (leaf) node denotes a class prediction. At each node, the algorithm chooses the
“best” attribute to partition the data into individual classes.

Data Reduction 2: Numerosity Reduction:

 Reduce data volume by choosing alternative, smaller forms of data representation


 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-D space as the product on
appropriate marginal subspaces

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …

Numerosity Reduction:

 Linear regression
 Histogram
 Clustering
 Sampling

Parametric Data Reduction: Regression and Log-Linear Models

 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a linear function of multidimensional
feature vector
 Log-linear model
 Approximates discrete multidimensional probability distributions

Regression Analysis:

 Regression analysis: A collective name for techniques for the modeling and analysis of
numerical data consisting of values of a dependent variable (also called response variable or
measurement) and of one or more independent variables (aka. explanatory variables or
predictors)
 The parameters are estimated so as to give a "best fit" of the data
 Most commonly the best fit is evaluated by using the least squares method, but other criteria
have also been used

Y1
Y1’ y=x+1
X1 x
Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof
Regress Analysis and Log-Linear Models:

 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be estimated by using
the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional space for a set of
discretized attributes, based on a smaller subset of dimensional combinations
 Useful for dimensionality reduction and data smoothing

Histogram Analysis:

 Divide data into buckets and store average (sum) for each bucket
 Partitioning rules:
 Equal-width: equal bucket range
 Equal-frequency (or equal-depth)

40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof
Data Cube Aggregation:

 The lowest level of a data cube (base cuboid)


 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to solve the task
 Queries regarding aggregated information should be answered using data cube, when possible

4.Data transformation and discretization:

 Discretization
 Supervised
 Entropy – based
 Unsupervised
 Equal width and equal frequency
 Normalization
 Min-max

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


 Z-score
 Decimal scaling
 Binarization

Discretization/Quantization:

 Three types of attributes:


 Nominal – values from an unordered set
 Ordinal – values from an ordered set
 Continuous – real numbers
 Discretization :
 Divide the range of a continuous attribute into intervals
 Some classification algos only accept categorical attributes
 Reduce data size by discretization
 Prepare for further analysis

Transformation by Discretization:

Some Algorithms require nominal/ discrete attributes

Discretization methods:

 Unsupervised
 Independent of the class label

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


 Ex: Equal width binning, equal frequency binning
 Supervised
 Dependent on the class label
 Ex: entropy based binning

Unsupervised Discretization:

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Entropy Based Discretization- Supervised:
 Uses the class info present in the data
 Entropy(info content) is calculated based on the class label
 Tries to find the best split so that bins are as pure as possible.
 Pure bin : majority of the values in a bin should correspond to the same class.
 Purity of a bin is measured using its entropy
 Entropy
 Zero – perfectly pure bin
 Max (1) – impure – equal class distribution

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Entropy:

Procedure:

1. Sort the attb values to be discretized, S


2. Bisect the initial values so that the resulting two intervals have minimum entropy.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


i. Consider each value T as a possible split point Where T = midpoint of each
consecutive attb values.
ii. Compute the information gain before and after choosing T as a split point.

Gain = E(S) – E(T,S)

i. Select the best T which gives the highest info gain as the optimum split.
ii. 3. Repeat step 2 with another interval (highest entropy) until a user specified no. of intervals
is reached or some stopping criterion is met.

Normalization:

l Scale attribute values to fall within a small-specified range.


 Min-max
 Z-score
 Decimal scaling

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof
Binarization:

 Transforming a continuous or discrete attribute into one or more binary attribute


 Why?
 ARM can be done only on binarized data.
 But i/p data set may have numeric/discrete attributes

Binarizing a categorical data:-

If the categorical attb has m distinct values,

1. assign a unique integer from 0 to m-1 to each value.


2. Represent each integer using unique bit combinations

Similarity and Dissimilarity:

• More important – used in clustering, some classification, anomaly detection.


• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Objects with Single Attribute:

p and q are the attribute values for two data objects.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Dissimilarities between Data Objects with multiple Numeric attributes:

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


2.Minkowski Distance:

Minkowski Distance: Examples:

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


SMC versus Jaccard: Example:

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


• SMC - Counts both presences and absences equally. Used for objects with symmetric
binary attributes.
• Can be used to find students who answered similarly in a test – true/false questions
• JC is used to handle objects with assymetric binary attributes.
• Ex: In a TDB:
• No. of products not purchased is far more than purchased
• SMC would say all transactions are very similar.
• Use JC

Cosine Similarity:

• If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,

where  indicates vector dot product and || d || is the length of vector d.

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Cos(x,y) = 0 indicates both are dissimilar

Cos(x,y) = 1 indicates both are similar, if output is nearer to 1 we can say that it is
similar

But in this case it is different/dissimilar.

Extended Jaccard Coefficient (Tanimoto):

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


Common Properties of a Distance:

• Distances, such as the Euclidean distance, have some well known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness). Distance will never have negative.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof


2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

p & q are the data points on the graph.

• A distance that satisfies all these properties is a metric

Common Properties of a Similarity:

• Similarities, also have some well known properties.


1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

Data Mining_M2 SOCSE & IS Arshiya Lubna-Asst Prof

You might also like