Data Preprocessing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Data Mining

By :
Parul Chauhan
Assistant Prof.
Data Mining
▶ Huge amount of data gets added up in our computer
networks, world wide web, and various storage
devices everyday from media, facebook, science etc.

▶ Example:
▶ Walmart handle hundreds of millions of transactions
per week at thousands of branches

Parul Chauhan (Assistant Prof.) RTU,Kota


▶ Solution:

▶ Powerful and versatile tools are badly needed to


automatically uncover valuable information from the
tremendous amounts of data and to transform such
data into organized knowledge.

▶ This led to the birth of data mining.

Parul Chauhan (Assistant Prof.) RTU,Kota


KDD
▶ Data mining does not mean extracting the data from
huge amounts of data.

▶ It actually mean extracting knowledge from huge


amount of data.
▶ Therefore, also known as Knowledge discovery (KDD)
from data.

▶ Knowledge discovery is a bigger process and data


mining is just a part.

Parul Chauhan (Assistant Prof.) RTU,Kota


Data mining as
a
step in the
process of
knowledge
discovery
KDD
▶ 1. Data cleaning- To remove noise, inconsistent data.
Performed on various sources.
▶ 2. Data Integration- Multiple sources may be
combined.
▶ 3. Data Selection- Where data relevant to the analysis
task are retrieved from the database.
▶ 4. Data Transformation- Where data are transformed
into forms appropriate for mining by performing
aggregation operations

Parul Chauhan (Assistant Prof.) RTU,Kota


KDD
▶ 5. Data Mining - Process where intelligent patterns are
applied to extract data patterns.
▶ 6. Pattern evaluation - To identify the truly interesting
patterns representing knowledge based on interestingness
measures.
▶ 7. Knowledge Representation - Where visualization and
knowledge representation techniques are used to present
mined knowledge to users.
▶ STEP 1-4 are different forms of pre-processing where data
are prepared for mining.

Parul Chauhan (Assistant Prof.) RTU,Kota


What Kinds of data can be mined ?
1. Relational database- Is a collection of tables, each
of which is assigned a unique name. Each table
consists of set of attributes(columns or fields) and
usually stores a large set of tuples(records or rows).

2. It can be accessed by database queries written in


SQL etc.

3. Example- How many people having income upto


100k are defaulted borrowers ?

Parul Chauhan (Assistant Prof.) RTU,Kota


Relational Database

Parul Chauhan (Assistant Prof.) RTU,Kota


2. Data Warehouse- Is a repository of information
collected from multiple sources, stored under a
unified schema and usually reside at a single site.

The data in the data warehouse are organized around


major subjects, such as customer, item, supplier and
activity. The data are stored to provide information
from a historical perspective Such as from past 5-10
years) and are typically summarized.

Parul Chauhan (Assistant Prof.) RTU,Kota


▶ Example-

◦ An international company has branches all over the


world.
◦ Each branch has its own set of databases.
◦ Owner has asked you to provide an analysis of the
company’s sales per item per branch for the third
quarter.
◦ Difficult- data is spread out over several databases.
◦ If the company had a warehouse, this task would be easy.

▶ Rather than storing the details of each sales transaction,


the data warehouse may store a summary of the
transactions per item type for each store.

Parul Chauhan (Assistant Prof.) RTU,Kota


3. Transactional Databases- Record represents a
transaction.

Transaction includes a unique transaction identity


number(trans_ID) and a list of the items making up
the transaction.

So as an analyst we can ask- Which items sold


together?

This kind of analysis is called Market Basket Analysis.


Such analysis can be used to boast sales.

Parul Chauhan (Assistant Prof.) RTU,Kota


Parul Chauhan (Assistant Prof.) RTU,Kota
Data Pre-processing
▶ Today’s real-world databases are highly susceptible to
noisy, missing, and inconsistent data due to their
typically huge size and their likely origin from multiple,
heterogeneous sources.

▶ Low-quality data will lead to low-quality mining results.

“How can the data be preprocessed in order to help improve


the quality of the data and, consequently, of the mining
results?

Parul Chauhan (Assistant Prof.) RTU,Kota


Data Quality: Why Preprocess the Data?
▶ Imagine that you are a manager at AllElectronics and
have been charged with analyzing the company’s data
with respect to your branch’s sales.

▶ You carefully inspect the company’s database and


data warehouse, identifying and selecting the
attributes (e.g., item, price, and units sold) to be
included in your analysis.

▶ You notice that several of the attributes for various


tuples have no recorded value.
Parul Chauhan (Assistant Prof.) RTU,Kota
▶ For your analysis, you would like to include information as to
whether each item purchased was advertised as on sale, yet
you discover that this information has not been recorded.

▶ Furthermore, users of your database system have reported


errors, unusual values, and inconsistencies in the data
recorded for some transactions.

▶ The data you wish to analyze by data mining techniques are


incomplete (lacking attribute values or certain attributes of
interest, or containing only aggregate data); inaccurate or
noisy (containing errors, or values that deviate from the
expected); and inconsistent (e.g., containing discrepancies in
the department codes used to categorize items

Parul Chauhan (Assistant Prof.) RTU,Kota


▶ This scenario illustrates three of the elements defining
data quality: accuracy, completeness, and consistency.
▶ There are many possible reasons for inaccurate data:
a) The data collection instruments used
may be faulty.
b) There may have been human or
computer errors occurring at data entry.
c) Users may purposely submit incorrect
data values for mandatory fields when
they do not wish to submit personal
information

Parul Chauhan (Assistant Prof.) RTU,Kota


Incomplete data

a) Attributes of interest may not always be available


b) data may not be included simply because they were
not considered important at the time of entry.
c) Relevant data may not be recorded due to a
misunderstanding or because of equipment
malfunctions.

Parul Chauhan (Assistant Prof.) RTU,Kota


Inconsistent

▶ Data that were inconsistent with other recorded data


may have been deleted.

▶ the recording of the data history or modifications may


have been overlooked.

Parul Chauhan (Assistant Prof.) RTU,Kota


Two other factors affecting data quality are believability
and interpretability.

▶ Believability reflects how much the data are trusted


by users, while

▶ Interpretability reflects how easy the data are


understood.

Parul Chauhan (Assistant Prof.) RTU,Kota


Major Tasks in Data Preprocessing

Parul Chauhan (Assistant Prof.) RTU,Kota


1. DATA CLEANING

▶ Real-world data tend to be incomplete, noisy, and


inconsistent.

▶ Data cleaning (or data cleansing) routines attempt to fill in


missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data.

▶ 1. Missing Values
▶ 2. Noisy data

Parul Chauhan (Assistant Prof.) RTU,Kota


A) Missing Values
▶ Imagine that you need to analyze AllElectronics sales
and customer data. You note that many tuples have
no recorded value for several attributes such as
customer income.

▶ How can you go about filling in the missing values for


this attribute?

▶ Various methods are :

Parul Chauhan (Assistant Prof.) RTU,Kota


1. Ignore the tuple: This method is not very effective, unless the
tuple contains several attributes with missing values.
By ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple. Such data could have been useful to
the task at hand.

2. Fill in the missing value manually: This approach is time


consuming and may not be feasible given a large data set with
many missing values.

3. Use a measure of central tendency for the


attribute to fill in the missing value:
▶ For example, suppose that the data distribution regarding the
income of AllElectronics customers is symmetric and that the
mean income is $56,000. Use this value to replace the missing
value for income.

Parul Chauhan (Assistant Prof.) RTU,Kota


4. Use a global constant to fill in the missing value.

5. Use the attribute mean or median for all samples


belonging to the same class as the given tuple.

6. Use the most probable value to fill in the missing


value

Parul Chauhan (Assistant Prof.) RTU,Kota


Note:
▶ In some cases, a missing value may not imply an error
in the data!
For example, when applying for a credit card, candidates
may be asked to supply their driver’s license number.
Candidates who do not have a driver’s license may
naturally leave this field blank.
▶ Forms should allow respondents to specify values
such as “not applicable.”

Parul Chauhan (Assistant Prof.) RTU,Kota


B) Noisy data
▶ Noise is a random error or variance in a measured
variable.
▶ Let’s look at the following data smoothing techniques:

i. Binning: smooth a sorted data value by consulting


its “neighborhood,” that is, the values around it.
ii. Clustering/ outlier analysis

iii. Regression

Parul Chauhan (Assistant Prof.) RTU,Kota


i) Binning
▶ The sorted values are distributed into a number of
“buckets,” or bins.
▶ Because binning methods consult the neighborhood of
values, they perform local smoothing.
▶ The original data values are divided into small
intervals known as bins and then they are replaced by
a general value calculated for that bin.

Parul Chauhan (Assistant Prof.) RTU,Kota


Steps for Binning:

1. Sort the array of given data set.

2. Divides the range into N intervals, each containing


the approximately same number of samples(Equal-
depth partitioning).

3. Store mean/ median/ boundaries in each row.

Parul Chauhan (Assistant Prof.) RTU,Kota


Sorted data for price
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins


Bin 1: 4, 8, 15
Bin 2: 21,21,24
Bin 3: 25,28,34

A) smoothing by means
Bin 1: 9,9,9
Bin 2: 22,22,22
Bin 3: 29,29,29

Parul Chauhan (Assistant Prof.) RTU,Kota


4, 8, 15, 21, 21, 24, 25, 28, 34
B) Smoothing by bin boundaries:
Bin 1: 4,4,15
Bin 2: 21,21,24
Bin 3: 25,25,34

C) Smoothing by bin median:


Bin 1: 8,8,8
Bin 2: 21,21,21
Bin 3: 28,28,28

Parul Chauhan (Assistant Prof.) RTU,Kota


410, 451, 492, 533, 533, 575, 615, 656, 697,
738, 779, 820,

Partition into bins


Bin 1: 410, 451, 492
Bin 2: 533, 533, 575
Bin 3: 615, 656, 697
Bin 4: 738, 779, 820

Parul Chauhan (Assistant Prof.) RTU,Kota


410, 451, 492, 533, 533, 575, 615, 656, 697,
738, 779, 820,

A) smoothing by means
Bin 1: 451, 451, 451
Bin 2: 547, 547, 547
Bin 3 656, 656, 656
Bin 4: 779, 779, 779

Parul Chauhan (Assistant Prof.) RTU,Kota


Parul Chauhan (Assistant Prof.) RTU,Kota
CONS of data smoothing

▶ Data smoothing doesn’t always provide a clear


explanation of the patterns among the data.

▶ It is possible that certain data points being ignored by


focusing the other data points.

Parul Chauhan (Assistant Prof.) RTU,Kota


ii) Regression
▶ Regression is a data mining function that predicts a
number. Age, weight, distance, temperature, income, or
sales could all be predicted using regression techniques.

▶ For example, a regression model could be used to predict


children's height, given their age, weight, and other factors.

▶ Data can be smoothed by fitting the data to function, such


as with regression. (It finds the best line to fit two
attributes.

Parul Chauhan (Assistant Prof.) RTU,Kota


iii) Clustering

▶ Outliers may be detected by clustering, where similar


values are organized into groups or clusters.

▶ Values that fall outside of the set of clusters may be


considered outliers.

Parul Chauhan (Assistant Prof.) RTU,Kota


2. Data Integration
▶ Data integration—the merging of data from multiple data
stores.

▶ Careful integration can help reduce and avoid


redundancies and inconsistencies in the resulting data set.
▶ The heterogeneity and structure of data pose great
challenges in data integration.

a) Entity Identification Problem


b) Redundancy
c) Data Value Conflict Detection and Resolution

Parul Chauhan (Assistant Prof.) RTU,Kota


1. Entity Identification Problem

▶ For example,

▶ How can the data analyst or the computer be sure


that customer id in one database and cust number in
another refer to the same attribute ?

Parul Chauhan (Assistant Prof.) RTU,Kota


2. Redundancy
▶ Redundancy is another important issue in data
integration.
▶ An attribute (such as annual revenue, for instance)
may be redundant if it can be “derived” from another
attribute or set of attributes.
▶ Some redundancies can be detected by correlation
analysis.

Parul Chauhan (Assistant Prof.) RTU,Kota


3. Data Value Conflict Detection and
Resolution

▶ For example, for the same real-world entity,


attribute values from different sources may differ.

▶ This may be due to differences in representation,


scaling, or encoding.

▶ For instance, a weight attribute may be stored in


metric units in one system and British imperial units in
another.

Parul Chauhan (Assistant Prof.) RTU,Kota


3. Data Reduction
▶ Imagine that you have selected data from the
AllElectronics data warehouse for analysis.

▶ The data set will likely be huge! Complex data analysis and
mining on huge amounts of data can take a long time,
making such analysis impractical or infeasible.

▶ Data reduction techniques can be applied to obtain a


reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity of
the original data.

Parul Chauhan (Assistant Prof.) RTU,Kota


Data reduction strategies include:

a) Dimensionality reduction,

b) Numerosity reduction, and

c) Data compression.

Parul Chauhan (Assistant Prof.) RTU,Kota


i) Dimensionality reduction

1. Dimensionality reduction is the process of reducing


the number of random variables or attributes under
consideration.

▶ Dimensionality reduction methods include wavelet


transforms which transform or project the original
data onto a smaller space.

Parul Chauhan (Assistant Prof.) RTU,Kota


ii) Numerosity reduction
2. Numerosity reduction techniques replace the original
data volume by alternative, smaller forms of data
representation. These techniques may be parametric or
nonparametric.

▶ For parametric methods, a model is used to estimate the


data, so that typically only the data parameters need to be
stored, instead of the actual data. Eg. Regression

▶ Nonparametric methods for storing reduced


representations of the data include histograms, clustering
,sampling , and data cube aggregation

Parul Chauhan (Assistant Prof.) RTU,Kota


a) Data Cube Aggregation
▶ Imagine that you have collected the data for your analysis.
These data consist of the AllElectronics sales per quarter,
for the years 2008 to 2010.

▶ You are, however, interested in the annual sales (total per


year), rather than the total per quarter.

▶ Thus, the data can be aggregated so that the resulting


data summarize the total sales per year instead of per
quarter.

Parul Chauhan (Assistant Prof.) RTU,Kota


Parul Chauhan (Assistant Prof.) RTU,Kota
▶ Data cubes store multidimensional aggregated
information.

Parul Chauhan (Assistant Prof.) RTU,Kota


b) Histograms
▶ Histograms use binning to approximate data distributions and are a
popular form of data reduction.

▶ A histogram for an attribute, A, partitions the data distribution of A


into subsets, referred to as buckets or bins.

▶ The following data are a list of AllElectronics prices for commonly sold
items (rounded to the nearest dollar). The numbers have been sorted:

1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18,
18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28,
28, 30,30, 30.

Parul Chauhan (Assistant Prof.) RTU,Kota


▶ If each bucket represents only a single
attribute–value/frequency pair, the buckets are called
singleton buckets.

Parul Chauhan (Assistant Prof.) RTU,Kota


▶ To further reduce the data, it is common to have each
bucket denote a continuous value range for the given
attribute.

Parul Chauhan (Assistant Prof.) RTU,Kota


c) Clustering
▶ They partition the objects into groups, or clusters, so
that objects within a cluster are “similar” to one
another and “dissimilar” to objects in other clusters.

▶ The “quality” of a cluster may be represented by its


diameter, the maximum distance between any two
objects in the cluster.

Parul Chauhan (Assistant Prof.) RTU,Kota


iii) Data compression
In data compression, transformations are applied so as to obtain
a reduced or “compressed” representation of the original data.

▶ If the original data can be reconstructed from the compressed


data without any information loss, the data reduction is called
lossless.

▶ If, instead, we can reconstruct only an approximation of the


original data, then the data reduction is called lossy.

▶ Dimensionality reduction and numerosity reduction techniques


can also be considered forms of data compression.

Parul Chauhan (Assistant Prof.) RTU,Kota


4. Data Transformation
▶ In data transformation, the data are transformed or
consolidated into forms appropriate for mining.

▶ Various methods include:


a) Min_Max

b) Z score

c) Decimal Scaling

Parul Chauhan (Assistant Prof.) RTU,Kota


Data Transformation Strategies
1. Smoothing, which works to remove noise from the data.
Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where
new attributes are constructed and added from the given set
of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations
are applied to the data. For example, the daily sales data
may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a
data cube for data analysis at multiple abstraction levels.

Parul Chauhan (Assistant Prof.) RTU,Kota


4. Normalization, where the attribute data are scaled
so as to fall within a smaller range, such as -1.0 to 1.0, or
0.0 to 1.0.
5. Discretization, where the raw values of a numeric
attribute (e.g., age) are replaced by interval labels (e.g.,
0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior). The labels, in turn, can be recursively organized
into higher-level concepts, resulting in a concept
hierarchy for the numeric attribute.

Parul Chauhan (Assistant Prof.) RTU,Kota


6. Concept hierarchy generation for nominal data
where attributes such as street can be generalized to
higher-level concepts, like city or country. Many
hierarchies for nominal attributes are implicit within the
database schema and can be automatically defined at
the schema definition level.

Parul Chauhan (Assistant Prof.) RTU,Kota


▶  

Parul Chauhan (Assistant Prof.) RTU,Kota


▶  

Parul Chauhan (Assistant Prof.) RTU,Kota


▶ Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and
$16,000, respectively.

▶ With z-score normalization, a value of $73,600 for


income is transformed to

Parul Chauhan (Assistant Prof.) RTU,Kota


▶  

Parul Chauhan (Assistant Prof.) RTU,Kota


▶ Suppose that the recorded values of A range from
-986 to 917. The maximum absolute value of A is 986.
To normalize by decimal scaling, we therefore divide
each value by 1000 (i.e., j =3) so that -986 normalizes
to -0.986 and 917 normalizes to 0.917.

Parul Chauhan (Assistant Prof.) RTU,Kota


Example
▶ The following data (in increasing order) for the
attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
▶ (a) Use min-max normalization to transform the value
35 for age onto the range [0.0, 1.0].
▶ (b) Use z-score normalization to transform the value
35 for age, where the standard deviation of age is
12.94 years.
▶ (c) Use normalization by decimal scaling to transform
the value 35 for age.

Parul Chauhan (Assistant Prof.) RTU,Kota


Solution

Parul Chauhan (Assistant Prof.) RTU,Kota

You might also like