1preparing Data

This document discusses different approaches for handling missing data and detecting outliers when preparing data for data mining. It describes some common challenges with missing data like reducing the dataset or manually filling in values. It also evaluates different methods for filling in missing values like replacing with constants, feature means, or predictive models. The document also defines outliers as samples that do not comply with the general data behavior and discusses visualization and statistical approaches for detecting outliers in one-dimensional and multidimensional data.

Uploaded by

Ukky

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

1preparing Data

Uploaded by

Ukky

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

I.

Preparing Data
1.1. Missing Data
For many real world applications oI data mining, even when there are huge amounts oI data,
the subset oI cases with complete data may be relatively small. Available samples and also
Iuture cases may have values missing. Some oI the data mining methods accept missing values
and satisIactorily process data to reach a Iinal conclusion. Other methods require that all values
be available. An obvious question is whether these missing values can be Iilled in during data
preparation, prior to the application oI the data mining methods. The simplest solution Ior this
problem is the reduction oI the data set and the elimination oI all samples with missing values.
That is possible when large data sets are available, and missing values occur only in a small
percentage oI samples. II we do not drop the samples with missing values, then we have to Iind
values Ior them. What are the practical solution ?
First, a data miner, together with the domain expert can manually examine samples that have
no values and enter a reasonable, probable, or expected value, based on a domain experience.
The method is straightIorward Ior small numbers oI missing values and relatively small data sets.
But, iI there is no obvious or plausible value Ior each case, the miner is introducing noise into the
data set by manually generating a value.
The second approach gives an even simpler solution Ior elimination oI missing values. It
based on a Iormal, oIten automatic replacement oI missing values with some constants, such as :
a. replace all missing values with a single global constant (a selection oI a global constant is
highly application-dependent)
b. replace a missing value with its Ieature mean.
replace a missing value with its Ieature mean Ior the given class (this approach is possible
only Ior classiIication problems where samples are classiIied in advance).
These simple solution are tempting. Their main Ilaw is that the substituted value is not the
correct value. By replacing the missing value with a constant or changing the values Ior a Iew
diIIerent Ieatures, the data are biased. The replaced value (values) will homogenize the cases
with missing values into a uniIorm subset directed toward the class with most missing values (an
artiIicial class). II missing values are replaced with a single global constant Ior all Ieatures, an
unknown value may be implicitly made into a positive Iactor that is not objectively justiIied.
One possible interpretation oI missing values is that they are 'don`t care values. In other
words, we suppose that these values do not have any inIluence on the Iinal data mining results.
In that case, a sample with the missing value may be extended to the set oI artiIicial samples,
where, Ior each new sample, the missing value is replaced with one oI the possible Ieature values
oI a given domain. Although this interpretation may look more natural, the problem withthis
approach is the combinatorial explosion oI artiIicial samples. For example, iI one three
dimensional sample X is given as X 1,?,3}, where the second Ieature`s value is missing, the
process will generate Iive artiIicial samples Ior the Ieature domain |0,1,2,3,4|.
X11,0,3}, X21,1,3}, X31,2,3},X41,3,3}, and X51,4,3}.
Finally, the data miner can generate a predictive model to predict each oI the missing values.
For example, iI three Ieatures A,B, and C are given Ior each sample, then based on samples that
have all three values as a training set, the data miner can generate a model oI correlation between
Ieatures. DiIIerent technique such as regression, Bayesian Iormalism, clustering, or decision tree
induction may be used depending on data types (all these techniques are explained later). Once
you have a trained model, you can present a new sample that has a value missing and generate a
predictive value. For example, iI values Ior Ieatures A and B are given, the model generates the
value Ior the Ieature C. II a missing value is highly correlated with the other known Ieatures, this
process will generate the best value Ior that Ieature. OI course, iI you can always predict a
missing value with certainty, this means that the Ieature is redundant in a data set and not
necessary in Iurther data mining analyses. In real world applications, you should expect an
imperIect correlation between the Ieature with the missing value and other Ieatures. ThereIore,
all automatic methods Iill in values that may not be correct. Such automatic methods, however
are among the most popular in the data mining community. In comparison to the other methods,
they use the most inIormation Irom the present data to predict missing values.
In general, it is speculative and oIten misleading to replace missing values using a simple,
artiIicial schema oI data preparation. It is best to generate multiple solution oI data mining with
and without Ieatures that have missing values and then analyze and interpret them.
1.2. Outlier Analysis
Very oIten, in large data sets, there exist samples that do not comply with the general behavior oI
the data model. Such samples, which are signiIicantly diIIerent oI inconsistent with the
remaining set oI data, are called outliers.
For the Iollowing examples oI outliers :
person`s age in the database is -1, the value is obviously not correct, and the error could
have been caused by a deIault setting oI the Iield 'unrecorded age in the computer
program.
The number oI children Ior one person is 25, this value is unusual and has to be checked.
The value could be typographical error.
Many data mining algorithms try to minimize the inIluence oI outliers on the Iinal model, or to
eliminate them in the preprocessing phases. The data mining analyst has to be very careIul in the
automatic elimination oI outliers because, iI the data are correct, that could result in the loss oI
important hidden inIormation.
Some data mining application are Iocused on outlier detection, and it is the essential result oI a
data analysis. For example, while detecting Iraudulent credit card transactions in a bank, the
outliers are typical examples that may indicate Iraudulent activity, and the entire data mining
process is concentrated on their detection.
But, in most oI the other data mining applications, especially iI they are supported with large
data sets, outliers are not very useIul.
Outlier detection and potential removal Irom a data set can be described as a process oI the
selection oI k out oI n samples that considerably dissimilar, exceptional, or inconsistent with
respect to the remaining data.
The problem oI deIining outliers is nontrivial, especially in multidimensional samples. Data
visualization methods that are useIul in outlier detection Ior one to three dimensions are weaker
in multidimensional data because oI a lack oI adequate visualization methodologies Ior these
spaces. An example oI a visualization oI two-dimensional samples and visual detection oI
outliers is given in Figure 1.1
1.2.1. Simplest approach
The simplest approach to outlier detection Ior one-dimensional samples is based on statistics.
Assuming that the distribution oI values is given, it is necessary to Iind basic statistical
parameters sucah as mean value and variance. Based on these values and the expected (or
predicted) number oI outliers, it is possible to exstablish the threshold value as a Iunction oI
variance. All samples out oI the threshold value are candidates Ior outliers. The main problem
with this simple methodology is an a prori assumption about data distribution. In most real
world examples the data distribution may not be known.
For example, iI the given data set representas the Ieature Age with twenty diIIerent values :
Age 3,56,23,39,156,41,22,9,28,139,31,55,20,-67,37,11,55,45,37}
Then, the corresponding statistical parameters are :
Mean 39.9
Standard deviation 45.65
II we select the threshold value Ior normal distribution oI data as :
Threshold Mean +2 Standard deviation
Then, all data that are out oI range -51,4, 131.2 }will be potential outliers. Additional
knowledge oI the characteristics oI the Ieature (Age is always greater than zero) may Iurther
reduce the range to 0,131.2}. In our example there are three values that are outliers based on
the given criteria : 156,139 and -67. With a high probability we can conclude that all three oI
them are typo-errors (data entered with additional digits or an additional '-' sign).
1.2.2. Distance-based outlier detection
Is a second method that eliminates some oI the limitations imposed by the statistical approach.
The most important diIIerence is that this method is applicable to multidimensional samples
while statistical descriptors analyze only a single dimension, or several dimensions, but
separately. The basic computational complexity oI this method is the evaluation oI distance
measures between all samples in an n-dimensional data set. Then, a sample si in a data set S is an
outlier iI at least Iraction p oI the samples in S lies at a distance greater than d. In other words,
distance based outliers are those samples which do not have enough neighbors, where neighbors
are deIined through the multidimensional two parameters, p and d, which may be given in
advanceusing knowledge about the data, or which may be changed during the iterations (trial and
error approach) to select the most representative outliers.
To illustrate the approach we can analyze a set oI two-dimensional samples S, where the
requirements Ior outliers are the values oI thresholds p_4 and d_3.
Ss1,s2,s3,s4,s5,s6,s7}(2,4),(3,2),(1,1),(4,3),(1,6),(5,3),(4,2)}
The table oI Euclidian distances, d |(x1-x2)
2
(y1-y2)
2
|

, Ior the set S is given in table 2.3

and based on this table, we can calculate a value Ior the parameter p with the given threshold
distance (d3) Ior each sample. The results are represented in Table 2.4.
Using the results oI the applied procedure and given threshold values, it is possible to select
samples s3 and s5 asoutliers (because their values Ior p is above the threshold value : p4). The
same results could be obtained by visual inspection oI a data set, represented in Figure 2.4. OI
course, the given data set is very small and a two-dimensional graphical representation is
possible and useIul. For n-dimensional, real world data analyses the visualization process is
much more diIIicult, and analytical approaches in outlier detection are oIten more practical and
reliable.
Table 2.3 Table oI distances Ior data set S
S1 s2 s3 s4 s5 s6 s7
S1 2.236 3.162 2.236 2.236 3.162 2.828
S2 2.236 1.414 4.472 2.236 1.000
S3 3.605 5.000 4.472 3.162
S4 4.242 1.000 1.000
S5 5.000 5.000
S6 1.414
Table 2.4. Number oI points p with the distance greater then d Ior each given point in S.
Sample p
S1 2
S2 1
S3 5
S4 2
S5 5
S6 3
The outliers is S3 and S5 because number oI points p greater than threshold value (p~4)
Exercise :
1. Given a set oI Iour-dimensional samples with missing values :
X10,1,1,2}
X22,1,,1}
X31,,,0}
X4,2,1,}
II the domains Ior all attributes are |0,1,2| what will be thw number oI 'artiIicial
samples iI missing values are interpreted as 'don`t`care value and they are replaced with
all possible values Ior a given domain.
2. The number oI children Ior diIIerent patients in a database is given with a vector :
C 3,1,0,2,7,3,6,4,-2,0,0,10,15,6}
a. Find the outliers in the set C using standard statistical parameters mean and variance
b. II the threshold value is changed Irom +3 standard deviations to +2 standard deviations,
what additional outliers are Iound ?
3. For a given data set X oI three dimensional samples,
X|1,2,0}, 3,1,4},2,1,5},0,1,6},2,4,3},4,4,2},5,2,1},7,7,7},0,0,0},3,3,3}|.
a. Find the outliers using the distance based technique iI :
i. The threshold distance is 4, and threshold Iraction p Ior non-neighbor samples is
3.
ii. The threshold distance is 6, and threshold Iraction p Ior non-neighbor samples is 2.
b. Describe the procedure and interpret the results oI outlier detection based on mean values
and variances Ior each dimension separately.

CSIS 5420 Final Exam - Answers (13 Jul 05)
No ratings yet
CSIS 5420 Final Exam - Answers (13 Jul 05)
8 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Business Intelligence Carlo Vercellis
No ratings yet
Business Intelligence Carlo Vercellis
5 pages
Data Quality
100% (2)
Data Quality
16 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
28 pages
ML Assignment-1
No ratings yet
ML Assignment-1
7 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
1.3 Data Quality
No ratings yet
1.3 Data Quality
6 pages
Summary- Data Quality
No ratings yet
Summary- Data Quality
7 pages
Summary Data Quality Course
No ratings yet
Summary Data Quality Course
7 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
Businnes Intelligence
No ratings yet
Businnes Intelligence
36 pages
data science slides
No ratings yet
data science slides
57 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
Data Mining
No ratings yet
Data Mining
5 pages
BDA - Lecture 4
No ratings yet
BDA - Lecture 4
41 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Unit 4 - Data Mining and Warehousing - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Data Mining and Warehousing - WWW - Rgpvnotes.in
13 pages
Subtitle
No ratings yet
Subtitle
2 pages
03 Data Science Process_Fall 23-24
No ratings yet
03 Data Science Process_Fall 23-24
38 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
UNIT-2
No ratings yet
UNIT-2
34 pages
188 1496475265 - 03-06-2017 PDF
No ratings yet
188 1496475265 - 03-06-2017 PDF
6 pages
Unit 1
No ratings yet
Unit 1
21 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
DMDW_
No ratings yet
DMDW_
14 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Outliers PDF
No ratings yet
Outliers PDF
5 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Prep
No ratings yet
Data Prep
5 pages
DMDW Lecture Notes
No ratings yet
DMDW Lecture Notes
24 pages
Yihao Final Paper CCSC for Submission
No ratings yet
Yihao Final Paper CCSC for Submission
6 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
13 - Chapter 4 PDF
No ratings yet
13 - Chapter 4 PDF
46 pages
Data Mining Notes
100% (1)
Data Mining Notes
75 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
A Review of Statistical Outlier Methods
No ratings yet
A Review of Statistical Outlier Methods
8 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
CPT Dec 2013 Quantitative Aptitude
No ratings yet
CPT Dec 2013 Quantitative Aptitude
6 pages
Instrumentation Analysis Details
No ratings yet
Instrumentation Analysis Details
14 pages
VARSIGMA - GB - Preparatory Module V1.1 PDF
100% (1)
VARSIGMA - GB - Preparatory Module V1.1 PDF
104 pages
I PUC Question Bank
No ratings yet
I PUC Question Bank
45 pages
Statistics MCQ S I Com Part 2
100% (4)
Statistics MCQ S I Com Part 2
8 pages
Business Statistics May Module
No ratings yet
Business Statistics May Module
72 pages
Business Statistics - Unit 1 - Theory
No ratings yet
Business Statistics - Unit 1 - Theory
17 pages
X X MD X X MD X: Assignment XI (Statistics)
No ratings yet
X X MD X X MD X: Assignment XI (Statistics)
2 pages
Important math Questions for Practice - RBE material
No ratings yet
Important math Questions for Practice - RBE material
5 pages
0
No ratings yet
0
8 pages
BSTAT HANDOUTS - DESCRIPTIVE ONLY Handouts 3
No ratings yet
BSTAT HANDOUTS - DESCRIPTIVE ONLY Handouts 3
18 pages
Weighted Average Formula & Calculation Examples
No ratings yet
Weighted Average Formula & Calculation Examples
3 pages
Merits and Demerits of Averages
75% (4)
Merits and Demerits of Averages
3 pages
Biostatistics Unit 2
No ratings yet
Biostatistics Unit 2
20 pages
Measures of Central Tendency
100% (3)
Measures of Central Tendency
36 pages
Module 3 Psy002
No ratings yet
Module 3 Psy002
16 pages
GRE Math Flash Cards - 500 Math
No ratings yet
GRE Math Flash Cards - 500 Math
10 pages
Terms in SAPM
No ratings yet
Terms in SAPM
10 pages
50% TEST 1
No ratings yet
50% TEST 1
6 pages
NTS PPSC MCQs Solved Math Guide
No ratings yet
NTS PPSC MCQs Solved Math Guide
45 pages
MPH Test
75% (4)
MPH Test
47 pages
Statistics: Assignment Title: INDEX NUMBER
No ratings yet
Statistics: Assignment Title: INDEX NUMBER
8 pages
MST-002 - Descriptive Statistics
No ratings yet
MST-002 - Descriptive Statistics
267 pages
Tut I
No ratings yet
Tut I
3 pages
16 UGC NET-JRF Paper II (Commerce) Preview Compressed
No ratings yet
16 UGC NET-JRF Paper II (Commerce) Preview Compressed
14 pages
Class 10, Winter Break HW, 2024-25
No ratings yet
Class 10, Winter Break HW, 2024-25
6 pages
Mean Median Mode
No ratings yet
Mean Median Mode
29 pages
Mean Median Mode Range Demonstration
No ratings yet
Mean Median Mode Range Demonstration
29 pages
Consumer Perception Towards GST
100% (3)
Consumer Perception Towards GST
82 pages
Index Numbers: Quantitative Aptitude & Business Statistics
No ratings yet
Index Numbers: Quantitative Aptitude & Business Statistics
115 pages

1preparing Data

Uploaded by

1preparing Data

Uploaded by

I.

, Ior the set S is given in table 2.3

You might also like