0% found this document useful (0 votes)
12 views32 pages

Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

NOISY DATA

● MISSING DATA or WRONG DATA

● NOISE in the measurement

2
Missing Data
• Data is not always available

• E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data

• Missing data may be due to


• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data

• Missing data may need to be inferred.

• Missing values may carry some information content: e.g. a credit


application may carry information by noting which field the applicant
3
did not complete
Missing Values
• There are always MVs in a real dataset

• MVs may have an impact on modelling, in fact, they can destroy it!

• Some tools ignore missing values, others use some metric to fill in
replacements

• The modeller should avoid default automated replacement


techniques
• Difficult to know limitations, problems and introduced bias

• Replacing missing values without elsewhere capturing that


information removes information from the dataset

4
How to Handle Missing Data?

• Ignore records (use only cases with all values)

• Usually done when class label is missing as most prediction


methods do not handle missing data well
• Not effective when the percentage of missing values per
attribute varies considerably as it can lead to insufficient
and/or biased sample sizes

• Ignore attributes with missing values

• Use only features (attributes) with all values (may leave


out important features)

• Fill in the missing value manually

• tedious + infeasible?
5
How to Handle Missing Data?

• Use a global constant to fill in the missing value

• e.g., “unknown”. (May create a new class!)

• Use the attribute mean to fill in the missing value

• It will do the least harm to the mean of existing data


• If the mean is to be unbiased
• What if the standard deviation is to be unbiased?

• Use the attribute mean for all samples belonging to the same
class to fill in the missing value
6
How to Handle Missing Data?

• Use the most probable value to fill in the missing value

• Inference-based such as Bayesian formula or decision tree

• Identify relationships among variables


• Linear regression, Multiple linear regression, Nonlinear
regression

• Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the
most frequent value or the average value
• Finding neighbours in a large dataset may be slow

7
Nearest-Neighbour

8
How to Handle Missing Data?

• Note that, it is as important to avoid adding bias and distortion


to the data as it is to make the information available.

• bias is added when a wrong value is filled-in

• No matter what techniques you use to conquer the problem, it


comes at a price. The more guessing you have to do, the further
away from the real data the database becomes. Thus, in turn, it
can affect the accuracy and validation of the mining results.

9
INCORRECT DATA
This is inconsistent data
Like negative number for age !!
Can be treated as missing value.
NOISE
Noise can be
• At attribute level
– random error
– outlier
• At record level
– outlier
Noise at attribute level
• Random error added to the measurement.
• Random error will have 0 mean and some
small variance.

• If the mean is not having 0 mean, it is called


the bias in the measurement.
– Also called systematic error.
• Temporal Data -- Stock data, sensor data
indexed with time.
• Spatial Data -- Image

• Model based

• Generic Data
Temporal : Average Filter
Gaussian Filter
Example : Gaussian filter in image data
Generic
Model based
Quadratic regression
OUTLIERS

22
Outliers
• Outliers are values thought to be out of range.
• “An outlier is an observation that deviates so much from
other observations as to arouse suspicion that it was
generated by a different mechanism”

• Can be detected by standardizing observations and label


the standardized values outside a predetermined bound as
outliers
• Outlier detection can be used for fraud detection or data
cleaning

• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem
23
Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an outlier
if outside limits (normal distribution assumed)

(x − ks, x + ks)

24
25
Outlier detection
• Univariate
• Boxplot: An observation is an extreme outlier if it lies
outside (Q1-3×IQR, Q3+3×IQR), where IQR=Q3-Q1
(IQR = Inter Quartile Range)

and declared a mild outlier if it


lies outside of the interval
(Q1-1.5×IQR, Q3+1.5×IQR).

http://www.physics.csbsju.edu/stats/box2.html 44
> 3
L
> 1.5
L

27
Outlier detection
• Multivariate

• Clustering
• Very small clusters are outliers

http://www.ibm.com/developerworks/data/li
brary/techarticle/dm-0811wurst/
28
Outlier detection
• Multivariate

• Distance based
• An instance with very few neighbors within D is regarded
as an outlier

Knn algorithm

29
30
Conept (model) based outlier: A bi-dimensional outlier that is not an
outlier in either of its projections. But linear relation between
attributes can say that red dot is an outlier.

You might also like