Lecture 3
Lecture 3
2
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• MVs may have an impact on modelling, in fact, they can destroy it!
• Some tools ignore missing values, others use some metric to fill in
replacements
4
How to Handle Missing Data?
• tedious + infeasible?
5
How to Handle Missing Data?
• Use the attribute mean for all samples belonging to the same
class to fill in the missing value
6
How to Handle Missing Data?
• Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the
most frequent value or the average value
• Finding neighbours in a large dataset may be slow
7
Nearest-Neighbour
8
How to Handle Missing Data?
9
INCORRECT DATA
This is inconsistent data
Like negative number for age !!
Can be treated as missing value.
NOISE
Noise can be
• At attribute level
– random error
– outlier
• At record level
– outlier
Noise at attribute level
• Random error added to the measurement.
• Random error will have 0 mean and some
small variance.
• Model based
• Generic Data
Temporal : Average Filter
Gaussian Filter
Example : Gaussian filter in image data
Generic
Model based
Quadratic regression
OUTLIERS
22
Outliers
• Outliers are values thought to be out of range.
• “An outlier is an observation that deviates so much from
other observations as to arouse suspicion that it was
generated by a different mechanism”
• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem
23
Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an outlier
if outside limits (normal distribution assumed)
(x − ks, x + ks)
24
25
Outlier detection
• Univariate
• Boxplot: An observation is an extreme outlier if it lies
outside (Q1-3×IQR, Q3+3×IQR), where IQR=Q3-Q1
(IQR = Inter Quartile Range)
http://www.physics.csbsju.edu/stats/box2.html 44
> 3
L
> 1.5
L
27
Outlier detection
• Multivariate
• Clustering
• Very small clusters are outliers
http://www.ibm.com/developerworks/data/li
brary/techarticle/dm-0811wurst/
28
Outlier detection
• Multivariate
• Distance based
• An instance with very few neighbors within D is regarded
as an outlier
Knn algorithm
29
30
Conept (model) based outlier: A bi-dimensional outlier that is not an
outlier in either of its projections. But linear relation between
attributes can say that red dot is an outlier.