0% found this document useful (0 votes)

12 views32 pages

Lecture 3

Uploaded by

srinutirumanisetti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views32 pages

Lecture 3

Uploaded by

srinutirumanisetti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

NOISY DATA

● MISSING DATA or WRONG DATA

● NOISE in the measurement

2
Missing Data
• Data is not always available

• E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data

• Missing data may be due to

• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data

• Missing data may need to be inferred.

• Missing values may carry some information content: e.g. a credit

application may carry information by noting which field the applicant
3
did not complete
Missing Values
• There are always MVs in a real dataset

• MVs may have an impact on modelling, in fact, they can destroy it!

• Some tools ignore missing values, others use some metric to fill in
replacements

• The modeller should avoid default automated replacement

techniques
• Difficult to know limitations, problems and introduced bias

• Replacing missing values without elsewhere capturing that

information removes information from the dataset

4
How to Handle Missing Data?

• Ignore records (use only cases with all values)

• Usually done when class label is missing as most prediction

methods do not handle missing data well
• Not effective when the percentage of missing values per
attribute varies considerably as it can lead to insufficient
and/or biased sample sizes

• Ignore attributes with missing values

• Use only features (attributes) with all values (may leave

out important features)

• Fill in the missing value manually

• tedious + infeasible?
5
How to Handle Missing Data?

• Use a global constant to fill in the missing value

• e.g., “unknown”. (May create a new class!)

• Use the attribute mean to fill in the missing value

• It will do the least harm to the mean of existing data

• If the mean is to be unbiased
• What if the standard deviation is to be unbiased?

• Use the attribute mean for all samples belonging to the same
class to fill in the missing value
6
How to Handle Missing Data?

• Use the most probable value to fill in the missing value

• Inference-based such as Bayesian formula or decision tree

• Identify relationships among variables

• Linear regression, Multiple linear regression, Nonlinear
regression

• Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the
most frequent value or the average value
• Finding neighbours in a large dataset may be slow

7
Nearest-Neighbour

8
How to Handle Missing Data?

• Note that, it is as important to avoid adding bias and distortion

to the data as it is to make the information available.

• bias is added when a wrong value is filled-in

• No matter what techniques you use to conquer the problem, it

comes at a price. The more guessing you have to do, the further
away from the real data the database becomes. Thus, in turn, it
can affect the accuracy and validation of the mining results.

9
INCORRECT DATA
This is inconsistent data
Like negative number for age !!
Can be treated as missing value.
NOISE
Noise can be
• At attribute level
– random error
– outlier
• At record level
– outlier
Noise at attribute level
• Random error added to the measurement.
• Random error will have 0 mean and some
small variance.

• If the mean is not having 0 mean, it is called

the bias in the measurement.
– Also called systematic error.
• Temporal Data -- Stock data, sensor data
indexed with time.
• Spatial Data -- Image

• Model based

• Generic Data
Temporal : Average Filter
Gaussian Filter
Example : Gaussian filter in image data
Generic
Model based
Quadratic regression
OUTLIERS

22
Outliers
• Outliers are values thought to be out of range.
• “An outlier is an observation that deviates so much from
other observations as to arouse suspicion that it was
generated by a different mechanism”

• Can be detected by standardizing observations and label

the standardized values outside a predetermined bound as
outliers
• Outlier detection can be used for fraud detection or data
cleaning

• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem
23
Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an outlier
if outside limits (normal distribution assumed)

(x − ks, x + ks)

24
25
Outlier detection
• Univariate
• Boxplot: An observation is an extreme outlier if it lies
outside (Q1-3×IQR, Q3+3×IQR), where IQR=Q3-Q1
(IQR = Inter Quartile Range)

and declared a mild outlier if it

lies outside of the interval
(Q1-1.5×IQR, Q3+1.5×IQR).

http://www.physics.csbsju.edu/stats/box2.html 44
> 3
L
> 1.5
L

27
Outlier detection
• Multivariate

• Clustering
• Very small clusters are outliers

http://www.ibm.com/developerworks/data/li
brary/techarticle/dm-0811wurst/
28
Outlier detection
• Multivariate

• Distance based
• An instance with very few neighbors within D is regarded
as an outlier

Knn algorithm

29
30
Conept (model) based outlier: A bi-dimensional outlier that is not an
outlier in either of its projections. But linear relation between
attributes can say that red dot is an outlier.

Data Quality
No ratings yet
Data Quality
14 pages
Data Quality
100% (2)
Data Quality
16 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Lecture 7 - Data Cleaning
No ratings yet
Lecture 7 - Data Cleaning
36 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Farmer Registry Faqs
No ratings yet
Farmer Registry Faqs
18 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Outliners
No ratings yet
Outliners
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
UNIT02
No ratings yet
UNIT02
41 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
CH 2
No ratings yet
CH 2
36 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Data Integration
No ratings yet
Data Integration
20 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Ch3 Queries, Forms and Reports
No ratings yet
Ch3 Queries, Forms and Reports
3 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
DBMS Question Bank 2024
No ratings yet
DBMS Question Bank 2024
4 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
1.3 Data Quality
No ratings yet
1.3 Data Quality
6 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
UGRD-ITE6100B Fundamentals of Database System FINAL EXAM
No ratings yet
UGRD-ITE6100B Fundamentals of Database System FINAL EXAM
12 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Cake World Project Report
No ratings yet
Cake World Project Report
53 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
BDA - Lecture 4
No ratings yet
BDA - Lecture 4
41 pages
Unit 1
No ratings yet
Unit 1
21 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Artificial Intelligence: Dr. Piyush Joshi IIIT Sri City
No ratings yet
Artificial Intelligence: Dr. Piyush Joshi IIIT Sri City
27 pages
Expt 2
No ratings yet
Expt 2
3 pages
Artificial Intelligence: Dr. Piyush Joshi IIIT Sri City
No ratings yet
Artificial Intelligence: Dr. Piyush Joshi IIIT Sri City
23 pages
Module 9
No ratings yet
Module 9
11 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
SDL MultiTerm 2011 Presentation 4
No ratings yet
SDL MultiTerm 2011 Presentation 4
48 pages
A Search: F (N) Estimated Cost of The Best Path That Continues From N To A Goal
No ratings yet
A Search: F (N) Estimated Cost of The Best Path That Continues From N To A Goal
20 pages
17 Decidabi - Ity
No ratings yet
17 Decidabi - Ity
58 pages
18 Reducibility
No ratings yet
18 Reducibility
57 pages
Lecture 11
No ratings yet
Lecture 11
49 pages
Problem Solving by Searching
No ratings yet
Problem Solving by Searching
40 pages
Dsa Unit 4
No ratings yet
Dsa Unit 4
9 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Pentaho Data Integration
No ratings yet
Pentaho Data Integration
99 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
SAP Basis
No ratings yet
SAP Basis
5 pages
16 Turing Machines Variants NTM
No ratings yet
16 Turing Machines Variants NTM
36 pages
Oracle Application Framework: by Sridhar Yerram
No ratings yet
Oracle Application Framework: by Sridhar Yerram
63 pages
FULL PORTIONS 1 ANSWER KEY Ekbs
No ratings yet
FULL PORTIONS 1 ANSWER KEY Ekbs
22 pages
??????? ???????? ????????? ????
No ratings yet
??????? ???????? ????????? ????
30 pages
Clase XML Oracle
No ratings yet
Clase XML Oracle
72 pages
19 Reduction Computation History PCP
No ratings yet
19 Reduction Computation History PCP
25 pages
Presentation On Library Automation System
No ratings yet
Presentation On Library Automation System
23 pages
DA Exam Paper
No ratings yet
DA Exam Paper
3 pages
Dr.M.rajamanickam 07 2023
No ratings yet
Dr.M.rajamanickam 07 2023
5 pages
Bank MGMT System
No ratings yet
Bank MGMT System
15 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Lecture 14
No ratings yet
Lecture 14
20 pages
Siebel ConnIAAFINS
No ratings yet
Siebel ConnIAAFINS
82 pages
Lecture 12
No ratings yet
Lecture 12
13 pages
Power Bi Data Modelling
No ratings yet
Power Bi Data Modelling
18 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
Subp CT DAS
No ratings yet
Subp CT DAS
14 pages
Mapping Test
No ratings yet
Mapping Test
2 pages
Bars
No ratings yet
Bars
22 pages
CO - COMP3311 - 1 - 2023 - Term 3 - T3 - Multimodal - Standard - Kensington
No ratings yet
CO - COMP3311 - 1 - 2023 - Term 3 - T3 - Multimodal - Standard - Kensington
10 pages
Business Intelligence Carlo Vercellis
No ratings yet
Business Intelligence Carlo Vercellis
5 pages
Dell Emc Data Protection Suite Family
No ratings yet
Dell Emc Data Protection Suite Family
4 pages
AWP Practicals-51-99
No ratings yet
AWP Practicals-51-99
53 pages
Salesforce Integration Questions For Discovery
No ratings yet
Salesforce Integration Questions For Discovery
3 pages
20 Properties of RE and R Sets
No ratings yet
20 Properties of RE and R Sets
2 pages
Coding Problem - Jvd2101
No ratings yet
Coding Problem - Jvd2101
2 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet

Lecture 3

Uploaded by

Lecture 3

Uploaded by

NOISY DATA

● MISSING DATA or WRONG DATA

● NOISE in the measurement

• Missing data may be due to

• Missing data may need to be inferred.

• Missing values may carry some information content: e.g. a credit

• The modeller should avoid default automated replacement

• Replacing missing values without elsewhere capturing that

• Ignore records (use only cases with all values)

• Usually done when class label is missing as most prediction

• Ignore attributes with missing values

• Use only features (attributes) with all values (may leave

• Fill in the missing value manually

• Use a global constant to fill in the missing value

• e.g., “unknown”. (May create a new class!)

• Use the attribute mean to fill in the missing value

• It will do the least harm to the mean of existing data

• Use the most probable value to fill in the missing value

• Inference-based such as Bayesian formula or decision tree

• Identify relationships among variables

• Note that, it is as important to avoid adding bias and distortion

• bias is added when a wrong value is filled-in

• No matter what techniques you use to conquer the problem, it

• If the mean is not having 0 mean, it is called

• Can be detected by standardizing observations and label

and declared a mild outlier if it

You might also like