Subtitle

Uploaded by

Bhavik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views2 pages

Subtitle

Uploaded by

Bhavik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

In the last lecture we

discussed data quality issues. We will now discuss some common techniques
for addressing those quality issues. After this video, you will be able
to define what imputation means, illustrate three ways to
handle missing values, and describe the role of domain knowledge
in addressing data quality issues. As we discussed in the last lecture,
real world data is messy. Some data quality issues that you can
find in your data are missing values, duplicate data, invalid data,
noise and outliers. You will need to clean your data if you
want to perform any meaningful analysis on that data. Recall that missing data
occurs
when you don't have a value for certain variables in some samples. A simple way to
handle missing data is
to simply drop any samples with missing values or NAs. All machine learning tools
provide
a mechanism or command for filtering out rows with
any missing values. The advantage of this approach
is that it is very simple. The caveat is that you are removing
data when you filter out examples. If the number of samples dropped is large,
then you end up losing a lot of your data. An alternative to dropping
samples with missing data is to impute the missing values. Imputing means to
replace the missing
values with some reasonable values. The advantage of this approach is that
you're making use of all your data. Oc course, imputing is more complicated
than simply dropping samples. There are several ways to
impute missing values. One strategy is to replace
the missing values with the mean or median value of the variable. For example, a
missing value for years of
employment can be replaced by the mean or median value for years of employment for
all current employees. Another approach is to use
the most frequent value in place of the missing value. For example, the most
frequently
recorded age of customers associated with the specific item can
be used if that value is missing. Alternatively, a sensible value can
be derived as a replacement for a missing value. For example, a missing value for
income can be set to zero for customers less then 18 years old, or it can be
replaced with an average
value based on occupation and location. Note that this approach requires
knowledge about the application and the variable with missing values in
order to make reasonable choices about what valuables would be sensible
to replace the missing values. In the case of duplicate data one
approach is to delete the older record. Another approach is to
merge duplicate records. This often requires a way to determine
how to resolve conflicting values. For example, in the case of multiple
addresses for the same customer, some logic for determining similarities
between addresses might be necessary. For example,
St period is the same as Street. To address invalid data, consulting
another data source may be necessary. For example,
an invalid zip code can be corrected by looking up the correct zip
code based on city and state. A best estimate for a reasonable value
can also be used as a replacement. For example, for
a missing age value for an employee, a reasonable value can be estimated based
on the employee's length of employment. Noise that distorts the data
values can be addressed by filtering out the source of the noise. For example,
filtering out the frequency
of a constant background noise will remove that noise
component from a recording. This filtering must be
done with care however, as it can also remove some components
of the true data in the process. Outliers can be detected through
the use of summary statistics and plots of the data. Outliers can significantly
skew
the distribution of your data and thus the results of your analysis. In cases where
outliers are not
the focus of your analysis, you will want to remove these
outlier samples from your data set. For example,
when a thermostat malfunctions and causes values to fluctuate wildly,
or to be much higher or lower than normal,
these samples should be filtered out. In some applications, however, outliers
are exactly what you're looking for. So when you detect outliers,
you don't want to throw them out. Instead, you want to
examine them more closely. A classic example of this is in fraud
detection, where outliers represent potential fraudulent use and
those samples should be analyzed closely. In order to address data
quality issues effectively knowledge about
the application is crucial. Things such as how the data was collected, the user
population, the intended use
of the application etc, are important. This domain knowledge is essential
to making informed decisions on how to best impute missing values,
how to handle duplicate records and invalid data and what to do about
noise and outliers in your data.

Data Quality
100% (2)
Data Quality
16 pages
Data Quality
No ratings yet
Data Quality
14 pages
EDA
100% (1)
EDA
9 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Outline Biology IA
67% (3)
Outline Biology IA
3 pages
Data Quality and Data Cleaning: An Overview
0% (1)
Data Quality and Data Cleaning: An Overview
132 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Investigation of OOS
100% (2)
Investigation of OOS
11 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
71 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Da Mid1
No ratings yet
Da Mid1
32 pages
Multi-Span Prestressed Concrete Girder Bridges W CVR
100% (1)
Multi-Span Prestressed Concrete Girder Bridges W CVR
39 pages
SAS Cluster Project Report
100% (1)
SAS Cluster Project Report
24 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Module 4 - (Process Data From Dirty To Clean)
No ratings yet
Module 4 - (Process Data From Dirty To Clean)
36 pages
Data Analytics Program - Introduction To Data Analytics - Lesson 1
No ratings yet
Data Analytics Program - Introduction To Data Analytics - Lesson 1
56 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
20 pages
Data Quality
No ratings yet
Data Quality
13 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
No ratings yet
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
15 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Outliners
No ratings yet
Outliners
15 pages
Unit 1
No ratings yet
Unit 1
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
1.3 Data Quality
No ratings yet
1.3 Data Quality
6 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Lined Interjection Approach
No ratings yet
Lined Interjection Approach
7 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Project - Tagged 3
No ratings yet
Project - Tagged 3
5 pages
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
No ratings yet
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
12 pages
Data Cleansing
No ratings yet
Data Cleansing
6 pages
ASTM D5628-18 Falling Dart - Tub
No ratings yet
ASTM D5628-18 Falling Dart - Tub
10 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Subtitle Big Data Coursera 4
No ratings yet
Subtitle Big Data Coursera 4
2 pages
DM 24 Data Cleaning
No ratings yet
DM 24 Data Cleaning
2 pages
CE5.3.2 Common Sources of Data Errors and ErrorChecking Techniques
No ratings yet
CE5.3.2 Common Sources of Data Errors and ErrorChecking Techniques
2 pages
Data Quality
No ratings yet
Data Quality
7 pages
Business Intelligence Carlo Vercellis
No ratings yet
Business Intelligence Carlo Vercellis
5 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Goldman Sachs
No ratings yet
Goldman Sachs
5 pages
Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation
No ratings yet
Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation
3 pages
4a Campus Map
No ratings yet
4a Campus Map
1 page
ORM-2 Assignment 1
No ratings yet
ORM-2 Assignment 1
2 pages
A Comprehensive Guide To Data Exploration
100% (2)
A Comprehensive Guide To Data Exploration
18 pages
Math
No ratings yet
Math
15 pages
CHP 9 Financial Statements Analysis
No ratings yet
CHP 9 Financial Statements Analysis
42 pages
SSRN Id3365282
No ratings yet
SSRN Id3365282
32 pages
Crayfish
No ratings yet
Crayfish
1 page
Stress Measure
No ratings yet
Stress Measure
10 pages
(2022) Machine Learning Techniques To Model A Full-Scale Wastewater Treatment Plant With Biological Nutrient - Zaghloul, Achari
No ratings yet
(2022) Machine Learning Techniques To Model A Full-Scale Wastewater Treatment Plant With Biological Nutrient - Zaghloul, Achari
18 pages
DEPRECIATION
No ratings yet
DEPRECIATION
24 pages
Statistics New Syllabus - Part 2
No ratings yet
Statistics New Syllabus - Part 2
70 pages
Effects of Individual Toxic Behavior On Team
No ratings yet
Effects of Individual Toxic Behavior On Team
24 pages
Supplement Undergraduate Projects On Descriptive Statistic
No ratings yet
Supplement Undergraduate Projects On Descriptive Statistic
5 pages
Video Pres
No ratings yet
Video Pres
7 pages
Subtitle Big Data Coursera 4
No ratings yet
Subtitle Big Data Coursera 4
1 page
Mco 3 - June Tee 2024
No ratings yet
Mco 3 - June Tee 2024
46 pages
Curve Fitting Tutorial
No ratings yet
Curve Fitting Tutorial
13 pages
Detection of Phishing Web Page Using Machine Learning
No ratings yet
Detection of Phishing Web Page Using Machine Learning
20 pages
18 AS Statistics and Mechanics Practice Paper I Mark Scheme
No ratings yet
18 AS Statistics and Mechanics Practice Paper I Mark Scheme
8 pages
The Impact of Merger and Acquisition On Financial Performance in Indonesia
No ratings yet
The Impact of Merger and Acquisition On Financial Performance in Indonesia
14 pages
Inferential Statistics
No ratings yet
Inferential Statistics
10 pages
Cipac Status Report
No ratings yet
Cipac Status Report
3 pages
OREAS 256b Certificate
No ratings yet
OREAS 256b Certificate
23 pages
Yr 8 MATHDairy Herd Data TCH Guide
No ratings yet
Yr 8 MATHDairy Herd Data TCH Guide
8 pages
The AirSensor Open-Source R-Package and DataViewer Web Application For Interpreting Community Data Collected by Low-Cost Sensor Networks
No ratings yet
The AirSensor Open-Source R-Package and DataViewer Web Application For Interpreting Community Data Collected by Low-Cost Sensor Networks
17 pages
Tsanalyzer, A Gnss Time Series Analysis Software: Gps Solutions July 2017
No ratings yet
Tsanalyzer, A Gnss Time Series Analysis Software: Gps Solutions July 2017
7 pages
Statistics Problem Set
No ratings yet
Statistics Problem Set
2 pages

Subtitle

Uploaded by

Subtitle

Uploaded by

In the last lecture we

You might also like