0% found this document useful (0 votes)

37 views39 pages

Unsia - Data Mining Pertemuan 9

Uploaded by

Rayhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views39 pages

Unsia - Data Mining Pertemuan 9

Uploaded by

Rayhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

DATA MINING

Pertemuan ke-9: Data Cleaning

Riad Sahara, S.SI., MT

Ir. Henny Yulianti, M.M., M.Kom

Data Preprocessing

2
CRISP-DM

3
Why Preprocess the Data?
Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not

• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
4
Major Tasks in Data Preprocessing
1. Data cleaning
• Fill in missing values
• Smooth noisy data
• Identify or remove outliers
• Resolve inconsistencies
2. Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
3. Data transformation and data discretization
• Normalization
• Concept hierarchy generation
4. Data integration
• Integration of multiple databases or files
5
Data Preparation Law (Data Mining Law 3)
Data preparation is more than half of every data mining process

• Maxim of data mining: most of the effort in a data mining

project is spent in data acquisition and preparation, and
informal estimates vary from 50 to 80 percent
• The purpose of data preparation is:
1. To put the data into a form in which the data mining question can be
asked
2. To make it easier for the analytical techniques (such as data mining
algorithms) to answer it
6
Data Cleaning

7
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error

• Incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• Noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• Inconsistent: containing discrepancies in codes or names
• e.g., Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• Discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 8
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred
9
Contoh Missing Data
• Dataset: MissingDataSet.csv

10
MissingDataSet.csv
• Jerry is the marketing manager for a small Internet design and advertising firm
• Jerry’s boss asks him to develop a data set containing information about
Internet users
• The company will use this data to determine what kinds of people are using
the Internet and how the firm may be able to market their services to this
group of users
• To accomplish his assignment, Jerry creates an online survey and places links
to the survey on several popular Web sites
• Within two weeks, Jerry has collected enough data to begin analysis, but he
finds that his data needs to be denormalized
• He also notes that some observations in the set are missing values or they
appear to contain invalid values
• Jerry realizes that some additional work on the data needs to take place before
analysis begins. 11
Relational Data

12
View of Data (Denormalized Data)

13
Contoh Missing Data
• Dataset: MissingDataSet.csv

14
How to Handle Missing Data?
• Ignore the tuple:
• Usually done when class label is missing (when doing
classification)—not effective when the % of missing values per
attribute varies considerably
• Fill in the missing value manually:
• Tedious + infeasible?
• Fill in it automatically with
• A global constant: e.g., “unknown”, a new class?!
• The attribute mean
• The attribute mean for all samples belonging to the same class:
smarter
• The most probable value: inference-based such as Bayesian 15
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd Edition,
2016, Chapter 3 Data Preparation
1. Handling Missing Data, pp. 34-48 (replace)
2. Data Reduction, pp. 48-51 (delete/filter)

• Dataset: MissingDataSet.csv

• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukan pada
dataset tersebut? 16
Missing Value Detection

17
Missing Value Replace

18
Missing Value Filtering

19
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which require data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data
20
How to Handle Noisy Data?
• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
• Combined computer and human inspection
• Detect suspicious values and check by human (e.g., deal with
possible outliers)
21
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency,
distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering to
find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users
to specify transformations through a graphical user
interface
• Integration of the two processes 22
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd Edition,
2016, Chapter 3 Data Preparation, pp. 52-54
(Handling Inconsistence Data)

• Dataset: MissingDataSet.csv

• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukan pada 23
dataset tersebut!
24
Setting Regex

Ujicoba Regex
25
Latihan
• Impor data MissingDataValue-Noisy.csv
• Gunakan Regular Expression (operator
Replace) untuk mengganti semua noisy
data pada atribut nominal menjadi “N”

26
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. Gunakan operator Replace Missing Value untuk mengisi data
Latihan 3.
kosong
Gunakan Regular Expression (operator Replace) untuk mengganti
semua noisy data pada atribut nominal menjadi “N”
4. Gunakan operator Map untuk mengganti semua isian Face, FB dan
Fesbuk menjadi Facebook

27
28
1 3 4
2
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. operator Replace Missing Value untuk mengisi data
kosong
3. operator Replace untuk mengganti semua noisy data
pada atribut nominal menjadi “N”
4. operator Map untuk mengganti semua isian Face, FB
dan Fesbuk menjadi Facebook
Studi Kasus CRISP-DM
• Sport Skill – Discriminant Analysis
• (Matthew North, Data Mining for the Masses 2nd Edition,
2016,
Chapter 7 Discriminant Analysis, pp. 123-143)
• Dataset: SportSkill-Training.csv
• Dataset: SportSkill-Scoring.csv

30
1. Business Understanding
• Motivation:
• Gill runs a sports academy designed to help high school aged athletes
achieve their maximum athletic potential. He focuses on four major sports:
Football, Basketball, Baseball and Hockey
• He has found that while many high school athletes enjoy participating in a
number of sports in high school, as they begin to consider playing a sport at
the college level, they would prefer to specialize in one sport
• As he’s worked with athletes over the years, Gill has developed an
extensive data set, and he now is wondering if he can use past performance
from some of his previous clients to predict prime sports for up-and-coming
high school athletes
• By evaluating each athlete’s performance across a battery of test, Gill
hopes we can help him figure out for which sport each athlete has the
highest aptitude
• Objective:
• Ultimately, he hopes he can make a recommendation to each athlete as to 31
the sport in which they should most likely choose to specialize
2. Data Understanding
• Every athlete that has enrolled at Gill’s academy
over the past several years has taken a battery
test, which tested for a number of athletic and
personal traits
• Because the academy has been operating for
some time, Gill has the benefit of knowing which
of his former pupils have gone on to specialize in
a single sport, and which sport it was for each of
them 32
2. Data Understanding
• Working with Gill, we gather the results of the batteries for all
former clients who have gone on to specialize
• Gill adds the sport each person specialized in, and we have a
data set comprised of 493 observations containing the following
attributes:
1. Age: ....
2. Strength: ....
3. Quickness: ....
4. Injury: ....
5. Vision: ....
6. Endurance: ....
7. Agility: ....
8. Decision Making: ....
33
9. Prime Sport: ....
3. Data Preparation
• Filter Examples: attribute value filter
• Decision_Making>=3
• Decision_Making<=100
• Deleted Records= 493-482=11

34
Latihan
1. Lakukan training pada data SportSkill-
Training.csv dengan menggunakan C4.5, NB, K-
NN dan LDA
2. Lakukan pengujian dengan menggunakan 10-
fold X Validation
3. Uji beda dengan t-Test untuk mendapatkan
model terbaik
4. Simpan model terbaik dari komparasi di atas
dengan operator Write Model, dan kemudian
Apply Model pada dataset SportSkill- 35
DT NB k-NN LDA

DT
NB
k-NN
36
LDA
37
38
Terima Kasih

Polaroid 50 Inch TV P50up2038a
No ratings yet
Polaroid 50 Inch TV P50up2038a
30 pages
ICT Course Outline
No ratings yet
ICT Course Outline
4 pages
Dwina DM 03 Persiapan 2018
No ratings yet
Dwina DM 03 Persiapan 2018
82 pages
3 Persiapan Data Mining
No ratings yet
3 Persiapan Data Mining
83 pages
Romi DM 03 Persiapan Mar2016
No ratings yet
Romi DM 03 Persiapan Mar2016
82 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Pengenalan Beragam Macam Data
No ratings yet
Pengenalan Beragam Macam Data
113 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Mining Pertemuan 6
No ratings yet
Data Mining Pertemuan 6
28 pages
TPC-1071H - 1271H - 1571H - 1771H - User Manual - Ed3
No ratings yet
TPC-1071H - 1271H - 1571H - 1771H - User Manual - Ed3
88 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Veeam Customer Sales Playbook
100% (1)
Veeam Customer Sales Playbook
16 pages
36.why Data Preprocessing Introduction
No ratings yet
36.why Data Preprocessing Introduction
37 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
C and Data Structures - Balaguruswamy
100% (2)
C and Data Structures - Balaguruswamy
52 pages
Webinar On The IPCRF Data Collection System For SY 2019-2020
No ratings yet
Webinar On The IPCRF Data Collection System For SY 2019-2020
99 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
W4-5 03preprocessing
No ratings yet
W4-5 03preprocessing
83 pages
@1-Aspire - 4741 Hardware Servicing Guide
No ratings yet
@1-Aspire - 4741 Hardware Servicing Guide
49 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
DSF - Data Preprocessing
No ratings yet
DSF - Data Preprocessing
20 pages
Matrices: Discrete Mathematical Structures: Theory and Applications
No ratings yet
Matrices: Discrete Mathematical Structures: Theory and Applications
48 pages
Job Specification - Data Import
No ratings yet
Job Specification - Data Import
43 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Analysis of Functions of BCD-TO-7-Segment Decoder/Driver and Operation of 7-Segment LED Display
No ratings yet
Analysis of Functions of BCD-TO-7-Segment Decoder/Driver and Operation of 7-Segment LED Display
6 pages
Incent - View: Smiths Detection Networking Solution
No ratings yet
Incent - View: Smiths Detection Networking Solution
2 pages
Energies: PC Implementation of A Real-Time Simulator Using ATP Foreign Models and A Sound Card
No ratings yet
Energies: PC Implementation of A Real-Time Simulator Using ATP Foreign Models and A Sound Card
11 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
D06A Data Preprocessing
No ratings yet
D06A Data Preprocessing
25 pages
Analyze Source Gas
No ratings yet
Analyze Source Gas
2 pages
Class12 DataScience Project Template 2024-25
No ratings yet
Class12 DataScience Project Template 2024-25
50 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
Computer Project
No ratings yet
Computer Project
101 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
66 pages
DM Day3 Preprocessing A F24
No ratings yet
DM Day3 Preprocessing A F24
85 pages
Ajit Pal Singh
No ratings yet
Ajit Pal Singh
4 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
QX-5000 Configurator User Guide
No ratings yet
QX-5000 Configurator User Guide
40 pages
Electronic Warfare Signal Generation: Technologies and Methods
No ratings yet
Electronic Warfare Signal Generation: Technologies and Methods
20 pages
DBMSL Assignment 1
No ratings yet
DBMSL Assignment 1
6 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Lec 1 Data Acquisition and Preprocessing
No ratings yet
Lec 1 Data Acquisition and Preprocessing
8 pages
Compensation Exam 20230710
No ratings yet
Compensation Exam 20230710
4 pages
Iot Based Ev Vehcile Fire Safety System
No ratings yet
Iot Based Ev Vehcile Fire Safety System
13 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
The Internet and E-Mail
No ratings yet
The Internet and E-Mail
5 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
CM Wifi
No ratings yet
CM Wifi
21 pages
BSNL GIS An Overview3
No ratings yet
BSNL GIS An Overview3
18 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Kitab Mujarobat: A-Z Keywords
No ratings yet
Kitab Mujarobat: A-Z Keywords
2 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Wa0000.
No ratings yet
Wa0000.
4 pages
CS322 - Lec 3 - S25
No ratings yet
CS322 - Lec 3 - S25
42 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
Unit-4 Part 1 Preparing Model
No ratings yet
Unit-4 Part 1 Preparing Model
20 pages
Grade 12 COMPUTER VIRUS
No ratings yet
Grade 12 COMPUTER VIRUS
20 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Cyberpsychology, Behavior and Social Networking
No ratings yet
Cyberpsychology, Behavior and Social Networking
12 pages
Law06 Signed
No ratings yet
Law06 Signed
1 page
(Ebook) Building Ethereum Dapps: Decentralized Applications On The Ethereum Blockchain by Roberto Infante ISBN 9781617295157, 1617295159
100% (2)
(Ebook) Building Ethereum Dapps: Decentralized Applications On The Ethereum Blockchain by Roberto Infante ISBN 9781617295157, 1617295159
72 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
CH 3
No ratings yet
CH 3
34 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
Mlii-103 Guess Paper of
No ratings yet
Mlii-103 Guess Paper of
35 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet

Unsia - Data Mining Pertemuan 9

Uploaded by

Unsia - Data Mining Pertemuan 9

Uploaded by

DATA MINING

Pertemuan ke-9: Data Cleaning

Riad Sahara, S.SI., MT

Ir. Henny Yulianti, M.M., M.Kom

• Accuracy: correct or wrong, accurate or not

• Maxim of data mining: most of the effort in a data mining

• Incomplete: lacking attribute values, lacking certain attributes of

• Analisis metode preprocessing apa saja yang

• Analisis metode preprocessing apa saja yang

You might also like