0% found this document useful (0 votes)
37 views39 pages

Unsia - Data Mining Pertemuan 9

Uploaded by

Rayhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views39 pages

Unsia - Data Mining Pertemuan 9

Uploaded by

Rayhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

DATA MINING

Pertemuan ke-9: Data Cleaning

By

Riad Sahara, S.SI., MT

Ir. Henny Yulianti, M.M., M.Kom


Data Preprocessing

2
CRISP-DM

3
Why Preprocess the Data?
Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not


• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
4
Major Tasks in Data Preprocessing
1. Data cleaning
• Fill in missing values
• Smooth noisy data
• Identify or remove outliers
• Resolve inconsistencies
2. Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
3. Data transformation and data discretization
• Normalization
• Concept hierarchy generation
4. Data integration
• Integration of multiple databases or files
5
Data Preparation Law (Data Mining Law 3)
Data preparation is more than half of every data mining process

• Maxim of data mining: most of the effort in a data mining


project is spent in data acquisition and preparation, and
informal estimates vary from 50 to 80 percent
• The purpose of data preparation is:
1. To put the data into a form in which the data mining question can be
asked
2. To make it easier for the analytical techniques (such as data mining
algorithms) to answer it
6
Data Cleaning

7
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error

• Incomplete: lacking attribute values, lacking certain attributes of


interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• Noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• Inconsistent: containing discrepancies in codes or names
• e.g., Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• Discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 8
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred
9
Contoh Missing Data
• Dataset: MissingDataSet.csv

10
MissingDataSet.csv
• Jerry is the marketing manager for a small Internet design and advertising firm
• Jerry’s boss asks him to develop a data set containing information about
Internet users
• The company will use this data to determine what kinds of people are using
the Internet and how the firm may be able to market their services to this
group of users
• To accomplish his assignment, Jerry creates an online survey and places links
to the survey on several popular Web sites
• Within two weeks, Jerry has collected enough data to begin analysis, but he
finds that his data needs to be denormalized
• He also notes that some observations in the set are missing values or they
appear to contain invalid values
• Jerry realizes that some additional work on the data needs to take place before
analysis begins. 11
Relational Data

12
View of Data (Denormalized Data)

13
Contoh Missing Data
• Dataset: MissingDataSet.csv

14
How to Handle Missing Data?
• Ignore the tuple:
• Usually done when class label is missing (when doing
classification)—not effective when the % of missing values per
attribute varies considerably
• Fill in the missing value manually:
• Tedious + infeasible?
• Fill in it automatically with
• A global constant: e.g., “unknown”, a new class?!
• The attribute mean
• The attribute mean for all samples belonging to the same class:
smarter
• The most probable value: inference-based such as Bayesian 15
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd Edition,
2016, Chapter 3 Data Preparation
1. Handling Missing Data, pp. 34-48 (replace)
2. Data Reduction, pp. 48-51 (delete/filter)

• Dataset: MissingDataSet.csv

• Analisis metode preprocessing apa saja yang


digunakan dan mengapa perlu dilakukan pada
dataset tersebut? 16
Missing Value Detection

17
Missing Value Replace

18
Missing Value Filtering

19
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which require data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data
20
How to Handle Noisy Data?
• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
• Combined computer and human inspection
• Detect suspicious values and check by human (e.g., deal with
possible outliers)
21
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency,
distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering to
find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users
to specify transformations through a graphical user
interface
• Integration of the two processes 22
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd Edition,
2016, Chapter 3 Data Preparation, pp. 52-54
(Handling Inconsistence Data)

• Dataset: MissingDataSet.csv

• Analisis metode preprocessing apa saja yang


digunakan dan mengapa perlu dilakukan pada 23
dataset tersebut!
24
Setting Regex

Ujicoba Regex
25
Latihan
• Impor data MissingDataValue-Noisy.csv
• Gunakan Regular Expression (operator
Replace) untuk mengganti semua noisy
data pada atribut nominal menjadi “N”

26
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. Gunakan operator Replace Missing Value untuk mengisi data
Latihan 3.
kosong
Gunakan Regular Expression (operator Replace) untuk mengganti
semua noisy data pada atribut nominal menjadi “N”
4. Gunakan operator Map untuk mengganti semua isian Face, FB dan
Fesbuk menjadi Facebook

27
28
1 3 4
2
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. operator Replace Missing Value untuk mengisi data
kosong
3. operator Replace untuk mengganti semua noisy data
pada atribut nominal menjadi “N”
4. operator Map untuk mengganti semua isian Face, FB
dan Fesbuk menjadi Facebook
Studi Kasus CRISP-DM
• Sport Skill – Discriminant Analysis
• (Matthew North, Data Mining for the Masses 2nd Edition,
2016,
Chapter 7 Discriminant Analysis, pp. 123-143)
• Dataset: SportSkill-Training.csv
• Dataset: SportSkill-Scoring.csv

30
1. Business Understanding
• Motivation:
• Gill runs a sports academy designed to help high school aged athletes
achieve their maximum athletic potential. He focuses on four major sports:
Football, Basketball, Baseball and Hockey
• He has found that while many high school athletes enjoy participating in a
number of sports in high school, as they begin to consider playing a sport at
the college level, they would prefer to specialize in one sport
• As he’s worked with athletes over the years, Gill has developed an
extensive data set, and he now is wondering if he can use past performance
from some of his previous clients to predict prime sports for up-and-coming
high school athletes
• By evaluating each athlete’s performance across a battery of test, Gill
hopes we can help him figure out for which sport each athlete has the
highest aptitude
• Objective:
• Ultimately, he hopes he can make a recommendation to each athlete as to 31
the sport in which they should most likely choose to specialize
2. Data Understanding
• Every athlete that has enrolled at Gill’s academy
over the past several years has taken a battery
test, which tested for a number of athletic and
personal traits
• Because the academy has been operating for
some time, Gill has the benefit of knowing which
of his former pupils have gone on to specialize in
a single sport, and which sport it was for each of
them 32
2. Data Understanding
• Working with Gill, we gather the results of the batteries for all
former clients who have gone on to specialize
• Gill adds the sport each person specialized in, and we have a
data set comprised of 493 observations containing the following
attributes:
1. Age: ....
2. Strength: ....
3. Quickness: ....
4. Injury: ....
5. Vision: ....
6. Endurance: ....
7. Agility: ....
8. Decision Making: ....
33
9. Prime Sport: ....
3. Data Preparation
• Filter Examples: attribute value filter
• Decision_Making>=3
• Decision_Making<=100
• Deleted Records= 493-482=11

34
Latihan
1. Lakukan training pada data SportSkill-
Training.csv dengan menggunakan C4.5, NB, K-
NN dan LDA
2. Lakukan pengujian dengan menggunakan 10-
fold X Validation
3. Uji beda dengan t-Test untuk mendapatkan
model terbaik
4. Simpan model terbaik dari komparasi di atas
dengan operator Write Model, dan kemudian
Apply Model pada dataset SportSkill- 35
DT NB k-NN LDA

DT
NB
k-NN
36
LDA
37
38
Terima Kasih

39

You might also like