Slide Data Preprocessing
Slide Data Preprocessing
Dimodifikasi oleh
Dr. Taufik Fuadi Abidin, S.Si., M.Tech)
1
Mengapa Diperlukan Data Preprocessing?
Data in the real world is dirty (tidak sempurna)
incomplete: nilai atribut tidak lengkap, attribut
yang seharusnya ada tidak ada, atau hanya
data agrigasi yang tersedia (aggregate data)
e.g., Occupation = ― ‖, Jenis_kelamin = ― ‖
noisy: mengandung error atau outliers
e.g., Gaji = ―-100.000‖
inconsistent: terjadi perbedaan (discrepancies)
dalam pengkodean dan nilai
e.g., Age=―42‖ Birthday=―03/07/1980‖
e.g., Sebelumnya rating ―1,2,3‖, sekarang ―A, B, C‖
e.g., Terjadi perbedaan pada data yang duplikat
Data Mining: Concepts and Techniques 2
Why Is Data Dirty?
Incomplete data dapat terjadi karena
Pada saat dikumpulkan, nilai dari atribut tertentu tidak tersedia
―not applicable‖
Terjadi perbedaan pertimbangan sewaktu data dikumpulkan
dengan sewaktu data dianalisa
Problem yang disebabkan oleh manusia/hardware/software
Noisy data (incorrect values) dapat terjadi karena
Faulty data collection instruments (kesalahan pada alat)
Human atau komputer error pada saat entry data
Terjadi error pada saat dikirim (errors in data transmission)
Inconsistent data dapat terjadi karena
Perbedaan sumber data (different data sources)
Pelanggaran ketergantungan fungsionalitas (functional
dependency violation) e.g., modify some linked data
Terjadinya Duplikasi Record (Data)
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
1 n x
x xi e.g: 4, 36, 45, 50, 75
n i 1 N
Median:
Middle value if odd number of values, or average of the middle two
values otherwise e.g: 1, 5, 2, 8, 7
Importance
―Data cleaning is one of the three biggest problems
in data warehousing‖—Ralph Kimball
―Data cleaning is the number one problem in data
warehousing‖—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
incomplete data
inconsistent data
Clustering
detect and remove outliers
Y1
Y1’ y=x+1
X1 x
rA, B
( A A)( B B) ( AB ) n AB
(n 1)AB (n 1)AB
73,600 54,000
Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Mining: Concepts and Techniques 26
Referensi
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications
of ACM, 42:73-78, 1999
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build
a Data Quality Browser. SIGMOD’02.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the
Technical Committee on Data Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of
ACM, 39:86-95, 1996
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995