Data Mining Pertemuan
Data Mining Pertemuan
GTA Transformation
Academy
Preprocessing
Model Estimasi, Klasifikasi, Evaluasi
Data Preprocessing/Persiapan Data
Pembersihan data
Mengapa Data diproses di awal?
Dataset: Missingdataset.csv
Bagaimana Mengolah Missing Data?
• Ignore the tuple
• Fill in the missing value manually
• Melelahkan dan tidak mungkin
• Fill in it automatically with
• A global constant
• The attribute mean : nilai rata2 attribute yg memiliki missing value
• The most probable value: nilai missing value diganti dengan data yg paling banyak
muncul
MissingDataset.csv
• Jerry is the marketing manager for a small Internet design and advertising firm
• Jerry’s boss asks him to develop a data set containing information about Internet users
• The company will use this data to determine what kinds of people are using the Internet
and how the firm may be able to market their services to this group of users
• To accomplish his assignment, Jerry creates an online survey and places links to the
survey on several popular Web sites
• Within two weeks, Jerry has collected enough data to begin analysis, but he finds that
his data needs to be denormalized
• He also notes that some observations in the set are missing values or they appear to
contain invalid values
• Jerry realizes that some additional work on the data needs to take place before analysis
begins.
Latihan Preprocessing dengan Rapidminer
• Membuang dataset yang missing
(menggunakan replace missing value) (menggunakan filter example)
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which require data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data
How to Handle Noisy Data?
• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
• Combined computer and human inspection
• Detect suspicious values and check by human (e.g., deal with possible outliers)
Latihan Preprocessing dengan Rapidminer
• Membuang dataset yang noisy (menggunakan replace),(menggunakan regex)
(menggunakan map)
• Impor data MissingData-Noisy.csv
• Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy data
pada atribut nominal menjadi “N”
1. Estimasi
5. Asosiasi 2. Forcesting
Data Mining Roles
(Larose, 2005)
4. Klastering 3. Klasifikasi
Proses Data Mining
Evaluasi Kinerja Model Data Mining
1. Estimation:
Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
Confusion Matrix: Accuracy
ROC Curve: Area Under Curve (AUC)
4. Clustering:
Internal Evaluation: Davies–Bouldin index, Dunn index,
External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index,
Confusion matrix
5. Association:
Lift Charts: Lift Ratio
Precision and Recall (F-measure)
Evaluasi Model Data Mining
• Pembagian dataset, perbandingan 90:10 atau 80:20 :
• Data Training
• Data Testing
• Data Training untuk pembentukan model, dan data testing digunakan untuk
pengujian model
pred MACET – true Macet : Jumlah data yang diprediksi macet dan kenyataannya macet (TP) TP
pred LANCAR – true LANCAR : Jumlah data yang diprediksi lancar dan kenyataannya lancar (TN)
pred MACET – true LANCAR : Jumlah data yang diprediksi macet dan kenyataannya lancer (FP) FN
pred LANCAR – true MACET : Jumlah data yang diprediksi lancer tapi kenyataannya macet (FN)
❑
𝑇𝑃 +𝑇𝑁 52+ 38 90
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = =90 %
𝑇𝑃 +𝑇𝑁 + 𝐹𝑃+ 𝐹𝑁 52+38+ 3+7 100 ❑
Guide for Classifying the AUC
(Gorunescu, 2011)
Latihan : Prediksi Harga Saham