Data Mining
Data Mining
Start Immediately
Cognitive Competencies
Minimize the
Reading
Affective Performance
Error Recognition
4
Textbooks
5
Pre-Test
1. Jelaskan perbedaan antara data, informasi dan pengetahuan!
2. Jelaskan apa yang anda ketahui tentang data mining!
3. Sebutkan peran utama data mining!
4. Sebutkan pemanfaatan dari data mining di berbagai bidang!
5. Pengetahuan apa yang bisa kita dapatkan dari data di bawah?
NIM Gender Nilai Asal IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat
UN Sekolah Waktu
10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya
10002 P 27 SMAN 7 4.0 3.2 3.8 3.7 Tidak
10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak
10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya
...
11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
6
1.1 Apa dan Mengapa Data Mining?
1. Pengantar 1.2 Peran Utama dan Metode Data Mining
1.3 Sejarah dan Penerapan Data Mining
Course Outline
2.1 Proses dan Tools Data Mining
2.2 Penerapan Proses Data Mining
2. Proses 2.3 Evaluasi Model Data Mining
2.4 Proses Data Mining berbasis CRISP-DM
7
1. Pengantar Data Mining
1.1 Apa dan Mengapa Data Mining?
1.2 Peran Utama dan Metode Data Mining
1.3 Sejarah dan Penerapan Data Mining
8
1.1 Apa dan Mengapa Data Mining?
9
Manusia Memproduksi Data
Manusia memproduksi beragam
data yang jumlah dan ukurannya
sangat besar
• Astronomi
• Bisnis
• Kedokteran
• Ekonomi
• Olahraga
• Cuaca
• Financial
• …
10
Pertumbuhan Data kilobyte (kB) 103
megabyte (MB) 106
Astronomi gigabyte (GB) 109
• Sloan Digital Sky Survey terabyte (TB) 1012
• New Mexico, 2000 petabyte (PB) 1015
exabyte (EB) 1018
• 140TB over 10 years
zettabyte (ZB) 1021
• Large Synoptic Survey Telescope yottabyte (YB) 1024
• Chile, 2016
• Will acquire 140TB every five days
11
Perubahan Kultur dan Perilaku
12
kilobyte (kB) 103
Datangnya Tsunami Data megabyte (MB) 106
gigabyte (GB) 109
terabyte (TB) 1012
• Mobile Electronics market petabyte (PB) 1015
• 7B smartphone subscriptions in 2015 exabyte (EB) 1018
zettabyte (ZB) 1021
yottabyte (YB) 1024
13
Kebanjiran Data tapi Miskin Pengetahuan
14
Mengubah Data Menjadi Pengetahuan
• Data harus kita olah menjadi
pengetahuan supaya bisa bermanfaat
bagi manusia
• Dengan pengetahuan
tersebut, manusia dapat:
• Melakukan estimasi dan prediksi
apa yang terjadi di depan
• Melakukan analisis tentang asosiasi, korelasi dan
pengelompokan antar data dan atribut
• Membantu pengambilan keputusan dan
pembuatan kebijakan
15
16
Data - Informasi – Pengetahuan - Kebijakan
1103 22
1142 18 2 2
1156 10 1 11
1173 12 5 5
1180 10 12
Terlambat 7 0 1 0 5
Pulang 0 1 1 1 8
Cepat
Izin 3 0 0 1 4
Alpa 1 0 2 0 2
20
Data - Informasi – Pengetahuan - Kebijakan
Data 21
Data Absensi Pegawai
Data - Informasi – Pengetahuan - Kebijakan
22
Apa itu Data Mining?
24
Konsep Proses Data Mining
25
Definisi Data Mining
• Melakukan ekstraksi untuk mendapatkan informasi
penting yang sifatnya implisit dan sebelumnya tidak
diketahui, dari suatu data (Witten et al., 2011)
26
Contoh Data di Kampus
• Puluhan ribu data mahasiswa di kampus yang
diambil dari sistem informasi akademik
• Apakah pernah kita ubah menjadi pengetahuan
yang lebih bermanfaat? TIDAK!
• Seperti apa pengetahuan itu? Rumus, Pola, Aturan
27
Prediksi Kelulusan Mahasiswa
28
Contoh Data di Komisi Pemilihan
Umum
• Puluhan ribu data calon anggota legislatif di KPU
• Apakah pernah kita ubah menjadi pengetahuan
yang lebih bermanfaat? TIDAK!
29
Prediksi Calon Legislatif DKI Jakarta
30
Penentuan Kelayakan Kredit
20
15
10 Jumlah kredit
macet
5
0
2003 2004
31
Deteksi Pencucian Uang
32
Prediksi Kebakaran Hutan
FFMC DMC DC ISI temp RH wind rain ln(area+1)
93.5 139.4 594.2 20.3 17.6 52 5.8 0 0
92.4 124.1 680.7 8.5 17.2 58 1.3 0 0
90.9 126.5 686.5 7 15.6 66 3.1 0 0
85.8 48.3 313.4 3.9 18 42 2.7 0 0.307485
91 129.5 692.6 7 21.7 38 2.2 0 0.357674
90.9 126.5 686.5 7 21.9 39 1.8 0 0.385262
95.5 99.9 513.3 13.2 23.3 31 4.5 0 0.438255
12
10 9.648
6
5.9 5.615
SVM SVM+GA
C 4.3 1,840 4.3
RMSE 2
1.391 1.379
0 1.379
C Gamma Epsilon RMSE
SVM SVM+GA
33
Profiling dan Prediksi Koruptor
Asosiasi atribut
Data tersangka koruptor
Data Data Pengetahuan
34
Pola Profil Tersangka Koruptor
35
Profiling dan Deteksi Kasus TKI
36
Klasterisasi Tingkat Kemiskinan
37
Pola Aturan Asosiasi dari Data Transaksi
38
Pola Aturan Asosiasi di Amazon.com
39
From Stupid Apps to Smart Apps
Stupid Smart
Applications Applications
40
Revolusi Industri 4.0
M urah!
yaM akin
Bi a
e pat dan
C
N SI: Makin
EFIS IE
41
Perusahaan Pengolah Pengetahuan
• Uber - the world’s largest taxi company,
owns no vehicles
• Google - world’s largest
media/advertising company, creates no
content
• Alibaba - the most valuable retailer, has
no inventory
• Airbnb - the world’s largest
accommodation provider, owns no real
estate
• Gojek - perusahaan angkutan umum,
tanpa memiliki kendaraan
42
Data Mining Tasks and Roles
in General
Increasing potential
values to support End User
business decisions Decision
Making
43
Data Mining Tasks and Roles
in Product Development
44
Computing Service
It governance research
operation
How
BUAT
JUALA
COMP
DATA SCIENCEproduk
N CD
Infrastructure &
UTING
aplikasi Security
MUSIK
SOLUT
langgan
GA
ION
an
LAKU?
WORK
musik?
SOFTWARE
roduct managementS? ENGINEER
45
Hubungan Data Mining dan Bidang Lain
Statistics
Machine Computing
Learning Algorithms
46
Masalah-Masalah di Data Mining
1. Tremendous amount of data
• Algorithms must be highly scalable to handle such as tera-bytes
of data
2. High-dimensionality of data
• Micro-array may have tens of thousands of dimensions
3. High complexity of data
• Data streams and sensor data
• Time-series data, temporal data, sequence data
• Structure data, graphs, social networks and multi-linked data
• Heterogeneous databases and legacy databases
• Spatial, spatiotemporal, multimedia, text and Web data
• Software programs, scientific simulations
4. New and sophisticated applications
47
Latihan
1. Jelaskan dengan kalimat sendiri apa
yang dimaksud dengan data mining?
48
1.2 Peran Utama dan Metode Data
Mining
49
Peran Utama Data Mining
1. Estimasi
5. Asosiasi 2. Forecasting
4. Klastering 3. Klasifikasi
50
Dataset (Himpunan Data)
Attribute/Feature/Dimension
Class/Label/Target
Record/
Object/
Sample/
Tuple/
Data
Nominal
Numerik
51
Tipe Data
(Kontinyu)
(Diskrit)
52
Tipe Data Deskripsi Contoh Operasi
Interval • Data yang diperoleh dengan cara • Suhu 0°c-100°c, mean, standard
(Jarak) pengukuran, dimana jarak dua titik • Umur 20-30 tahun deviation,
pada skala sudah diketahui Pearson's
• Tidak mempunyai titik nol yang correlation, t and
absolut F tests
(+, - )
Nominal • Data yang diperoleh dengan cara • Kode pos mode, entropy,
(Label) kategorisasi atau klasifikasi • Jenis kelamin contingency
• Menunjukkan beberapa object • Nomer id karyawan correlation, 2
yang berbeda • Nama kota test
(=, )
53
Peran Utama Data Mining
1. Estimasi
5. Asosiasi 2. Forecasting
4. Klastering 3. Klasifikasi
54
1. Estimasi Waktu Pengiriman Pizza
Label
Customer Jumlah Pesanan (P) Jumlah Traffic Light (TL) Jarak (J) Waktu Tempuh (T)
1 3 3 3 16
2 1 7 4 20
3 2 4 6 18
4 4 6 8 36
...
1000 2 4 2 12
Pembelajaran dengan
Metode Estimasi (Regresi Linier)
56
Output/Pola/Model/Knowledge
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)
57
2. Forecasting Harga Saham
Label Time Series
Pembelajaran dengan
Metode Forecasting (Neural Network)
58
Pengetahuan berupa
Rumus Neural Network
Prediction Plot
59
Forecasting Cuaca
60
Exchange Rate Forecasting
61
Inflation Rate Forecasting
62
3. Klasifikasi Kelulusan Mahasiswa
Label
NIM Gender Nilai Asal IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat
UN Sekolah Waktu
10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya
10002 P 27 SMA DK 4.0 3.2 3.8 3.7 Tidak
10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak
10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya
...
...
11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
Pembelajaran dengan
Metode Klasifikasi (C4.5)
63
Pengetahuan Berupa Pohon Keputusan
64
Contoh: Rekomendasi Main Golf
• Input:
• Output (Rules):
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
65
Contoh: Rekomendasi Main Golf
• Output (Tree):
66
Contoh: Rekomendasi Contact Lens
• Input:
67
Contoh: Rekomendasi Contact Lens
• Output/Model (Tree):
68
Klasifikasi Sentimen Analisis
69
Bankruptcy Prediction
70
4. Klastering Bunga Iris
Dataset Tanpa Label
Pembelajaran dengan
Metode Klastering (K-Means)
71
Pengetahuan (Model) Berupa Klaster
72
Klastering Jenis Pelanggan
73
Klastering Sentimen Warga
74
Poverty Rate Clustering
75
5. Aturan Asosiasi Pembelian Barang
Pembelajaran dengan
Metode Asosiasi (FP-Growth)
76
Pengetahuan Berupa Aturan Asosiasi
77
Contoh Aturan Asosiasi
• Algoritma association rule (aturan asosiasi) adalah
algoritma yang menemukan atribut yang “muncul
bersamaan”
• Contoh, pada hari kamis malam, 1000 pelanggan
telah melakukan belanja di supermaket ABC, dimana:
• 200 orang membeli Sabun Mandi
• dari 200 orang yang membeli sabun mandi, 50 orangnya
membeli Fanta
• Jadi, association rule menjadi, “Jika membeli sabun
mandi, maka membeli Fanta”, dengan nilai support =
200/1000 = 20% dan nilai confidence = 50/200 = 25%
• Algoritma association rule diantaranya adalah: A
priori algorithm, FP-Growth algorithm, GRI algorithm
78
Aturan Asosiasi di Amazon.com
79
Heating Oil Consumption
Korelasi antara jumlah konsumsi minyak
pemanas dengan faktor-faktor di bawah:
80
81
82
Korelasi 4 Variable terhadap Konsumsi Minyak
Jumlah
Penghuni
Rumah
Rata-Rata 0.381
Umur 0.848
Konsumsi
Ketebalan 0.736 Minyak
Insulasi
Rumah
-0.774
Temperatur
83
Insight Law (Data Mining Law 6)
Data mining amplifies perception in the
business domain
• How does data mining produce insight? This law approaches the
heart of data mining – why it must be a business process and not a
technical one
• Business problems are solved by people, not by algorithms
• The data miner and the business expert “see” the solution to a
problem, that is the patterns in the domain that allow the business
objective to be achieved
• Thus data mining is, or assists as part of, a perceptual process
• Data mining algorithms reveal patterns that are not normally visible to
human perception
• Within the data mining process, the human problem solver
interprets the results of data mining algorithms and integrates
them into their business understanding
84
Metode Data Mining
1. Estimation (Estimasi):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
2. Forecasting (Prediksi/Peramalan):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
3. Classification (Klasifikasi):
Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative
Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear
Discriminant Analysis (LDA), Logistic Regression (LogR), etc
4. Clustering (Klastering):
K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means
(FCM), etc
5. Association (Asosiasi):
FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc
85
Output/Pola/Model/Knowledge
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)
86
Kategorisasi Algoritma Data Mining
Supervised Semi-
Supervised
Unsupervised
Learning Learning Learning
87
1. Supervised Learning
88
Dataset dengan Class
Attribute/Feature/Dimension Class/Label/Target
Nominal
Numerik
89
2. Unsupervised Learning
90
Dataset tanpa Class
Attribute/Feature/Dimension
91
3. Semi-Supervised Learning
• Semi-supervised learning
adalah metode data mining
yang menggunakan data
dengan label dan tidak
berlabel sekaligus dalam
proses pembelajarannya
93
1.3 Sejarah dan Penerapan
Data Mining
94
Evolution of Sciences
• Sebelum 1600: Empirical science
• Disebut sains kalau bentuknya kasat mata
(Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Communication of ACM, 45(11): 50-54, Nov. 2002)
95
96
Revolusi Industri 4.0
97
98
Business
Knowledge
Methods
Technology
99
Business Goals Law (Data Mining Law 1)
Business objectives are the origin of every data
mining solution
101
Private and Commercial Sector
• Marketing: product recommendation, market basket
analysis, product targeting, customer retention
• Finance: investment support, portfolio management, price
forecasting
• Banking and Insurance: credit and policy approval, money
laundry detection
• Security: fraud detection, access control, intrusion
detection, virus detection
• Manufacturing: process modeling, quality control, resource
allocation
• Web and Internet: smart search engines, web marketing
• Software Engineering: effort estimation, fault prediction
• Telecommunication: network monitoring, customer churn
prediction, user behavior analysis
102
Use Case: Product Recommendation
4,000,000
Tot.Belanja
3,500,000
Jml.Pcs
3,000,000 Jml.Item
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
0 5 10 15 20 25 30 35
103
Use Case: Software Fault Prediction
104
Public and Government Sector
• Finance: exchange rate forecasting, sentiment analysis
• Taxation: adaptive monitoring, fraud detection
• Medicine and Healt Care: hypothesis discovery, disease
prediction and classification, medical diagnosis
• Education: student allocation, resource forecasting
• Insurance: worker’s compensation analysis
• Security: bomb, iceberg detection
• Transportation: simulation and analysis, load estimation
• Law: legal patent analysis, law and rule analysis
• Politic: election prediction
105
Contoh Penerapan Data Mining
• Penentuan kelayakan kredit pemilihan rumah di bank
• Penentuan pasokan listrik PLN untuk wilayah Jakarta
• Prediksi profile tersangka koruptor dari data pengadilan
• Perkiraan harga saham dan tingkat inflasi
• Analisis pola belanja pelanggan
• Memisahkan minyak mentah dan gas alam
• Penentuan pola pelanggan yang loyal pada perusahaan
operator telepon
• Deteksi pencucian uang dari transaksi perbankan
• Deteksi serangan (intrusion) pada suatu jaringan
106
Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in Databases
• Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
Conferences Journals
• ACM SIGKDD Int. Conf. on Knowledge • ACM Transactions on Knowledge
Discovery in Databases and Data Discovery from Data (TKDD)
Mining (KDD)
• ACM Transactions on
• SIAM Data Mining Conf. (SDM)
Information Systems (TOIS)
• (IEEE) Int. Conf. on Data Mining (ICDM)
• IEEE Transactions on Knowledge
• European Conf. on Machine Learning
and Principles and practices of
and Data Engineering
Knowledge Discovery and Data Mining • Springer Data Mining and
(ECML-PKDD) Knowledge Discovery
• Pacific-Asia Conf. on Knowledge • International Journal of Business
Discovery and Data Mining (PAKDD) Intelligence and Data Mining
• Int. Conf. on Web Search and Data (IJBIDM)
Mining (WSDM)
108
2. Proses Data Mining
2.1 Proses dan Tools Data Mining
2.2 Penerapan Proses Data Mining
2.3 Evaluasi Model Data Mining
2.4 Proses Data Mining berbasis CRISP-DM
109
2.1 Proses dan Tools Data Mining
110
Proses Data Mining
(Pahami dan (Pilih Metode (Pahami Model dan (Analisis Model dan
Persiapkan Data) Sesuai Karakter Data) Pengetahuan yg Sesuai ) Kinerja Metode)
111
1. Himpunan Data (Dataset)
• Atribut adalah faktor atau parameter yang menyebabkan
class/label/target terjadi
• Jenis dataset ada dua: Private dan Public
• Private Dataset: data set dapat diambil dari organisasi yang kita
jadikan obyek penelitian
• Bank, Rumah Sakit, Industri, Pabrik, Perusahaan Jasa, etc
• Public Dataset: data set dapat diambil dari repositori pubik
yang disepakati oleh para peneliti data mining
• UCI Repository (http://www.ics.uci.edu/~mlearn/MLRepository.html)
• ACM KDD Cup (http://www.sigkdd.org/kddcup/)
• PredictionIO (http://docs.prediction.io/datacollection/sample/)
• Trend penelitian data mining saat ini adalah menguji metode
yang dikembangkan oleh peneliti dengan public dataset,
sehingga penelitian dapat bersifat: comparable, repeatable dan
verifiable
112
Public Data Set (UCI Repository)
113
Dataset (Himpunan Data)
Attribute/Feature/Dimension
Class/Label/Target
Record/
Object/
Sample/
Tuple/
Data
Nominal
Numerik
114
2. Metode Data Mining
1. Estimation (Estimasi):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
2. Forecasting (Prediksi/Peramalan):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
3. Classification (Klasifikasi):
Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative
Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear
Discriminant Analysis (LDA), Logistic Regression (LogR), etc
4. Clustering (Klastering):
K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means
(FCM), etc
5. Association (Asosiasi):
FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc
115
3. Pengetahuan (Pola/Model)
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)
116
4. Evaluasi (Akurasi, Error, etc)
1. Estimation:
Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
Confusion Matrix: Accuracy
ROC Curve: Area Under Curve (AUC)
4. Clustering:
Internal Evaluation: Davies–Bouldin index, Dunn index,
External Evaluation: Rand measure, F-measure, Jaccard index,
Fowlkes–Mallows index, Confusion matrix
5. Association:
Lift Charts: Lift Ratio
Precision and Recall (F-measure)
117
Kriteria Evaluasi dan Validasi Model
1. Akurasi
• Ukuran dari seberapa baik model mengkorelasikan antara hasil
dengan atribut dalam data yang telah disediakan
• Terdapat berbagai model akurasi, tetapi semua model akurasi
tergantung pada data yang digunakan
2. Kehandalan
• Ukuran di mana model data mining diterapkan pada dataset yang
berbeda
• Model data mining dapat diandalkan jika menghasilkan pola
umum yang sama terlepas dari data testing yang disediakan
3. Kegunaan
• Mencakup berbagai metrik yang mengukur apakah model
tersebut memberikan informasi yang berguna
Keseimbangan diantaranya ketiganya diperlukan karena belum tentu model
yang akurat adalah handal, dan yang handal atau akurat belum tentu berguna
118
Magic Quadrant for Data Science Platform (Gartner, 2017)
119
Magic Quadrant for Data Science Platform (Gartner, 2018)
120
KNIME
• KNIME (Konstanz Information Miner)
adalah platform data mining untuk analisis,
pelaporan, dan integrasi data yang
termasuk perangkat lunak bebas dan
sumber terbuka
• KNIME mulai dikembangkan tahun 2004
oleh tim pengembang perangkat lunak dari
Universitas Konstanz, yang dipimpin oleh
Michael Berthold, yang awalnya digunakan
untuk penelitian di industri farmasi
• Mulai banyak digunakan orang sejak tahun
2006, dan setelah itu berkembang pesat
sehingga tahun 2017 masuk ke Magic
Quadrant for Data Science Platform
(Gartner Group)
121
KNIME
122
Rapidminer
• Pengembangan dimulai pada 2001 oleh Ralf Klinkenberg,
Ingo Mierswa, dan Simon Fischer di Artificial Intelligence
Unit dari University of Dortmund, ditulis dalam bahasa Java
124
Role Atribut Pada Rapidminer
1. Atribut: karakteristik atau fitur dari data
yang menggambarkan sebuah proses
atau situasi
• ID, atribut biasa
125
Tipe Nilai Atribut pada Rapidminer
1. nominal: nilai secara kategori
2. binominal: nominal dua nilai
3. polynominal: nominal lebih dari dua nilai
4. numeric: nilai numerik secara umum
5. integer: bilangan bulat
6. real: bilangan nyata
7. text: teks bebas tanpa struktur
8. date_time: tanggal dan waktu
9. date: hanya tanggal
10.time: hanya waktu
126
Perspektif dan View
1. Perspektif Selamat Datang
(Welcome perspective)
2. Perspektif Desain
(Design perspective)
3. Perspektif Hasil
(Result perspective)
127
Perspektif Desain
• Perspektif dimana semua proses dibuat dan dikelola
• Pindah ke Perspektif Desain dengan Klik:
128
View Operator
• Process Control
Untuk mengontrol aliran proses,
seperti loop atau conditional branch
• Utility
Untuk mengelompokkan subprocess,
juga macro dan logger
• Repository Access
Untuk membaca dan menulis repositori
• Import
Untuk membaca data dari berbagai
format eksternal
• Export
Untuk menulis data ke berbagai format eksternal
• Data Transformation
Untuk transformasi data dan metadata
• Modelling
Untuk proses data mining seperti klasifikasi, regresi, clustering, asosiasi, dll
• Evaluation
Untuk mengukur kualitas dan perfomansi dari model
129
View Proses
130
Operator dan Proses
• Proses data mining pada dasarnya adalah proses analisis yang berisi
alur kerja dari komponen data mining
• Komponen dari proses ini disebut operator, yang memiliki:
1. Input
2. Output
3. Aksi yang dilakukan
4. Parameter yang diperlukan
132
View Help dan View Comment
• View Help menampilkan deskripsi dari operator
• View Comment menampilkan komentar yang dapat
diedit terhadap operator
133
Mendesain Proses
Kumpulan dan rangkaian fungsi-fungsi (operator)
yang bisa disusun secara visual (visual programming)
134
Menjalankan Proses
Proses dapat dijalankan dengan:
• Menekan tombol Play
• Memilih menu Process → Run
• Menekan kunci F11
135
Melihat Hasil
136
View Problems dan View Log
137
Instalasi dan Registrasi Lisensi Rapidminer
• Instal Rapidminer versi 9
• Registrasi account di rapidminer.com dan dapatkan lisensi
Educational Program untuk mengolah data tanpa batasan record
138
139
2.2 Penerapan Proses Data Mining
140
Proses Data Mining
(Pahami dan (Pilih Metode (Pahami Model dan (Analisis Model dan
Persiapkan Data) Sesuai Karakter Data) Pengetahuan yg Sesuai ) Kinerja Metode)
141
Latihan: Rekomendasi Main Golf
1. Lakukan training pada data golf (ambil
dari repositories rapidminer) dengan
menggunakan algoritma decision tree
142
143
144
145
146
147
148
149
150
151
Latihan: Penentuan Jenis Bunga Iris
1. Lakukan training pada data Bunga Iris (ambil dari
repositories rapidminer) dengan menggunakan algoritma
decision tree
2. Tampilkan himpunan data (dataset) dan pengetahuan
(model tree) yang terbentuk
152
Latihan: Klastering Jenis Bunga Iris
1. Lakukan training pada data Bunga Iris (ambil dari
repositories rapidminer) dengan algoritma k-Means
2. Tampilkan himpunan data (dataset) dan pengetahuan
(model tree) yang terbentuk
3. Tampilkan grafik dari cluster yang terbentuk
153
Latihan: Penentuan Mine/Rock
1. Lakukan training pada data Sonar (ambil dari repositories
rapidminer) dengan menggunakan algoritma decision tree
(C4.5)
2. Tampilkan himpunan data (dataset) dan pengetahuan
(model tree) yang terbentuk
154
Latihan: Rekomendasi Contact Lenses
1. Lakukan training pada data Contact Lenses (contact-lenses.xls)
dengan menggunakan algoritma decision tree
2. Gunakan operator Read Excel (on the fly) atau langsung
menggunakan fitur Import Data (persistent)
3. Tampilkan himpunan data (dataset) dan pengetahuan (model
tree) yang terbentuk
155
Read Excel Operator
156
Import Data Function
157
Latihan: Estimasi Performance CPU
1. Lakukan training pada data CPU (cpu.xls) dengan
menggunakan algoritma linear regression
2. Lakukan pengujian terhadap data baru (cpu-
testing.xls), untuk model yang dihasilkan dari
tahapan 1. Data baru berisi 10 setting konfigurasi,
yang belum diketahui berapa performancenya
3. Amati hasil estimasi performance dari 10 setting
konfigurasi di atas
158
Estimasi Performace cpu-testing.xls
cpu.xls
160
Proses Prediksi Elektabilitas Caleg
161
Latihan: Estimasi Konsumsi Minyak
1. Lakukan training pada data konsumsi minyak (HeatingOil.csv)
• Dataset jumlah konsumsi minyak untuk alat pemanas ruangan di rumah
pertahun perrumah
• Atribut:
• Insulation: Ketebalan insulasi rumah
• Temperatur: Suhu udara sekitar rumah
• Heating Oil: Jumlah konsumsi minyak pertahun perrumah
• Number of Occupant: Jumlah penghuni rumah
• Average Age: Rata-rata umur penghuni rumah
• Home Size: Ukuran rumah
2. Gunakan operator Set Role untuk memilih Label (Heating Oil),
tidak langsung dipilih pada saat Import Data
3. Pilih metode yang tepat supaya menghasilkan model
4. Apply model yang dihasilkan ke data pelanggan baru di file
HeatingOil-Scoring.csv, supaya kita bisa mengestimasi berapa
kebutuhan konsumsi minyak mereka, untuk mengatur stok
penjualan minyak
162
Proses Estimasi Konsumsi Minyak
163
Latihan: Matrix Correlation Konsumsi Minyak
1. Lakukan training pada data konsumsi minyak
(HeatingOil.csv)
• Dataset jumlah konsumsi minyak untuk alat pemanas
ruangan di rumah pertahun perrumah
• Atribut:
• Insulation: Ketebalan insulasi rumah
• Temperatur: Suhu udara sekitar rumah
• Heating Oil: Jumlah konsumsi minyak pertahun perrumah
• Number of Occupant: Jumlah penghuni rumah
• Average Age: Rata-rata umur penghuni rumah
• Home Size: Ukuran rumah
2. Tujuannya ingin mendapatkan informasi tentang
atribut apa saja yang paling berpengaruh pada
konsumsi minyak
164
165
Tingkat Korelasi 4 Atribut terhadap Konsumsi Minyak
Jumlah
Penghuni
Rumah
Rata-Rata 0.381
Umur 0.848
Konsumsi
Ketebalan 0.736 Minyak
Insulasi
Rumah
-0.774
Temperatur
166
Latihan: Aturan Asosiasi Data Transaksi
1. Lakukan training pada data transaksi
(transaksi.xlsx)
167
168
Latihan: Klasifikasi Data Kelulusan Mahasiswa
174
Parameter dari Windowing
• Window size: Determines how many “attributes”
are created for the cross-sectional data
• Each row of the original time series within the window
width will become a new attribute
• We choose w = 6
• Step size: Determines how to advance the window
• Let us use s = 1
• Horizon: Determines how far out to make the
forecast
• If the window size is 6 and the horizon is 1, then the
seventh row of the original time series becomes the first
sample for the “label” variable
• Let us use h = 1
175
Latihan
• Lakukan training dengan menggunakan
linear regression pada dataset hargasaham-
training-uni.xls
• Gunakan Split Data untuk memisahkan
dataset di atas, 90% training dan 10% untuk
testing
• Harus dilakukan proses Windowing pada
dataset
• Plot grafik antara label dan hasil prediksi
dengan menggunakan chart
176
Forecasting Harga Saham (Data Lampau)
177
Forecasting Harga Saham (Data Masa Depan)
178
179
Latihan: Penentuan Kelayakan Kredit
180
Latihan: Deteksi Kanker Payudara
1. Lakukan training pada data kanker payudara
(breasttissue.xls)
182
Latihan: Klasifikasi Resiko Kredit
1. Lakukan training pada data resiko kredit
(CreditRisk.csv)
(http://romisatriawahono.net/lecture/dm/dataset/)
183
Latihan: Klasifikasi Music Genre
1. Lakukan training pada data Music Genre
(musicgenre-small.csv)
184
Data Profile dan Kinerja Marketing
Mana Atribut yang Layak jadi Class dan Tidak?
NIP Gender Univers Progra IPK Usia Hasil Status Jumlah Kota
itas m Studi Penjual Keluarg Anak Tinggal
an a
1001 L UI Komuni 3.1 21 100jt Single 0 Jakarta
kasi
1002 P UNDIP Informa 2.9 26 50jt menika 1 Bekasi
tika h
… … … … … … … … … …
1001 L 10 20 50 30 100jt
1002 P 10 10 5 25 50jt
… … … … … … …
185
Data Profile dan Kinerja Marketing
Mana Atribut yang Layak jadi Class dan Tidak?
186
Data Profil dan Kinerja Dosen
NIP Gender Univers Progra Absens Usia Jumlah Status Disiplin Kota
itas m Studi i Peneliti Keluarg Tinggal
an a
1001 L UI Komuni 98% 21 3 Single Baik Jakarta
kasi
1002 P UNDIP Informa 50% 26 4 menika Buruk Bekasi
tika h
… … … … … … … … … …
1001 L UI Komunikasi 5 3 8
1002 P UNDIP Informatika 2 1 3
… … … … … … …
187
Competency Check
1. Dataset – Methods – Knowledge
1. Dataset Main Golf (Klasifikasi)
2. Dataset Iris (Klasifikasi)
3. Dataset Iris (Klastering)
4. Dataset CPU (Estimasi)
5. Dataset Pemilu (Klasifikasi)
6. Dataset Heating Oil (Asosiasi, Estimasi)
7. Dataset Transaksi (Association)
8. Dataset Harga Saham (Forecasting) (Uni dan Multi)
188
Tugas: Mencari dan Mengolah Dataset
• Pahami berbagai dataset yang ada di folder
dataset
• Gunakan rapidminer untuk mengolah
dataset tersebut sehingga menjadi
pengetahuan
• Pilih algoritma yang sesuai dengan jenis data
pada dataset
189
Tugas: Menguasai Satu Metode DM
1. Pahami dan kuasai satu metode data mining dari berbagai literature:
1. Naïve Bayes 2. k Nearest Neighbor
3. k-Means4. C4.5
5. Neural Network6. Logistic Regression
7. FP Growth 8. Fuzzy C-Means
9. Self-Organizing Map 0. Support Vector Machine
2. Rangkumkan dengan detail dalam bentuk slide,
dengan format:
1. Definisi (Solid dan Sistematis)
2. Tahapan Algoritma (lengkap dengan formulanya)
3. Penerapan Tahapan Algoritma untuk Studi Kasus Dataset Main Golf, Iris,
Transaksi, CPU, dsb
(hitung manual (gunakan excel) dan tidak dengan menggunakan
rapidminer, harus sinkron dengan tahapan algoritma)
3. Kirimkan slide dan excel ke [email protected], sehari sebelum
mata kuliah berikutnya
4. Presentasikan di depan kelas pada mata kuliah berikutnya dengan
bahasa manusia yang baik dan benar190
Tugas: Kembangkan Code dari Algoritma DM
1. Kembangkan Java Code dari algoritma yang dipilih
2. Gunakan hanya 1 class (file) dan beri nama sesuai
nama algoritma, boleh membuat banyak method
dalam class tersebut
3. Buat account di Trello.Com dan register ke
https://trello.com/b/ZOwroEYg/course-assignment
4. Buat card dengan nama sendiri dan upload semua
file (pptx, xlsx, pdf, etc) laporan ke card tersebut
5. Deadline: sehari sebelum pertemuan berikutnya
191
Algoritma k-Means
Format Template Tugas
192
Definisi
• Rangkuman Definisi:
• K-means adalah ..... (John, 2016)
• K-Means adalah …. (Wedus, 2020)
• Kmeans adalah … (Telo, 2017)
193
Tahapan Algoritma k-Means
1. Siapkan dataset
194
1. Siapkan dataset
195
2. Tentukan A
• blablabla
196
3. Tentukan B
• blablabla
197
4. Iterasi 1
• blablabla
198
4. Iterasi 2 ... dst
• blablabla
199
2.3 Evaluasi Model Data Mining
200
Proses Data Mining
(Pahami dan (Pilih Metode (Pahami Model dan (Analisis Model dan
Persiapkan Data) Sesuai Karakter Data) Pengetahuan yg Sesuai ) Kinerja Metode)
201
Evaluasi Model Data Mining
1. Estimation:
• Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
• Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
• Confusion Matrix: Accuracy
• ROC Curve: Area Under Curve (AUC)
4. Clustering:
• Internal Evaluation: Davies–Bouldin index, Dunn index,
• External Evaluation: Rand measure, F-measure, Jaccard index,
Fowlkes–Mallows index, Confusion matrix
5. Association:
• Lift Charts: Lift Ratio
• Precision and Recall (F-measure)
202
Evaluasi Model Data Mining
• Pembagian dataset, perbandingan 90:10 atau
80:20:
• Data Training
• Data Testing
203
2.3.1 Pemisahan Data Manual
204
Latihan: Penentuan Kelayakan Kredit
• Gunakan dataset di bawah:
• creditapproval-training.xls: untuk membuat model
• creditapproval-testing.xls: untuk menguji model
• Data di atas terpisah dengan perbandingan:
data testing (10%) dan data training (90%)
• Data training sebagai pembentuk model, dan data
testing untuk pengujian model, ukur performancenya
205
Confusion Matrix Accuracy
• pred MACET- true MACET: Jumlah data yang diprediksi macet dan
kenyataannya macet (TP)
• pred LANCAR-true LANCAR: Jumlah data yang diprediksi lancar dan
kenyataannya lancar (TN)
• pred MACET-true LANCAR: Jumlah data yang diprediksi macet tapi
kenyataannya lancer (FP)
• pred LANCAR-true MACET: Jumlah data yang diprediksi lancar tapi
kenyataanya macet (FN)
𝐓𝐏+ 𝐓𝐍 53+37 90
Accuracy= = = =90 %
𝐓𝐏 +𝐓𝐍 +𝐅𝐏+𝐅𝐍 53+ 37+ 4+6 100
206
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that
the classifier labeled as positive are actually
positive
208
PPV and NPV
We need to know the probability that the classifier will
give the correct diagnosis, but the sensitivity and
specificity do not give us this information
• Positive Predictive Value (PPV) is the proportion of cases
with ’positive’ test results that are correctly diagnosed
209
Kurva ROC - AUC (Area Under Curve)
• ROC (Receiver Operating Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• ROC curves are two-dimensional graphs in which the TP rate is
plotted on the Y-axis and the FP rate is plotted on the X-axis
• ROC curve depicts relative trade-offs between benefits (’true
positives’) and costs (’false positives’)
• Two types of ROC curves: discrete and continuous
210
Kurva ROC - AUC (Area Under Curve)
211
Guide for Classifying the AUC
(Gorunescu, 2011)
212
Latihan: Prediksi Kanker Payudara
• Gunakan dataset: breasttissue.xls
• Split data dengan perbandingan:
data testing (10%) dan data training (90%)
• Ukur performance
(Accuracy dan Kappa)
213
Kappa Statistics
• The (Cohen’s) Kappa statistics is a more vigorous
measure than the ‘percentage correct prediction’
calculation, because Kappa considers the correct
prediction that is occurring by chance
• Kappa is essentially a measure of how well the
classifier performed as compared to how well it
would have performed simply by chance
• A model has a high Kappa score if there is a big
difference between the accuracy and the null error
rate (Markham, K., 2014)
• Kappa is an important measure on classifier
performance, especially on imbalanced data set
214
Latihan: Prediksi Harga Saham
• Gunakan dataset di bawah:
• hargasaham-training.xls: untuk membuat model
• hargasaham-testing.xls: untuk menguji model
• Data di atas terpisah dengan perbandingan:
data testing (10%) dan data training (90%)
• Jadikan data training sebagai pembentuk
model/pola/knowledge, dan data testing untuk
pengujian model
• Ukur performance
215
216
Root Mean Square Error
• The square root of the mean/average of the square of all of the
error
218
219
Davies–Bouldin index (DBI)
• The Davies–Bouldin index (DBI) (introduced by David L. Davies and
Donald W. Bouldin in 1979) is a metric for evaluating clustering
algorithms
• This is an internal evaluation scheme, where the validation of how
well the clustering has been done is made using quantities and
features inherent to the dataset
• As a function of the ratio of the within cluster scatter, to the between
cluster separation, a lower value will mean that the clustering is better
• This affirms the idea that no cluster has to be similar to another, and
hence the best clustering scheme essentially minimizes the Davies–
Bouldin index
• This index thus defined is an average over all the i clusters, and hence
a good measure of deciding how many clusters actually exists in the
data is to plot it against the number of clusters it is calculated over
• The number i for which this value is the lowest is a good measure of
the number of clusters the data could be ideally classified into
220
2.3.2 Pemisahan Data Otomatis dengan
Operator Split Data
221
Split Data Otomatis
• The Split Data operator takes a dataset as its input
and delivers the subsets of that dataset through its
output ports
• The sampling type parameter decides how the
examples should be shuffled in the resultant
partitions:
1. Linear sampling: Divides the dataset into partitions
without changing the order of the examples
2. Shuffled sampling: Builds random subsets of the
dataset
3. Stratified sampling: Builds random subsets and
ensures that the class distribution in the subsets is
the same as in the whole dataset
222
223
Latihan: Prediksi Kelulusan Mahasiswa
1. Dataset: datakelulusanmahasiswa.xls
2. Pisahkan data menjadi dua secara otomatis (Split
Data): data testing (10%) dan data training (90%)
3. Ujicoba parameter pemisahan data baik
menggunakan Linear Sampling, Shuffled
Sampling dan Stratified Sampling
4. Jadikan data training sebagai pembentuk
model/pola/knowledge, dan data testing untuk
pengujian model
5. Terapkan algoritma yang sesuai dan ukur
performance dari model yang dibentuk
224
Proses Prediksi Kelulusan Mahasiswa
225
Latihan: Estimasi Konsumsi Minyak
1. Dataset: HeatingOil.csv
2. Pisahkan data menjadi dua secara otomatis
(Split Data): data testing (10%) dan data
training (90%)
3. Jadikan data training sebagai pembentuk
model/pola/knowledge, dan data testing
untuk pengujian model
4. Terapkan algoritma yang sesuai dan ukur
performance dari model yang dibentuk
226
2.3.3 Pemisahan Data dan Evaluasi Model
227
Metode Cross-Validation
• Metode cross-validation digunakan untuk
menghindari overlapping pada data testing
• Tahapan cross-validation:
1. Bagi data menjadi k subset yg berukuran sama
2. Gunakan setiap subset untuk data testing dan sisanya
untuk data training
• Disebut juga dengan k-fold cross-validation
• Seringkali subset dibuat stratified (bertingkat)
sebelum cross-validation dilakukan, karena
stratifikasi akan mengurangi variansi dari estimasi
228
10 Fold Cross-Validation
Eksperimen Dataset Akurasi
1 93%
2 91%
3 90%
4 93%
5 93%
6 91%
7 94%
8 93%
9 91%
10 90%
Akurasi Rata-Rata 92%
Orange: k-subset (data testing)
229
10 Fold Cross-Validation
230
Latihan: Prediksi Elektabilitas Caleg
1. Lakukan training pada data pemilu (datapemilukpu.xls)
2. Lakukan pengujian dengan menggunakan 10-fold X Validation
3. Ukur performance-nya dengan confusion matrix dan ROC Curve
4. Lakukan ujicoba, ubah algoritma menjadi Naive Bayes, k-NN,
Random Forest (RF), Logistic Regression (LogR), analisis mana
algoritma yang menghasilkan model yang lebih baik (akurasi
tinggi)
C4.5 NB k-NN
Accuracy 92.87% 79.34% 88.7%
AUC 0.934 0.849 0.5
231
Latihan: Komparasi Prediksi Harga Saham
• Gunakan dataset harga saham
(hargasaham-training.xls)
• Lakukan pengujian dengan 10-
fold X Validation
• Lakukan ujicoba dengan
mengganti algoritma (GLM, LR,
NN, DL, SVM), catat hasil RMSE
yang keluar
232
2.3.4 Komparasi Algoritma Data Mining
233
Metode Data Mining
1. Estimation (Estimasi):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
2. Forecasting (Prediksi/Peramalan):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
3. Classification (Klasifikasi):
Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative
Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear
Discriminant Analysis (LDA), Logistic Regression (LogR), etc
4. Clustering (Klastering):
K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means
(FCM), etc
5. Association (Asosiasi):
FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc
234
Latihan: Prediksi Elektabilitas Caleg
1. Lakukan training pada data pemilu
(datapemilukpu.xls) dengan menggunakan
algoritma
1. Decision Tree (C4.5)
2. Naïve Bayes (NB)
3. K-Nearest Neighbor (K-NN)
2. Lakukan pengujian dengan menggunakan 10-fold
X Validation
DT NB K-NN
Accuracy 92.45% 77.46% 88.72%
AUC 0.851 0.840 0.5
235
236
Latihan: Prediksi Elektabilitas Caleg
1. Lakukan training pada data pemilu
(datapemilukpu.xls) dengan menggunakan
algoritma C4.5, NB dan K-NN
2. Lakukan pengujian dengan menggunakan 10-fold
X Validation
3. Ukur performance-nya dengan confusion matrix
dan ROC Curve
4. Uji beda dengan t-Test untuk mendapatkan model
terbaik
237
238
Hasil Prediksi Elektabilitas Caleg
• Komparasi Accuracy dan AUC
C4.5 NB K-NN
Accuracy 92.45% 77.46% 88.72%
AUC 0.851 0.840 0.5
C4.5
NB
kNN
Values with a colored background are smaller than alpha=0.050, which indicate a
probably significant difference between the mean values
C4.5
NB
kNN
Values with a white background are higher than alpha=0.050, which indicate a probably
NO significant difference between the mean values
241
Analisis Statistik
1. Statistik Deskriptif
• Nilai mean (rata-rata), standar deviasi,
varians, data maksimal, data minimal, dsb
2. Statistik Inferensi
• Perkiraan dan estimasi
• Pengujian Hipotesis
242
Statistik Inferensi (Pengujian Hipotesis)
Penggunaan Parametrik Non Parametrik
Dua sampel saling T Test Sign test
berhubungan Z Test Wilcoxon Signed-Rank
(Two Dependent samples) Mc Nemar Change test
243
Metode Parametrik
• Metode parametrik dapat dilakukan jika
beberapa persyaratan dipenuhi, yaitu:
• Sampel yang dianalisis haruslah berasal dari
populasi yang berdistribusi normal
• Jumlah data cukup banyak
• Jenis data yang dianalisis adalah biasanya
interval atau rasio
244
Metode Non Parametrik
• Metode ini dapat dipergunakan secara lebih luas, karena
tidak mengharuskan datanya berdistribusi normal
• Dapat dipakai untuk data nominal dan ordinal sehingga sangat
berguna bagi para peneliti sosial untuk meneliti perilaku
konsumen, sikap manusia, dsb
• Cenderung lebih sederhana dibandingkan dengan metode
parametrik
245
Interpretasi Statistik
• Ho = tidak ada perbedaan signifikan
• Ha = ada perbedaan signifikan
alpha=0.05
Bila p < 0.05, maka Ho ditolak
246
Latihan: Prediksi Kelulusan Mahasiswa
1. Lakukan training pada data mahasiswa
(datakelulusanmahasiswa.xls) dengan
menggunakan C4.5, ID3, NB, K-NN, RF dan
LogR
2. Lakukan pengujian dengan menggunakan
10-fold X Validation
3. Uji beda dengan t-Test untuk mendapatkan
model terbaik
247
Hasil Prediksi Kelulusan Mahasiswa
• Komparasi Accuracy dan AUC
C4.5 NB K-NN LogR
Accuracy 91.55% 82.58% 83.63% 77.47%
AUC 0.909 0.894 0.5 0.721
C4.5
NB
kNN
LogR
249
250
Latihan: Estimasi Konsumsi Minyak
1. Lakukan training pada data minyak pemanas
(HeatingOil.csv) dengan menggunakan algoritma
linear regression, neural network dan support
vector machine, Deep Learning
2. Lakukan pengujian dengan XValidation
(numerical) dan Uji beda dengan t-Test
3. Ukur performance-nya dengan menggunakan
RMSE (Root
LR Mean NNSquare Error)
SVM DL
RMSE
251
Urutan model terbaik:
1. NN dan DL
2. LR dan SVM
LR NN DL SVM
LR
NN
DL
SVM
252
Latihan: Prediksi Elektabilitas Caleg
1. Lakukan training pada data pemilu (datapemilukpu.xls)
dengan menggunakan algoritma Decision Tree, Naive
Bayes, K-Nearest Neighbor, RandomForest, Logistic
Regression
2. Lakukan pengujian dengan menggunakan XValidation
3. Ukur performance-nya dengan confusion matrix dan
ROC Curve
4. Masukkan setiap hasil percobaan ke dalam file Excel
DT NB K-NN RandFor LogReg
Accuracy 92.21% 76.89% 89.63%
AUC 0.851 0.826 0.5
253
Latihan: Prediksi Harga Saham
1. Lakukan training pada data harga saham
(hargasaham-training.xls) dengan neural network,
linear regression, support vector machine
2. Lakukan pengujian dengan menggunakan
XValidation
LR NN SVM
RMSE
254
Latihan: Klastering Jenis Bunga Iris
1. Lakukan training pada data iris (ambil dari
repositories rapidminer) dengan menggunakan
algoritma clustering k-means
2. Gunakan pilihan nilai untuk k, isikan dengan 3, 4,
5, 6, 7
3. Ukur performance-nya dengan Cluster Distance
Performance, dari analisis Davies Bouldin Indeks
(DBI), tentukan nilai k yang paling optimal
255
Davies–Bouldin index (DBI)
• The Davies–Bouldin index (DBI) (introduced by David L. Davies
and Donald W. Bouldin in 1979) is a metric for evaluating
clustering algorithms
• This is an internal evaluation scheme, where the validation of
how well the clustering has been done is made using quantities
and features inherent to the dataset
• As a function of the ratio of the within cluster scatter, to the
between cluster separation, a lower value will mean that the
clustering is better
• This affirms the idea that no cluster has to be similar to another,
and hence the best clustering scheme essentially minimizes the
Davies–Bouldin index
• This index thus defined is an average over all the i clusters, and
hence a good measure of deciding how many clusters actually
exists in the data is to plot it against the number of clusters it is
calculated over
• The number i for which this value is the lowest is a good measure
of the number of clusters the data could be ideally classified into
256
Evaluasi Model Data Mining
1. Estimation:
• Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
• Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
• Confusion Matrix: Accuracy
• ROC Curve: Area Under Curve (AUC)
4. Clustering:
• Internal Evaluation: Davies–Bouldin index, Dunn index,
• External Evaluation: Rand measure, F-measure, Jaccard index,
Fowlkes–Mallows index, Confusion matrix
5. Association:
• Lift Charts: Lift Ratio
• Precision and Recall (F-measure)
257
Tugas: Mengolah Semua Dataset
1. Lakukan ujicoba terhadap semua dataset yang ada di
folder datasets, dengan menggunakan berbagai metode
data mining yang sesuai (estimasi, prediksi, klasifikasi,
clustering, association)
2. Kombinasikan pengujian dengan pemecahan data
training-testing, dan pengujian dengan menggunakan
metode X validation
3. Ukur performance dari model yang terbentuk dengan
menggunakan metode pengukuran sesuai dengan
metode data mining yang dipilih
4. Jelaskan secara mendetail tahapan ujicoba yang
dilakukan, kemudian lakukan analisis dan sintesis, dan
buat laporan dalam bentuk slide
5. Presentasikan di depan kelas
258
Tugas: Mereview Paper
• Technical Paper:
• Judul: Application and Comparison of Classification
Techniques in Controlling Credit Risk
• Author: Lan Yu, Guoqing Chen, Andy Koronios, Shiwu Zhu,
and Xunhua Guo
• Download: http://romisatriawahono.net/lecture/dm/paper/
259
Tugas: Mereview Paper
• Technical Paper:
• Judul: A Comparison Framework of Classification Models for
Software Defect Prediction
• Author: Romi Satria Wahono, Nanna Suryana Herman, Sabrina
Ahmad
• Publications: Adv. Sci. Lett. Vol. 20, No. 10-12, 2014
• Download: http://romisatriawahono.net/lecture/dm/paper
• Isi paper:
• Abstract: Harus berisi obyek-masalah-metode-hasil
• Introduction: Latar belakang masalah penelitian dan struktur paper
• Related Work: Penelitian yang berhubungan
• Theoretical Foundation: Landasan dari berbagai teori yang digunakan
• Proposed Method: Metode yang diusulkan
• Experimental Results: Hasil eksperimen
• Conclusion: Kesimpulan dan future works
263
Competency Check
1. Dataset – Methods – Knowledge
1. Dataset Main Golf (Klasifikasi)
2. Dataset Iris (Klasifikasi)
3. Dataset Iris (Klastering)
4. Dataset CPU (Estimasi)
5. Dataset Pemilu (Klasifikasi)
6. Dataset Heating Oil (Association)
7. Dataset Transaksi (Association)
8. Dataset Harga Saham (Forecasting)
2. Dataset – Methods – Knowledge – Evaluation
1. Manual
2. Data Split Operator
3. Cross Validation
3. Methods Comparison
• Uji t-Test
4. Paper Reading
1. Lan Yu (DeLong Pearson Test)
2. Wahono (Friedman Test)
264
2.4 Proses Data Mining berbasis
Metodologi CRISP-DM
265
Data Mining Standard Process
• Dunia industri yang beragam bidangnya memerlukan
proses yang standard yang mampu mendukung
penggunaan data mining untuk menyelesaikan
masalah bisnis
• Proses tersebut harus dapat digunakan di lintas
industry (cross-industry) dan netral secara bisnis, tool
dan aplikasi yang digunakan, serta mampu
menangani strategi pemecahan masalah bisnis
dengan menggunakan data mining
• Pada tahun 1996, lahirlah salah satu standard proses
di dunia data mining yang kemudian disebut dengan:
the Cross-Industry Standard Process for Data Mining
(CRISP–DM) (Chapman, 2000)
266
CRISP-DM
267
1. Business Understanding
• Enunciate the project objectives and
requirements clearly in terms of the business
or research unit as a whole
• Translate these goals and restrictions into
the formulation of a data mining problem
definition
• Prepare a preliminary strategy for achieving
these objectives
• Designing what you are going to build
268
2. Data Understanding
• Collect the data
• Use exploratory data analysis to familiarize
yourself with the data and discover initial
insights
• Evaluate the quality of the data
• If desired, select interesting subsets that may
contain actionable patterns
269
3. Data Preparation
• Prepare from the initial raw data the final
data set that is to be used for all subsequent
phases
• Select the cases and variables you want to
analyze and that are appropriate for your
analysis
• Perform data cleaning, integration, reduction
and transformation, so it is ready for the
modeling tools
270
4. Modeling
• Select and apply appropriate modeling
techniques
• Calibrate model settings to optimize results
• Remember that often, several different
techniques may be used for the same data
mining problem
• If necessary, loop back to the data
preparation phase to bring the form of the
data into line with the specific requirements
of a particular data mining technique
271
5. Evaluation
• Evaluate the one or more models delivered in
the modeling phase for quality and
effectiveness before deploying them for use in
the field
• Determine whether the model in fact achieves
the objectives set for it in the first phase
• Establish whether some important facet of the
business or research problem has not been
accounted for sufficiently
• Come to a decision regarding use of the data
mining results
272
6. Deployment
• Make use of the models created:
• model creation does not signify the completion of a
project
• Example of a simple deployment:
• Generate a report
• Example of a more complex deployment:
• Implement a parallel data mining process in another
department
• For businesses, the customer often carries
out the deployment based on your model
273
CRISP-DM: Detail Flow
274
Studi Kasus CRISP-DM
275
CRISP-DM
276
1. Business Understanding
• Problems:
• Sarah is a regional sales manager for a nationwide supplier of
fossil fuels for home heating
• Marketing performance is very poor and decreasing, while
marketing spending is increasing
• She feels a need to understand the types of behaviors and
other factors that may influence the demand for heating oil in
the domestic market
• She recognizes that there are many factors that influence
heating oil consumption, and believes that by investigating
the relationship between a number of those factors, she will
be able to better monitor and respond to heating oil demand,
and also help her to design marketing strategy in the future
• Objective:
• To investigate the relationship between a number of factors
that influence heating oil consumption
277
2. Data Understanding
• In order to investigate her question, Sarah has enlisted our help in
creating a correlation matrix of six attributes
• Using employer’s data resources which are primarily drawn from
the company’s billing database, we create a data set comprised of
the following attributes:
1. Insulation: This is a density rating, ranging from one to ten, indicating
the thickness of each home’s insulation. A home with a density rating
of one is poorly insulated, while a home with a density of ten has
excellent insulation
2. Temperature: This is the average outdoor ambient temperature at
each home for the most recent year, measure in degree Fahrenheit
3. Heating_Oil: This is the total number of units of heating oil
purchased by the owner of each home in the most recent year
4. Num_Occupants: This is the total number of occupants living in each
home
5. Avg_Age: This is the average age of those occupants
6. Home_Size: This is a rating, on a scale of one to eight, of the home’s
overall size. The higher the number, the larger the home
278
3. Data Preparation
Data set: HeatingOil.csv
279
3. Data Preparation
• Data set appears to be very clean with:
• No missing values in any of the six attributes
• No inconsistent data apparent in our ranges (Min-Max)
or other descriptive statistics
280
4. Modeling
281
4. Modeling
• Hasil correlation matrix berupa tabel
• Semakin tinggi nilainya (semakin tebal warna ungu),
semakin tinggi tingkat korelasinya
282
5. Evaluation
Positive
Correlation
Negative
Correlation
283
5. Evaluation
• Atribut (faktor) yang paling signifikan berpengaruh (hubungan positif)
pada konsumsi minyak pemanas (Heating Oil) adalah Average Age (Rata-
Rata Umur) penghuni rumah
• Atribut (faktor) kedua yang paling berpengaruh adalah Temperature
(hubungan negatif)
• Atribut (faktor) ketiga yang paling berpengaruh adalah Insulation
(hubungan positif)
• Atribut Home Size, pengaruhnya sangat kecil, sedangkan Num_Occupant
boleh dikatakan tidak ada pengaruh ke konsumsi minyak pemanas
284
5. Evaluation 1
2 dan 3
1. Grafik menunjukkan hubungan antara temperature dan insulation, dengan warna adalah konsumsi minyak
(semakin merah kebutuhan minyak semakin tinggi)
2. Secara umum dapat dikatakan bahwa hubungan temperatur dengan insulation dan konsumsi minyak
adalah negatif. Jadi temperatur semakin rendah, kebutuhan minyak semakin tinggi (kolom kiri bagian atas)
ditunjukkan dengan banyak yang berwarna kuning dan merah
3. Insulation juga berhubungan negatif dengan temperatur, sehingga makin rendah temperatur, semakin
butuh insulation
4. Beberapa anomali terdapat pada Insulation yang rendah nilainya, ada beberapa yang masih memerlukan
minyak yang tinggi 286
5. Evaluation
287
6. Deployment
Dropping the Num_Occupants attribute
288
6. Deployment
Adding additional attributes to the data set
• It turned out that the number of occupants in the
home didn’t correlate much with other attributes,
but that doesn’t mean that other attributes would
be equally uninteresting
• For example, what if Sarah had access to the
number of furnaces and/or boilers in each home?
• Home_size was slightly correlated with Heating_Oil
usage, so perhaps the number of instruments that
consume heating oil in each home would tell an
interesting story, or at least add to her insight
289
6. Deployment
Investigating the role of home insulation
• The Insulation rating attribute was fairly strongly
correlated with a number of other attributes
• There may be some opportunity there to partner
with a company that specializes in adding insulation
to existing homes
290
6. Deployment
Focusing the marketing efforts to the city with low
temperature and high average age of citizen
• The temperature attribute was fairly strongly negative
correlated with a heating oil consumption
• The average age attribute was strongest positive correlated
with a heating oil consumption
291
6. Deployment
Adding greater granularity in the data set
• This data set has yielded some interesting results, but it’s pretty
general
• We have used average yearly temperatures and total annual
number of heating oil units in this model
• But we also know that temperatures fluctuate throughout the
year in most areas of the world, and thus monthly, or even
weekly measures would not only be likely to show more detailed
results of demand and usage over time, but the correlations
between attributes would probably be more interesting
• From our model, Sarah now knows how certain attributes
interact with one another, but in the day-to-day business of doing
her job, she’ll probably want to know about usage over time
periods shorter than one year
292
Studi Kasus CRISP-DM
294
CRISP-DM: Detail Flow
295
1. Business Understanding
• Problems:
• Business is booming, her sales team is signing up thousands of new
clients, and she wants to be sure the company will be able to meet
this new level of demand
• Sarah’s new data mining objective is pretty clear: she wants to
anticipate demand for a consumable product
• We will use a linear regression model to help her with her desired
predictions. She has data, 1,218 observations that give an attribute
profile for each home, along with those homes’ annual heating oil
consumption
• She wants to use this data set as training data to predict the usage
that 42,650 new clients will bring to her company
• She knows that these new clients’ homes are similar in nature to her
existing client base, so the existing customers’ usage behavior should
serve as a solid gauge for predicting future usage by new customers
• Objective:
• to predict the usage that 42,650 new clients will bring to her company
296
2. Data Understanding
• Sarah has assembled separate Comma Separated Values file
containing all of these same attributes, for her 42,650 new clients
• She has provided this data set to us to use as the scoring data set
in our model
• Data set comprised of the following attributes:
• Insulation: This is a density rating, ranging from one to ten, indicating
the thickness of each home’s insulation. A home with a density rating
of one is poorly insulated, while a home with a density of ten has
excellent insulation
• Temperature: This is the average outdoor ambient temperature at
each home for the most recent year, measure in degree Fahrenheit
• Heating_Oil: This is the total number of units of heating oil purchased
by the owner of each home in the most recent year
• Num_Occupants: This is the total number of occupants living in each
home
• Avg_Age: This is the average age of those occupants
• Home_Size: This is a rating, on a scale of one to eight, of the home’s
overall size. The higher the number, the larger the home
297
3. Data Preparation
• Filter Examples: attribute value filter or custom filter
• Avg_Age>=15.1
• Avg_Age<=72.2
• Deleted Records= 42650-42042 = 508
298
299
3. Modeling
300
4. Evaluation – Model Regresi
301
4. Evaluation – Hasil Prediksi
302
5. Deployment
303
304
Latihan
• Karena bantuan data mining sebelumnya, Sarah akhirnya mendapatkan promosi
menjadi VP marketing, yang mengelola ratusan marketer
• Sarah ingin para marketer dapat memprediksi pelanggan potensial mereka
masing-masing secara mandiri. Masalahnya, data HeatingOil.csv hanya boleh
diakses oleh level VP (Sarah), dan tidak diperbolehkan diakses oleh marketer
secara langsung
• Sarah ingin masing-masing marketer membuat proses yang dapat mengestimasi
kebutuhan konsumsi minyak dari client yang mereka approach, dengan
menggunakan model yang sebelumnya dihasilkan oleh Sarah, meskipun tanpa
mengakses data training (HeatingOil.csv)
• Asumsikan bahwa data HeatingOil-Marketing.csv adalah data calon pelanggan
yang berhasil di approach oleh salah satu marketingnya
• Yang harus dilakukan Sarah adalah membuat proses untuk:
1. Mengkomparasi algoritma yang menghasilkan model yang memiliki akurasi tertinggi (LR,
NN, SVM), gunakan 10 Fold X Validation
2. Menyimpan model terbaik ke dalam suatu file (operator Store)
• Yang harus dilakukan Marketer adalah membuat proses untuk:
1. Membaca model yang dihasilkan Sarah (operator Retrieve)
2. Menerapkannya di data HeatingOil-Marketing.csv yang mereka miliki
• Mari kita bantu Sarah dan Marketer membuat
305
dua proses tersebut
Proses Komparasi Algoritma (Sarah)
306
Proses Pengujian Data (Marketer)
307
Studi Kasus CRISP-DM
308
CRISP-DM
309
1. Business Understanding
• Problems:
• Budi adalah Rektor di Universitas Suka Belajar
• Universitas Suka Belajar memiliki masalah besar karena
rasio kelulusan mahasiswa tiap angkatan sangat rendah
• Budi ingin memahami dan membuat pola dari profile
mahasiswa yang bisa lulus tepat waktu dan yang tidak
lulus tepat waktu
• Dengan pola tersebut, Budi bisa melakukan konseling,
terapi, dan memberi peringatan dini kepada mahasiswa
kemungkinan tidak lulus tepat waktu untuk memperbaiki
diri, sehingga akhirnya bisa lulus tepat waktu
• Objective:
• Menemukan pola dari mahasiswa yang lulus tepat waktu
dan tidak
310
2. Data Understanding
• Untuk menyelesaikan masalah, Budi mengambil data dari sistem
informasi akademik di universitasnya
• Data-data dikumpulkan dari data profil mahasiswa dan indeks
prestasi semester mahasiswa, dengan atribut seperti di bawah
1. NAMA
2. JENIS KELAMIN: Laki-Laki atau Perempuan
3. STATUS MAHASISWA: Mahasiswa atau Bekerja
4. UMUR:
5. STATUS NIKAH: Menikah atau Belum Menikah
6. IPS 1: Indeks Prestasi Semester 1
7. IPS 2: Indeks Prestasi Semester 1
8. IPS 3: Indeks Prestasi Semester 1
9. IPS 4: Indeks Prestasi Semester 1
10. IPS 5: Indeks Prestasi Semester 1
11. IPS 6: Indeks Prestasi Semester 1
12. IPS 7: Indeks Prestasi Semester 1
13. IPS 8: Indeks Prestasi Semester 1
14. IPK: Indeks Prestasi Kumulatif
311
3. Data Preparation
Data set: datakelulusanmahasiswa.xls
312
3. Data Preparation
• Terdapat 379 data mahasiswa dengan 15 atribut
• Missing Value sebayak 10 data, dan tidak terdapat
data noise
313
3. Data Preparation
• Missing Value dipecahkan dengan menambahkan data
dengan nilai rata-rata
• Hasilnya adalah data bersih tanpa missing value
314
4. Modeling
• Modelkan dataset dengan Decision Tree
• Pola yang dihasilkan bisa berbentuk tree atau if-then
315
4. Modeling
Hasil pola dari data berupa berupa decision tree
(pohon keputusan)
316
5. Evaluation
Hasil pola dari data berupa berupa peraturan if-then
317
5. Evaluation
• Atribut atau faktor yang paling berpengaruh adalah
Status Mahasiswa, IPS2, IPS5, IPS1
318
6. Deployment
• Budi membuat program peningkatan disiplin dan pendampingan ke
mahasiswa di semester awal (1-2) dan semester 5, karena faktor yang
paling menentukan kelulusan mahasiswa ada di dua semester itu
319
Latihan
• Pahami dan lakukan eksperimen berdasarkan
seluruh studi kasus yang ada di buku Data
Mining for the Masses (Matthew North)
320
Tugas Menyelesaikan Masalah Organisasi
• Analisis masalah dan kebutuhan yang ada di organisasi
lingkungan sekitar anda
• Kumpulkan dan review dataset yang tersedia, dan hubungkan
masalah dan kebutuhan tadi dengan data yang tersedia (analisis
dari 5 peran data mining)
• Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah
data tersebut, misalnya: lakukan association (analisis faktor), sekaligus
estimation atau clustering
• Lakukan proses CRISP-DM untuk menyelesaikan masalah yang
ada di organisasi sesuai dengan data yang didapatkan
• Pada proses data preparation, lakukan data cleaning (replace missing
value, replace, filter attribute) sehingga data siap dimodelkan
• Lakukan juga komparasi algoritma untuk memilih algoritma terbaik
• Rangkumkan dalam bentuk slide dengan contoh studi kasus
Sarah yang menggunakan data mining untuk:
• Menganalisis faktor yang berhubungan (matrix correlation)
• Mengestimasi jumlah stok minyak (linear regression)
321
Studi Kasus CRISP-DM
322
Contoh Kasus Pengolahan Data LHKPN
323
Prediksi Profil Tersangka Koruptor
324
Pola Profil Tersangka Koruptor
325
Forecasting Jumlah Wajib Lapor
326
Forecasting Jumlah Wajib Lapor
327
Rekomendasi Hasil Pemeriksaan LHKPN
328
Pola Rekomendasi Hasil Pemeriksaan LHKPN
329
3. Persiapan Data
3.1 Data Cleaning
3.2 Data Reduction
3.3 Data Transformation and Data Discretization
3.4 Data Integration
330
CRISP-DM
331
Why Preprocess the Data?
Measures for data quality: A multidimensional view
332
Major Tasks in Data Preprocessing
1. Data cleaning
• Fill in missing values
• Smooth noisy data
• Identify or remove outliers
• Resolve inconsistencies
2. Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
3. Data transformation and data discretization
• Normalization
• Concept hierarchy generation
4. Data integration
• Integration of multiple databases or files
333
Data Preparation Law (Data Mining Law 3)
Data preparation is more than half of every data
mining process
334
3.1 Data Cleaning
335
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect
data, e.g., instrument faulty, human or computer error,
transmission error
337
Contoh Missing Data
• Dataset: MissingDataSet.csv
338
MissingDataSet.csv
• Jerry is the marketing manager for a small Internet design and
advertising firm
• Jerry’s boss asks him to develop a data set containing
information about Internet users
• The company will use this data to determine what kinds of
people are using the Internet and how the firm may be able to
market their services to this group of users
• To accomplish his assignment, Jerry creates an online survey and
places links to the survey on several popular Web sites
• Within two weeks, Jerry has collected enough data to begin
analysis, but he finds that his data needs to be denormalized
• He also notes that some observations in the set are missing
values or they appear to contain invalid values
• Jerry realizes that some additional work on the data needs to
take place before analysis begins.
339
Relational Data
340
View of Data (Denormalized Data)
341
Contoh Missing Data
• Dataset: MissingDataSet.csv
342
How to Handle Missing Data?
• Ignore the tuple:
• Usually done when class label is missing (when doing
classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually:
• Tedious + infeasible?
• Fill in it automatically with
• A global constant: e.g., “unknown”, a new class?!
• The attribute mean
• The attribute mean for all samples belonging to the same
class: smarter
• The most probable value: inference-based such as
Bayesian formula or decision tree
343
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd
Edition, 2016, Chapter 3 Data Preparation
1. Handling Missing Data, pp. 34-48 (replace)
2. Data Reduction, pp. 48-51 (delete/filter)
• Dataset: MissingDataSet.csv
345
Missing Value Replace
346
Missing Value Filtering
347
Noisy Data
• Noise: random error or variance in a measured
variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which require data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data
348
How to Handle Noisy Data?
• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
• Combined computer and human inspection
• Detect suspicious values and check by human (e.g., deal
with possible outliers)
349
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
350
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd
Edition, 2016, Chapter 3 Data Preparation, pp.
52-54 (Handling Inconsistence Data)
• Dataset: MissingDataSet.csv
351
352
Setting Regex
Ujicoba Regex
353
Latihan
• Impor data MissingDataValue-Noisy.csv
• Gunakan Regular Expression (operator Replace)
untuk mengganti semua noisy data pada atribut
nominal menjadi “N”
354
Latihan
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. Gunakan operator Replace Missing Value untuk mengisi data kosong
3. Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy
data pada atribut nominal menjadi “N”
4. Gunakan operator Map untuk mengganti semua isian Face, FB dan Fesbuk
menjadi Facebook
355
356
1 2 3 4
360
2. Data Understanding
• Working with Gill, we gather the results of the batteries
for all former clients who have gone on to specialize
• Gill adds the sport each person specialized in, and we
have a data set comprised of 493 observations
containing the following attributes:
1. Age: ....
2. Strength: ....
3. Quickness: ....
4. Injury: ....
5. Vision: ....
6. Endurance: ....
7. Agility: ....
8. Decision Making: ....
9. Prime Sport: ....
361
3. Data Preparation
• Filter Examples: attribute value filter
• Decision_Making>=3
• Decision_Making<=100
• Deleted Records= 493-482=11
362
Latihan
1. Lakukan training pada data SportSkill-
Training.csv dengan menggunakan C4.5,
NB, K-NN dan LDA
2. Lakukan pengujian dengan menggunakan
10-fold X Validation
3. Uji beda dengan t-Test untuk mendapatkan
model terbaik
4. Simpan model terbaik dari komparasi di
atas dengan operator Write Model, dan
kemudian Apply Model pada dataset
SportSkill-Scoring.csv
363
DT NB k-NN LDA
DT
NB
k-NN
LDA 364
365
366
3.2 Data Reduction
367
Data Reduction Methods
• Data Reduction
• Obtain a reduced representation of the data set that is much smaller in volume but yet
produces the same analytical results
• Why Data Reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis take a very long time to run on the complete dataset
371
372
373
Data Awal Sebelum PCA
374
Data Setelah PCA
375
Latihan
• Susun ulang proses yang mengkomparasi
model yang dihasilkan oleh k-NN dan PCA +
k-NN
• Gunakan 10 Fold X Validation
376
377
Latihan
• Review operator apa saja
yang bisa digunakan untuk
feature extraction
380
Wrapper Approach vs Filter Approach
381
Feature Selection Approach
1. Filter Approach:
• information gain
• chi square
• log likehood rasio
• etc
2. Wrapper Approach:
• forward selection
• backward elimination
• randomized hill climbing
• etc
3. Embedded Approach:
• decision tree
• weighted naïve bayes
• etc 382
Latihan
• Lakukan eksperimen mengikuti buku
Markus Hofmann (Rapid Miner -
Data Mining Use Case) Chapter 4 (k-
Nearest Neighbor Classification II)
383
384
385
Latihan
• Lakukan eksperimen mengikuti buku Markus Hofmann
(Rapid Miner - Data Mining Use Case) Chapter 4 (k-
Nearest Neighbor Classification II)
386
387
Feature
Extraction
Feature Selection
(Filter)
Feature Selection
(Wrapper)
388
Feature
Extraction
Feature Selection
(Filter)
Feature Selection
(Wrapper)
389
Hasil Komparasi Akurasi dan Signifikansi t-Test
k-NN k-NN+PCA k-NN+ICA k-NN+IG k-NN+IGR kNN + FS k-NN+BE
Accuracy
AUC
390
Latihan: Prediksi Kelulusan Mahasiswa
1. Lakukan training pada data mahasiswa
(datakelulusanmahasiswa.xls) dengan menggunakan 3
algoritma klasifikasi (DT, NB, k-NN)
2. Analisis dan komparasi, mana algoritma klasifikasi yang
menghasilkan model paling akurat (AK)
3. Lakukan feature selection dengan Information Gain (Filter),
Forward Selection, Backward Elimination (Wrapper) untuk
model yang paling akurat
4. Analisis dan komparasi, mana algoritma feature selection
yang menghasilkan model paling akurat
5. Lakukan pengujian dengan menggunakan 10-fold X
Validation
AK AK+IG AK+FS AK+BE
Accuracy 91.55 92.10 91.82
AUC 0.909 0.920 0.917
391
Latihan: Prediksi Kelulusan Mahasiswa
1. Lakukan training pada data mahasiswa
(datakelulusanmahasiswa.xls) dengan menggunakan 4
algoritma klasifikasi (DT
2. Lakukan feature selection dengan Forward Selection untuk
algoritma DT (DT+FS)
3. Lakukan feature selection dengan Backward Elimination
untuk algoritma DT (DT+BE)
4. Lakukan pengujian dengan menggunakan 10-fold X
Validation
5. Uji beda dengan t-Test untuk mendapatkan model terbaik
(DT vs DT+FS vs DT+BE)
DT DT+FS DT+BE
Accuracy 91.55 92.10 91.82
AUC 0.909 0.920 0.917
392
DT DT+FS DT+BE
Accuracy 91.55 92.10 91.82
AUC 0.909 0.920 0.917
no significant difference
393
Latihan: Prediksi Elektabilitas Pemilu
1. Lakukan komparasi algoritma pada data pemilu
(datapemilukpu.xls), sehingga didapatkan algoritma terbaik
2. Ambil algoritma terbaik dari langkah 1, kemudian lakukan
feature selection dengan Forward Selection dan Backward
Elimination
3. Tentukan kombinasi algoritma dan feature selection apa
yang memiliki performa terbaik
4. Lakukan pengujian dengan menggunakan 10-fold X
Validation
5. Uji beda dengan t-Test untuk mendapatkan model terbaik
DT NB K-NN A A + FS A + BE
Accuracy Accuracy
AUC AUC
394
Latihan: Prediksi Kelulusan Mahasiswa
1. Lakukan training pada data mahasiswa
(datakelulusanmahasiswa.xls) dengan menggunakan
DT, NB, K-NN
2. Lakukan dimension reduction dengan Forward
Selection untuk ketiga algoritma di atas
3. Lakukan pengujian dengan menggunakan 10-fold X
Validation
4. Uji beda dengan t-Test untuk mendapatkan model
terbaik
395
No Free Lunch Theory (Data Mining Law 4)
There is No Free Lunch for the Data Miner (NFL-DM)
The right model for a given application can only be discovered by
experiment
2. Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …
397
Numerosity Reduction
398
Parametric Data Reduction: Regression and Log-Linear Models
• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the
line
• Multiple regression
• Allows a response variable Y to be modeled as a
linear function of multidimensional feature
vector
• Log-linear model
• Approximates discrete multidimensional
probability distributions
399
Regression Analysis
• Regression analysis: A collective name for
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more
independent variables (aka. explanatory
Y1
variables or predictors)
• The parameters are estimated so as to give a
"best fit" of the data Y1’
y=x+1
• Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
X1 x
• Used for prediction (including forecasting of
time-series data), inference, hypothesis
testing, and modeling of causal relationships
400
Regress Analysis and Log-Linear Models
• Linear regression: Y = w X + b
• Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
• Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
• Many nonlinear functions can be transformed into the above
• Log-linear models:
• Approximate discrete multidimensional probability distributions
• Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
• Useful for dimensionality reduction and data smoothing
401
Histogram Analysis
402
Clustering
• Partition data set into clusters based on
similarity, and store cluster representation (e.g.,
centroid and diameter) only
• Can be very effective if data is clustered but not
if data is “smeared”
• Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
• There are many choices of clustering definitions
and clustering algorithms
403
Sampling
• Sampling: obtaining a small sample s to represent
the whole data set N
• Allow a mining algorithm to run in complexity that
is potentially sub-linear to the size of the data
• Key principle: Choose a representative subset of the
data
• Simple random sampling may have very poor performance in the
presence of skew
• Develop adaptive sampling methods, e.g., stratified sampling
404
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular
item
• Sampling without replacement
• Once an object is selected, it is removed from the
population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling
• Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
• Used in conjunction with skewed data
405
Sampling: With or without Replacement
SWOR
S R a n d om
le r
(s im p
w it hout
le
samp cement)
repla
SRSW
R
Raw Data
406
Sampling: Cluster or Stratified Sampling
407
Stratified Sampling
• Stratification is the process of dividing members of the population
into homogeneous subgroups before sampling
• Suppose that in a company there are the following staff:
• Male, full-time: 90
• Male, part-time: 18
• Female, full-time: 9
• Female, part-time: 63
• Total: 180
• We are asked to take a sample of 40 staff, stratified according to
the above categories
• An easy way to calculate the percentage is to multiply each group
size by the sample size and divide by the total population:
• Male, full-time = 90 × (40 ÷ 180) = 20
• Male, part-time = 18 × (40 ÷ 180) = 4
• Female, full-time = 9 × (40 ÷ 180) = 2
• Female, part-time = 63 × (40 ÷ 180) = 14
408
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd Edition,
2016, Chapter 7 Discriminant Analysis, pp. 125-
143
• Datasets:
• SportSkill-Training.csv
• SportSkill-Scoring.csv
410
Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values
• Each old value can be identified with one of the new values
• Methods:
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing
411
Normalization
73,600 54,000
1.225
• Ex. Let μ = 54,000, σ = 16,000. Then 16,000
413
Data Discretization Methods
Typical methods: All the methods can be
applied recursively
• Binning: Top-down split, unsupervised
• Histogram analysis: Top-down split, unsupervised
• Clustering analysis: Unsupervised, top-down split
or bottom-up merge
• Decision-tree analysis: Supervised, top-down split
• Correlation (e.g., 2) analysis: Unsupervised,
bottom-up merge
414
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform
grid
• if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
415
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
417
Discretization by Classification & Correlation Analysis
418
Latihan
• Lakukan eksperimen mengikuti buku Markus
Hofmann (Rapid Miner - Data Mining Use Case)
Chapter 5 (Naïve Bayes Classification I)
• Dataset: crx.data
• Analisis metode preprocessing apa saja yang
digunakan dan mengapa perlu dilakukan pada
dataset tersebut!
• Bandingkan akurasi model apabila tidak
menggunakan filter dan diskretisasi
• Bandingkan pula apabila digunakan feature
selection (wrapper) dengan Backward Elimination
419
420
Hasil
421
3.4 Data Integration
422
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema Integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• Entity Identification Problem:
• Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
• Detecting and Resolving Data Value Conflicts
• For the same real world entity, attribute values from
different sources are different
• Possible reasons: different representations, different scales,
e.g., metric vs. British units
423
Handling Redundancy in Data Integration
• Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected
by correlation analysis and covariance analysis
• Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
424
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
2
( Observed Expected )
2
Expected
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
425
Chi-Square Calculation: An Example
n n
(ai A)(bi B ) (ai bi ) n AB
rA, B i 1
i 1
(n 1) A B (n 1) A B
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation of
B
A AB cross-product
A and B, and Σ(aibi) is the sum of the
427
Visually Evaluating Correlation
Scatter plots
showing the
similarity
from –1 to 1
428
Correlation
• Correlation measures the linear relationship
between objects
• To compute correlation, we standardize data
objects, A and B, and then take their dot product
429
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
A B
where n is the number of tuples, and are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely
to be smaller than its expected value
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
430
Covariance: An Example
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
431
Rangkuman
1. Data quality: accuracy, completeness, consistency,
timeliness, believability, interpretability
2. Data cleaning: e.g. missing/noisy values, outliers
3. Data reduction
• Dimensionality reduction
• Numerosity reduction
4. Data transformation and data discretization
• Normalization
5. Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
432
Tugas Membuat Tulisan Ilmiah
• Buat tulisan ilmiah dari slide (ppt) yang sudah dibuat, dengan
menggunakan template di http://journal.ilmukomputer.org
• Struktur Paper mengikuti format di bawah:
1. Pendahuluan
• Latar belakang masalah dan tujuan
2. Penelitian Yang Berhubungan
• Penelitian lain yang melakukan hal yang mirip dengan yang kita lakukan
3. Metode Penelitian
• Cara kita menganalisis data, jelaskan bahwa kita menggunakan CRISP-DM
4. Hasil dan Pembahasan
• 4.1 Business Understanding
• 4.2 Data Understanding
• 4.3 Data Preparation
• 4.4 Modeling
• 4.5 Evaluation
• 4.6 Deployment
5. Kesimpulan
• Kesimpulan harus sesuai dengan tujuan
6. Daftar Referensi
• Masukan daftar referensi yang digunakan
433
Tugas Menyelesaikan Masalah Organisasi
• Analisis masalah dan kebutuhan yang ada di organisasi lingkungan
sekitar anda
• Kumpulkan dan review dataset yang tersedia, dan hubungkan
masalah dan kebutuhan tadi dengan data yang tersedia (analisis
dari 5 peran data mining)
• Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah data
tersebut, misalnya: lakukan association (analisis faktor), sekaligus
estimation atau clustering
• Lakukan proses CRISP-DM untuk menyelesaikan masalah yang ada
di organisasi sesuai dengan data yang didapatkan
• Pada proses data preparation, lakukan data cleaning (replace missing
value, replace, filter attribute) sehingga data siap dimodelkan
• Lakukan juga komparasi algoritma dan feature selection untuk memilih
pola dan model terbaik
• Rangkumkan evaluasi dari pola/model/knowledge yang dihasilkan dan
relasikan hasil evaluasi dengan deployment yang dilakukan
• Rangkumkan dalam bentuk slide dengan contoh studi kasus Sarah
untuk membantu bidang marketing
434
4. Algoritma Data Mining
4.1 Algoritma Klasifikasi
4.2 Algoritma Klastering
4.3 Algoritma Asosiasi
4.4 Algoritma Estimasi dan Forecasting
435
4.1 Algoritma Klasifikasi
436
4.1.1 Decision Tree
437
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
1. Tree is constructed in a top-down recursive divide-and-
conquer manner
2. At start, all the training examples are at the root
3. Attributes are categorical (if continuous-valued, they are
discretized in advance)
4. Examples are partitioned recursively based on selected
attributes
5. Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain, gain ratio, gini
index)
m=2
439
Attribute Selection Measure:
Information Gain (ID3)
j 1
Info ( D ) j
441
Computing Information-Gain for Continuous-Valued Attributes
444
2. Pilih atribut sebagai akar
• Untuk memilih atribut akar, didasarkan pada nilai Gain
tertinggi dari atribut-atribut yang ada. Untuk mendapatkan
nilai Gain, harus ditentukan terlebih dahulu nilai Entropy
n
• Rumus Entropy: Entropy ( S ) pi * log 2 pi
i 1
• S = Himpunan Kasus
• n = Jumlah Partisi S
• pi = Proporsi dari Si terhadap S
n
| Si |
• Rumus Gain: Gain( S , A) Entropy ( S ) * Entropy ( S i )
i 1 | S |
• S = Himpunan Kasus
• A = Atribut
• n = Jumlah Partisi Atribut A
• | Si | = Jumlah Kasus pada partisi ke-i
• |S| = Jumlah Kasus dalam S 445
Perhitungan Entropy dan Gain Akar
446
Penghitungan Entropy Akar
• Entropy Total
• Entropy (Outlook)
• Entropy (Temperature)
• Entropy (Humidity)
• Entropy (Windy)
447
Penghitungan Entropy Akar
JML KASUS TIDAK
NODE ATRIBUT (S) YA (Si) (Si) ENTROPY GAIN
1 TOTAL 14 10 4 0,86312
OUTLOOK
CLOUDY 4 4 0 0
RAINY 5 4 1 0,72193
SUNNY 5 2 3 0,97095
TEMPERATURE
COOL 4 0 4 0
HOT 4 2 2 1
MILD 6 2 4 0,91830
HUMADITY
HIGH 7 4 3 0,98523
NORMAL 7 7 0 0
WINDY
FALSE 8 2 6 0,81128
TRUE 6 4 2 0,91830
448
Penghitungan Gain Akar
449
Penghitungan Gain Akar
JML KASUS TIDAK
NODE ATRIBUT YA (Si) ENTROPY GAIN
(S) (Si)
1 TOTAL 14 10 4 0,86312
OUTLOOK 0,25852
CLOUDY 4 4 0 0
RAINY 5 4 1 0,72193
SUNNY 5 2 3 0,97095
TEMPERATURE 0,18385
COOL 4 0 4 0
HOT 4 2 2 1
MILD 6 2 4 0,91830
HUMADITY 0,37051
HIGH 7 4 3 0,98523
NORMAL 7 7 0 0
WINDY 0,00598
FALSE 8 2 6 0,81128
TRUE 6 4 2 0,91830
450
Gain Tertinggi Sebagai Akar
• Dari hasil pada Node 1, dapat diketahui
bahwa atribut dengan Gain tertinggi
adalah HUMIDITY yaitu sebesar 0.37051
• Dengan demikian HUMIDITY dapat menjadi
node akar
1.1
????? Yes
451
2. Buat cabang untuk tiap-tiap nilai
• Untuk memudahkan, dataset di filter dengan
mengambil data yang memiliki kelembaban
HUMADITY=HIGH untuk membuat table Node 1.1
452
Perhitungan Entropi Dan Gain Cabang
453
Gain Tertinggi Sebagai Node 1.1
Cloudy Sunny
Rainy
1.1.2
Yes ????? No
454
3. Ulangi proses untuk setiap cabang sampai
semua kasus pada cabang memiliki kelas yg sama
JML KASUS
NODE ATRIBUT YA (Si) TIDAK (Si) ENTROPY GAIN
(S)
HUMADITY HIGH &
1.2 2 1 1 1
OUTLOOK RAINY
TEMPERATURE 0
COOL 0 0 0 0
HOT 0 0 0 0
MILD 2 1 1 1
WINDY 1
FALSE 1 1 0 0
TRUE 1 0 1 0
455
Gain Tertinggi Sebagai Node 1.1.2
1.
• Dari tabel, Gain Tertinggi HUMIDIT
adalah WINDY dan Y
menjadi node cabang dari High Normal
atribut RAINY
1.1
OUTLOOK Yes
• Karena semua kasus
sudah masuk dalam kelas Cloudy Sunny
• Jadi, pohon keputusan Rainy
pada Gambar merupakan
pohon keputusan terakhir
1.1.2
yang terbentuk Yes WINDY No
False True
Yes No
456
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
• Training data set: 31…40 high no fair yes
Buys_computer >40
>40
medium
low
no
yes
fair
fair
yes
yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
457
Gain Ratio for Attribute Selection (C4.5)
• Ex.
• If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
|D1| |D |
gini A ( D) gini( D1) 2 gini( D 2)
|D| |D|
• Reduction in Impurity:
gini( A) gini(D) giniA (D)
• The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node (need to enumerate all the
possible splitting points for each attribute)
459
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) 1 0.459
14 14
• Suppose the attribute income partitions D into 10 in D1: {low, medium} and
4 in D2 10 4
gini ( D) Gini ( D ) Gini ( D )
income{low, medium} 1 2
14 14
461
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases
463
Pruning
464
Why is decision tree induction popular?
465
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd Edition,
2016, Chapter 10 (Decision Tree), p 195-217
• Datasets:
• eReaderAdoption-Training.csv
• eReaderAdoption-Scoring.csv
Objectives:
• To mine the customers’ consumer behaviors on the web site, in order
to figure out which customers will buy the new tablet early, which
ones will buy next, and which ones will buy later on
467
Latihan
• Lakukan training pada data eReader Adoption
(eReader-Training.csv) dengan menggunakan DT
dengan 3 alternative criterion (Gain Ratio,
Information Gain dan Gini Index)
• Ujicoba masing-masing split criterion baik
menggunakan prunning atau tidak
• Lakukan pengujian dengan menggunakan 10-fold X
Validation
• Dari model terbaik, tentukan faktor (atribut) apa saja
yang berpengaruh pada tingkat adopsi eReader
DTGR DTIG DTGI DTGR+Pr DTIG+Pr DTGI+Pr
Accuracy 58.39 51.01 31.01
468
Latihan
• Lakukan feature selection dengan Forward Selection
untuk ketiga algoritma di atas
• Lakukan pengujian dengan menggunakan 10-fold X
Validation
• Dari model terbaik, tentukan faktor (atribut) apa saja
yang berpengaruh pada tingkat adopsi eReader
469
470
4.1.2 Bayesian Classification
471
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with
observed data
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
472
Bayes’ Theorem: Basics
M
• Total probability Theorem: P( B) P(B | A )P( A )
i i
i 1
473
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
475
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i ) P( x | C i) P( x | C i ) P( x | C i ) ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the
class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based
on Gaussian distribution with a mean μ and standard 2
( x )
deviation σ 1
g ( x, , )
2
2
e
2
and P(xk|Ci) is P ( X | C i ) g ( xk , C i , Ci )
476
Naïve Bayes Classifier: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
479
1. Baca Data Training
480
Teorema Bayes
P(X | H ) P( H )
P ( H | X) P ( X | H ) P ( H ) / P ( X)
P ( X)
481
2. Hitung jumlah class/label
• Maka:
• P (C1) = 9/14 = 0.642857143
• P (C2) = 5/14 = 0.357142857
• Pertanyaan:
• Data X = (outlook=rainy, temperature=cool, humidity=high, windy=true)
• Main golf atau tidak?
482
3. Hitung jumlah kasus yang sama dengan class yang sama
• P(outlook=“overcast”|play=“yes”)=4/9=0.444444444
• P(outlook=“overcast”|play=“no”)=0/5=0
• P(outlook=“rainy”|play=“yes”)=3/9=0.333333333
• P(outlook=“rainy”|play=“no”)=2/5=0.4
483
3. Hitung jumlah kasus yang sama dengan class yang sama
484
4. Kalikan semua nilai hasil sesuai dengan data X yang dicari class-
nya
• Pertanyaan:
• Data X = (outlook=rainy, temperature=cool, humidity=high,
windy=true)
• Main Golf atau tidak?
• P(X|play=“yes”)*P(C1) = 0.012345679*0.642857143
= 0.007936508
• P(X|play=“no”)*P(C2) = 0.0384*0.357142857
= 0.013714286
• Nilai “no” lebih besar dari nilai “yes” maka class dari data
X tersebut adalah “No”
485
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob.
be non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i ) P( xk | C i)
k 1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
486
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss
of accuracy
• Practically, dependencies exist among variables, e.g.:
• Hospitals Patients Profile: age, family history, etc.
• Symptoms: fever, cough etc.,
• Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayes Classifier
• How to deal with these dependencies? Bayesian Belief
Networks
487
4.1.3 Neural Network
488
Neural Network
• Neural Network adalah suatu model yang dibuat
untuk meniru fungsi belajar yang dimiliki otak manusia
atau jaringan dari sekelompok unit pemroses kecil
yang dimodelkan berdasarkan jaringan saraf manusia
489
Neural Network
• Model Perceptron adalah model jaringan yang
terdiri dari beberapa unit masukan (ditambah
dengan sebuah bias), dan memiliki sebuah unit
keluaran
• Fungsi aktivasi bukan hanya merupakan fungsi
biner (0,1) melainkan bipolar (1,0,-1)
• Untuk suatu harga threshold ѳ yang ditentukan:
490
Fungsi Aktivasi
Macam fungsi aktivasi yang dipakai untuk
mengaktifkan net diberbagai jenis neural network:
491
Tahapan Algoritma Perceptron
1. Inisialisasi semua bobot dan bias (umumnya wi = b = 0)
2. Selama ada element vektor masukan yang respon unit keluarannya tidak
sama dengan target, lakukan:
2.1 Set aktivasi unit masukan xi = Si (i = 1,...,n)
2.2 Hitung respon unit keluaran: net = +b
1 Jika net > ѳ
F (net) = 0 Jika – ѳ ≤ net ≤ ѳ
-1 Jika net < - ѳ
493
1: Inisialisasi Bobot
• Inisialisasi Bobot dan bias awal: b = 0 dan bias = 1
t X1 X2
1 2,9 1
-1 2.8 3
-1 2.3 5
-1 2,7 6
494
2.1: Set aktivasi unit masukan
• Treshold (batasan), θ = 0 , yang artinya :
1 Jika net > 0
F (net) = 0 Jika net = 0
-1 Jika net < 0
495
2.2 - 2.3 Hitung Respon dan Perbaiki
Bobot
• Hitung Response Keluaran iterasi 1
• Perbaiki bobot pola yang mengandung kesalahan
496
2.4 Ulangi iterasi sampai perubahan bobot
(∆wn = 0) tidak ada (Iterasi 2)
INISIALISASI -2,2 -1 -1
497
2.4 Ulangi iterasi sampai perubahan bobot (∆wn = 0) tidak ada (Iterasi 3)
INISIALISASI -2,1 -3 -1
499
1. Business Understanding
Motivation:
• Juan is a performance analyst for a major professional athletic team
• His team has been steadily improving over recent seasons, and heading into
the coming season management believes that by adding between two and four
excellent players, the team will have an outstanding shot at achieving the
league championship
• They have tasked Juan with identifying their best options from among a list of
59 players that may be available to them
• All of these players have experience; some have played professionally before
and some have years of experience as amateurs
• None are to be ruled out without being assessed for their potential ability to
add star power and productivity to the existing team
• The executives Juan works for are anxious to get going on contacting the most
promising prospects, so Juan needs to quickly evaluate these athletes’ past
performance and make recommendations based on his analysis
Objectives:
• To evaluate each of the 59 prospects’ past statistical performance in order to
help him formulate recommendations based on his analysis
500
Latihan
• Lakukan training dengan neural network untuk
dataset TeamValue-Training.csv
501
Penentuan Hidden Layer
Hidden Capabilities
Layer
0 Only capable of representing linear separable functions
or decisions
1 Can approximate any function that contains a continuous
mapping from one finite space to another
2 Can represent an arbitrary decision boundary to
arbitrary accuracy with rational activation functions and
can approximate any smooth mapping to any accuracy
502
Penentuan Neuron Size
1. Trial and Error
2. Rule of Thumb:
• Between the size of the input layer and the size of the
output layer
• 2/3 the size of the input layer, plus the size of the output
layer
• Less than twice the size of the input layer
3. Search Algorithm:
• Greedy
• Genetic Algorithm
• Particle Swarm Optimization
• etc
503
Techniques to Improve Classification
Accuracy: Ensemble Methods
504
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers
• Boosting: weighted vote with a collection of classifiers
• Ensemble: combining a set of heterogeneous classifiers
505
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with
the most votes to X
• Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
506
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
1. Weights are assigned to each training tuple
2. A series of k classifiers is iteratively learned
3. After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
4. The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
507
Adaboost (Freund and Schapire, 1997)
1. Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
2. Initially, all the weights of tuples are set the same (1/d)
3. Generate k classifiers in k rounds. At round i,
1. Tuples from D are sampled (with replacement) to form a training
set Di of the same size
2. Each tuple’s chance of being selected is based on its weight
3. A classification model Mi is derived from Di
4. Its error rate is calculated using Di as a test set
5. If a tuple is misclassified, its weight is increased, o.w. it is decreased
4. Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights of the misclassified tuples:
d
error ( M i ) w j err ( X j )
j
510
4.2 Algoritma Klastering
4.2.1 Partitioning Methods
4.2.2 Hierarchical Methods
4.2.3 Density-Based Methods
4.2.4 Grid-Based Methods
511
What is Cluster Analysis?
• Cluster: A collection of data objects
• similar (or related) to one another within the same
group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
• Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
512
Applications of Cluster Analysis
• Data reduction
• Summarization: Preprocessing for regression, PCA,
classification, and association analysis
• Compression: Image processing: vector quantization
• Hypothesis generation and testing
• Prediction based on groups
• Cluster & find characteristics/patterns for each group
• Finding K-nearest Neighbors
• Localizing search to one or a small number of clusters
• Outlier detection: Outliers are often viewed as those “far
away” from any cluster
513
Clustering: Application Examples
• Biology: taxonomy of living things: kingdom, phylum, class,
order, family, genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth
observation database
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
• Climate: understanding earth climate, find patterns of
atmospheric and ocean
• Economic Science: market research
514
Basic Steps to Develop a Clustering Task
• Feature selection
• Select info concerning the task of interest
• Minimal information redundancy
• Proximity measure
• Similarity of two feature vectors
• Clustering criterion
• Expressed via a cost function or some rules
• Clustering algorithms
• Choice of algorithms
• Validation of the results
• Validation test (also, clustering tendency test)
• Interpretation of the results
• Integration with applications
515
Quality: What Is Good Clustering?
• A good clustering method will produce high quality
clusters
• high intra-class similarity: cohesive within clusters
• low inter-class similarity: distinctive between clusters
516
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
• Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
• The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
• Weights should be associated with different variables
based on applications and data semantics
• Quality of clustering:
• There is usually a separate “quality” function that
measures the “goodness” of a cluster.
• It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
517
Considerations for Cluster Analysis
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
518
Requirements and Challenges
• Scalability
• Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture of
these
• Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
• Interpretability and usability
• Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• High dimensionality
519
Major Clustering Approaches 1
• Partitioning approach:
• Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
• Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data (or
objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
• Based on connectivity and density functions
• Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
• based on a multiple-level granularity structure
• Typical methods: STING, WaveCluster, CLIQUE
520
Major Clustering Approaches 2
• Model-based:
• A model is hypothesized for each of the clusters and tries to
find the best fit of that model to each other
• Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
• Based on the analysis of frequent patterns
• Typical methods: p-Cluster
• User-guided or constraint-based:
• Clustering by considering user-specified or application-specific
constraints
• Typical methods: COD (obstacles), constrained clustering
• Link-based clustering:
• Objects are often linked together in various ways
• Massive links can be used to cluster objects: SimRank, LinkClus
521
4.2.1 Partitioning Methods
522
Partitioning Algorithms: Basic Concept
523
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four
steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters
of the current partitioning (the centroid is the center,
i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed
point
4. Go back to Step 2, stop when the assignment does not
change
524
An Example of K-Means Clustering
K=2
525
Tahapan Algoritma k-Means
1. Pilih jumlah klaster k yang diinginkan
2. Inisialisasi k pusat klaster (centroid) secara random
3. Tempatkan setiap data atau objek ke klaster terdekat. Kedekatan dua objek
ditentukan berdasar jarak. Jarak yang dipakai pada algoritma k-Means adalah
Euclidean distance (d)
n
d Euclidean x, y
i i 2
x y
i 1
• x = x1, x2, . . . , xn, dan y = y1, y2, . . . , yn merupakan banyaknya n atribut(kolom) antara
2 record
4. Hitung kembali pusat klaster dengan keanggotaan klaster yang sekarang. Pusat
klaster adalah rata-rata (mean) dari semua data atau objek dalam klaster
tertentu
5. Tugaskan lagi setiap objek dengan memakai pusat klaster yang baru. Jika pusat
klaster sudah tidak berubah lagi, maka proses pengklasteran selesai. Atau,
kembali lagi ke langkah nomor 3 sampai pusat klaster tidak berubah lagi (stabil)
atau tidak ada penurunan yang signifikan dari nilai SSE (Sum of Squared Errors)
526
Contoh Kasus – Iterasi 1
SSE d p, mi
i 1 pCi
527
Interasi 2
528
Iterasi 3
529
Hasil Akhir
530
Latihan
• Lakukan eksperimen mengikuti buku
Matthew North, Data Mining for the Masses,
2012, Chapter 6 k-Means Clustering, pp. 91-
103 (CoronaryHeartDisease.csv)
• Gambarkan grafik (chart) dan pilih Scatter 3D
Color untuk menggambarkan data hasil
klastering yang telah dilakukan
• Analisis apa yang telah dilakukan oleh Sonia,
dan apa manfaat k-Means clustering bagi
pekerjaannya?
531
Latihan
• Lakukan pengukuran performance dengan
menggunakan Cluster Distance Performance, untuk
mendapatkan nilai Davies Bouldin Index (DBI)
• Nilai DBI semakin rendah berarti cluster yang kita
bentuk semakin baik
532
Latihan
• Lakukan klastering terhadap data
IMFdata.csv
(http://romisatriawahono.net/lecture/dm/dataset)
533
Comments on the K-Means Method
• Strength:
• Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
534
Variations of the K-Means Method
• Most of the variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means
535
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
536
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7
Arbitrary Assign
7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3 initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9 9
8
Compute 8
and Oramdom 6
swapping 6
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
537
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects
(medoids) in clusters
• PAM (Partitioning Around Medoids, Kaufmann &
Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not
scale well for large data sets (due to the computational
complexity)
• Efficiency improvement on PAM
• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
• CLARANS (Ng & Han, 1994): Randomized re-sampling
538
4.2.2 Hierarchical Methods
539
Hierarchical Clustering
• Use distance matrix as clustering criteria
• This method does not require the number of clusters k as an
input, but needs a termination condition
540
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical packages, e.g., Splus
• Use the single-link method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
541
Dendrogram: Shows How Clusters are Merged
542
DIANA (Divisive Analysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages, e.g.,
Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
543
Distance between Clusters X X
• Single link: smallest distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci,
Cj )
• Medoid: distance between the medoids of two clusters, i.e., dist(K i, Kj) = dist(Mi,
Mj)
• Medoid: a chosen, centrally located object in the cluster
544
Centroid, Radius and Diameter of a Cluster (for numerical data sets)
545
4.2.3 Density-Based Methods
546
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
• Several interesting studies:
• DBSCAN: Ester, et al. (KDD’96)
• OPTICS: Ankerst, et al (SIGMOD’99).
• DENCLUE: Hinneburg & D. Keim (KDD’98)
• CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
547
Density-Based Clustering: Basic Concepts
• Two parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
• p belongs to NEps(q)
• core point condition:
p MinPts = 5
|NEps (q)| ≥ MinPts
Eps = 1 cm
q
548
Density-Reachable and Density-Connected
• Density-reachable:
• A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
p1
chain of points p1, …, pn, p1 = q, pn = p q
such that pi+1 is directly density-
reachable from pi
• Density-connected
• A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
o
density-reachable from o w.r.t. Eps
and MinPts
549
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
550
DBSCAN: The Algorithm
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
552
OPTICS: A Cluster-Ordering Method (1999)
• OPTICS: Ordering Points To Identify the Clustering Structure
• Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
• Produces a special order of the database wrt its density-
based clustering structure
• This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
• Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
• Can be represented graphically or using visualization
techniques
553
OPTICS: Some Extension from DBSCAN
555
Reachability-
distance
undefined
‘
556
Density-Based Clustering: OPTICS & Applications:
http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo
557
4.2.4 Grid-Based Methods
558
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
• STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using
wavelet method
• CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
559
STING: A Statistical Information Grid Approach
(i-1)st layer
i-th layer
560
The STING Clustering Method
• Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
• Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
• Parameters of higher level cells can be easily calculated
from parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small
number of cells
• For each cell in the current level compute the confidence
interval
561
STING Algorithm and Its Analysis
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the
next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
• Query-independent, easy to parallelize, incremental
update
• O(K), where K is the number of grid cells at the lowest
level
• Disadvantages:
• All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
562
CLIQUE (Clustering In QUEst)
563
CLIQUE: The Major Steps
1. Partition the data space and find the number of
points that lie inside each cell of the partition.
2. Identify the subspaces that contain clusters using
the Apriori principle
3. Identify clusters
1. Determine dense units in all subspaces of interests
2. Determine connected dense units in all subspaces of
interests.
4. Generate minimal description for the clusters
1. Determine maximal regions that cover a cluster of
connected dense units for each cluster
2. Determination of minimal cover for each cluster
564
Salary
=3
(10,000)
0 1 2 3 4 5 6 7
20
30
40
S
50
ala
r
y
Vacation
60
age
30
565
50 Vacation(w
eek)
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
Strength and Weakness of CLIQUE
• Strength
• automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
• insensitive to the order of records in input and does not
presume some canonical data distribution
• scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
• The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
566
4.3 Algoritma Asosiasi
567
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in
the context of frequent itemsets and association rule
mining
• Motivation: Finding inherent regularities in data
• What products were often purchased together?— Beer and
diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
• Applications
• Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
568
Why Is Freq. Pattern Mining Important?
• Freq. pattern: An intrinsic and important property of
datasets
• Foundation for many essential data mining tasks
• Association, correlation, and causality analysis
• Sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
• Classification: discriminative, frequent pattern analysis
• Cluster analysis: frequent pattern-based clustering
• Data warehousing: iceberg cube and cube-gradient
• Semantic data compression: fascicles
• Broad applications
569
Basic Concepts: Frequent Patterns
Tid Items bought • itemset: A set of one or more
10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper
• k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
• (absolute) support, or, support
50 Nuts, Coffee, Diaper, Eggs, Milk
count of X: Frequency or
occurrence of an itemset X
Customer
buys both
Customer • (relative) support, s, is the fraction
buys diaper
of transactions that contains X
(i.e., the probability that a
transaction contains X)
• An itemset X is frequent if X’s
Customer support is no less than a minsup
buys beer threshold
570
Basic Concepts: Association Rules
573
Computational Complexity of Frequent Itemset Mining
575
Scalable Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
• Improving the Efficiency of Apriori
• FPGrowth: A Frequent Pattern-Growth Approach
• ECLAT: Frequent Pattern Mining with Vertical Data
Format
576
The Downward Closure Property and Scalable Mining Methods
577
Apriori: A Candidate Generation & Test Approach
578
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
579
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
580
Implementation of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
581
How to Count Supports of Candidates?
• Why counting supports of candidates a problem?
• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets are stored in a hash-tree
• Leaf node of hash-tree contains a list of itemsets and
counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a
transaction
582
Counting Supports of Candidates Using Hash Tree
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 357 368
12+356 689
124
457 125 159
458
583
Candidate Generation: An SQL Implementation
• SQL Implementation of candidate generation
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
• Use object-relational extensions like UDFs, BLOBs, and Table
functions for efficient implementation
(S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. SIGMOD’98)
584
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
• The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
• Depth-first search
• Avoid explicit candidate generation
586
Partition Patterns and Databases
587
Find Patterns Having P From P-conditional
Database
• Starting at the frequent item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item p
• Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
588
From Conditional Pattern-bases to Conditional FP-trees
589
Recursion: Mining Each Conditional FP-tree
{}
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
590
A Special Case: Single Prefix Path in FP-tree
a1:n1
a2:n2
a3:n3
{} r1
C1:k1 a1:n1
b1:m1
r1 =
a2:n2
+ b1:m1 C1:k1
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
591
Benefits of the FP-tree Structure
• Completeness
• Preserve complete information for frequent pattern
mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more
frequently occurring, the more likely to be shared
• Never be larger than the original database (not count
node-links and the count field)
592
The Frequent Pattern Growth Mining Method
• Method
1. For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
2. Repeat the process on each newly created conditional
FP-tree
3. Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
593
Scaling FP-growth by Database Projection
• What about if FP-tree cannot fit in memory? DB projection
• First partition a database into a set of projected DBs
• Then construct and mine FP-tree for each projected DB
• Parallel projection vs. partition projection techniques
• Parallel projection
• Project the DB in parallel for each frequent item
• Parallel projection is space costly
• All the partitions can be processed in parallel
• Partition projection
• Partition the DB based on the ordered frequent items
• Passing the unprocessed parts to the subsequent partitions
594
Partition-Based Projection
• Parallel projection needs a lot of disk space
• Partition projection saves it Tran. DB
fcamp
fcabm
fb
cbp
fcamp
am-proj DB cm-proj DB
fc f …
fc f
fc f
595
FP-Growth vs. Apriori:
Scalability With the Support Threshold
70
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
596
FP-Growth vs. Tree-Projection:
Scalability with the Support Threshold
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%)
597
Advantages of the Pattern Growth Approach
• Divide-and-conquer:
• Decompose both the mining task and DB according to
the frequent patterns obtained so far
• Lead to focused search of smaller databases
• Other factors
• No candidate generation, no candidate test
• Compressed database: FP-tree structure
• No repeated scan of entire database
• Basic ops: counting local freq items and building sub FP-
tree, no pattern search and matching
• A good open-source implementation and
refinement of FPGrowth
• FPGrowth+ (Grahne and J. Zhu, FIMI'03)
598
Further Improvements of Mining Methods
• AFOPT (Liu, et al. @ KDD’03)
• A “push-right” method for mining condensed frequent
pattern (CFP) tree
• Carpenter (Pan, et al. @ KDD’03)
• Mine data sets with small rows but numerous columns
• Construct a row-enumeration tree for efficient mining
• FPgrowth+ (Grahne and Zhu, FIMI’03)
• Efficiently Using Prefix-Trees in Mining Frequent Itemsets,
Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI'03), Melbourne, FL, Nov. 2003
• TD-Close (Liu, et al, SDM’06)
599
Extension of Pattern Growth Mining
Methodology
• Mining closed frequent itemsets and max-patterns
• CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
• Mining sequential patterns
• PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
• Mining graph patterns
• gSpan (ICDM’02), CloseGraph (KDD’03)
• Constraint-based mining of frequent patterns
• Convertible constraints (ICDE’01), gPrune (PAKDD’03)
• Computing iceberg data cubes with complex measures
• H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
• Pattern-growth-based Clustering
• MaPle (Pei, et al., ICDM’03)
• Pattern-Growth-Based Classification
• Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)
600
Tahapan Algoritma FP Growth
1. Penyiapan Dataset
2. Pencarian Frequent Itemset (Item yang sering
muncul)
3. Dataset diurutkan Berdasarkan Priority
4. Pembuatan FP-Tree Berdasarkan Item yang sudah
diurutkan
5. Pembangkitan Conditional Pattern Base
6. Pembangkitan Conditional FP-tree
7. Pembangkitan Frequent Pattern
8. Mencari Support
9. Mencari Confidence
601
1. Penyiapan Dataset
602
2. Pencarian Frequent Itemset
603
3. Dataset diurutkan Berdasarkan Priority
604
4. Pembuatan FP-Tree
605
5. Pembangkitan Conditional Pattern Base
606
6. Pembangkitan Conditional FP-tree
607
7. Pembangkitan Frequent Pattern
608
Frequent 2 Itemset
609
8. Mencari Support 2 Itemset
610
9. Mencari Confidence 2 Itemset
611
4.3.2 Pattern Evaluation Methods
612
Interestingness Measure: Correlations (Lift)
• play basketball eat cereal [40%, 66.7%] is misleading
• The overall % of students eating cereal is 75% > 66.7%.
• play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
• Measure of dependent/correlated events: lift
613
Are lift and 2 Good Measures of Correlation?
614
Null-Invariant Measures
615
Comparison of Interestingness Measures
Null-transactions Kulczynski
w.r.t. m and c measure (1927) Null-invariant
616
Subtle: They disagree
616
Analysis of DBLP Coauthor Relationships
618
Latihan
• Lakukan eksperimen mengikuti buku
Matthew North, Data Mining for the Masses
2nd Edition, 2016, Chapter 5 (Association
Rules), p 85-97
619
1. Business Understanding
• Motivation:
• Roger is a city manager for a medium-sized, but steadily growing city
• The city has limited resources, and like most municipalities, there are
more needs than there are resources
• He feels like the citizens in the community are fairly active in various
community organizations, and believes that he may be able to get a
number of groups to work together to meet some of the needs in
the community
• He knows there are churches, social clubs, hobby enthusiasts and
other types of groups in the community
• What he doesn’t know is if there are connections between the
groups that might enable natural collaborations between two or
more groups that could work together on projects around town
• Objectives:
• To find out if there are any existing associations between the
different types of groups in the area
620
4.4 Algoritma Estimasi dan
Forecasting
4.4.1 Linear Regression
4.4.2 Time Series Forecasting
621
4.4.1 Linear Regression
622
Tahapan Algoritma Linear Regression
1. Siapkan data
2. Identifikasi Atribut dan Label
3. Hitung X², Y², XY dan total dari masing-masingnya
4. Hitung a dan b berdasarkan persamaan yang
sudah ditentukan
5. Buat Model Persamaan Regresi Linear Sederhana
623
1. Persiapan Data
624
2. Identifikasikan Atribut dan Label
Y = a + bX
Dimana:
626
4. Hitung a dan b berdasarkan persamaan yang sudah ditentukan
Y = a + bX
Y = -27,02 + 1,56X
628
Pengujian
1. Prediksikan Jumlah Cacat Produksi jika suhu dalam keadaan
tinggi (Variabel X), contohnya: 30°C
Y = -27,02 + 1,56X
Y = -27,02 + 1,56(30)
=19,78
629
7.1.2 Studi Kasus CRISP-DM
630
Latihan
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses, 2012,
Chapter 8 Estimation, pp. 127-140 tentang Heating
Oil Consumption
631
CRISP-DM
632
Context and Perspective
• Sarah, the regional sales manager is back for more help
• Business is booming, her sales team is signing up thousands of new
clients, and she wants to be sure the company will be able to meet
this new level of demand, she now is hoping we can help her do
some prediction as well
• She knows that there is some correlation between the attributes in
her data set (things like temperature, insulation, and occupant ages),
and she’s now wondering if she can use the previous data set to
predict heating oil usage for new customers
• You see, these new customers haven’t begun consuming heating oil
yet, there are a lot of them (42,650 to be exact), and she wants to
know how much oil she needs to expect to keep in stock in order to
meet these new customers’ demand
• Can she use data mining to examine household attributes and
known past consumption quantities to anticipate and meet her new
customers’ needs?
633
1. Business Understanding
• Sarah’s new data mining objective is pretty clear: she
wants to anticipate demand for a consumable product
• We will use a linear regression model to help her with her
desired predictions
• She has data, 1,218 observations that give an attribute
profile for each home, along with those homes’ annual
heating oil consumption
• She wants to use this data set as training data to predict
the usage that 42,650 new clients will bring to her
company
• She knows that these new clients’ homes are similar in
nature to her existing client base, so the existing
customers’ usage behavior should serve as a solid gauge
for predicting future usage by new customers.
634
2. Data Understanding
We create a data set comprised of the following attributes:
• Insulation: This is a density rating, ranging from one to ten,
indicating the thickness of each home’s insulation. A home with a
density rating of one is poorly insulated, while a home with a
density of ten has excellent insulation
• Temperature: This is the average outdoor ambient temperature
at each home for the most recent year, measure in degree
Fahrenheit
• Heating_Oil: This is the total number of units of heating oil
purchased by the owner of each home in the most recent year
• Num_Occupants: This is the total number of occupants living in
each home
• Avg_Age: This is the average age of those occupants
• Home_Size: This is a rating, on a scale of one to eight, of the
home’s overall size. The higher the number, the larger the home
635
3. Data Preparation
• A CSV data set for this chapter’s example is available for
download at the book’s companion web site
(https://sites.google.com/site/dataminingforthemasses/)
636
3. Data Preparation
637
3. Data Preparation
638
4. Modeling
639
4. Modeling
640
5. Evaluation
641
5. Evaluation
642
6. Deployment
643
6. Deployment
644
6. Deployment
645
4.4.2 Time Series Forecasting
646
Time Series Forecasting
• Time series forecasting is one of the oldest known
predictive analytics techniques
• It has existed and been in widespread use even before the term
“predictive analytics” was ever coined
• Independent or predictor variables are not strictly
necessary for univariate time series forecasting, but are
strongly recommended for multivariate time series
• Time series forecasting methods:
1. Data Driven Method: There is no difference between a
predictor and a target. Techniques such as time series
averaging or smoothing are considered data-driven
approaches to time series forecasting
2. Model Driven Method: Similar to “conventional” predictive
models, which have independent and dependent variables,
but with a twist: the independent variable is now time
647
Data Driven Methods
• There is no difference between a predictor and a
target
• The predictor is also the target variable
• Data Driven Methods:
• Naïve Forecast
• Simple Average
• Moving Average
• Weighted Moving Average
• Exponential Smoothing
• Holt’s Two-Parameter Exponential Smoothing
648
Model Driven Methods
• In model-driven methods, time is the predictor or
independent variable and the time series value is the
dependent variable
• Model-based methods are generally preferable when the
time series appears to have a “global” pattern
• The idea is that the model parameters will be able to
capture these patterns
• Thus enable us to make predictions for any step ahead in the
future under the assumption that this pattern is going to repeat
• For a time series with local patterns instead of a global
pattern, using the model-driven approach requires
specifying how and when the patterns change, which is
difficult
649
Model Driven Methods
• Linear Regression
• Polynomial Regression
• Linear Regression with Seasonality
• Autoregression Models and ARIMA
650
How to Implement
• RapidMiner’s approach to time series is
based on two main data transformation
processes
• The first is windowing to transform the time
series data into a generic data set:
• This step will convert the last row of a window
within the time series into a label or target
variable
• We apply any of the “learners” or algorithms
to predict the target variable and thus
predict the next time step in the series
651
Windowing Concept
• The parameters of the Windowing operator allow
changing the size of the windows, the overlap between
consecutive windows (step size), and the prediction
horizon, which is used for forecasting
• The prediction horizon controls which row in the raw
data series ends up as the label variable in the
transformed series
652
Rapidminer Windowing Operator
653
Windowing Operator Parameters
• Window size: Determines how many “attributes”
are created for the cross-sectional data
• Each row of the original time series within the window
width will become a new attribute
• We choose w = 6
• Step size: Determines how to advance the window
• Let us use s = 1
• Horizon: Determines how far out to make the
forecast
• If the window size is 6 and the horizon is 1, then the
seventh row of the original time series becomes the first
sample for the “label” variable
• Let us use h = 1
654
Latihan
• Lakukan training dengan menggunakan linear
regression pada dataset hargasaham-training-
uni.xls
655
656
Latihan
• Lakukan training dengan menggunakan linear
regression pada dataset hargasaham-training.xls
657
658
5. Text Mining
5.1 Text Mining Concepts
5.2 Text Clustering
5.3 Text Classification
5.4 Data Mining Law
659
5.1 Text Mining Concepts
660
Data Mining vs Text Mining
1. Text Mining:
• Mengolah data tidak terstruktur dalam bentuk text, web,
social media, dsb
• Menggunakan metode text processing untuk mengkonversi
data tidak terstruktur menjadi terstruktur
• Kemudian diolah dengan data mining
2. Data Mining:
• Mengolah data terstruktur dalam bentuk tabel yang
memiliki atribut dan kelas
• Menggunakan metode data mining, yang terbagi menjadi
metode estimasi, forecasting, klasifikasi, klastering atau
asosiasi
• Yang dasar berpikirnya menggunakan konsep statistika atau
heuristik ala machine learning
661
How Text Mining Works
• The fundamental step is to convert text into semi-structured data
• Then apply the data mining methods to classify, cluster, and predict
Text
Processing
662
Text Mining: Jejak Pornografi di Indonesia
663
Text Mining: AHY-AHOK-ANIES
664
Proses Data Mining
(Pahami dan (Pilih Metode (Pahami Model dan (Analisis Model dan
Persiapkan Data) Sesuai Karakter Data) Pengetahuan yg Sesuai ) Kinerja Metode)
665
Word, Token and Tokenization
666
Matrix of Terms
• We can impose some form of structure on this raw
data by creating a matrix, where:
• the columns consist of all the tokens found in the two
documents
• the cells of the matrix are the counts of the number of
times a token appears
• Each token is now an attribute in standard data
mining parlance and each document is an example
667
Term Document Matrix (TDM)
• Basically, unstructured raw data is now transformed
into a format that is recognized, not only by the
human users as a data table, but more importantly
by all the machine learning algorithms which
require such tables for training
• This table is called a document vector or term
document matrix (TDM) and is the cornerstone of
the preprocessing required for text mining
668
TF–IDF
• We could have also chosen to use the TF–IDF scores
for each term to create the document vector
• N is the number of documents that we are trying to
mine
• Nk is the number of documents that contain the
keyword, k
669
Stopwords
• In the two sample text documents was the occurrence of common
words such as “a,” “this,” “and,” and other similar terms
• Clearly in larger documents we would expect a larger number of
such terms that do not really convey specific meaning
• Most grammatical necessities such as articles, conjunctions,
prepositions, and pronouns may need to be filtered before we
perform additional analysis
• Such terms are called stopwords and usually include most articles,
conjunctions, pronouns, and prepositions
• Stopword filtering is usually the second step that follows immediately
after tokenization
• Notice that our document vector has a significantly reduced size
after applying standard English stopword filtering
670
Stopwords Bahasa Indonesia
• Lakukan googling dengan keyword:
stopwords bahasa Indonesia
• Download stopword bahasa Indonesia dan
gunakan di Rapidminer
671
Stemming
• Words such as “recognized,” “recognizable,” or “recognition”
in different usages, but contextually they may all imply the
same meaning, for example:
• “Einstein is a well-recognized name in physics”
• “The physicist went by the easily recognizable name of Einstein”
• “Few other physicists have the kind of name recognition that Einstein
has”
• The so-called root of all these highlighted words is “recognize”
• By reducing terms in a document to their basic stems, we can
simplify the conversion of unstructured text to structured data
because we now only take into account the occurrence of the
root terms
• This process is called stemming. The most common stemming
technique for text mining in English is the Porter method
(Porter, 1980)
672
A Typical Sequence of Preprocessing Steps to Use in Text Mining
673
N-Grams
• There are families of words in the spoken and written
language that typically go together
• The word “Good” is usually followed by either “Morning,”
“Afternoon,” “Evening,” “Night,” or in Australia, “Day”
• Grouping such terms, called n-grams, and analyzing them
statistically can present new insights
• Search engines use word n-gram models for a variety
of applications, such as:
• Automatic translation, identifying speech patterns,
checking misspelling, entity detection, information
extraction, among many different use cases
674
Rapidminer Process of Text Mining
675
5.2 Text Clustering
676
Latihan
• Lakukan eksperimen mengikuti buku
Matthew North (Data Mining for the Masses)
Chapter 12 (Text Mining), 2012, p 189-215
677
1. Business Understanding
• Motivation:
• Gillian is a historian, and she has recently curated an exhibit on the
Federalist Papers, the essays that were written and published in the late
1700’s
• The essays were published anonymously under the author name ‘Publius’,
and no one really knew at the time if ‘Publius’ was one individual or many
• After Alexander Hamilton died in 1804, some notes were discovered that
revealed that he (Hamilton), James Madison and John Jay had been the
authors of the papers
• The notes indicated specific authors for some papers, but not for others:
• John Jay was revealed to be the author for papers 3, 4 and 5
• James Madison for paper 14
• Hamilton for paper 17
• Paper 18 had no author named, but there was evidence that Hamilton and
Madison worked on that one together
• Objective:
• Gillian would like to analyze paper 18’s content in the context of the other
papers with known authors, to see if she can generate some evidence that
678
2. Data Understanding
• The Federalist Papers are available through a number
of sources:
• They have been re-published in book form, they are available on
a number of different web sites
• Their text is archived in many libraries throughout the world
679
Modeling
Text Processing Extension Installation
680
Operator
Modeling K-Means
Parameters
K=2
Text Processing
Operator
Read Document
681
Modelling with Annotation
682
Evaluation
• Gillian feels confident that paper 18 is a
collaboration that John Jay did not contribute to
• His vocabulary and grammatical structure was quite
different from those of Hamilton and Madison
683
Latihan
• Lakukan eksperimen mengikuti buku Vijay Kotu
(Predictive Analytics and Data Mining) Chapter 9 (Text
Mining), Case Study 1: Keyword Clustering, p 284-287
• Datasets (file pages.txt):
1. https://www.cnnindonesia.com/olahraga
2. https://www.cnnindonesia.com/ekonomi
• Gunakan stopword Bahasa Indonesia (ada di folder
dataset), dengan operator Stopword (Dictionary) dan
pilih file stopword-indonesia.txt
• Untuk mempermudah, copy/paste file
09_Text_9.3.1_keyword_clustering_webmining.rmp
ke Repository dan kemudian buka di Rapidminer
• Pilih file pages.txt yang berisis URL pada Read URL
684
685
Testing Model (Read Document)
686
Testing Model (Get Page)
687
5.3 Text Classification
688
Latihan
• Dengan berbagai konsep dan teknik yang
anda kuasai, lakukan text classification pada
dataset polarity data - small
692
Latihan
• Dengan berbagai konsep dan teknik yang
anda kuasai, lakukan text classification pada
dataset polarity data
• Terapkan beberapa metode feature
selection, baik filter maupun wrapper
• Lakukan komparasi terhadap berbagai
algoritma klasifikasi, dan pilih yang terbaik
693
Latihan
• Lakukan eksperimen mengikuti buku Vijay Kotu
(Predictive Analytics and Data Mining) Chapter 9
(Text Mining), Case Study 2: Predicting the Gender
of Blog Authors, p 287-301
• Datasets: blog-gender-dataset.xslx
• Split Data: 50% data training dan 50% data testing
• Gunakan algoritma Naïve Bayes
• Apply model yang dihasilkan untuk data testing
• Ukur performance nya
694
695
Latihan
• Lakukan eksperimen mengikuti buku Vijay
Kotu (Predictive Analytics and Data Mining)
Chapter 9 (Text Mining), Case Study 2:
Predicting the Gender of Blog Authors, p 287-
301
• Datasets:
• blog-gender-dataset.xslx
• blog-gender-dataset-testing.xlsx
• Gunakan 10-fold X validation dan operator
write model (read model), store (retrieve)
696
697
698
699
700
Post-Test
1. Jelaskan perbedaan antara data, informasi dan pengetahuan!
2. Jelaskan apa yang anda ketahui tentang data mining!
3. Sebutkan peran utama data mining!
4. Sebutkan pemanfaatan dari data mining di berbagai bidang!
5. Pengetahuan atau pola apa yang bisa kita dapatkan dari data
di bawah?
NIM Gender Nilai Asal IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat
UN Sekolah Waktu
10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya
10002 P 27 SMAN 7 4.0 3.2 3.8 3.7 Tidak
10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak
10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya
...
11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
701
5.4 Data Mining Laws
702
Data Mining Laws
1. Business objectives are the origin of every data mining solution
2. Business knowledge is central to every step of the data mining
process
3. Data preparation is more than half of every data mining process
4. There is no free lunch for the data miner
5. There are always patterns
6. Data mining amplifies perception in the business domain
7. Prediction increases information locally by generalisation
8. The value of data mining results is not determined by the
accuracy or stability of predictive models
9. All patterns are subject to change
Tom Khabaza, Nine Laws of Data Mining, 2010
(http://khabaza.codimension.net/index_files/9laws.htm)
703
1 Business Goals Law
Business objectives are the origin of every data
mining solution
705
2 Business Knowledge Law
1. Business understanding must be based on business knowledge, and
so must the mapping of business objectives to data mining goals
2. Data understanding uses business knowledge to understand which
data is related to the business problem, and how it is related
3. Data preparation means using business knowledge to shape the
data so that the required business questions can be asked and
answered
4. Modelling means using data mining algorithms to create predictive
models and interpreting both the models and their behaviour in
business terms – that is, understanding their business relevance
5. Evaluation means understanding the business impact of using the
models
6. Deployment means putting the data mining results to work in a
business process
706
3 Data Preparation Law
Data preparation is more than half of every data
mining process
707
4 No Free Lunch Theory
There is No Free Lunch for the Data Miner (NFL-DM)
The right model for a given application can only be discovered by
experiment
709
5 Watkins’ Law
There are always patterns
710
6 Insight Law
Data mining amplifies perception in the
business domain
• How does data mining produce insight? This law approaches the heart of
data mining – why it must be a business process and not a technical one
• Business problems are solved by people, not by algorithms
• The data miner and the business expert “see” the solution to a problem,
that is the patterns in the domain that allow the business objective to be
achieved
• Thus data mining is, or assists as part of, a perceptual process
• Data mining algorithms reveal patterns that are not normally visible to human
perception
• The data mining process integrates these algorithms with the normal
human perceptual process, which is active in nature
• Within the data mining process, the human problem solver interprets the
results of data mining algorithms and integrates them into their business
understanding
711
7 Prediction Law
Prediction increases information locally by
generalisation
• “Predictive models” and “predictive analytics” means “predict the most likely
outcome”
• Other kinds of data mining models, such as clustering and association, are
also characterised as “predictive”; this is a much looser sense of the term:
• A clustering model might be described as “predicting” the group into which an
individual falls
• An association model might be described as “predicting” one or more attributes on
the basis of those that are known
• What is “prediction” in this sense? What do classification, regression,
clustering and association algorithms and their resultant models have in
common?
• The answer lies in “scoring”, that is the application of a predictive model to a new
example
• The available information about the example in question has been increased, locally,
on the basis of the patterns found by the algorithm and embodied in the model, that
is on the basis of generalisation or induction
712
8 Value Law
The value of data mining results is not determined by
the accuracy or stability of predictive models
714
Tugas Menyelesaikan Masalah Organisasi
• Analisis masalah dan kebutuhan yang ada di organisasi lingkungan
sekitar anda
• Kumpulkan dan review dataset yang tersedia, dan hubungkan
masalah dan kebutuhan tadi dengan data yang tersedia (analisis
dari 5 peran data mining)
• Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah data
tersebut, misalnya: lakukan association (analisis faktor), sekaligus
estimation atau clustering
• Lakukan proses CRISP-DM untuk menyelesaikan masalah yang ada
di organisasi sesuai dengan data yang didapatkan
• Pada proses data preparation, lakukan data cleaning (replace missing
value, replace, filter attribute) sehingga data siap dimodelkan
• Lakukan juga komparasi algoritma dan feature selection untuk memilih
pola dan model terbaik
• Rangkumkan evaluasi dari pola/model/knowledge yang dihasilkan dan
relasikan hasil evaluasi dengan deployment yang dilakukan
• Rangkumkan dalam bentuk slide dengan contoh studi kasus Sarah
untuk membantu bidang marketing
715
Studi Kasus Organisasi
Organisasi Masalah Tujuan Dataset
KPK • Sulitnya mengidentifikasi • Klasifikasi Profil Pelaku Korupsi • LHKPN
profil koruptor • Asosiasi Atribut Pelaku Korupsi • Penuntuta
• Tidak patuhnya WL dalam • Klasifikasi Kepatuhan LHKPN n
LHKPN • Estimasi Penentuan Angka
Tuntutan
BSM Sulit mengidentifikasi faktor Klasifikasi kualitas profil nasabah Data
apa yang mempengaruhi pembiayaan
kualitas pembiayaan nasabah
LKPP Banyaknya konsultasi dan • Asosiasi pola pertanyaan instansi Data
pertanyaan dari berbagai • Klasifikasi jenis pertanyaan konsultasi
instansi yg harus dijawab
BPPK Sulitnya penanganan tweet Klasifikasi dan Klastering text mining Data twitter
dari masyarakat, apakah dari keluhan atau pertanyaan atau masyarakat
terkait pertanyaan, keluhan saran di media sosial
atau saran
Universitas Tingkat kelulusan tepat waktu Klasifikasi data kelulusan mahasiswa Data
Siliwangi belum maksimal (apakah mahasiswa
dikarenakan faktor jurusan
atau faktor lain?)
716
Studi Kasus Organisasi
Organisasi Masalah Tujuan Dataset
Kemenkeu Sulit menentukan faktor 1. Seberapa erat hubungan antar Data kinerja
(DJPB) refinement indicator komponen terhadap potensi organisasi
kinerja penyempurnaan
2. Klastering data kinerja organsiasi
Kemenkeu Sulit menentukan arah 1. Melihat hubungan beberapa data Data profil
(DJPB) opini hasil audit terhadap opini kementerian
kementerian 2. Klasifikasi profil kementerian
Kemenkeu Banyaknya pelaporan 1. Melihat hubungan beberapa Data
(DJPB) kanwil yang harus indikator laporan kanwil terhadap pelaporan
dianalisis dengan beragam akurasi kanwil
atribut 2. Klastering data pelaporan kanwil
3. Klasifikasi akurasi pelaporan kanwil
Kemenkeu Sulit menentukan prioritas 1. Klastering data profil kanwil Data
(DJPB) monitoring kanwil 2. Melihat hubungan beberapa transaksi dan
atribut terhadap klaster profil profil kanwil
kanwil
717
Studi Kasus Organisasi
Organisasi Masalah Tujuan Dataset
Kemenkeu Kebijakan masalah reward Klasifikasi profil pegawai yang Pegawai
(SDM) dan punishment untuk sering telat dan disiplin, sehingga
pegawai sering tidak efektif terdeteksi lebih dini
Kemenkeu Rasio perempuan yang • Klasifikasi dan klastering profile Pegawai
(SDM) menjabat eselon 4/3/2/1 pejabat eselon 4/3/2/1
hanya 15%, padahal masuk • Asosiasi jabatan dan atribut
PNS rasionya hampir imbang profile pegawai
Bank Peredaran uang palsu yang • Asosiasi jumlah peredaran uang Peredaran
Indonesia semakin banyak di Indonesia palsu dengan profil wilayah Uang Palsu
Indonesia
• Klastering wilayah peredaran
uang palsu
Adira Rasio kredit macet yang • Klasifikasi kualitas kreditur yang Kreditur
Finance semakin meninggi lancar dan macet
• Forecasting jumlah kredit macet
• Tingkat hubungan kredit macet
dengan berbagai atribut
718
Studi Kasus Organisasi
Organisasi Masalah Tujuan Dataset
Kemsos Kompleksnya parameter Klasifikasi profil rumah tangga Rumah
penentuan tingkat kemiskinan miskin di kabupaten Brebes tangga
rumah tangga di Indonesia miskin di
kabupaten
Brebes
Kemsos Sulitnya menentukan rumah Klastering profile rumah tangga Rumah
tangga yang diprioritaskan miskin yang belum menerima tangga
menerima bantuan sosial bantuan miskin di
kabupaten
Belitung
Kemsos Sulitnya menentukan jenis Klasifikasi penyakit kronis yang Anggota
penyakit kronis yang diderita anggota rumah tangga rumah
diprioritaskan menerima miskin tangga di
program Penerima Bantuan kabupaten
Iuran jaminan kesehatan Belitung
(PBIJK)
Kemsos Sulitnya menentukan rumah
tangga miskin di Indonesia
719
Studi Kasus Organisasi
Organisasi Masalah Tujuan Dataset
Kemsos Kompleksnya parameter Klastering profil rumah tangga Data Terpadu
penentuan tingkat kemiskinan miskin di kabupaten Belitung Kesejahteraan
rumah tangga di Indonesia Sosial (DTKS)
kabupaten
Belitung
Kemsos Penerima Program Keluarga Klasifikasi faktor utama atribut Data Terpadu
Harapan (PKH) yang tidak yang berpengaruh pada Kesejahteraan
tepat sasaran penerima PKH Sosial (DTKS)
kabupaten
Belitung
Kemsos Penentuan kebijakan Klastering spesifikasi rumah pada Data Terpadu
penerima bantuan program data rumah tangga Data Terpadu Kesejahteraan
rehabilitasi sosial rumah tidak Kesejahteraan Sosial (DTKS) Sosial (DTKS)
layak huni kabupaten
Seram bagian
barat
Kemsos Banyaknya penerima bantuan Klastering profil rumah tangga Data Terpadu
sosial yang tidak tepat miskin dari Data Terpadu Kesejahteraan
sasaran di kabupaten Seram Kesejahteraan Sosial (DTKS) Sosial (DTKS)
bagian barat kabupaten
Seram bagian
720
Terima Kasih
Romi Satria Wahono
[email protected]
et
http://romisatriawahono.net
08118228331