Konsep Data Mining – 3
Proses Data Mining 2
Achmad Benny Mutiara
2021
2.5 PROSES STANDARD PADA DATA MINING
(CRISP-DM)
2
Data Mining Standard Process
A cross-industry standard was clearly required that is
industry neutral, tool-neutral, and application-neutral
The Cross-Industry Standard Process for Data Mining
(CRISP–DM) was developed in 1996 (Chapman, 2000)
CRISP-DM provides a nonproprietary and freely available
standard process for fitting data mining into the general
problem-solving strategy of a business or research unit
3
CRISP-DM
4
1. Business Understanding
Enunciate the project objectives and requirements
clearly in terms of the business or research unit as a
whole
Translate these goals and restrictions into the
formulation of a data mining problem definition
Prepare a preliminary strategy for achieving these
objectives
Designing what you are going to build
5
2. Data Understanding
Collect the data
Use exploratory data analysis to familiarize yourself
with the data and discover initial insights
Evaluate the quality of the data
If desired, select interesting subsets that may
contain actionable patterns
6
3. Data Preparation
Prepare from the initial raw data the final data set
that is to be used for all subsequent phases
Select the cases and variables you want to analyze
and that are appropriate for your analysis
Perform data cleaning, integration, reduction and
transformation, so it is ready for the modeling tools
7
4. Modeling
Select and apply appropriate modeling techniques
Calibrate model settings to optimize results
Remember that often, several different techniques
may be used for the same data mining problem
If necessary, loop back to the data preparation
phase to bring the form of the data into line with
the specific requirements of a particular data mining
technique
8
5. Evaluation
Evaluate the one or more models delivered in the
modeling phase for quality and effectiveness before
deploying them for use in the field
Determine whether the model in fact achieves the
objectives set for it in the first phase
Establish whether some important facet of the
business or research problem has not been accounted
for sufficiently
Come to a decision regarding use of the data mining
results 9
6. Deployment
Make use of the models created:
model creation does not signify the completion of a
project
Example of a simple deployment:
Generate a report
Example of a more complex deployment:
Implement a parallel data mining process in another
department
For businesses, the customer often carries out the 10
STUDI KASUS CRISP-DM
Heating Oil Consumption – Correlational Methods
(Matthew North, Data Mining for the Masses 2nd Edition, 2016,
Chapter 4 Correlational Methods, pp. 69-76)
Dataset: HeatingOil.csv
11
CRISP-DM
12
CRISP-DM: Detail Flow
13
1. Business Understanding
Motivation:
Sarah is a regional sales manager for a nationwide supplier of fossil fuels for home
heating
She feels a need to understand the types of behaviors and other factors that may
influence the demand for heating oil in the domestic market
She recognizes that there are many factors that influence heating oil
consumption, and believes that by investigating the relationship between a
number of those factors, she will be able to better monitor and respond to
heating oil demand
She has selected correlation as a way to model the relationship between the
factors she wishes to investigate. Correlation is a statistical measure of how strong
the relationships are between attributes in a data set
Objective:
To investigate the relationship between a number of factors that influence heating
oil consumption
14
2. Data Understanding
In order to investigate her question, Sarah has enlisted our help
in creating a correlation matrix of six attributes
Using employer’s data resources which are primarily drawn from
the company’s billing database, we create a data set comprised
of the following attributes:
1. Insulation: This is a density rating, ranging from one to ten, indicating the thickness
of each home’s insulation. A home with a density rating of one is poorly insulated,
while a home with a density of ten has excellent insulation
2. Temperature: This is the average outdoor ambient temperature at each home for
the most recent year, measure in degree Fahrenheit
3. Heating_Oil: This is the total number of units of heating oil purchased by the owner
of each home in the most recent year
4. Num_Occupants: This is the total number of occupants living in each home
5. Avg_Age: This is the average age of those occupants
6. Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The
higher the number, the larger the home 15
3. Data Preparation
Data set: HeatingOil.csv
16
3. Data Preparation
Data set appears to be very clean with:
No missing values in any of the six attributes
No inconsistent data apparent in our ranges (Min-Max) or other
descriptive statistics
17
4. Modeling
18
4. Modeling
Hasil correlation matrix berupa tabel
Semakin tinggi nilainya (semakin tebal warna ungu), semakin
tinggi tingkat korelasinya
19
5. Evaluation
Positive
Correlation
Negative
Correlation
20
5. Evaluation
Atribut (faktor) yang paling signifikan berpengaruh (hubungan positif) pada
konsumsi minyak pemanas (Heating Oil) adalah Average Age (Rata-Rata Umur)
penghuni rumah
Atribut (faktor) kedua yang paling berpengaruh adalah Temperature (hubungan
negatif)
Atribut (faktor) ketiga yang paling berpengaruh adalah Insulation (hubungan
positif)
Atribut Home Size, pengaruhnya sangat kecil, sedangkan Num_Occupant boleh
dikatakan tidak ada pengaruh ke konsumsi minyak pemanas
21
5. Evaluation
1
Grafik menunjukkan bahwa konsumsi minyak memiliki korelasi positif dengan rata-rata usia
Meskipun ada beberapa anomali juga terjadi:
1. Ada beberapa orang yang rata-rata usia tinggi, tapi kebutuhan minyaknya rendah (warna biru muda di kolom kiri
bagian atas)
2. Ada beberapa orang yang rata-rata usia rendah, tapi kebutuhan minyaknya tinggi (warna merah di kolom kanan
bagian bawah)
22
5. Evaluation
2 dan 3
2 dan 3
1. Grafik menunjukkan hubungan antara temperature dan insulation, dengan warna adalah konsumsi minyak (semakin merah
kebutuhan minyak semakin tinggi)
2. Secara umum dapat dikatakan bahwa hubungan temperatur dengan insulation dan konsumsi minyak adalah negatif. Jadi
temperatur semakin rendah, kebutuhan minyak semakin tinggi (kolom kiri bagian atas) ditunjukkan dengan banyak yang
berwarna kuning dan merah
3. Insulation juga berhubungan negatif dengan temperatur, sehingga makin rendah temperatur, semakin butuh insulation
4. Beberapa anomali terdapat pada Insulation yang rendah nilainya, ada beberapa yang masih memerlukan minyak yang tinggi
23
5. Evaluation
1. Grafik tiga dimensi menunjukkan hubungan antara temperatur, rata-rata usia dan
insulation
2. Warna menunjukkan kebutuhan minyak, semakin memerah maka semakin tinggi
3. Temperatur semakin tinggi semakin tidak butuh minyak (warna biru tua
4. Rata-rata usia dan insulation semakin tinggi semakin butuh minyak
24
6. Deployment
Dropping the Num_Occupants attribute
While the number of people living in a home might logically seem
like a variable that would influence energy usage, in our model it
did not correlate in any significant way with anything else
Sometimes there are attributes that don’t turn out to be very
interesting
25
6. Deployment
Adding additional attributes to the data set
It turned out that the number of occupants in the home
didn’t correlate much with other attributes, but that
doesn’t mean that other attributes would be equally
uninteresting
For example, what if Sarah had access to the number of
furnaces and/or boilers in each home?
Home_size was slightly correlated with Heating_Oil usage,
so perhaps the number of instruments that consume
heating oil in each home would tell an interesting story, or
at least add to her insight 26
6. Deployment
Investigating the role of home insulation
The Insulation rating attribute was fairly strongly correlated
with a number of other attributes
There may be some opportunity there to partner with a
company that specializes in adding insulation to existing
homes
27
6. Deployment
Focusing the marketing efforts to the city with low temperature and high
average age of citizen
The temperature attribute was fairly strongly negative correlated
with a heating oil consumption
The average age attribute was strongest positive correlated with a
heating oil consumption
28
6. Deployment
Adding greater granularity in the data set
This data set has yielded some interesting results, but it’s pretty general
We have used average yearly temperatures and total annual number of
heating oil units in this model
But we also know that temperatures fluctuate throughout the year in most
areas of the world, and thus monthly, or even weekly measures would not
only be likely to show more detailed results of demand and usage over
time, but the correlations between attributes would probably be more
interesting
From our model, Sarah now knows how certain attributes interact with one
another, but in the day-to-day business of doing her job, she’ll probably
want to know about usage over time periods shorter than one year
29
STUDI KASUS CRISP-DM
Heating Oil Consumption – Linear Regression
(Matthew North, Data Mining for the Masses 2nd Edition, 2016,
Chapter 8 Linear Regression, pp. 159-171)
Dataset: HeatingOil.csv
Dataset: HeatingOil-scoring.csv
30
CRISP-DM
31
CRISP-DM: Detail Flow
32
1. Business Understanding
Business is booming, her sales team is signing up thousands of new clients,
and she wants to be sure the company will be able to meet this new level of
demand
Sarah’s new data mining objective is pretty clear: she wants to anticipate
demand for a consumable product
We will use a linear regression model to help her with her desired
predictions. She has data, 1,218 observations that give an attribute profile
for each home, along with those homes’ annual heating oil consumption
She wants to use this data set as training data to predict the usage that
42,650 new clients will bring to her company
She knows that these new clients’ homes are similar in nature to her existing
client base, so the existing customers’ usage behavior should serve as a solid
gauge for predicting future usage by new customers
33
2. Data Understanding
Sarah has assembled separate Comma Separated Values file
containing all of these same attributes, for her 42,650 new clients
She has provided this data set to us to use as the scoring data set
in our model
Data set comprised of the following attributes:
Insulation: This is a density rating, ranging from one to ten, indicating the thickness of
each home’s insulation. A home with a density rating of one is poorly insulated, while
a home with a density of ten has excellent insulation
Temperature: This is the average outdoor ambient temperature at each home for the
most recent year, measure in degree Fahrenheit
Heating_Oil: This is the total number of units of heating oil purchased by the owner of
each home in the most recent year
Num_Occupants: This is the total number of occupants living in each home
Avg_Age: This is the average age of those occupants
Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The
higher the number, the larger the home
34
3. Data Preparation
Filter Examples: attribute value filter or custom filter
Avg_Age>=15.1
Avg_Age<=72.2
Deleted Records= 42650-42042 = 508
35
36
3. Modeling
37
4. Evaluation
38
5. Deployment
39
Latihan
Karena bantuan data mining sebelumnya, Sarah akhirnya mendapatkan promosi menjadi
VP marketing, yang mengelola ratusan marketer
Sarah ingin para marketer dapat memprediksi pelanggan potensial mereka masing-masing
secara mandiri. Masalahnya, data HeatingOil.csv hanya boleh diakses oleh level VP
(Sarah), dan tidak diperbolehkan diakses oleh marketer secara langsung
Sarah ingin masing-masing marketer membuat proses yang dapat mengestimasi
kebutuhan konsumsi minyak dari client yang mereka approach, dengan menggunakan
model yang sebelumnya dihasilkan oleh Sarah, meskipun tanpa mengakses data training
(HeatingOil.csv)
Asumsikan bahwa data HeatingOil-Marketing.csv adalah data calon pelanggan yang
berhasil di approach oleh salah satu marketingnya
Yang harus dilakukan Sarah adalah membuat proses untuk:
1. Mengkomparasi algoritma yang menghasilkan model yang memiliki akurasi tertinggi (LR, NN,
SVM), gunakan 10 Fold X Validation
2. Menyimpan model ke dalam suatu file (operator Write Model)
Yang harus dilakukan Marketer adalah membuat proses untuk:
1. Membaca model yang dihasilkan Sarah (operator Read Model)
2. Menerapkannya di data HeatingOil-Marketing.csv yang mereka miliki
Mari kita bantu Sarah dan Marketer membuat dua proses tersebut 40
Proses Komparasi Algoritma (Sarah)
41
Proses Pengujian Data (Marketer)
42
Latihan
Pahami bahwa metode CRISP-DM membantu kita memahami
penggunaan metode data mining yang lebih sesuai dengan
kebutuhan organisasi
Pahami dan lakukan eksperimen berdasarkan seluruh studi
kasus yang ada di buku Data Mining for the Masses (Matthew
North)
43
Tugas
Analisis masalah dan kebutuhan yang ada di organisasi lingkungan
sekitar anda
Kumpulkan dan review dataset yang tersedia, dan hubungkan
masalah dan kebutuhan tadi dengan data yang tersedia (analisis
dari 5 peran data mining). Bila memungkinkan pilih beberapa
peran untuk mengolah data tersebut, misalnya: lakukan
association (analisis faktor), sekaligus estimation.
Lakukan proses menggunakan CRISP-DM untuk menyelesaikan
masalah yang ada di organisasi anda sesuai dengan data yang
didapatkan
Pada proses data preparation, lakukan data cleaning (replace missing
value, replace, filter attribute) sehingga data siap dimodelkan
Lakukan juga komparasi algoritma untuk memilih algoritma terbaik
44