0% found this document useful (0 votes)

106 views44 pages

M3-4 Proses 2

The document describes the CRISP-DM (Cross-Industry Standard Process for Data Mining) process used to analyze factors influencing home heating oil consumption, which included collecting data on home attributes, analyzing correlations between attributes, and finding that average age of occupants and home insulation had the strongest positive correlations with oil consumption while temperature had a negative correlation.

Uploaded by

auamad dah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views44 pages

M3-4 Proses 2

Uploaded by

auamad dah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Konsep Data Mining – 3

Proses Data Mining 2

Achmad Benny Mutiara

2021
2.5 PROSES STANDARD PADA DATA MINING
(CRISP-DM)
2
Data Mining Standard Process
 A cross-industry standard was clearly required that is
industry neutral, tool-neutral, and application-neutral
 The Cross-Industry Standard Process for Data Mining
(CRISP–DM) was developed in 1996 (Chapman, 2000)
 CRISP-DM provides a nonproprietary and freely available
standard process for fitting data mining into the general
problem-solving strategy of a business or research unit

3
CRISP-DM

4
1. Business Understanding

 Enunciate the project objectives and requirements

clearly in terms of the business or research unit as a
whole
 Translate these goals and restrictions into the
formulation of a data mining problem definition
 Prepare a preliminary strategy for achieving these
objectives
 Designing what you are going to build
5
2. Data Understanding

 Collect the data

 Use exploratory data analysis to familiarize yourself
with the data and discover initial insights
 Evaluate the quality of the data
 If desired, select interesting subsets that may
contain actionable patterns

6
3. Data Preparation

 Prepare from the initial raw data the final data set
that is to be used for all subsequent phases
 Select the cases and variables you want to analyze
and that are appropriate for your analysis
 Perform data cleaning, integration, reduction and
transformation, so it is ready for the modeling tools

7
4. Modeling

 Select and apply appropriate modeling techniques

 Calibrate model settings to optimize results
 Remember that often, several different techniques
may be used for the same data mining problem
 If necessary, loop back to the data preparation
phase to bring the form of the data into line with
the specific requirements of a particular data mining
technique
8
5. Evaluation

 Evaluate the one or more models delivered in the

modeling phase for quality and effectiveness before
deploying them for use in the field
 Determine whether the model in fact achieves the
objectives set for it in the first phase
 Establish whether some important facet of the
business or research problem has not been accounted
for sufficiently
 Come to a decision regarding use of the data mining
results 9
6. Deployment

 Make use of the models created:

 model creation does not signify the completion of a
project
 Example of a simple deployment:
 Generate a report
 Example of a more complex deployment:
 Implement a parallel data mining process in another
department
 For businesses, the customer often carries out the 10
STUDI KASUS CRISP-DM
Heating Oil Consumption – Correlational Methods
(Matthew North, Data Mining for the Masses 2nd Edition, 2016,
Chapter 4 Correlational Methods, pp. 69-76)
Dataset: HeatingOil.csv
11
CRISP-DM

12
CRISP-DM: Detail Flow

13
1. Business Understanding
 Motivation:
 Sarah is a regional sales manager for a nationwide supplier of fossil fuels for home
heating
 She feels a need to understand the types of behaviors and other factors that may
influence the demand for heating oil in the domestic market
 She recognizes that there are many factors that influence heating oil
consumption, and believes that by investigating the relationship between a
number of those factors, she will be able to better monitor and respond to
heating oil demand
 She has selected correlation as a way to model the relationship between the
factors she wishes to investigate. Correlation is a statistical measure of how strong
the relationships are between attributes in a data set
 Objective:
 To investigate the relationship between a number of factors that influence heating
oil consumption

14
2. Data Understanding
 In order to investigate her question, Sarah has enlisted our help
in creating a correlation matrix of six attributes
 Using employer’s data resources which are primarily drawn from
the company’s billing database, we create a data set comprised
of the following attributes:
1. Insulation: This is a density rating, ranging from one to ten, indicating the thickness
of each home’s insulation. A home with a density rating of one is poorly insulated,
while a home with a density of ten has excellent insulation
2. Temperature: This is the average outdoor ambient temperature at each home for
the most recent year, measure in degree Fahrenheit
3. Heating_Oil: This is the total number of units of heating oil purchased by the owner
of each home in the most recent year
4. Num_Occupants: This is the total number of occupants living in each home
5. Avg_Age: This is the average age of those occupants
6. Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The
higher the number, the larger the home 15
3. Data Preparation
Data set: HeatingOil.csv

16
3. Data Preparation
 Data set appears to be very clean with:
 No missing values in any of the six attributes
 No inconsistent data apparent in our ranges (Min-Max) or other
descriptive statistics

17
4. Modeling

18
4. Modeling

 Hasil correlation matrix berupa tabel

 Semakin tinggi nilainya (semakin tebal warna ungu), semakin
tinggi tingkat korelasinya

19
5. Evaluation
Positive
Correlation
Negative
Correlation

20
5. Evaluation
 Atribut (faktor) yang paling signifikan berpengaruh (hubungan positif) pada
konsumsi minyak pemanas (Heating Oil) adalah Average Age (Rata-Rata Umur)
penghuni rumah
 Atribut (faktor) kedua yang paling berpengaruh adalah Temperature (hubungan
negatif)
 Atribut (faktor) ketiga yang paling berpengaruh adalah Insulation (hubungan
positif)
 Atribut Home Size, pengaruhnya sangat kecil, sedangkan Num_Occupant boleh
dikatakan tidak ada pengaruh ke konsumsi minyak pemanas

21
5. Evaluation
1

 Grafik menunjukkan bahwa konsumsi minyak memiliki korelasi positif dengan rata-rata usia
 Meskipun ada beberapa anomali juga terjadi:
1. Ada beberapa orang yang rata-rata usia tinggi, tapi kebutuhan minyaknya rendah (warna biru muda di kolom kiri
bagian atas)
2. Ada beberapa orang yang rata-rata usia rendah, tapi kebutuhan minyaknya tinggi (warna merah di kolom kanan
bagian bawah)
22
5. Evaluation
2 dan 3

2 dan 3

1. Grafik menunjukkan hubungan antara temperature dan insulation, dengan warna adalah konsumsi minyak (semakin merah
kebutuhan minyak semakin tinggi)
2. Secara umum dapat dikatakan bahwa hubungan temperatur dengan insulation dan konsumsi minyak adalah negatif. Jadi
temperatur semakin rendah, kebutuhan minyak semakin tinggi (kolom kiri bagian atas) ditunjukkan dengan banyak yang
berwarna kuning dan merah
3. Insulation juga berhubungan negatif dengan temperatur, sehingga makin rendah temperatur, semakin butuh insulation
4. Beberapa anomali terdapat pada Insulation yang rendah nilainya, ada beberapa yang masih memerlukan minyak yang tinggi
23
5. Evaluation

1. Grafik tiga dimensi menunjukkan hubungan antara temperatur, rata-rata usia dan
insulation
2. Warna menunjukkan kebutuhan minyak, semakin memerah maka semakin tinggi
3. Temperatur semakin tinggi semakin tidak butuh minyak (warna biru tua
4. Rata-rata usia dan insulation semakin tinggi semakin butuh minyak
24
6. Deployment
Dropping the Num_Occupants attribute
 While the number of people living in a home might logically seem
like a variable that would influence energy usage, in our model it
did not correlate in any significant way with anything else
 Sometimes there are attributes that don’t turn out to be very
interesting

25
6. Deployment
Adding additional attributes to the data set
 It turned out that the number of occupants in the home
didn’t correlate much with other attributes, but that
doesn’t mean that other attributes would be equally
uninteresting
 For example, what if Sarah had access to the number of
furnaces and/or boilers in each home?
 Home_size was slightly correlated with Heating_Oil usage,
so perhaps the number of instruments that consume
heating oil in each home would tell an interesting story, or
at least add to her insight 26
6. Deployment

Investigating the role of home insulation

 The Insulation rating attribute was fairly strongly correlated
with a number of other attributes
 There may be some opportunity there to partner with a
company that specializes in adding insulation to existing
homes

27
6. Deployment
Focusing the marketing efforts to the city with low temperature and high
average age of citizen
 The temperature attribute was fairly strongly negative correlated
with a heating oil consumption
 The average age attribute was strongest positive correlated with a
heating oil consumption

28
6. Deployment

Adding greater granularity in the data set

 This data set has yielded some interesting results, but it’s pretty general
 We have used average yearly temperatures and total annual number of
heating oil units in this model
 But we also know that temperatures fluctuate throughout the year in most
areas of the world, and thus monthly, or even weekly measures would not
only be likely to show more detailed results of demand and usage over
time, but the correlations between attributes would probably be more
interesting
 From our model, Sarah now knows how certain attributes interact with one
another, but in the day-to-day business of doing her job, she’ll probably
want to know about usage over time periods shorter than one year

29
STUDI KASUS CRISP-DM
Heating Oil Consumption – Linear Regression
(Matthew North, Data Mining for the Masses 2nd Edition, 2016,
Chapter 8 Linear Regression, pp. 159-171)
Dataset: HeatingOil.csv
Dataset: HeatingOil-scoring.csv

30
CRISP-DM

31
CRISP-DM: Detail Flow

32
1. Business Understanding
 Business is booming, her sales team is signing up thousands of new clients,
and she wants to be sure the company will be able to meet this new level of
demand
 Sarah’s new data mining objective is pretty clear: she wants to anticipate
demand for a consumable product
 We will use a linear regression model to help her with her desired
predictions. She has data, 1,218 observations that give an attribute profile
for each home, along with those homes’ annual heating oil consumption
 She wants to use this data set as training data to predict the usage that
42,650 new clients will bring to her company
 She knows that these new clients’ homes are similar in nature to her existing
client base, so the existing customers’ usage behavior should serve as a solid
gauge for predicting future usage by new customers

33
2. Data Understanding
 Sarah has assembled separate Comma Separated Values file
containing all of these same attributes, for her 42,650 new clients
 She has provided this data set to us to use as the scoring data set
in our model
 Data set comprised of the following attributes:
 Insulation: This is a density rating, ranging from one to ten, indicating the thickness of
each home’s insulation. A home with a density rating of one is poorly insulated, while
a home with a density of ten has excellent insulation
 Temperature: This is the average outdoor ambient temperature at each home for the
most recent year, measure in degree Fahrenheit
 Heating_Oil: This is the total number of units of heating oil purchased by the owner of
each home in the most recent year
 Num_Occupants: This is the total number of occupants living in each home
 Avg_Age: This is the average age of those occupants
 Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The
higher the number, the larger the home

34
3. Data Preparation
 Filter Examples: attribute value filter or custom filter
 Avg_Age>=15.1
 Avg_Age<=72.2
 Deleted Records= 42650-42042 = 508

35
36
3. Modeling

37
4. Evaluation

38
5. Deployment

39
Latihan
 Karena bantuan data mining sebelumnya, Sarah akhirnya mendapatkan promosi menjadi
VP marketing, yang mengelola ratusan marketer
 Sarah ingin para marketer dapat memprediksi pelanggan potensial mereka masing-masing
secara mandiri. Masalahnya, data HeatingOil.csv hanya boleh diakses oleh level VP
(Sarah), dan tidak diperbolehkan diakses oleh marketer secara langsung
 Sarah ingin masing-masing marketer membuat proses yang dapat mengestimasi
kebutuhan konsumsi minyak dari client yang mereka approach, dengan menggunakan
model yang sebelumnya dihasilkan oleh Sarah, meskipun tanpa mengakses data training
(HeatingOil.csv)
 Asumsikan bahwa data HeatingOil-Marketing.csv adalah data calon pelanggan yang
berhasil di approach oleh salah satu marketingnya
 Yang harus dilakukan Sarah adalah membuat proses untuk:
1. Mengkomparasi algoritma yang menghasilkan model yang memiliki akurasi tertinggi (LR, NN,
SVM), gunakan 10 Fold X Validation
2. Menyimpan model ke dalam suatu file (operator Write Model)
 Yang harus dilakukan Marketer adalah membuat proses untuk:
1. Membaca model yang dihasilkan Sarah (operator Read Model)
2. Menerapkannya di data HeatingOil-Marketing.csv yang mereka miliki
 Mari kita bantu Sarah dan Marketer membuat dua proses tersebut 40
Proses Komparasi Algoritma (Sarah)

41
Proses Pengujian Data (Marketer)

42
Latihan
 Pahami bahwa metode CRISP-DM membantu kita memahami
penggunaan metode data mining yang lebih sesuai dengan
kebutuhan organisasi
 Pahami dan lakukan eksperimen berdasarkan seluruh studi
kasus yang ada di buku Data Mining for the Masses (Matthew
North)

43
Tugas
 Analisis masalah dan kebutuhan yang ada di organisasi lingkungan
sekitar anda
 Kumpulkan dan review dataset yang tersedia, dan hubungkan
masalah dan kebutuhan tadi dengan data yang tersedia (analisis
dari 5 peran data mining). Bila memungkinkan pilih beberapa
peran untuk mengolah data tersebut, misalnya: lakukan
association (analisis faktor), sekaligus estimation.
 Lakukan proses menggunakan CRISP-DM untuk menyelesaikan
masalah yang ada di organisasi anda sesuai dengan data yang
didapatkan
 Pada proses data preparation, lakukan data cleaning (replace missing
value, replace, filter attribute) sehingga data siap dimodelkan
 Lakukan juga komparasi algoritma untuk memilih algoritma terbaik