Business Intelligence DM1
Business Intelligence DM1
Part 2:
Introduction to Data Mining
Using Weka
WEKA !
What’s Weka?
– A bird found only in New Zealand?
A Data mining workbench
Waikato Environment for Knowledge Analysis
Machine learning algorithms for data mining tasks
100+ algorithms for classification
75 for data preprocessing
25 to assist with feature selection
20 for clustering, finding association rules, etc .
• There are 3 MOOCS (=massive open online course) available. This course is
considerably inspired by them.
• https://www.cs.waikato.ac.nz/ml/weka/courses.html
• Is there “noise” in the data? (errors such as “age = -20” , or “gender= m and name= Marianne”,
etc.)
• Are there “outliers”? (extreme or exceptional values) that can influence our results? If this is the
case, you can ultimately drop them…
• What do available attributes look like? What type of data do we have? Is the field textual or numeric?
If numeric, are they continuous variables or discrete (categorical) ones? (For example “gender”
is a discrete variable (m or f), “age” is a continuous variable but can become discrete (for example
subdivision in these categories : “0-17”; “18-30”, “ > 60”, …)).
• How many products can an enterprise (hope to) sell next year ?
• Is the income tax return of a company the same as the one of similar companies (tax
evasion)?
• What is the probability of success of a student given all the information we have about that
student and given a group of students from the past (social background, secondary education,
exam results,…)?
• Some questions can also, to some extent, be answered thanks to “conventional” queries, cross-
tabulations,...
Sore Swollen
patient ID Throat Fever Glands Congestion Headache Diagnosis
Swollen
11 No No Yes yes yes ?
Glands
12 Yes Yes No No yes ?
13 No no No no yes ?
no yes
Source : Roiger, R. J. & M.W. Geatz (2003)
The decision tree (with the “production rules”) can now be Fever
used to make forecasts for future patients
whose diagnosis is still unknown… no yes
Remark:
- a custodial account is an account type where an institution or tutor represents a “protected” individual
- a joint account: an account shared by people
This question becomes more global and can be answered through data mining, and more specifically data
clustering.
We can use “clustering” on the above-mentioned data. The clustering technique determines clusters
(categories/groups) in which the “distance” between the clusters is as great as possible, whereas the clients distance
within a cluster is as small as possible.
We could obtain the following “rules” :
If (margin account = yes && Age = 20-29 && Annual Income = 40-59K)
THEN cluster = 1 (accuracy = 0.80; coverage = 0.50)
If (account type = custodial && favorite recreation = skiing && annual income = 80-90K)
THEN cluster = 2 (accuracy = 0.92; coverage = 0.35)
If (account type = joint && Trades/month > 5 && transaction method = online)
THEN cluster = 3 (accuracy = 0.82; coverage = 0.65)
Accuracy & coverage do tell us something about the clustering value (validation), see further .
A clustering will never be perfect !
• Outliers
• Remove or keep?
• E,g, Age = 400 (false observation) vs. income = 10 000 Euro (correct
observation(?))
• Missing values:
• How to deal with them? E.g. replace with average value?
• Definition of the target variable = outcome variable (if required, cf. below)
• Credit scoring: What is a bad customer/actor (e.g. 90 days payment arrears
according to the international Basel II guidelines)
• Churn management: What is a churner? (e.g. a customer without any
purchase in the last 4 months)
Data Mining
Techniques
Training
Data
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tenured?
Tom A ssistan t P ro f 2 no
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
Data Mining
Strategies
Classification Estimation
Source: adapted from Roiger, R. J. & M.W. Geatz (2003), Data Mining: A tutorial-based primer, Addison-Wesley, 350 p.
Expert System
Data
Intelligence (AI))
Human Knowledge
Expert Engineer
Expert System
Building Tool Knowledge Engineer
A person who is trained to work with an expert
and capturing his knowledge.
If Swollen Glands = Yes
Then Diagnosis = Strep
Throat
Weather.nominal.arff
• Open iris.arff
• Bring up the Visualize panel
• Bars on the right change correspond to attributes: click for x axis; right‐click for y axis
• Jitter slider (“Jitter is a random displacement applied to X and Y values to separate points
that lie on top of one another. Without jitter, 1000 instances at the same data point would
look just the same as 1 instance.)
• On classification problems, the output variable must be nominal. For regression problems, the output
variable must be real.