0% found this document useful (0 votes)
29 views36 pages

Business Intelligence DM1

Uploaded by

Rihab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views36 pages

Business Intelligence DM1

Uploaded by

Rihab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Business Intelligence

Part 2:
Introduction to Data Mining

Using Weka

Computer to install Weka (Mac / Windows / Linux)


Module 2: Architectural Perspective

Stephan Poelmans– Data Mining


Background literature & Sources

WEKA !

What’s Weka?
– A bird found only in New Zealand?
A Data mining workbench
Waikato Environment for Knowledge Analysis
Machine learning algorithms for data mining tasks
100+ algorithms for classification
75 for data preprocessing
25 to assist with feature selection
20 for clustering, finding association rules, etc .

No programming experience required


High School Mathematics
Practice !

Stephan Poelmans– Data Mining


Online courses

• There are 3 MOOCS (=massive open online course) available. This course is
considerably inspired by them.

• https://www.cs.waikato.ac.nz/ml/weka/courses.html

• Course material of the Moocs (slides, videos):


• https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

Stephan Poelmans– Data Mining


What is data mining?
• Data mining implies different quantitative methods and techniques aimed at
finding in huge amounts of data (such as a data warehouse) relations and
patterns between data (preferably hidden relations and relations that
contribute to better management decisions)

• Definition “… the non-trivial process of identifying valid, novel, potentially


useful, and ultimately understandable patterns in data” (source: Fayyad,
1996)

• Data Mining uses statistics and artificial intelligence

• Data Mining is part of what we call Knowledge Discovery in Databases


(KDD). KDD is more a description of the process for obtaining data mining.
The terms data mining and KDD are often interchangeably.

Stephan Poelmans– Data Mining


Stephan Poelmans– Data Mining
Stephan Poelmans– Data Mining
Stephan Poelmans– Data Mining
Stephan Poelmans– Data Mining
Data Mining and Knowledge Discovery

Data mining in “The Knowledge Data Discovery (KDD) Process”:

Stephan Poelmans– Data Mining


Data Mining and Knowledge Discovery

• Data collection/Source data: for example data from operational


databases to a data warehouse.
• Selection: selecting the data you need to perform data mining (for
example tables from a data warehouse)
• Preprocessing: controlling the data and cleaning it (see next slide)
• Transformation: convert the data into a usable form (for example
in a format you can import in a data mining tool)
• Data mining: the application of data mining techniques
• Interpretation/Evaluation: interpretation for the results (which
relations/patterns/forecasts have we found?)
• We then get the “knowledge” that can be used for taking better
decisions

Stephan Poelmans– Data Mining


What is data mining? KDD Preprocessing
• Preprocessing is the comparison of the perspectives of the data you will use for data mining and the
cleaning of this data. The phase is very important and far too often achieved too fast. Here are some
important activities:

• Which attributes are relevant?


For example. “Credit scoring” (evaluating the solvency of clients) : what is a “bad” client (for
example, the Basel 2 Directive : 90 days past due).

• Are there “missing values”?


For example, we have a client database (different tables), but some critical information is missing for
some of the clients (such a the birth date)

• Is there “noise” in the data? (errors such as “age = -20” , or “gender= m and name= Marianne”,
etc.)

• Are there “outliers”? (extreme or exceptional values) that can influence our results? If this is the
case, you can ultimately drop them…

• What do available attributes look like? What type of data do we have? Is the field textual or numeric?
If numeric, are they continuous variables or discrete (categorical) ones? (For example “gender”
is a discrete variable (m or f), “age” is a continuous variable but can become discrete (for example
subdivision in these categories : “0-17”; “18-30”, “ > 60”, …)).

Stephan Poelmans– Data Mining


Applications of DM (based on ”Data Mining with
Weka”, Futurelearn.com)
• Can data mining be applied to …
• customer relationship management?
• “To establish a productive relationship with a customer, businesses collect data and analyse it. This can help in acquiring and retaining
customers, developing customer loyalty, and implementing customer focused strategies. “
• supermarket basket analysis?
• “If you buy certain items you are more likely to buy other items. Understanding purchasing behavior helps the retailer understand the buyer’s
needs and change the store layout accordingly. “
• financial data analysis?
• “Data mining can contribute to solving business problems in banking and finance by finding patterns, causalities, and correlations in business
information and market prices that are not immediately apparent to managers. “
• E.g. financial data from companies that went bankrupt (in order to predict or warn companies in difficulty)
• education?
• “Mining data generated by online courses can help predict students’ future learning success, assess the effect of educational support, and
advance scientific knowledge about learning. Perhaps individual learning patterns can be discovered and used to personalize the presentation
of material. “
• Learning Analytics = using online data from online tools like Blackboard to detect patterns and predict success
• healthcare?
• “Data mining can help identify best practices that improve care and reduce costs. It can be used to predict the volume of patients in different
categories, which helps hospitals develop processes to ensure that patients receive appropriate care at the right place and at the right time. “
• criminal investigation?
• Crime analysis involves exploring and detecting crimes and their relationships with criminals. Crime datasets are large, comprehensive and
complex. Textual reports – interrogation transcripts for instance - can be analyzed using text mining.
• making babies?
• “Yes! Selecting approapriate embryos for artificial insemination, amongst other possibilities that we’ll leave to your imagination.”
• … and many other fields..

Stephan Poelmans– Data Mining


Typical Business Requirements
• Typical “business requirements” where data mining can offer an answer :
• What is the credit risk of a client?

• Can we divide clients into different groups (clusters)?

• Which products do clients often buy together?

• How many products can an enterprise (hope to) sell next year ?

• Is the income tax return of a company the same as the one of similar companies (tax
evasion)?

• What is the probability of success of a student given all the information we have about that
student and given a group of students from the past (social background, secondary education,
exam results,…)?

• Some questions can also, to some extent, be answered thanks to “conventional” queries, cross-
tabulations,...

Stephan Poelmans– Data Mining


Example of data mining: a Decision Tree
Decision Tree:
Preprocessed Data: (data from the past)
Swollen
patient Sore Swollen Headach Glands
ID Throat Fever Glands Congestion e Diagnosis
1 yes yes yes yes yes strep throat no yes
2 no no no yes yes allergy
3 yes yes no yes no cold
4 yes no yes no no strep throat Diagnosis= Strep Throat
5 no yes no yes no cold Fever
6 no no no yes no allergy
7 no no yes no no strep throat no yes
8 yes no no yes yes allergy
9 no yes no yes yes cold
10 yes yes no yes yes cold
Diagnosis= Allergy Diagnosis= Cold

Source: Roiger, R. J. & M.W. Geatz (2003)

The data is “preprocessed” and converted in “discrete variables”.


Associated “Production rules”:
1.If a patient has swollen glands, the diagnosis is strep throat.
2.If a patient does not have swollen glands and has fever, the diagnosis is a cold.
3.If a patient does not have swollen glands and does not have fever, the diagnosis is an allergy.

Stephan Poelmans– Data Mining


Example of data mining: a Decision Tree

Sore Swollen
patient ID Throat Fever Glands Congestion Headache Diagnosis
Swollen
11 No No Yes yes yes ?
Glands
12 Yes Yes No No yes ?
13 No no No no yes ?
no yes
Source : Roiger, R. J. & M.W. Geatz (2003)

Diagnosis= Strep Throat

The decision tree (with the “production rules”) can now be Fever
used to make forecasts for future patients
whose diagnosis is still unknown… no yes

Diagnosis= Allergy Diagnosis= Cold

Stephan Poelmans– Data Mining


Example of data mining: Clustering
Credit Card Example
Customer Transaction Trades Favorite Annual
ID Account Type Margin Account method /Month Sex Age Recreation Income
1005 joint no online 12,5 f 30-39 tennis 40-59K
1013 custodial no broker 0,5 f 50-59 skiing 80-99K
1245 joint no online 3,6 m 20-29 golf 20-39K
2110 individual yes broker 22,3 m 30-39 fishing 40-59K
1001 individual yes online 5 m 40-49 golf 60-79K

Source: Roiger, R. J. & M.W. Geatz (2003)

Remark:
- a custodial account is an account type where an institution or tutor represents a “protected” individual
- a joint account: an account shared by people

We could ask the question:


“According to which attributes (fields) and attribute values can we group Clients into clusters (similar
groups)?”

This question becomes more global and can be answered through data mining, and more specifically data
clustering.

Stephan Poelmans– Data Mining


Example of data mining: Clustering
Credit Card Example
Transaction Trades Favorite Annual
Customer ID Account Type Margin Account method /Month Sex Age Recreation Income
1005 joint no online 12,5 f 30-39 tennis 40-59K
1013 custodial no broker 0,5 f 50-59 skiing 80-99K
1245 joint no online 3,6 m 20-29 golf 20-39K
2110 individual yes broker 22,3 m 30-39 fishing 40-59K
1001 individual yes online 5 m 40-49 golf 60-79K

We can use “clustering” on the above-mentioned data. The clustering technique determines clusters
(categories/groups) in which the “distance” between the clusters is as great as possible, whereas the clients distance
within a cluster is as small as possible.
We could obtain the following “rules” :

If (margin account = yes && Age = 20-29 && Annual Income = 40-59K)
THEN cluster = 1 (accuracy = 0.80; coverage = 0.50)
If (account type = custodial && favorite recreation = skiing && annual income = 80-90K)
THEN cluster = 2 (accuracy = 0.92; coverage = 0.35)
If (account type = joint && Trades/month > 5 && transaction method = online)
THEN cluster = 3 (accuracy = 0.82; coverage = 0.65)

Accuracy & coverage do tell us something about the clustering value (validation), see further .
A clustering will never be perfect !

Stephan Poelmans– Data Mining


Stephan Poelmans– Data Mining
Other Basic Concepts:
• Discrete vs. continuous variables
• DM is frequently applied to numerical or Boolean variables
• We can divide numerical variables into the following ones:
• Discrete variables: for example gender, nationality, marital status,…
• Continuous variables: for example age, income, temperature, satisfaction level
(on a scale),…
• Note: continuous variables can be converted in discrete variables!
For example, age: “0-18”, “18-30”, “30-50”, “>50’
For example, income: “low: < 1300”, “average: 1300-1800”, etc.
• Be careful: it is not always easy to clearly know if, strictly speaking, we are dealing with a discrete
or continuous variable!
For example, the use of a Likert-scale:
A “likert-scale” in an survey can vary from 1 “very unhappy” to 5 “very happy”. Is this a discrete
variable or a continuous one? Statisticians will rather say this is a discrete one (with 5 categories) ;
in many scientific surveys, this scale is nonetheless considered as a continuous variable (all score
between 1 and 5 are possible if you calculate an average of all respondents).

Stephan Poelmans– Data Mining


Other Basic Concepts:

• Outliers
• Remove or keep?
• E,g, Age = 400 (false observation) vs. income = 10 000 Euro (correct
observation(?))
• Missing values:
• How to deal with them? E.g. replace with average value?
• Definition of the target variable = outcome variable (if required, cf. below)
• Credit scoring: What is a bad customer/actor (e.g. 90 days payment arrears
according to the international Basel II guidelines)
• Churn management: What is a churner? (e.g. a customer without any
purchase in the last 4 months)

Stephan Poelmans– Data Mining


Other Basic Concepts:
• Training VS Test Set
➢ Different from “conventional statistics”:
▪ In the case of Data Mining, we split a sample/dataset in 2 parts
▪ The “training set” is used as a calculation base (to build the initial model).
Schemes/relations are searched for in the training set.
▪ The schemes/relations/models are tested with (applied to) a test set to
verify if they have a good prediction rate and to adjust them (make them
more general).
▪ For example we use 70% of the data for the training set and 30% for the
test set
▪ This division is not problematic because data mining may be applied on a
really huge datasets (most often coming from a data warehouse)

Stephan Poelmans– Data Mining


Data Mining Model Construction

Data Mining
Techniques
Training
Data

NAME RANK YEARS TENURED


Resulting
Model
M ike A ssistan t P ro f 3 no
(rules)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’

Stephan Poelmans– Data Mining


Use the Model - built using the training set - with
the test set and later in new “unseen” data

Data Mining is never perfect.


According to the “rules” Merlisa
should be “tenured” … Model
Rules

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tenured?
Tom A ssistan t P ro f 2 no
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes

Stephan Poelmans– Data Mining


Data Mining Strategies

• We are looking for relations between data (variables such as fields


from a table)

• Three data mining “strategies”: Supervised VS Unsupervised


Learning VS Market Basket Analysis

Data Mining
Strategies

Unsupervised Supervised Market Basket


Learning Learning Analysis

Classification Estimation

Source: adapted from Roiger, R. J. & M.W. Geatz (2003), Data Mining: A tutorial-based primer, Addison-Wesley, 350 p.

Stephan Poelmans– Data Mining


Data Mining Strategies
• Supervised learning = predictive methods:
• We are looking for an explicative or a predictive model
• There is an “output” variable (or target variable)
• For example :
• What are the differences between clients that do not come back after one purchase and clients
who do come back for a second time at least?
Output variable: on the basis of a series of criteria, we determine if the a person will not return
(0) or the person will return (1)
• What are the differences between owners and non-owners of a house?
• How could we explain the differences in client satisfaction ? What makes clients satisfied or not?
• How can we explain the fact that some projects are successful and others not?
• In the case of supervised learning, we make another difference between
“classification” (do not mix up with clustering) and “estimation”
• Classification: the output variable is always discrete (for example Decistion Trees,
Logistic Regression….)
• Estimation: the output variable is continuous (for example a linear regression)

Stephan Poelmans– Data Mining


Data Mining Strategies

• Unsupervised learning = Descriptive methods:


• We are not really looking for an explicative or a predictive model
• We do not have an “output” or target variable
• For example: Divide customers into similar groups (clustering)
• Market Basked Analysis:
• The aim is to find relationships / connections between retail products.
• Typically: which products do the customers buy together?
• “Association rules” is the technique “by excellence” (see further)
• Supervised and unsupervised learning could be described respectively as “purposeful” and
“exploratory”.
• There are several data mining techniques. Some techniques are supervised or unsupervised
or belong to “market basket analysis”. Other techniques can be used as both supervised or
unsupervised.

Stephan Poelmans– Data Mining


Data Mining vs. Expert Systems

Data Mining Tool

Expert System
Data

If Swollen Glands = Yes A computer program owning the problem-solving


capabilities of one or more human experts. (= Artificial
Then Diagnosis = Strep
Throat

Intelligence (AI))

Human Knowledge
Expert Engineer

Expert System
Building Tool Knowledge Engineer
A person who is trained to work with an expert
and capturing his knowledge.
If Swollen Glands = Yes
Then Diagnosis = Strep
Throat

Data mining vs. expert systems

Stephan Poelmans– Data Mining


Installing Weka
• Go to https://www.cs.waikato.ac.nz/ml/weka/
• Click the Download button
• the latest version, currently Weka-3-8-3
• the appropriate link for your computer; Windows, Mac, or Linux
• Weka = developed in Java, you’ll need a JVM (= Java Virtual machine)
• if you don’t know what Java is, you probably want the file that includes a Java VM
• Once it’s downloaded, open it to get a standard setup “wizard”.
• Just keep clicking “Next”! Install it in the default place – and remember the name of that place!
• After installation, uncheck the box that says “Start Weka” and clicks “Finish”.
• Then go to where Weka was installed and
• make a copy of the data folder (within the Weka folder) and put it in a convenient place for future
use
• rename it “Weka datasets” !!
• Having installed Weka, note the different interfaces in Weka: Explorer; the Experimenter; the
KnowledgeFlow interface; and the Command-line interface. We will be using the explorer and the
experimenter.
• In the Explorer there are several panels: Preprocess; Classify; Clustering; Association rules;
Attribute selection; and Visualization.

Stephan Poelmans– Data Mining


Exploring Weka

Weather.nominal.arff

Stephan Poelmans– Data Mining


Stephan Poelmans– Data Mining
Weka: Visualizing your Data
• Using the Visualize panel

• Open iris.arff
• Bring up the Visualize panel

• Click one of the plots; examine some instances

• Set x axis to petalwidth and y axis to petallength

• Click on Class colour (bottom) to change the colour

• Bars on the right change correspond to attributes: click for x axis; right‐click for y axis

• Jitter slider (“Jitter is a random displacement applied to X and Y values to separate points
that lie on top of one another. Without jitter, 1000 instances at the same data point would
look just the same as 1 instance.)

• Show Select Instance: Rectangle option

• Submit, Reset, Clear and Save

Stephan Poelmans– Data Mining


Weka: Use a filter to remove an attribute
• Open weather.nominal.arff (again)
• Check the filters
• supervised vs unsupervised
• attribute vs instance
• Choose the unsupervised attribute filter Remove , click on the Remove text box
• Check the “More” button on the pop-up window; look at the options
• Set attributeIndices to 3 and click OK (So the 3rd attribute is removed for analysis)
• Apply the filter
• Recall that you can Save the result
• Press Undo

Stephan Poelmans– Data Mining


Weka: Use a filter to remove an attribute (value)

• Remove instances where humidity is high


• Supervised or unsupervised?
• Attribute or instance?
• Look at them
• Select RemoveWithValues
• Set attributeIndex (humidity = 3)
• Set nominalIndices (“high” instances need to be removed…)
• Apply & check the result: Are high humidity instances set to 0?
• Undo (!)

Stephan Poelmans– Data Mining


Understanding .ARFF Files
• ARFF is an acronym that stands for Attribute-Relation File Format.
• Weka has a specific computer science centric vocabulary when describing data:
• Instance: A row of data is called an instance, as in an instance or observation from the problem
domain.
• Attribute: A column of data is called a feature or attribute, as in feature of the observation.
• Each attribute can have a different type, for example:
• Real for numeric values like 1.2.
• Integer for numeric values without a fractional part like 5.
• Nominal for categorical data like “dog” and “cat”.
• String for lists of words, like this sentence.

• On classification problems, the output variable must be nominal. For regression problems, the output
variable must be real.

Stephan Poelmans– Data Mining


Understanding .ARFF Files
• For example, iris.arff
• Directives start with the ”at” symbol (@) and there is one for the name of the dataset (e.g.
@RELATION iris), there is a directive to define the name and datatype of each attribute (e.g.
@ATTRIBUTE sepallength REAL) and there is a directive to indicate the start of the raw data
(e.g. @DATA).
• Lines in an ARFF file that start with a percentage symbol (%) indicate a comment.
• The CSV format is also recognized by Weka and easily exported from MS Excel, so once you
can get your data into Excel, you can easily convert it to the CSV format. (Note: a CSV file
created in Excel needs commas to separate values (not semicolons !). The first line in the .csv
file contains the names of the attributes.

Stephan Poelmans– Data Mining

You might also like