0% found this document useful (0 votes)

29 views36 pages

Business Intelligence DM1

Uploaded by

Rihab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views36 pages

Business Intelligence DM1

Uploaded by

Rihab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Business Intelligence

Part 2:
Introduction to Data Mining

Using Weka

Computer to install Weka (Mac / Windows / Linux)

Module 2: Architectural Perspective

Stephan Poelmans– Data Mining

Background literature & Sources

WEKA !

What’s Weka?
– A bird found only in New Zealand?
A Data mining workbench
Waikato Environment for Knowledge Analysis
Machine learning algorithms for data mining tasks
100+ algorithms for classification
75 for data preprocessing
25 to assist with feature selection
20 for clustering, finding association rules, etc .

No programming experience required

High School Mathematics
Practice !

Stephan Poelmans– Data Mining

Online courses

• There are 3 MOOCS (=massive open online course) available. This course is
considerably inspired by them.

• https://www.cs.waikato.ac.nz/ml/weka/courses.html

• Course material of the Moocs (slides, videos):

• https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

Stephan Poelmans– Data Mining

What is data mining?
• Data mining implies different quantitative methods and techniques aimed at
finding in huge amounts of data (such as a data warehouse) relations and
patterns between data (preferably hidden relations and relations that
contribute to better management decisions)

• Definition “… the non-trivial process of identifying valid, novel, potentially

useful, and ultimately understandable patterns in data” (source: Fayyad,
1996)

• Data Mining uses statistics and artificial intelligence

• Data Mining is part of what we call Knowledge Discovery in Databases

(KDD). KDD is more a description of the process for obtaining data mining.
The terms data mining and KDD are often interchangeably.

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining
Stephan Poelmans– Data Mining
Stephan Poelmans– Data Mining
Stephan Poelmans– Data Mining
Data Mining and Knowledge Discovery

Data mining in “The Knowledge Data Discovery (KDD) Process”:

Stephan Poelmans– Data Mining

Data Mining and Knowledge Discovery

• Data collection/Source data: for example data from operational

databases to a data warehouse.
• Selection: selecting the data you need to perform data mining (for
example tables from a data warehouse)
• Preprocessing: controlling the data and cleaning it (see next slide)
• Transformation: convert the data into a usable form (for example
in a format you can import in a data mining tool)
• Data mining: the application of data mining techniques
• Interpretation/Evaluation: interpretation for the results (which
relations/patterns/forecasts have we found?)
• We then get the “knowledge” that can be used for taking better
decisions

Stephan Poelmans– Data Mining

What is data mining? KDD Preprocessing
• Preprocessing is the comparison of the perspectives of the data you will use for data mining and the
cleaning of this data. The phase is very important and far too often achieved too fast. Here are some
important activities:

• Which attributes are relevant?

For example. “Credit scoring” (evaluating the solvency of clients) : what is a “bad” client (for
example, the Basel 2 Directive : 90 days past due).

• Are there “missing values”?

For example, we have a client database (different tables), but some critical information is missing for
some of the clients (such a the birth date)

• Is there “noise” in the data? (errors such as “age = -20” , or “gender= m and name= Marianne”,
etc.)

• Are there “outliers”? (extreme or exceptional values) that can influence our results? If this is the
case, you can ultimately drop them…

• What do available attributes look like? What type of data do we have? Is the field textual or numeric?
If numeric, are they continuous variables or discrete (categorical) ones? (For example “gender”
is a discrete variable (m or f), “age” is a continuous variable but can become discrete (for example
subdivision in these categories : “0-17”; “18-30”, “ > 60”, …)).

Stephan Poelmans– Data Mining

Applications of DM (based on ”Data Mining with
Weka”, Futurelearn.com)
• Can data mining be applied to …
• customer relationship management?
• “To establish a productive relationship with a customer, businesses collect data and analyse it. This can help in acquiring and retaining
customers, developing customer loyalty, and implementing customer focused strategies. “
• supermarket basket analysis?
• “If you buy certain items you are more likely to buy other items. Understanding purchasing behavior helps the retailer understand the buyer’s
needs and change the store layout accordingly. “
• financial data analysis?
• “Data mining can contribute to solving business problems in banking and finance by finding patterns, causalities, and correlations in business
information and market prices that are not immediately apparent to managers. “
• E.g. financial data from companies that went bankrupt (in order to predict or warn companies in difficulty)
• education?
• “Mining data generated by online courses can help predict students’ future learning success, assess the effect of educational support, and
advance scientific knowledge about learning. Perhaps individual learning patterns can be discovered and used to personalize the presentation
of material. “
• Learning Analytics = using online data from online tools like Blackboard to detect patterns and predict success
• healthcare?
• “Data mining can help identify best practices that improve care and reduce costs. It can be used to predict the volume of patients in different
categories, which helps hospitals develop processes to ensure that patients receive appropriate care at the right place and at the right time. “
• criminal investigation?
• Crime analysis involves exploring and detecting crimes and their relationships with criminals. Crime datasets are large, comprehensive and
complex. Textual reports – interrogation transcripts for instance - can be analyzed using text mining.
• making babies?
• “Yes! Selecting approapriate embryos for artificial insemination, amongst other possibilities that we’ll leave to your imagination.”
• … and many other fields..

Stephan Poelmans– Data Mining

Typical Business Requirements
• Typical “business requirements” where data mining can offer an answer :
• What is the credit risk of a client?

• Can we divide clients into different groups (clusters)?

• Which products do clients often buy together?

• How many products can an enterprise (hope to) sell next year ?

• Is the income tax return of a company the same as the one of similar companies (tax
evasion)?

• What is the probability of success of a student given all the information we have about that
student and given a group of students from the past (social background, secondary education,
exam results,…)?

• Some questions can also, to some extent, be answered thanks to “conventional” queries, cross-
tabulations,...

Stephan Poelmans– Data Mining

Example of data mining: a Decision Tree
Decision Tree:
Preprocessed Data: (data from the past)
Swollen
patient Sore Swollen Headach Glands
ID Throat Fever Glands Congestion e Diagnosis
1 yes yes yes yes yes strep throat no yes
2 no no no yes yes allergy
3 yes yes no yes no cold
4 yes no yes no no strep throat Diagnosis= Strep Throat
5 no yes no yes no cold Fever
6 no no no yes no allergy
7 no no yes no no strep throat no yes
8 yes no no yes yes allergy
9 no yes no yes yes cold
10 yes yes no yes yes cold
Diagnosis= Allergy Diagnosis= Cold

Source: Roiger, R. J. & M.W. Geatz (2003)

The data is “preprocessed” and converted in “discrete variables”.

Associated “Production rules”:
1.If a patient has swollen glands, the diagnosis is strep throat.
2.If a patient does not have swollen glands and has fever, the diagnosis is a cold.
3.If a patient does not have swollen glands and does not have fever, the diagnosis is an allergy.

Stephan Poelmans– Data Mining

Example of data mining: a Decision Tree

Sore Swollen
patient ID Throat Fever Glands Congestion Headache Diagnosis
Swollen
11 No No Yes yes yes ?
Glands
12 Yes Yes No No yes ?
13 No no No no yes ?
no yes
Source : Roiger, R. J. & M.W. Geatz (2003)

Diagnosis= Strep Throat

The decision tree (with the “production rules”) can now be Fever
used to make forecasts for future patients
whose diagnosis is still unknown… no yes

Diagnosis= Allergy Diagnosis= Cold

Stephan Poelmans– Data Mining

Example of data mining: Clustering
Credit Card Example
Customer Transaction Trades Favorite Annual
ID Account Type Margin Account method /Month Sex Age Recreation Income
1005 joint no online 12,5 f 30-39 tennis 40-59K
1013 custodial no broker 0,5 f 50-59 skiing 80-99K
1245 joint no online 3,6 m 20-29 golf 20-39K
2110 individual yes broker 22,3 m 30-39 fishing 40-59K
1001 individual yes online 5 m 40-49 golf 60-79K

Source: Roiger, R. J. & M.W. Geatz (2003)

Remark:
- a custodial account is an account type where an institution or tutor represents a “protected” individual
- a joint account: an account shared by people

We could ask the question:

“According to which attributes (fields) and attribute values can we group Clients into clusters (similar
groups)?”

This question becomes more global and can be answered through data mining, and more specifically data
clustering.

Stephan Poelmans– Data Mining

Example of data mining: Clustering
Credit Card Example
Transaction Trades Favorite Annual
Customer ID Account Type Margin Account method /Month Sex Age Recreation Income
1005 joint no online 12,5 f 30-39 tennis 40-59K
1013 custodial no broker 0,5 f 50-59 skiing 80-99K
1245 joint no online 3,6 m 20-29 golf 20-39K
2110 individual yes broker 22,3 m 30-39 fishing 40-59K
1001 individual yes online 5 m 40-49 golf 60-79K

We can use “clustering” on the above-mentioned data. The clustering technique determines clusters
(categories/groups) in which the “distance” between the clusters is as great as possible, whereas the clients distance
within a cluster is as small as possible.
We could obtain the following “rules” :

If (margin account = yes && Age = 20-29 && Annual Income = 40-59K)
THEN cluster = 1 (accuracy = 0.80; coverage = 0.50)
If (account type = custodial && favorite recreation = skiing && annual income = 80-90K)
THEN cluster = 2 (accuracy = 0.92; coverage = 0.35)
If (account type = joint && Trades/month > 5 && transaction method = online)
THEN cluster = 3 (accuracy = 0.82; coverage = 0.65)

Accuracy & coverage do tell us something about the clustering value (validation), see further .
A clustering will never be perfect !

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining
Other Basic Concepts:
• Discrete vs. continuous variables
• DM is frequently applied to numerical or Boolean variables
• We can divide numerical variables into the following ones:
• Discrete variables: for example gender, nationality, marital status,…
• Continuous variables: for example age, income, temperature, satisfaction level
(on a scale),…
• Note: continuous variables can be converted in discrete variables!
For example, age: “0-18”, “18-30”, “30-50”, “>50’
For example, income: “low: < 1300”, “average: 1300-1800”, etc.
• Be careful: it is not always easy to clearly know if, strictly speaking, we are dealing with a discrete
or continuous variable!
For example, the use of a Likert-scale:
A “likert-scale” in an survey can vary from 1 “very unhappy” to 5 “very happy”. Is this a discrete
variable or a continuous one? Statisticians will rather say this is a discrete one (with 5 categories) ;
in many scientific surveys, this scale is nonetheless considered as a continuous variable (all score
between 1 and 5 are possible if you calculate an average of all respondents).

Stephan Poelmans– Data Mining

Other Basic Concepts:

• Outliers
• Remove or keep?
• E,g, Age = 400 (false observation) vs. income = 10 000 Euro (correct
observation(?))
• Missing values:
• How to deal with them? E.g. replace with average value?
• Definition of the target variable = outcome variable (if required, cf. below)
• Credit scoring: What is a bad customer/actor (e.g. 90 days payment arrears
according to the international Basel II guidelines)
• Churn management: What is a churner? (e.g. a customer without any
purchase in the last 4 months)

Stephan Poelmans– Data Mining

Other Basic Concepts:
• Training VS Test Set
➢ Different from “conventional statistics”:
▪ In the case of Data Mining, we split a sample/dataset in 2 parts
▪ The “training set” is used as a calculation base (to build the initial model).
Schemes/relations are searched for in the training set.
▪ The schemes/relations/models are tested with (applied to) a test set to
verify if they have a good prediction rate and to adjust them (make them
more general).
▪ For example we use 70% of the data for the training set and 30% for the
test set
▪ This division is not problematic because data mining may be applied on a
really huge datasets (most often coming from a data warehouse)

Stephan Poelmans– Data Mining

Data Mining Model Construction

Data Mining
Techniques
Training
Data

NAME RANK YEARS TENURED

Resulting
Model
M ike A ssistan t P ro f 3 no
(rules)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’

Stephan Poelmans– Data Mining

Use the Model - built using the training set - with
the test set and later in new “unseen” data

Data Mining is never perfect.

According to the “rules” Merlisa
should be “tenured” … Model
Rules

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tenured?
Tom A ssistan t P ro f 2 no
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes

Stephan Poelmans– Data Mining

Data Mining Strategies

• We are looking for relations between data (variables such as fields

from a table)

• Three data mining “strategies”: Supervised VS Unsupervised

Learning VS Market Basket Analysis

Data Mining
Strategies

Unsupervised Supervised Market Basket

Learning Learning Analysis

Classification Estimation

Source: adapted from Roiger, R. J. & M.W. Geatz (2003), Data Mining: A tutorial-based primer, Addison-Wesley, 350 p.

Stephan Poelmans– Data Mining

Data Mining Strategies
• Supervised learning = predictive methods:
• We are looking for an explicative or a predictive model
• There is an “output” variable (or target variable)
• For example :
• What are the differences between clients that do not come back after one purchase and clients
who do come back for a second time at least?
Output variable: on the basis of a series of criteria, we determine if the a person will not return
(0) or the person will return (1)
• What are the differences between owners and non-owners of a house?
• How could we explain the differences in client satisfaction ? What makes clients satisfied or not?
• How can we explain the fact that some projects are successful and others not?
• In the case of supervised learning, we make another difference between
“classification” (do not mix up with clustering) and “estimation”
• Classification: the output variable is always discrete (for example Decistion Trees,
Logistic Regression….)
• Estimation: the output variable is continuous (for example a linear regression)

Stephan Poelmans– Data Mining

Data Mining Strategies

• Unsupervised learning = Descriptive methods:

• We are not really looking for an explicative or a predictive model
• We do not have an “output” or target variable
• For example: Divide customers into similar groups (clustering)
• Market Basked Analysis:
• The aim is to find relationships / connections between retail products.
• Typically: which products do the customers buy together?
• “Association rules” is the technique “by excellence” (see further)
• Supervised and unsupervised learning could be described respectively as “purposeful” and
“exploratory”.
• There are several data mining techniques. Some techniques are supervised or unsupervised
or belong to “market basket analysis”. Other techniques can be used as both supervised or
unsupervised.

Stephan Poelmans– Data Mining

Data Mining vs. Expert Systems

Data Mining Tool

Expert System
Data

If Swollen Glands = Yes A computer program owning the problem-solving

capabilities of one or more human experts. (= Artificial
Then Diagnosis = Strep
Throat

Intelligence (AI))

Human Knowledge
Expert Engineer

Expert System
Building Tool Knowledge Engineer
A person who is trained to work with an expert
and capturing his knowledge.
If Swollen Glands = Yes
Then Diagnosis = Strep
Throat

Data mining vs. expert systems

Stephan Poelmans– Data Mining

Installing Weka
• Go to https://www.cs.waikato.ac.nz/ml/weka/
• Click the Download button
• the latest version, currently Weka-3-8-3
• the appropriate link for your computer; Windows, Mac, or Linux
• Weka = developed in Java, you’ll need a JVM (= Java Virtual machine)
• if you don’t know what Java is, you probably want the file that includes a Java VM
• Once it’s downloaded, open it to get a standard setup “wizard”.
• Just keep clicking “Next”! Install it in the default place – and remember the name of that place!
• After installation, uncheck the box that says “Start Weka” and clicks “Finish”.
• Then go to where Weka was installed and
• make a copy of the data folder (within the Weka folder) and put it in a convenient place for future
use
• rename it “Weka datasets” !!
• Having installed Weka, note the different interfaces in Weka: Explorer; the Experimenter; the
KnowledgeFlow interface; and the Command-line interface. We will be using the explorer and the
experimenter.
• In the Explorer there are several panels: Preprocess; Classify; Clustering; Association rules;
Attribute selection; and Visualization.

Stephan Poelmans– Data Mining

Exploring Weka

Weather.nominal.arff

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining
Weka: Visualizing your Data
• Using the Visualize panel

• Open iris.arff
• Bring up the Visualize panel

• Click one of the plots; examine some instances

• Set x axis to petalwidth and y axis to petallength

• Click on Class colour (bottom) to change the colour

• Bars on the right change correspond to attributes: click for x axis; right‐click for y axis

• Jitter slider (“Jitter is a random displacement applied to X and Y values to separate points
that lie on top of one another. Without jitter, 1000 instances at the same data point would
look just the same as 1 instance.)

• Show Select Instance: Rectangle option

• Submit, Reset, Clear and Save

Stephan Poelmans– Data Mining

Weka: Use a filter to remove an attribute
• Open weather.nominal.arff (again)
• Check the filters
• supervised vs unsupervised
• attribute vs instance
• Choose the unsupervised attribute filter Remove , click on the Remove text box
• Check the “More” button on the pop-up window; look at the options
• Set attributeIndices to 3 and click OK (So the 3rd attribute is removed for analysis)
• Apply the filter
• Recall that you can Save the result
• Press Undo

Stephan Poelmans– Data Mining

Weka: Use a filter to remove an attribute (value)

• Remove instances where humidity is high

• Supervised or unsupervised?
• Attribute or instance?
• Look at them
• Select RemoveWithValues
• Set attributeIndex (humidity = 3)
• Set nominalIndices (“high” instances need to be removed…)
• Apply & check the result: Are high humidity instances set to 0?
• Undo (!)

Stephan Poelmans– Data Mining

Understanding .ARFF Files
• ARFF is an acronym that stands for Attribute-Relation File Format.
• Weka has a specific computer science centric vocabulary when describing data:
• Instance: A row of data is called an instance, as in an instance or observation from the problem
domain.
• Attribute: A column of data is called a feature or attribute, as in feature of the observation.
• Each attribute can have a different type, for example:
• Real for numeric values like 1.2.
• Integer for numeric values without a fractional part like 5.
• Nominal for categorical data like “dog” and “cat”.
• String for lists of words, like this sentence.

• On classification problems, the output variable must be nominal. For regression problems, the output
variable must be real.

Stephan Poelmans– Data Mining

Understanding .ARFF Files
• For example, iris.arff
• Directives start with the ”at” symbol (@) and there is one for the name of the dataset (e.g.
@RELATION iris), there is a directive to define the name and datatype of each attribute (e.g.
@ATTRIBUTE sepallength REAL) and there is a directive to indicate the start of the raw data
(e.g. @DATA).
• Lines in an ARFF file that start with a percentage symbol (%) indicate a comment.
• The CSV format is also recognized by Weka and easily exported from MS Excel, so once you
can get your data into Excel, you can easily convert it to the CSV format. (Note: a CSV file
created in Excel needs commas to separate values (not semicolons !). The first line in the .csv
file contains the names of the attributes.

Stephan Poelmans– Data Mining

UNIT 5 Introduction To Data Mining-1
No ratings yet
UNIT 5 Introduction To Data Mining-1
185 pages
IS352 - Lecture 01
No ratings yet
IS352 - Lecture 01
62 pages
SAP S - 4HANA Migration Cockpit - Migrate Your Data To SAP S - 4HANA
100% (4)
SAP S - 4HANA Migration Cockpit - Migrate Your Data To SAP S - 4HANA
61 pages
Lect 1 2 Data Mining 3
No ratings yet
Lect 1 2 Data Mining 3
19 pages
Data Mining
No ratings yet
Data Mining
254 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Chapter 1 DM
No ratings yet
Chapter 1 DM
20 pages
Data Mining
No ratings yet
Data Mining
17 pages
DM-Unit 1
No ratings yet
DM-Unit 1
110 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
224 pages
Knowledge Discovery Process and Data Mining - Final Remarks: - Moore's Law
No ratings yet
Knowledge Discovery Process and Data Mining - Final Remarks: - Moore's Law
25 pages
Dm1 Introduction ML Data Mining
No ratings yet
Dm1 Introduction ML Data Mining
39 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Haramaya University College of Engineering and Technology Department of Information Technology
No ratings yet
Haramaya University College of Engineering and Technology Department of Information Technology
38 pages
Unit 3.1
No ratings yet
Unit 3.1
23 pages
DB 14
No ratings yet
DB 14
97 pages
1 - 1 Intro To Data Mining - ch1
No ratings yet
1 - 1 Intro To Data Mining - ch1
18 pages
Data Mining
No ratings yet
Data Mining
31 pages
Unit 1
No ratings yet
Unit 1
59 pages
Week1 1
No ratings yet
Week1 1
18 pages
Lecture 1-Introduction To Data Mining - M
No ratings yet
Lecture 1-Introduction To Data Mining - M
38 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
BIS 541 Ch01 20-21 S
No ratings yet
BIS 541 Ch01 20-21 S
129 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
Class 1a-DataCollection
No ratings yet
Class 1a-DataCollection
14 pages
01 Intro
No ratings yet
01 Intro
45 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
Data Mining
No ratings yet
Data Mining
20 pages
Unit 3
No ratings yet
Unit 3
23 pages
Introduction
No ratings yet
Introduction
46 pages
Data Mining: Nicoleta ROGOVSCHI
No ratings yet
Data Mining: Nicoleta ROGOVSCHI
84 pages
DMM Finals
No ratings yet
DMM Finals
30 pages
Unit 1
No ratings yet
Unit 1
19 pages
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
Dataminig
No ratings yet
Dataminig
21 pages
Chapter 5 - Data Mining
No ratings yet
Chapter 5 - Data Mining
29 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
41 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
LectureSlide 1
No ratings yet
LectureSlide 1
12 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
01 Intro
No ratings yet
01 Intro
23 pages
Ac2000 Web
No ratings yet
Ac2000 Web
192 pages
Introduction To AmiBroker Second Edition
100% (3)
Introduction To AmiBroker Second Edition
202 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
IME 672-Chapter 1 PDF
No ratings yet
IME 672-Chapter 1 PDF
41 pages
Power Bi Connect Data
No ratings yet
Power Bi Connect Data
1,144 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Course: COMP6140 - Data Mining Effective Period: September 2017
No ratings yet
Course: COMP6140 - Data Mining Effective Period: September 2017
24 pages
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
No ratings yet
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
14 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Data Mining at UVA: New Horizons in Teaching and Learning Conference
No ratings yet
Data Mining at UVA: New Horizons in Teaching and Learning Conference
19 pages
MyMicros Introduction
No ratings yet
MyMicros Introduction
20 pages
Excel 2022 The Complete Tutorial For Beginners and Expert (Campbell, Curtis)
100% (1)
Excel 2022 The Complete Tutorial For Beginners and Expert (Campbell, Curtis)
209 pages
WinUV Software
No ratings yet
WinUV Software
88 pages
Igcse Ict Glossary
100% (2)
Igcse Ict Glossary
0 pages
Plaso Filtering: Cheat Sheet 1.03
No ratings yet
Plaso Filtering: Cheat Sheet 1.03
2 pages
Eror
No ratings yet
Eror
106 pages
Centric 8 Advanced Admin Guide - Volume II - PDF
No ratings yet
Centric 8 Advanced Admin Guide - Volume II - PDF
36 pages
ILMT Config Guide
No ratings yet
ILMT Config Guide
40 pages
Lab Guide - Automating Zerto
No ratings yet
Lab Guide - Automating Zerto
13 pages
3197appli Manual e 01
No ratings yet
3197appli Manual e 01
20 pages
AVEVA Cable Design
No ratings yet
AVEVA Cable Design
2 pages
Money Maker - Instagram Growth Guide
No ratings yet
Money Maker - Instagram Growth Guide
48 pages
Data-Driven Testing
No ratings yet
Data-Driven Testing
66 pages
Entity-Relationship Diagram (ERD) : Yves Wautelet
No ratings yet
Entity-Relationship Diagram (ERD) : Yves Wautelet
35 pages
DEER EnergyPlus Deer-Prototype-System-User-Guide 2024126 DEER Prototype System User Guide v1
No ratings yet
DEER EnergyPlus Deer-Prototype-System-User-Guide 2024126 DEER Prototype System User Guide v1
45 pages
Stack Practice Question
No ratings yet
Stack Practice Question
6 pages
WISE User's Manual
No ratings yet
WISE User's Manual
21 pages
TDS Survey Pro With TSX v4.6.0 Reference Manual - Recon PDF
No ratings yet
TDS Survey Pro With TSX v4.6.0 Reference Manual - Recon PDF
481 pages
Business Intelligence SQL - Global
No ratings yet
Business Intelligence SQL - Global
55 pages
UserManual en Parte3
No ratings yet
UserManual en Parte3
150 pages
Practical File-Xii-2022-23
No ratings yet
Practical File-Xii-2022-23
39 pages
L4 Processes
No ratings yet
L4 Processes
14 pages
Cambridge IGCSE: Information and Communication Technology 0417/21
No ratings yet
Cambridge IGCSE: Information and Communication Technology 0417/21
16 pages
RWTH Housing+Guide PDF
No ratings yet
RWTH Housing+Guide PDF
40 pages
Vri Pivot Prescription Software User Guide v6 50
No ratings yet
Vri Pivot Prescription Software User Guide v6 50
42 pages
Business Intelligence DM5 Unsupervised Learning - Updated sl28-29
No ratings yet
Business Intelligence DM5 Unsupervised Learning - Updated sl28-29
31 pages
Initial Pages Practical File 2023-24
No ratings yet
Initial Pages Practical File 2023-24
6 pages
Exercises Operational ERD - Without - Solution
No ratings yet
Exercises Operational ERD - Without - Solution
9 pages
1 - Introduction Operational Management
No ratings yet
1 - Introduction Operational Management
43 pages
Icecat CSV Documentation Ver 1.9
No ratings yet
Icecat CSV Documentation Ver 1.9
13 pages
Test Your Knowledge - Python and Automation - Coursera100
No ratings yet
Test Your Knowledge - Python and Automation - Coursera100
1 page
Reading .Dat Files in R
No ratings yet
Reading .Dat Files in R
2 pages

Business Intelligence DM1

Uploaded by

Business Intelligence DM1

Uploaded by

Business Intelligence

Computer to install Weka (Mac / Windows / Linux)

Stephan Poelmans– Data Mining

No programming experience required

Stephan Poelmans– Data Mining

• Course material of the Moocs (slides, videos):

Stephan Poelmans– Data Mining

• Definition “… the non-trivial process of identifying valid, novel, potentially

• Data Mining uses statistics and artificial intelligence

• Data Mining is part of what we call Knowledge Discovery in Databases

Stephan Poelmans– Data Mining

Data mining in “The Knowledge Data Discovery (KDD) Process”:

Stephan Poelmans– Data Mining

• Data collection/Source data: for example data from operational

Stephan Poelmans– Data Mining

• Which attributes are relevant?

• Are there “missing values”?

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

• Can we divide clients into different groups (clusters)?

• Which products do clients often buy together?

Stephan Poelmans– Data Mining

Source: Roiger, R. J. & M.W. Geatz (2003)

The data is “preprocessed” and converted in “discrete variables”.

Stephan Poelmans– Data Mining

Diagnosis= Strep Throat

Diagnosis= Allergy Diagnosis= Cold

Stephan Poelmans– Data Mining

Source: Roiger, R. J. & M.W. Geatz (2003)

We could ask the question:

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

NAME RANK YEARS TENURED

Stephan Poelmans– Data Mining

Data Mining is never perfect.

Stephan Poelmans– Data Mining

• We are looking for relations between data (variables such as fields

• Three data mining “strategies”: Supervised VS Unsupervised

Unsupervised Supervised Market Basket

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

• Unsupervised learning = Descriptive methods:

Stephan Poelmans– Data Mining

Data Mining Tool

If Swollen Glands = Yes A computer program owning the problem-solving

Data mining vs. expert systems

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

• Click one of the plots; examine some instances

• Set x axis to petalwidth and y axis to petallength

• Click on Class colour (bottom) to change the colour

• Show Select Instance: Rectangle option

• Submit, Reset, Clear and Save

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

• Remove instances where humidity is high

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

Stephan Poelmans– Data Mining

You might also like