SIMS 422: Knowledge Inference Systems & Applications
SIMS 422: Knowledge Inference Systems & Applications
Knowledge Inference
Systems & Applications
Slides by H. T. Bao
1
Outline of the presentation
2
Objectives
This course provides:
• programming skills
4
Content of the course
• Overview of KDD
• Mining association rules
• Mining action rules
• Decision tree induction
• Distributed knowledge systems and distributed
query answering
• Cluster analysis
5
Outline of the presentation
6
Brief introduction to lectures
Overview of KDD
7
Lecture 1: Overview of KDD
1. What is KDD and Why ?
3. KDD Applications
8
KDD: A Definition
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
9
Data, Information, Knowledge
We often see data as a string of bits, or numbers and
symbols, or “objects” which we collect daily.
10
From Data to Knowledge
Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes
...
10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97,
49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS
12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59,
F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA
15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F,
-,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA
16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44,
57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS
...
IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15
THEN Prediction = VIRUS [87,5%]
[confidence, predictive accuracy]
11
Data Rich Knowledge Poor
How to acquire knowledge for
knowledge-based systems
remains as the main difficult
and crucial problem.
People gathered and stored so
much data because they think
some valuable assets
are implicitly coded within it. ?
Raw data is rarely of direct benefit. knowledge inference
base engine
12
Benefits of Knowledge Discovery
Value
Disseminate
Generate
DSS
MIS
EDP
Rapid Response
Volume
EDP: Electronic Data Processing
MIS: Management Information Systems
DSS: Decision Support Systems
13
Lecture 1: Overview of KDD
1. What is KDD and Why ?
3. KDD Applications
14
The KDD process
The non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)
Multiple process
non-trivial process
Justified patterns/models
valid
novel Previously unknown
Data Mining
2 Extract Patterns/Models
Collect and
Preprocess Data
1
Create/select
target database
Data warehousing
Select sampling
1
technique and
sample data
3 4
Select DM Select DM Extract Test Refine
task (s) method (s) knowledge knowledge knowledge
Databases
Machine Learning
Store, access, search,
update data (deduction) Computer algorithms that improve
automatically through experience
(mainly induction, symbolic data)
18
Lecture 1: Overview of KDD
1. What is KDD and Why ?
3. KDD Applications
19
Potential Applications
Business information Manufacturing information
20
KDD: Opportunity and Challenges
Competitive
Pressure
Data Rich
Knowledge Poor
(the resource) KDD
Data Mining
Technology
Mature
Enabling Technology
(Interactive MIS, OLAP,
parallel computing, Web, etc.)
21
KDD: A New and Fast Growing Area
KDD workshops: since 1989.
Inter. Conferences: KDD (USA), first in 1995;
PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997.
ML’04/PKDD’04 (in Pisa, Italy)
3. KDD Applications
23
Primary Tasks of Data Mining
finding the description
identifying a finite
of several predefined
set of categories or
classes and classify
clusters to describe
a data item into one
the data.
of them.
Clustering
Classification
finding a model
maps a data item which describes
? significant dependencies
to a real-valued
prediction variable. between variables.
Regression Dependency
Modeling
discovering the finding a
most significant compact description
changes in the data for a subset of data
Deviation and
change detection Summarization
24
Classification
“What factors determine cancerous cells?”
Examples
25
Classification: Rule Induction
“What factors determine a cell is cancerous?”
If Color = light
and Tails = 1
and Nuclei = 2
Then Healthy Cell (certainty = 92%)
If Color = dark
and Tails = 2
and Nuclei = 2
Then Cancerous Cell (certainty = 87%)
26
Classification: Decision Trees
cancerous healthy
#tails=1 #tails=2
#tails=1 #tails=2
27
Classification: Neural Networks
“What factors determine a cell is cancerous?”
Color = dark
Healthy
# nuclei = 1
Cancerous
…
# tails = 2
28