0% found this document useful (0 votes)
47 views28 pages

UNESCO Courses: Module On Knowledge Discovery and Data Mining

This document provides an overview and outline of a course on knowledge discovery and data mining (KDD). The course objectives are to provide fundamental KDD techniques, practical applications of KDD, and case studies. The prerequisites are basic computer, database, statistics and programming skills. The course consists of 7 lectures covering topics like data preparation, decision trees, association rules, cluster detection, and evaluating discovered knowledge. The presentation summarizes the content and organization of the course lectures.

Uploaded by

heocon232
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views28 pages

UNESCO Courses: Module On Knowledge Discovery and Data Mining

This document provides an overview and outline of a course on knowledge discovery and data mining (KDD). The course objectives are to provide fundamental KDD techniques, practical applications of KDD, and case studies. The prerequisites are basic computer, database, statistics and programming skills. The course consists of 7 lectures covering topics like data preparation, decision trees, association rules, cluster detection, and evaluating discovered knowledge. The presentation summarizes the content and organization of the course lectures.

Uploaded by

heocon232
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

UNESCO courses: Module on Knowledge Discovery and Data Mining

Prof. Ho Tu Bao Prof. Bach Hung Khang

Institute of Information Technology


Japan Advanced Institute of Science and Technology
1

Outline of the presentation


Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion

This presentation summarizes the content and organization of lectures in module Knowledge Discovery and Data Mining
2

Objectives
This course provides:

fundamental techniques of knowledge issues in KDD practical use and tools case-studies of KDD application
3

discovery and data mining (KDD)

Prerequisite for the course


Nothing special but the followings are expected:

experience of computer use basis of databases and statistics programming skill for advanced levels
4

Content of the course


Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction

Lecture 4: Mining association rules


Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge
5

Outline of the presentation


Objectives, Brief Discussion

Prerequisite
and Content

Introduction
to Lectures

and
Conclusion

This presentation summarizes the content and organization of lectures in module Knowledge Discovery and Data Mining
6

Brief introduction to lectures


Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction

Lecture 4: Mining association rules


Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge
7

Lecture 1: Overview of KDD


1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD
8

KDD: A Definition
KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.

106-1012 bytes: never see the whole data set or put it in the memory of computers

Data mining algorithms?

What knowledge? How to represent and use it?

Data, Information, Knowledge


We often see data as a string of bits, or numbers and symbols, or objects which we collect daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data.

Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our mental pictures. Knowledge can be considered data at a high level of abstraction and generalization.
10

From Data to Knowledge


Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes
... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, ,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS ...

Numerical attribute

categorical attribute

missing values

class labels

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [87,5%] [confidence, predictive accuracy]
11

Data Rich Knowledge Poor


How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem.

People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Raw data is rarely of direct benefit.

?
knowledge base inference engine

Its true value depends on the ability to extract information useful for decision support. Tradition: via knowledge engineers Impractical Manual Data Analysis New trend: via automatic programs
12

Benefits of Knowledge Discovery


Value
Disseminate

Generate

DSS
MIS
Rapid Response

EDP

Volume

EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems
13

Lecture 1: Overview of KDD


1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD
14

The KDD process


The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)
Multiple process

non-trivial process valid novel useful understandable


Justified patterns/models Previously unknown Can be used by human and machine
15

The Knowledge Discovery Process


a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations 5

Putting the results in practical use

Interpret and Evaluate discovered knowledge

Data Mining
2

Extract Patterns/Models

Collect and Preprocess Data

Understand the domain and Define problems

KDD is inherently interactive and iterative


16

Data organized by function Create/select target database

The KDD Process


1

Data warehousing

Select sampling technique and sample data

Supply missing values

Eliminate noisy data

2
Find important attributes & value ranges

Normalize values

Transform values

Create derived attributes

3
Select DM task (s) Select DM method (s) Extract knowledge Test knowledge

4
Refine knowledge

Transform to different representation

Query & report generation Aggregation & sequences Advanced methods 5


17

Main Contributing Areas of KDD


[data warehouses: integrated data] [OLAP: On-Line Analytical Processing]

Statistics
KDD

Infer info from data (deduction & induction, mainly numeric data)

Databases
Store, access, search, update data (deduction)

Machine Learning
Computer algorithms that improve automatically through experience (mainly induction, symbolic data)

18

Lecture 1: Overview of KDD


1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD
19

Potential Applications
Business information Manufacturing information
- Marketing and sales data analysis - Investment analysis - Loan approval - Fraud detection - etc.

Controlling and scheduling Network management Experiment result analysis etc.

Scientific information
-

Personal information

Sky survey cataloging Biosequence Databases Geosciences: Quakefinder etc.


20

KDD: Opportunity and Challenges


Competitive Pressure
Data Rich Knowledge Poor (the resource)

KDD
Data Mining Technology Mature

Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.)


21

KDD: A New and Fast Growing Area


KDD workshops: 1989, 1991,1993, 1994. Inter. Conferences: KDD95, 96, 97, 98, 99 (USA) PAKDD97, 98, 99 (Asia) , PKDD97, 98, 99 (Europe) PAKDD00 (Kyoto, 2000.4.18-20, deadline 99.10.10) Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, 80% of the Fortune 500 companies are currently involved in data mining pilot projects or using data mining systems. JAPAN: FGCS Project (logic programming and reasoning, recently more attention on knowledge acquisition and machine learning). Interests in KDD: Special Issue on KDD of JSAI, July 1997. Knowledge Discovery is the most desirable end-product of computing. Wiederhold, Standford Univ.
22

Lecture 1: Overview of KDD


1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD
23

Primary Tasks of Data Mining


finding the description of several predefined classes and classify a data item into one of them. maps a data item to a real-valued prediction variable. identifying a finite set of categories or clusters to describe the data.

Classification
?

Clustering

finding a model which describes significant dependencies between variables.

Regression
discovering the most significant changes in the data

Dependency Modeling

Deviation and change detection

finding a compact description for a subset of data

Summarization
24

Classification
What factors determine cancerous cells?

Examples

Data

Mining Algorithm
Classification Algorithm

General patterns
- Rule Induction - Decision tree - Neural Network
25

Cancerous Cell Data

Classification: Rule Induction


What factors determine a cell is cancerous? If and and Then If and and Then Color = light Tails = 1 Nuclei = 2 Healthy Cell Color = dark Tails = 2 Nuclei = 2 Cancerous Cell

(certainty = 92%)

(certainty = 87%)
26

Classification: Decision Trees


Color = dark Color = light

#nuclei=1

#nuclei=2

#nuclei=1

#nuclei=2

#tails=1

#tails=2

cancerous

healthy #tails=1 healthy #tails=2 cancerous

healthy

cancerous

27

Classification: Neural Networks


What factors determine a cell is cancerous?
Color = dark # nuclei = 1 # tails = 2

Healthy
Cancerous

28

You might also like