LectureSlide 1
LectureSlide 1
Information Knowledge
Information is interpreted (processed) data so that it has Knowledge is a fluid mix of information, experience and
meaning for the user. insight that may benefit the individual or the
“The price of petrol has risen from Rs. 43 to Rs. 48 per organization.
liter” – is information for a pperson who tracks ppetrol When petrol prices go up by Rs.
“When Rs 5 per liter,
liter it is likely
prices. that bus fare will rise by 10%" is knowledge.
Data becomes information when it is processed for some The boundaries between data, information, and
purpose and adds value for the recipient. knowledge is fuzzy
A set of raw sales figures – Data What is data to one person is information to someone
Sales report (chart plotting, trend analysis) – Information else.
1
7/22/2010
Data Mining
Why Data Mining?
Main Objectives
Identification of data as a source of useful Data explosion problem
information
The Explosive Growth of Data: from terabytes to
petabytes
Use of discovered information for competitive
Automated data collection tools and mature database
advantages when working in business
technology lead to tremendous amounts of data
enviroment
stored in databases, datawarehouses and other
information repositories
2
7/22/2010
3
7/22/2010
medium 10-9 10-6 6x10-6 .001 1 medium Personal Personal Personal Super Computer Teraflop
seconds seconds seconds seconds second Computer Computer Computer Computer
large -8
10 -4
10 8x10-4
1 2.8 large Personal Workstation Super Computer Teraflop ---
Computer Computer
seconds seconds seconds second hours
-7 huge Personal Super Teraflop --- ---
huge 10 .01 .1 16.7 3.2 Computer Computer Computer
seconds seconds seconds minutes years
4
7/22/2010
5
7/22/2010
6
7/22/2010
7
7/22/2010
Evolution of Database
Short History of Data Mining
Technology c.d.
1989 - KDD term (Knowledge Discovery in
1980s: Databases) appears in (IJCAI Workshop)
RDBMS, advanced data models (extended- 1991 - a collection of research papers edited by
Piatetsky-Shapiro
y p and Frawley y
relational OO
relational, OO, deductive
deductive, etc
etc.)) and
application-oriented DBMS (spatial, scientific, 1993 – Association Rule Mining Algorithm
APRIORI proposed by Agrawal, Imielinski and
engineering, etc.)
Swami.
1990s—2000s: 1996 – present: KDD evolves as a conjuction of
Data mining and data warehousing, different knowledge areas (data bases, machine
multimedia databases, and Web databases learning, statistics, artificial intelligence) and the
term Data Mining becomes popular
8
7/22/2010
SELECTION
Processed Data Interpretation: discovered patterns are
Target data
presented in a proper format and the user decides
if it is neccesary to re-iterate the algorthms
Data
9
7/22/2010
Characterization Discrimination
Describes the process which aim is to It is the process which aim is to find rules
find rules that describe properties of a that allow us to discriminate the objects
concept. They take the form (records) belonging to a given concept (one
class ) from the rest of records ( classes)
If concept then characteristics If characteristics then concept
A=0 & B=1 Æ C=1 33% 83% (support, confidence: the conditional
C=1 Æ A=1 & B=3 25% (support: there are 25% o the records for probability of the concept given the characteristics)
which the rule is true) A=2 & B=0 Æ C=1 27% 80%
C=1 Æ A=1 & B=4 17% A=1 & B=1 Æ C=1 12% 76%
C=1 Æ A=0 & B=2 16% Discriminant rule can be good even if it has a low support (and high
confidence)
10
7/22/2010
Major Issues in Data Mining (2) Major Issues in Data Mining (3)
Handling noise and incomplete data Issues relating to the diversity of data types
Handling relational and complex types of data
Pattern evaluation: the interestingness problem
Mining information from heterogeneous databases and
global information systems (WWW)
Performance and scalability Issues related to applications and social impacts
Efficiency and scalability of data mining Application of discovered knowledge
Domain-specific data mining tools
algorithms Intelligent query answering
Parallel, distributed and incremental Process control and decision making
11
7/22/2010
Aproaches (I)
Approaches (II)
Artificial Intelligence:
Classification trees (ID3, C4.5..)
Clustering
Neural Networks
Genetic algorithms
Visualization techniques
...
12