0% found this document useful (0 votes)
59 views33 pages

Penambangan Data: Program Pascasarjana Fakultas Teknik Jteti - Ugm

The document discusses data mining topics including: 1. The lecture plan introduces the general and specific instructional objectives of the data mining course, which are to introduce data mining concepts, develop skills in data mining software, and gain research experience. 2. The topics to be covered are defined, including what is data mining, input/output, algorithms, credibility evaluation, advanced techniques, data transformation, and WEKA implementation. 3. An example lecture on "What is data mining" is outlined, addressing the relationships between data mining and machine learning, providing simple examples, discussing machine learning and statistics, and examining generalization as a search process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views33 pages

Penambangan Data: Program Pascasarjana Fakultas Teknik Jteti - Ugm

The document discusses data mining topics including: 1. The lecture plan introduces the general and specific instructional objectives of the data mining course, which are to introduce data mining concepts, develop skills in data mining software, and gain research experience. 2. The topics to be covered are defined, including what is data mining, input/output, algorithms, credibility evaluation, advanced techniques, data transformation, and WEKA implementation. 3. An example lecture on "What is data mining" is outlined, addressing the relationships between data mining and machine learning, providing simple examples, discussing machine learning and statistics, and examining generalization as a search process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Penambangan Data

Program Pascasarjana Fakultas Teknik


JTETI - UGM

Indriana Hidayah
References
1. Witten, Ian H. and Eibe Frank. Data mining: practical
machine learning tools and techniques, 2nd edition.
Morgan Kaufmann publishers. 2005.
2. Han, Jiawei, Micheline Kamber, and Jian Pei. Data
mining: concept and techniques, 3rd edition. Morgan
Kaufmann Publishers. 2012.
3. Liu, Bing. Web data mining: exploring hyperlinks,
contents, and usage data. Springer. 2007.
Lecture plan
RPKPS (Rencana Program Kegiatan
Pembelajaran Semester)

 Tujuan Instruksional Umum


 To introduce students to the basic concepts and techniques of Data
Mining.
 To develop skills of using recent data mining software for solving
practical problems.
 To gain experience of doing independent study and research.
Lecture plan
RPKPS (Rencana Program Kegiatan
Pembelajaran Semester)

• Tujuan Instruksional Khusus Tiap Topik (Pokok Bahasan)


Memahami unsur-unsur yang dirinci sebagai berikut.
– What is data mining
– Input: Concepts, instances, attributes
– Output: Knowledge representation
– Algorithms: The basic method
– Credibility: Evaluating what has been learned
– Advanced Data Mining: Implementation
– Data Transformation
– WEKA Data Mining Implementation.
Today’s topic
• What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
What Is Data Mining?
 Data mining (knowledge discovery in databases):
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large
databases
 Alternative names:
 Data mining: a misnomer?
 Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Source: Jiawei Han's slide

9/13/2013 Data Mining: Concepts and Techniques 7


Why data mining?
 The motivation:
 Data explosion problem Big data in
 Automated data collection tools databases and
other repositories
 Mature database technology
 Data rich but information poor!

 Solution: Data warehousing and data mining


 Data warehousing and on-line analytical
processing (OLAP)
 Data mining: extraction of interesting
knowledge (rules, patterns, constraints) from
data in large databases

9/13/2013 Data Mining: Concepts and Techniques 8


Evolution of Database Technology
 1960s:
 Data collection, change from primitive file processing to database
system
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific,
engineering, etc.)
 1990s—2000s:
 Data mining and data warehousing, multimedia databases, and
Web databases

9/13/2013 Data Mining: Concepts and Techniques 9


How about machine learning?
• Data mining is defined as the process of
discovering useful patterns, automatically or
semi-automatically, in large quantities of data.
• Where as, machine learning is…
– Learning (noun): cognitive process of acquiring skill
or knowledge (Wordweb 6.6)
– Thus, machine learning can be thought as the
machine (i.e. computer) going on a process of
acquiring skill or knowledge.
• So…
– How is the relation between data mining and
machine learning?
Potential Applications (1)
 Data mining can be applied in multidiscipline
field, involving:
 machine learning,
 statistics,
 databases,
 artificial intelligence, and
 pattern recognition
 Web usage mining
 Text mining
9/13/2013 Data Mining: Concepts and Techniques 12
What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
Simple example
Contact lens prescription

 The patterns can


be:
 Classification
 Presented in
decision tree

9/13/2013 Data Mining: Concepts and Techniques 14


More realistic example:
vertebral column
pelvic_inci- pelvic_tilt lumbar_lordo sacral_slope pelvic_radius degree_spondy- Class
dence sis_angle lolisthesis attribute

63.0278175 22.55258597 39.60911701 40.47523153 98.67291675 -0.254399986Hernia

39.05695098 10.06099147 25.01537822 28.99595951 114.4054254 4.564258645Hernia

68.83202098 22.21848205 50.09219357 46.61353893 105.9851355 -3.530317314Hernia

69.29700807 24.65287791 44.31123813 44.64413017 101.8684951 11.21152344Hernia

49.71285934 9.652074879 28.317406 40.06078446 108.1687249 7.918500615Hernia

40.25019968 13.92190658 25.1249496 26.32829311 130.3278713 2.230651729Hernia

53.43292815 15.86433612 37.16593387 37.56859203 120.5675233 5.988550702Hernia

45.36675362 10.75561143 29.03834896 34.61114218 117.2700675 -10.67587083Hernia

43.79019026 13.5337531 42.69081398 30.25643716 125.0028927 13.28901817Hernia

36.68635286 5.010884121 41.9487509 31.67546874 84.24141517 0.664437117Hernia

49.70660953 13.04097405 31.33450009 36.66563548 108.6482654 -7.825985755Hernia


Data mining process
 As a process, data mining encompasses three main
steps:
 Pre-processing → dealing with unsuitable raw data
 Data mining → applying data mining method
 Post-processing → interpreting mined patterns
Architecture of a Typical Data Mining
System
Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & data
integration Filtering

Data
Databases Warehouse
9/13/2013 Data Mining: Concepts and Techniques 18
Another example:Directed marketing
(S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-
DM Methodology. )

• Problem:
– Increasing vast number of marketing campaigns
– Global competitive world
– Mass campaigns are ineffective
• Solution:
– Directed campaigns with a strict and rigorous selection of
contacts.
• Focus on targets that assumable will be keener to that specific
product/service
• More efficient, reduction in costs and time
• The dataset:
– Portuguese marketing
campaign related with bank
deposit subscription.
– Dataset collected is related to
17 campaigns that occurred
between May 2008 and
November 2010,
corresponding to a total of
79354 contacts.
– For each contact, recorded
• a large number of attributes
• the target variable (class attribute)
• there were 6499 successes (8%
success rate).
Steps
1. Goal definition
– To predict if a client will subscribe the deposit
– Classification task
2. Simple data pre-processing (Data Preparation phase)
– Non-conclusive instances were discarded, leading to a total of 55817
contacts.
– Attribute reduction, leading to 29 attributes and 1 class attribute
– Discard instances that contained missing values, leading to 45211
instances (5289 of which were successful or 11.7% success rate).
3. Data mining step (Modeling phase), using NB, DT, SVM
– dataset was randomly divided into training (2/3) and test (1/3) sets
4. Evaluation of the model
Conclusion
• Call duration is the most
relevant feature, meaning
that longer calls tend
increase successes.
• In second place comes the
month of contact.
• Success is most likely to
occur in the last month of
each trimester (March, June,
September and December).
• Such knowledge can be
used to shift campaigns to
occur in those months.
Data Mining: On What Kind of
Data?

 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
9/13/2013 Data Mining: Concepts and Techniques 23
Functionality
Knowledge produced by data mining
 Knowledge in DM term, means useful pattern
 The pattern should be
 Useful
 Valid
 Understandable
 Pattern types can be produced by data mining
methods:
 Frequent pattern, association, correlation
 Data characterization and discrimination
 Classification and prediction
 Cluster
Frequent pattern, association,
correlation

 Patterns that occur frequently in data


 Frequent itemset
 Frequent subsequences
 Frequent substructures
 Leading to associations and correlation within
data
Classification and prediction
Cluster analysis
Are All the “Discovered” Patterns
Interesting?
 A data mining system may generate thousands of patterns, not all of them
are interesting.
 Suggested approach: Human-centered, query-based, focused mining
 Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree of
certainty, potentially useful, novel, or validates some hypothesis that a user
seeks to confirm
 Objective vs. subjective interestingness measures:
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,
actionability, etc.

9/13/2013 Data Mining: Concepts and Techniques 28


Can We Find All and Only Interesting
Patterns?

 Search for only interesting patterns: Optimization


 Can a data mining system find only the interesting patterns?
 Approaches
 First general all the patterns and then filter out the uninteresting
ones.
 Generate only the interesting patterns—mining query optimization

9/13/2013 Data Mining: Concepts and Techniques 29


What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
Machine learning and statistics
• Both are in the continuum of data analysis
techniques
– Some derive from the skills taught in standard
statistics courses,
– others are more closely associated with algorithms
that has arisen out of computer science.
What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
• One way of visualizing the problem of learning—
and one that distinguishes it from statistical
approaches—is to imagine a search through a
space of possible concept descriptions for one
that fits the data.

You might also like