CH 1
CH 1
1
What Is Data Mining
• Data mining (knowledge discovery from data)
• Refers to the extracting or mining knowledge from large amount of data
• Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
• What is not Data Mining?
• Simple search and query processing
• (Deductive) expert systems
2
Why Data Mining
3
Motivation
• Data rich but information poor!
• Data explosion problem
• Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases, data warehouses and other information
repositories
• Solution: Data Mining
• Extraction of interesting knowledge (rules, patterns,
constraints) from data in large databases regularities
4
Evolution of Database
Technology
• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
5
WHY DATA MINING?-
POTENTIAL APPLICATIONS
• Database analysis and decision support
• Market analysis and management
• target marketing, market basket analysis,...
• Risk analysis and management
• Forecasting, quality control, competitive analysis,....
• Fraud detection and management
• Other Applications
• Text mining (newsgroup, email, documents) and Web
analysis.
• Spatial data (eg. Map data) mining
• Intelligent query answering
6
MARKET ANALYSIS AND MANAGEMENT
9
RISK ANALYSIS AND MANAGEMENT
• Finance planning and asset evaluation
• cash flow analysis and prediction
• cross-sectional and time series analysis (see slide-12)
• Resource planning
• summarize and compare the resources and spending
• Competition
• monitor competitors and market directions
• group customers into classes and a class-based pricing
procedure
• set pricing strategy in a highly competitive market
10
CROSS-SECTIONAL
ANALYSIS
• A type of analysis where an investor, analyst or
portfolio manager may conduct on a company in
relation to its industry or industry peers.
• The analysis compares one company against the
industry it operates within, or directly against
certain competitors within the same industry, in an
attempt to discover the best of the business
methods.
• Time series analysis comprises methods for
analyzing time series data in order to extract
meaningful statistics and other characteristics of
the data.
11
FRAUD DETECTION AND MANAGEMENT
• Applications
• Health care, retail, credit card services, telecommunications etc.
• Approach
• Use historical data to build models of normal and fraudulent
behavior and use data mining to help identify fraudulent
instances
• Examples
• auto insurance: detect groups who stage accidents to collect
insurance
• money laundering: detect suspicious money transactions
• medical insurance: detect professional patients and ring of
doctors, in appropriate medical treatment
• detecting telephone fraud: Telephone call model: destination of
the call, duration, time of day/week. Analyze patterns that
deviate from expected norm.
12
DISCOVERY OF MEDICAL/BIOLOGICAL KNOWLEDGE
15
KDD Steps
17
MULTIDIMENSIONAL
ANALYSIS
• In statistics, econometrics, and related fields
multidimensional analysis is
• a data analysis process that groups data into two or
more categories: data dimensions and measurements.
• For example-
• A data set consisting of the number several football
teams in each department over several years is a two-
dimensional dataset.
18
DBMS
• A database system, also called a database
management system (DBMS), consists of a collection
of interrelated data, known as a database, and a set
of software programs to manage and access the data.
19
ARCHITECTURE OF A
TYPICAL DATA MINING
SYSTEM
20
Data Mining System
Components
• Database, Data Warehouse, WWW and Other
repository
• All this act as data source for data mining
• Data cleaning and integration techniques may be
applied.
• Database or data warehouse server
• Responsible for fetching user's requested data
• Knowledge base
• Used to guide the search or evaluate the interestingness
resulting pattern
21
Data Mining System
Components
• Data mining engine
• Consists of a set of modules which are responsible for characterization,
association and correlation analysis, classification, prediction, cluster
analysis, outlier analysis and evolution analysis.
• Pattern evaluation module
• typically employs interestingness measures and interacts with the data
mining modules so as to focus the search toward interesting patterns.
• User Interface
• communicates between users and the data mining system,
• allows the user to interact with the system by specifying a data mining
query or task,
• provides information to help focus the search, and
• performs exploratory (=discover more) data mining based on the
intermediate data mining results
22
DATA MINING: ON WHAT
KIND OF DATA
• Relational databases .
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
• Object-oriented (OO) and object-relational (OR) databases
• Spatial databases (medical, satellite image DBS, GIS)
• Temporal databases
• Text databases
• Multimedia databases (Image, Video, etc)
• Heterogeneous and legacy databases
• WWW
23
RELATIONAL DATABASE
• A relational database is a collection of tables, each of
which is assigned a unique name.
• Each table consists of a set of attributes (columns or
fields) and usually stores a large set of tuples (records
or rows).
• Each tuple in a relational table represents an object
identified by a unique key and described by a set of
attribute values.
• It can be described by E-R data model that represents
the database as a set of entities and their
relationships.
24
DATA WAREHOUSE
• a repository of information collected from multiple
sources, stored under a unified schema, and that
usually resides at a single site.
• constructed via a process of data cleaning, data
integration, data transformation, data loading, and
periodic data refreshing.
25
DATA CUBE
• Provides a multidimensional view of data and allows the
pre-computation and fast accessing of summarized data.
• Example- The cube has three dimensions:
• address (with city values Chicago, New York, Toronto,
Vancouver),
• time (with quarter values Q1, Q2, Q3, Q4), and
• item (with item type values home entertainment, computer,
phone, security).
• The aggregate value stored in each cell of the cube is
sales amount (in thousands). If the total sales for the
first quarter, Q1, for items relating to security systems in
Vancouver is $400,000, as stored in cell (Vancouver Q1,
security).
26
27
TRANSACTIONAL
DATABASES
• A transactional database consists of a file where
each record represents a transaction.
• A transaction typically includes a unique
transaction identity number (trans ID) and a list of
the items making up the transaction (such as items
purchased in a store).
28
OBJECT RELATIONAL DATA
MODEL
• Inherits the essential concepts of object-oriented
database where each entity is considered as an
object.
For example - for xyz electronics company, object can
individual employee, customers or product items.
29
OBJECT RELATIONAL
DATABASE
• Are constructed based on the object relational data model.
• Here data and code relating an object are encapsulated
into single unit.
• Each object has the following features:
• A set of variable that describes the objects. These corresponds to
the attributes in the entity.
• A set of message that the object can use to communicate with
other objects or the rest of the database system.
• A set of methods, where each method hold the code to
implement a message. Upon receiving a message, the method
returns a value in response.
• For example- the method for get_national_id (employee) will retrieve
and return national id of the given employee object.
30
TEMPORAL DATABASE
• Typically stores data that include time-related
attributes.
• Where this attributes may involve several
timestamp, each having different semantics.
31
SEQUENCE DATABASE
• Stores a sequence of ordered events, with or
without a concrete notion of time.
• Example-customer shopping sequence, web click
streams, and biological sequence
32
TIME SERIES DATABASE
• Store sequences of values or events obtained over
repeated measurements of time (e.g. hourly, daily,
weekly)
• For example-data collected from stock exchange,
inventory control, and the observation of natural
phenomena like temperature and wind.
33
SPATIAL DATABASE
• Contain spatial-related information.
• For example - geographic map database, VLSI or
CAD database, and medical and satellite database
34
SPATIOTEMPORAL
DATABASE
• Stores spatial objects that change with time
• From spatiotemporal database interesting
information can be mined.
• For example - database of moving objects like a
moving car where GPS device is attached.
35
TEXT DATABASE
• Text databases are databases that contain word descriptions for
objects.
• These word descriptions are usually not simple keywords but
rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages, summary
reports, notes, or other documents. Text databases may be highly
unstructured (such as some Web pages on the World Wide Web).
• Some text databases may be somewhat structured, that is, semi-
structured (such as e-mail messages and many HTML/XML Web
pages), whereas others are relatively well structured (such as
library catalogue databases).
• Text databases with highly regular structures "typically can be
implemented using relational
36
MULTIMEDIA DATABASE
• Multimedia databases store image, audio, and video data.
• They are used in applications such as picture content-based
retrieval, voice-mail systems, video-on-demand systems, the
World Wide Web, and speech-based user interfaces that
recognize spoken commands.
• Multimedia databases must support large objects, because
data objects such as video can require gigabytes of storage.
• Specialized storage and search techniques are also required.
• Because video and audio data require real-time retrieval at a
steady and predetermined rate in order to avoid picture or
sound gaps and system buffer overflows, such data are
referred to as continuous-media data.
37
HETEROGENEOUS
DATABASE
• A heterogeneous database consists of a set of
interconnected, autonomous component
databases.
• The components communicate in order to
exchange information and answer queries.
• Objects in one component database may differ
greatly from objects in other component databases,
making it difficult to incorporate their semantics
into the overall heterogeneous database.
38
LEGACY DATABASE
• A legacy database is a group of heterogeneous
databases
• It combines different kinds of data systems, such as
relational or object-oriented databases, hierarchical
databases, network databases, spreadsheets,
multimedia databases, or file systems.
39
DATA MINING TASK
(CLASSIFICATION)
• Descriptive
• Here mining tasks characterize the general properties of
the data in the database.
• Predictive
• Here mining tasks perform inference (=deduction) on
the current data in order to make predictions.
40
CLASS AND CONCEPTS OF
DATA
• Data can be associated with classes or concepts.
• For example,
• In the AllElectronics store, classes of items for sale
include
• computers and printers, and
• concepts of customers include
• bigSpenders and budgetSpenders.
• Descriptions of individual classes and concepts in
summarized, concise, and yet precise terms are
called class/concept descriptions.
41
HOW CLASS/CONCEPT
DESCRIPTION DERIVED?
Class/concept description can be derived via-
1) data characterization, by summarizing the data of
the class under study (often called the target
class) in general terms, or
2) data discrimination, by comparison of the target
class with one or a set of comparative classes
(often called the contrasting classes), or
3) both data characterization and discrimination.
42
DATA CHARACTERIZATION
• Data characterization is a summarization of the
general characteristics or features of a target class
of data.
• The data corresponding to the user-specified class
are typically collected by a database query.
43
EXAMPLE OF DATA
CHARACTERIZATION(CONT.
.)
• A data mining system should be able to produce a
description summarizing the characteristics of
customers who spend more than $1,000 a year at
XYZElectronics.
• The result could be a general profile of the
customers, such as they are 40-50 years old,
employed, and have excellent credit ratings.
• The system should allow users to drill down on any
dimension, such as on occupation in order to view
these customers according to their type of
employment.
44
DATA DISCRIMINATION
• Data discrimination is a comparison of the general
features of target class data objects with the general
features of objects from one or a set of contrasting
classes.
• The target and contrasting classes can be specified by
the user, and the corresponding data objects retrieved
through database queries.
For example, the user may like to compare the
general features of software products whose sales
increased by 10% in the last year with those whose
sales decreased by at least 30% during the same period.
45
STRUCTURED VS.
UNSTRUCTURED DATA
• Structured Data
• Databases
• XML data
• Data warehouses
• Enterprise systems (CRM, ERP, etc)
• Unstructured Data
• Excel spreadsheets
• Word documents
• Email messages
• RSS feeds
• Audio files
• Video files
46
MINING FREQUENT
PATTERNS
• Frequent patterns are patterns that occur frequently in data.
• Example: itemsets, subsequences, and substructures.
• A frequent itemset typically refers to a set of items that frequently
appear together in a transactional data set, such as milk and bread
• A frequently occurring subsequence, such as the pattern that
customers tend to purchase first a PC, followed by a digital cam-era,
and then a memory card, is a (frequent) sequential pattern
• A substructure can refer to different structural forms, such as
graphs, trees, or lattices, which may be combined with itemsets or
subsequences.
• Mining frequent patterns leads to the discovery of interesting
associations and correlations within data
47
ASSOCIATION RULE
• Association rules are if/then statements that help uncover
relationships between seemingly unrelated data in a relational
database or other information repository. For example "If a
customer buys a dozen eggs, he is 80% likely to also purchase milk.“
• An association rule has two parts: an antecedent (if) and a
consequent (then).“
• An antecedent is an item found in the data.
• A consequent is an item that is found in combination with the antecedent.
• Association rules are created by analyzing data for frequent if then
patterns and using the criteria support and confidence to identify
the most important relationships.
• Support is an indication of how frequently the items appear in the
database.
• Confidence indicates the number of times the if/then statements have been
found to be true.49
48
49
SINGLE-DIMENSIONAL ASSOCIATION
• Buy (X, "computer") Buy(X, "software")
[support=1%,confidence-50%]
• where X is a variable representing a customer.
• A confidence, or certainty, of 50% means that if a
customer buys a computer, there is a 50% chance that
she will buy software as well.
• A 1% support means that 1% of all of the transactions
under analysis showed that computer and software
were purchased together.
• This association rule involves a single attribute or
predicate (i.e., buys) that repeats.
• Association rules that contain a single predicate are
referred to as single-dimensional association rules.
50
MULTIDIMENSIONAL ASSOCIATION
RULE
• age(X, "20:::29")^income(X,"20K:::29K"))→ buys(X,"
CD player") [2%, 60%]
• The rule indicates that of the XYZ Computers' customers
under study, 2% are 20 to 29 years of age with an
income of 20,000 to 29,000 and have purchased a CD
player at XYZ Computers’.
• There is a 60% probability that a customer in this age
and income group will purchase a CD player.
• Note that this is an association between more than one
attribute, or predicate (i.e., age, income, and buys).
• Here each attribute is referred to as a dimension, the
above rule can be referred to as a multidimensional
association rule.
51
CLASSIFICATION
• Classification a form of data analysis that can be used to extract
models describing important data classes.
• Classification-predicts categorical labels
• Finding models (e.g., if-then rules, decision trees, mathematical
formulae, neural networks, classification rules) that describe
and distinguish classes or concepts for future prediction, e.g.,
classify cars based on gas mileage
• Example-
• A bank loans officer needs analysis of his/her data in order to learn
which loan applicants are "safe" and which are "risky“ for the bank
• In this examples, the data analysis task is classification, where a
model or classifier is constructed to predict categorical labels, such as
"safe" or "risky" for the loan application data;
52
CLASSIFICATION (NN= IP+HIDDEN +
OP LAYER)
53
PREDICTION
• Prediction - models continuous valued functions
• Predict some unknown or missing numerical values
• Suppose that the marketing manager would like to
predict how much a given customer will spend during a
sale at ‘XYZ Electronics’
• This data analysis task is an example of numeric
prediction.
54
DATA EXPLORATION
• is a common process in data warehouses which are
characterized by large bulks of data coming from
disparate systems.
58
WHEN IS A "DISCOVERED"
PATTERN INTERESTING?
• A data mining system/query may generate thousands of
patterns, not all of them are interesting.
• Suggested approach: Human-centered, query-based, focused
mining
• Interestingness measures: A pattern is interesting if it is
easily understood by humans, valid on new or test data with
some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures:
• Objective: based on statistics and structures of patterns; e.g.,
support, confidence, etc.
• Subjective: based on user's belief in the data, e.g.,
unexpectedness novelty, actionability, etc.
59
60
DATA MINING:
CONFLUENCE OF MULTIPLE
DISCIPLINES
61
MAJOR ISSUES IN DATA
MINING
• Issue relating to Mining methodology and User
interaction
• Issue relating to Performance
• Issues relating to the diversity of database types
62
MAJOR ISSUES IN DATA MINING
• Theo Issue relating to Mining methodology and User interaction
• Mining different kinds of knowledge from databases
• Here knowledge discovery tasks, including data characterization,
discrimination, association and correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis
• Pattern evaluation: the interestingness problem
• Incorporation of background knowledge:
• integrity constraints and deduction rules
• Handling noise and incomplete data
• Data mining query languages and ad-hoc mining
• Data mining query languages, such as DMQL can be designed to support ad
hoc and interactive data, mining
• Presentation and visualization of data mining results
• knowledge representation techniques, such as trees, tables, rules, graphs,
charts, crosstabs, matrices, or curves
• Interactive mining of knowledge
• Interactive mining allows users to focus the search for patterns, providing and
refinin data mining requests based an returned results.
63
MAJOR ISSUES IN DATA MINING
• Issue relating to Performance
• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining algorithms
• Issues relating to the diversity of database types
• Handling of relational and complex types of data
• complex data objects, hypertext and multimedia data, spatial
data, temporal data, or transaction data
• Mining information from heterogeneous databases and
global information system.
64
DATA MINING
CLASSIFICATION
• Databases to be mined
• Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
• Knowledge to be mined
• Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques
• Database-oriented, data warehouse, machine learning, statistics,
utilizedvisualization, pattern recognition, neural network, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
65
CONFERENCES AND
JOURNALS ON DATA
MINING
• KDD Conferences
• ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)
• SIAM Data Mining Conf. (SDM).
• (IEEE) Int. Conf. on Data Mining (ICDM)
• Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD)
• Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
• Other related conferences
• ACM SIGMOD
• VLDB
• (IEEE) ICDE
• WWW, SIGIR
• ICML, CVPR, NIPS
• Journals
• Data Mining and Knowledge Discovery (DAMI or DMKD)
• IEEE Trans. On Knowledge and Data Eng. (TKDE)
• KDD Explorations
• ACM Trans. on KDD
66