Datamining 1
Datamining 1
DATA MINING
1
Why Data Mining?
Necessity, who is the mother of invention. – Plato
2
Why Data Mining?
Data mining turns a large collection of data into
knowledge
3
Data Mining
4
What Is Data Mining?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
5
Data Mining Applications
6
Data Mining for Financial Data Analysis
7
Knowledge Discovery (KDD) Process
This is a view from typical database
systems and data warehousing
communities
Pattern Evaluation
Data mining plays an essential role
in the knowledge discovery process
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
8
Knowledge Discovery (KDD) Process
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be
combined)
Data selection (where data relevant to the analysis task are
retrieved from the database)
Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations)
Data mining (an essential process where intelligent methods
are applied to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures)
Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined
knowledge to users)
9
Data Warehouses
A data warehouse is a repository of information
collected from multiple sources, stored under a unified
schema, and usually residing at a single site.
It is usually modeled by a multidimensional data
structure, called a data cube
In data cube, each dimension corresponds to an
attribute or a set of attributes in the schema
each cell stores the value of some aggregate measure
such as count as an example
A data cube provides a multidimensional view of data
and allows the pre-computation and fast access of
summarized data
10
Data Warehouses
11
Data Mining: On What Kinds of Data?
12
Data Mining Functionalities
Data mining functionalities are used to specify the
kinds of patterns to be found in data mining tasks
13
Generalization
14
Example: Data Characterization
A customer relationship manager at
“ABCElectronics” may order the following data
mining task: Summarize the characteristics of
customers who spend more than $5000 a year at
“ABCElectronics”.
The result is a general profile of these customers,
such as that they are 40 to 50 years old, employed,
and have excellent credit ratings.
The data mining system should allow the customer
relationship manager to drill down on any
dimension, such as on occupation to view these
customers according to their type of employment
15
Example: Data Discrimination
A customer relationship manager at “ABCElectronics” may want
to compare two groups of customers—those who shop for
computer products regularly (e.g., more than twice a month) and
those who rarely shop for such products (e.g., less than three
times a year)
The resulting description provides a general comparative profile
of these customers, such as that 80% of the customers who
frequently purchase computer products are between 20 and 40
years old and have a university education
19
Answer
The rule indicates that of all the customers under
study, 2% are 20 to 29 years old with an income of
$40,000 to $49,000 and have purchased a laptop
(computer)
20
Classification
Classification and label prediction
Construct models (functions) based on some training
examples
Describe and distinguish classes or concepts for future
prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
Typical applications: Credit card fraud detection, direct
21
Some Classification Tools
22
Classification and Regression
23
Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing
interclass similarity
Many methods and applications
24
Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general behavior of
the data
Noise or exception? ― One person’s garbage could be another person’s
treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
25
Technologies Used
26
Technologies Used
Statistics
27
Technologies Used
Machine Learning
28
Technologies Used
Information Retrieval
It is the science of searching for documents or
information in documents
29
Major Issues
Mining various and new kinds of knowledge