001Lecture_1 Introduction-1
001Lecture_1 Introduction-1
2
Text Book:
3
Assessment Methods
• Class Work : Quiz + Assignment during the term
Degree Item
5 Quiz1
5 Quiz2
2 Assignment1
2 Assignment2
2 Assignment3
2 Assignment4
2 Assignment5
Total 20
Oral and Lab. 20
The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of the
following steps:
1) Data cleaning (to remove noise and inconsistent data)
2) Data integration (where multiple data sources may be combined)
3) Data selection (where data relevant to the analysis task are retrieved from the database)
4) Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5) Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6) Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures.)
7) Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Data warehouses are constructed via
a process of data cleaning, data
integration, data transformation, data
loading, and periodic data refreshing.
Examples of Large Datasets
Scientific
NASA, EOS project: 50 GB per hour
Environmental datasets
Examples of Data mining Applications
Database Data.
Data Warehouses.
Transactional Data.
Other Kinds of Data.
Database Data
Data entries can be associated with classes or concepts. For example, in the
AllElectronics store, classes of items for sale include computers and printers,
and concepts of customers include big Spenders and budget Spenders.
Data characterization is a summarization of the general characteristics or
features of a target class of data.
Data discrimination is a comparison of the general features of the target
class data objects against the general features of objects from one or
multiple contrasting classes.
Mining Frequent Patterns, Associations,
and Correlations
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
Rules Discovered:
3 Beer, Coke, Diaper, Milk
{Milk} --> {Coke}
4 Beer, Bread, Diaper, Milk {Diaper, Milk} --> {Beer}
5 Coke, Diaper, Milk
Association Rule Discovery:
Application 1
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
Model
10
10 No Single 90K Yes Set Classifier
Example of a Decision Tree
MarSt Single,
Married Divorced
Tid Home Marital Taxable
Owner Status Income Default
NO HO
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-
holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Label past transactions as fraud or fair transactions. This forms the
class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.
Clustering
Raw Data
Sampling
Raw Data Cluster/Stratified Sample