Chap1 - Introduction To Machine Learning
Chap1 - Introduction To Machine Learning
New mantra
Gather whatever data you can
whenever and wherever
possible.
Expectations Traffic Patterns Social Networking:
Twitter
Gathered data will have value
either for the purpose collected
or for a purpose not envisioned.
Improving health care and reducing costs Predicting the impact of climate change
Prediction Methods
to predict the value of a particular attribute
based on the values of other attributes
The attribute to be predicted is commonly
known as the target or dependent variable,
while the attributes used for making the
prediction are known as the explanatory or
independent variables.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Clu
ste Data
ri ng
Tid Refund Marital
Status
Taxable
Income Cheat
l i ng
1 Yes Single 125K No
ode
2 No Married 100K No
M
i ve
3 No Single 70K No
4 Yes Married 120K No
ct
5 No Divorced 95K Yes
edi
6
7
No
Yes
Married 60K
Divorced 220K
No
No P r
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
An
De oma
11 No Married 60K No
tec ly
oci 13 No Single 85K Yes
tio
s
As
14 No Married 75K No
10
15 No Single 90K Yes
n
l es
Ru
Milk
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
Training
Learn
Set Classifier Model
Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
Use credit card transactions and the information on its account-
holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
Label past transactions as fraud or fair transactions. This forms
the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Land Cluster 2
30
(SST) and Net Primary
Production (NPP) into clusters
Land Cluster 1
latitude
0
that reflect the Northern and
Ice or No NPP
-30
Southern Hemispheres.
Sea Cluster 2
-60
Sea Cluster 1
Introduction to Machine Learning By Mr.
-90
-180 -150 -120 -90 -60 -30 0 30
longitude
60 90 120 150 180
Cluster Gebreyes G.
Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
– Approach:
Collect different attributes of customers based on their
geographical and lifestyle related information.
Find clusters of similar customers.
Document Clustering:
– Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
TID Items
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread {Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-basket analysis
– Rules are used for sales promotion, shelf management, and
inventory management
Medical Informatics
– Rules are used to find combination of patient symptoms
and test results associated with certain diseases