Module 4.1 - Data Science
Module 4.1 - Data Science
– purchases at department/
grocery stores, e-commerce
Amazon handles millions of visits/day
– Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
3
Why Data Science? Scientific Viewpoint
Improving health care and reducing costs Predicting the impact of climate change
Data
Milk
25
Predictive Modeling: Classification
Find a model for class attribute as a function of
the values of other attributes Model for predicting credit
worthiness
Class
26
Classification Example
Test
Set
Training
Learn
Model
Set Classifier
09/09/2020 28
Classification: Application 1
Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit
card transactions on an account.
29
Classification: Application 2
30
Classification: Application 3
Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
Segment the image.
Measure image attributes (features) - 40 of them per
object.
Model the class based on these features.
Success Story: Could find 16 new high red-shift
quasars, some of the farthest objects that are
difficult to find!
31
Classifying Galaxies
Courtesy: http://aps.umn.edu
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
32
Regression
33
Clustering
34
Applications of Cluster Analysis
Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
Summarization
– Reduce the size of large data
sets Courtesy: Michael Eisen
Use of K-means to
60
Land Cluster 2
partition Sea Surface
30 Temperature (SST)
Land Cluster 1
and Net Primary
latitude
Ice or No NPP
Production (NPP) into
-30 clusters that reflect
Sea Cluster 2 the Northern and
-60
Southern
Sea Cluster 1
-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Hemispheres. 35
Cluster
longitude
Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
Collect different attributes of customers based on
their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.
36
Clustering: Application 2
Document Clustering:
– Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.
37
Deviation/Anomaly/Change Detection
38
Motivating Challenges
Scalability
High Dimensionality
Non-traditional Analysis
39
DS Career path