TTDS Lecture 1
TTDS Lecture 1
TECHNIQUES FOR
DATA SCIENCE
LECTURE 1
Introduction
Task-relevant Data
Data Cleaning
Data Integration
Databases
KDD Process: A Typical View from ML and Statistics
Online news portals: steady stream of 100’s of new articles every day
http://www.1000genomes.org/page.php
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.p
hp
Mobile phones today record a large amount of information about the user behavior
GPS records position
Camera produces images
Communication via phone and SMS
Text via facebook updates
Association with entities via check-ins
Amazon collects all the items that you browsed, placed into your basket, read reviews
about, purchased.
Google and Bing record all your browsing activity via toolbar plugins. They also record the
queries you asked, the pages you saw and the clicks you did.
Suppose that you are the owner of a supermarket and you have
collected billions of market basket data. What information would you
extract from it and how would you use it?
TID Items
Product placement
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Catalog creation
What if this4was Beer, Bread, Diaper, Milk
an online store?
5 Coke, Diaper, Milk Recommendations
What can you do with the data?
Suppose you are a search engine and you have a toolbar log
consisting of
pages browsed,
queries, Ad click prediction
pages clicked,
ads clicked
Query reformulations
each with a user id and a timestamp. What information would you like
to get our of the data?
What can you do with the data?
Suppose you are biologist who has microarray expression data:
thousands of genes, and their expression values over thousands of
different settings (e.g. tissues). What information would you like to
get out of your data?
Suppose you are a stock broker and you observe the fluctuations of
multiple stocks over time. What information would you like to get our
of your data?
Clustering of stocks
Correlation of stocks
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting