0% found this document useful (0 votes)
17 views22 pages

TTDS Lecture 1

The document provides an introduction to the Knowledge Discovery in Databases (KDD) process, highlighting the importance of data mining in extracting valuable insights from complex data. It outlines the evolution of sciences from empirical to data science, emphasizing the role of computational methods in handling vast amounts of data. Various examples of data types, including transaction, document, network, genomic, environmental, and behavioral data, are presented to illustrate the diverse applications of data mining in different fields.

Uploaded by

gpdmgz24fm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views22 pages

TTDS Lecture 1

The document provides an introduction to the Knowledge Discovery in Databases (KDD) process, highlighting the importance of data mining in extracting valuable insights from complex data. It outlines the evolution of sciences from empirical to data science, emphasizing the role of computational methods in handling vast amounts of data. Various examples of data types, including transaction, document, network, genomic, environmental, and behavioral data, are presented to illustrate the diverse applications of data mining in different fields.

Uploaded by

gpdmgz24fm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

TOOLS &

TECHNIQUES FOR
DATA SCIENCE
LECTURE 1
Introduction

Prepared by – Dr.Danish Jamil


Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
 Data mining plays an essential
role in the knowledge
discovery process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation


Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern
Dimension reduction interpretation
Clustering
Outlier analysis Pattern visualization
…………

 This is a view from typical machine learning and statistics communities


The data is also very complex

 Multiple types of data: tables, time series, images, graphs, etc

 Spatial and temporal aspects

 Interconnected data of different types:


 From the mobile phone we can collect, location of the user, friendship
information, check-ins to venues, opinions through twitter, images though
cameras, queries to search engines
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science
 Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible
 Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
Example: transaction data

 Billions of real-life customers:


 WALMART: 20M transactions per day
 AT&T 300 M calls per day
 Credit card companies: billions of transactions per day.

 The point cards allow companies to collect information about specific


users
Example: document data

 Web as a document repository: estimated 50 billions of web pages

 Wikipedia: 4 million articles (and counting)

 Online news portals: steady stream of 100’s of new articles every day

 Twitter: ~300 million tweets every day


Example: network data

 Web: 50 billion pages linked via hyperlinks

 Facebook: 500 million users

 Twitter: 300 million users

 Instant messenger: ~1billion users

 Blogs: 250 million blogs worldwide, presidential candidates run blogs


Example: genomic sequences

 http://www.1000genomes.org/page.php

 Full sequence of 1000 individuals

 3*109 nucleotides per person  3*1012 nucleotides

 Lots more data in fact: medical history of the persons, gene


expression data
Example: environmental data

 Climate data (just an example)

http://www.ncdc.gov/oa/climate/ghcn-monthly/index.p
hp

 “a database of temperature, precipitation and pressure records


managed by the National Climatic Data Center, Arizona State
University and the Carbon Dioxide Information Analysis Center”

 “6000 temperature stations, 7500 precipitation stations, 2000


pressure stations”
 Spatiotemporal data
Behavioral data

 Mobile phones today record a large amount of information about the user behavior
 GPS records position
 Camera produces images
 Communication via phone and SMS
 Text via facebook updates
 Association with entities via check-ins

 Amazon collects all the items that you browsed, placed into your basket, read reviews
about, purchased.

 Google and Bing record all your browsing activity via toolbar plugins. They also record the
queries you asked, the pages you saw and the clicks you did.

 Data collected for millions of users on a daily basis


Attributes
So, what is Data?
Tid Refund Marital Taxable
 Collection of data objects Status Income Cheat

and their attributes 1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
 An attribute is a property or
4 Yes Married 120K No
characteristic of an object
5 No Divorced 95K Yes
 Examples: eye color of a Objects
6 No Married 60K No
person, temperature, etc.
7 Yes Divorced 220K No
 Attribute is also known as 8 No Single 85K Yes
variable, field, 9 No Married 75K No
characteristic, or feature 10 No Single 90K Yes
 A collection of attributes
10

describe an object Size: Number of objects


 Object is also known as Dimensionality: Number of attributes
record, point, case, Sparsity: Number of populated
sample, entity, or instance object-attribute pairs
Types of Attributes

 There are different types of attributes


 Categorical
 Examples: eye color, zip codes, words, rankings (e.g, good, fair, bad), height in
{tall, medium, short}
 Nominal (no order or comparison) vs Ordinal (order but not comparable)
 Numeric
 Examples: dates, temperature, time, length, value, count.
 Discrete (counts) vs Continuous (temperature)
 Special case: Binary attributes (yes/no, exists/not exists)
Numeric Record Data
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of
as points in a multi-dimensional space, where
each dimension represents a distinct attribute

 Such data set can be represented by an n-by-d


data matrix, where there are n rows, one for each
object, and d columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Categorical Data

 Data that consists of a collection of records, each of which consists of


a fixed set of categorical attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single High No


2 No Married Medium No
3 No Single Low No
4 Yes Married High No
5 No Divorced Medium Yes
6 No Married Low No
7 Yes Divorced High No
8 No Single Medium Yes
9 No Married Medium No
10 No Single Medium Yes
10
What can you do with the data?

 Suppose that you are the owner of a supermarket and you have
collected billions of market basket data. What information would you
extract from it and how would you use it?

TID Items
Product placement
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Catalog creation
 What if this4was Beer, Bread, Diaper, Milk
an online store?
5 Coke, Diaper, Milk Recommendations
What can you do with the data?

 Suppose you are a search engine and you have a toolbar log
consisting of
 pages browsed,
 queries, Ad click prediction
 pages clicked,
 ads clicked
Query reformulations

each with a user id and a timestamp. What information would you like
to get our of the data?
What can you do with the data?
 Suppose you are biologist who has microarray expression data:
thousands of genes, and their expression values over thousands of
different settings (e.g. tissues). What information would you like to
get out of your data?

Groups of genes and tissues


What can you do with the data?

 Suppose you are a stock broker and you observe the fluctuations of
multiple stocks over time. What information would you like to get our
of your data?

Clustering of stocks

Correlation of stocks

Stock Value prediction


Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Evaluation of Knowledge
 Are all mined knowledge interesting?
 One can mine tremendous amount of “patterns” and knowledge
 Some may fit only certain dimension space (time, location, …)
 Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only
interesting knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
22
 …

You might also like