0% found this document useful (0 votes)

17 views22 pages

TTDS Lecture 1

The document provides an introduction to the Knowledge Discovery in Databases (KDD) process, highlighting the importance of data mining in extracting valuable insights from complex data. It outlines the evolution of sciences from empirical to data science, emphasizing the role of computational methods in handling vast amounts of data. Various examples of data types, including transaction, document, network, genomic, environmental, and behavioral data, are presented to illustrate the diverse applications of data mining in different fields.

Uploaded by

gpdmgz24fm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views22 pages

TTDS Lecture 1

Uploaded by

gpdmgz24fm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

TOOLS &

TECHNIQUES FOR
DATA SCIENCE
LECTURE 1
Introduction

Prepared by – Dr.Danish Jamil

Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
 Data mining plays an essential
role in the knowledge
discovery process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-

Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation

Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern
Dimension reduction interpretation
Clustering
Outlier analysis Pattern visualization
…………

 This is a view from typical machine learning and statistics communities

The data is also very complex

 Multiple types of data: tables, time series, images, graphs, etc

 Spatial and temporal aspects

 Interconnected data of different types:

 From the mobile phone we can collect, location of the user, friendship
information, check-ins to venues, opinions through twitter, images though
cameras, queries to search engines
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science
 Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible
 Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
Example: transaction data

 Billions of real-life customers:

 WALMART: 20M transactions per day
 AT&T 300 M calls per day
 Credit card companies: billions of transactions per day.

 The point cards allow companies to collect information about specific

users
Example: document data

 Web as a document repository: estimated 50 billions of web pages

 Wikipedia: 4 million articles (and counting)

 Online news portals: steady stream of 100’s of new articles every day

 Twitter: ~300 million tweets every day

Example: network data

 Web: 50 billion pages linked via hyperlinks

 Facebook: 500 million users

 Twitter: 300 million users

 Instant messenger: ~1billion users

 Blogs: 250 million blogs worldwide, presidential candidates run blogs

Example: genomic sequences

 http://www.1000genomes.org/page.php

 Full sequence of 1000 individuals

 3109 nucleotides per person  31012 nucleotides

 Lots more data in fact: medical history of the persons, gene

expression data
Example: environmental data

 Climate data (just an example)

http://www.ncdc.gov/oa/climate/ghcn-monthly/index.p
hp

 “a database of temperature, precipitation and pressure records

managed by the National Climatic Data Center, Arizona State
University and the Carbon Dioxide Information Analysis Center”

 “6000 temperature stations, 7500 precipitation stations, 2000

pressure stations”
 Spatiotemporal data
Behavioral data

 Mobile phones today record a large amount of information about the user behavior
 GPS records position
 Camera produces images
 Communication via phone and SMS
 Text via facebook updates
 Association with entities via check-ins

 Amazon collects all the items that you browsed, placed into your basket, read reviews
about, purchased.

 Google and Bing record all your browsing activity via toolbar plugins. They also record the
queries you asked, the pages you saw and the clicks you did.

 Data collected for millions of users on a daily basis

Attributes
So, what is Data?
Tid Refund Marital Taxable
 Collection of data objects Status Income Cheat

and their attributes 1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
 An attribute is a property or
4 Yes Married 120K No
characteristic of an object
5 No Divorced 95K Yes
 Examples: eye color of a Objects
6 No Married 60K No
person, temperature, etc.
7 Yes Divorced 220K No
 Attribute is also known as 8 No Single 85K Yes
variable, field, 9 No Married 75K No
characteristic, or feature 10 No Single 90K Yes
 A collection of attributes
10

describe an object Size: Number of objects

 Object is also known as Dimensionality: Number of attributes
record, point, case, Sparsity: Number of populated
sample, entity, or instance object-attribute pairs
Types of Attributes

 There are different types of attributes

 Categorical
 Examples: eye color, zip codes, words, rankings (e.g, good, fair, bad), height in
{tall, medium, short}
 Nominal (no order or comparison) vs Ordinal (order but not comparable)
 Numeric
 Examples: dates, temperature, time, length, value, count.
 Discrete (counts) vs Continuous (temperature)
 Special case: Binary attributes (yes/no, exists/not exists)
Numeric Record Data
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of
as points in a multi-dimensional space, where
each dimension represents a distinct attribute

 Such data set can be represented by an n-by-d

data matrix, where there are n rows, one for each
object, and d columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Categorical Data

 Data that consists of a collection of records, each of which consists of

a fixed set of categorical attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single High No

2 No Married Medium No
3 No Single Low No
4 Yes Married High No
5 No Divorced Medium Yes
6 No Married Low No
7 Yes Divorced High No
8 No Single Medium Yes
9 No Married Medium No
10 No Single Medium Yes
10
What can you do with the data?

 Suppose that you are the owner of a supermarket and you have
collected billions of market basket data. What information would you
extract from it and how would you use it?

TID Items
Product placement
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Catalog creation
 What if this4was Beer, Bread, Diaper, Milk
an online store?
5 Coke, Diaper, Milk Recommendations
What can you do with the data?

 Suppose you are a search engine and you have a toolbar log
consisting of
 pages browsed,
 queries, Ad click prediction
 pages clicked,
 ads clicked
Query reformulations

each with a user id and a timestamp. What information would you like
to get our of the data?
What can you do with the data?
 Suppose you are biologist who has microarray expression data:
thousands of genes, and their expression values over thousands of
different settings (e.g. tissues). What information would you like to
get out of your data?

Groups of genes and tissues

What can you do with the data?

 Suppose you are a stock broker and you observe the fluctuations of
multiple stocks over time. What information would you like to get our
of your data?

Clustering of stocks

Correlation of stocks

Stock Value prediction

Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Evaluation of Knowledge
 Are all mined knowledge interesting?
 One can mine tremendous amount of “patterns” and knowledge
 Some may fit only certain dimension space (time, location, …)
 Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only
interesting knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
22
 …

Lec Slides Combined Mid Quiz With Old Quizzes
No ratings yet
Lec Slides Combined Mid Quiz With Old Quizzes
378 pages
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
Updated DM
No ratings yet
Updated DM
72 pages
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
No ratings yet
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
33 pages
Datamining Lect1
No ratings yet
Datamining Lect1
59 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Internal
No ratings yet
Internal
267 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
L1
No ratings yet
L1
44 pages
Inf 444e - Datamining N Advanced Databases Introduction 2019
No ratings yet
Inf 444e - Datamining N Advanced Databases Introduction 2019
32 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
1 DM Intro
No ratings yet
1 DM Intro
34 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
02-Introduction To Data Mining
No ratings yet
02-Introduction To Data Mining
40 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
Chapter 1
No ratings yet
Chapter 1
37 pages
Cloud Computing Lab Manual Final
100% (1)
Cloud Computing Lab Manual Final
72 pages
Unit-1 PPT Dma
No ratings yet
Unit-1 PPT Dma
83 pages
Unit-1 A
No ratings yet
Unit-1 A
47 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
1 DM Intro
No ratings yet
1 DM Intro
38 pages
Data Whare House PDF
No ratings yet
Data Whare House PDF
51 pages
AndroRat Tutorial (Noob-Friendy)
50% (2)
AndroRat Tutorial (Noob-Friendy)
18 pages
DM Lec1 2
No ratings yet
DM Lec1 2
39 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
01 Intro
No ratings yet
01 Intro
28 pages
Lec 1
No ratings yet
Lec 1
48 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Bi - Unit 3
No ratings yet
Bi - Unit 3
18 pages
DM Unit-1
No ratings yet
DM Unit-1
14 pages
DM 1
No ratings yet
DM 1
78 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
01 Intro
No ratings yet
01 Intro
22 pages
Data Structure & Algorithm
0% (1)
Data Structure & Algorithm
19 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
DataMining S
No ratings yet
DataMining S
103 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Data Analytics-Unit1 Notes
No ratings yet
Data Analytics-Unit1 Notes
30 pages
DM Unit2 (Part1)
No ratings yet
DM Unit2 (Part1)
19 pages
Data Mining
No ratings yet
Data Mining
7 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
51 pages
LectureSlide 1
No ratings yet
LectureSlide 1
12 pages
Introduction To Data Mining
75% (4)
Introduction To Data Mining
45 pages
Data & Web Mining: Manoj Pandia, Silicon Institute of Technology
No ratings yet
Data & Web Mining: Manoj Pandia, Silicon Institute of Technology
21 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
100% (1)
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Lecture 1 Data Mining
No ratings yet
Lecture 1 Data Mining
51 pages
IME 672-Chapter 1 PDF
No ratings yet
IME 672-Chapter 1 PDF
41 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Mu 3 Steli
No ratings yet
Mu 3 Steli
18 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Deadlocks - CH 8
No ratings yet
Deadlocks - CH 8
49 pages
SVTB
No ratings yet
SVTB
222 pages
Motivation of Data Mining
No ratings yet
Motivation of Data Mining
4 pages
29.06.2022-SWIFT MT103 GPI30B PAF GMBH (No Code)
50% (2)
29.06.2022-SWIFT MT103 GPI30B PAF GMBH (No Code)
2 pages
Data Mining
No ratings yet
Data Mining
27 pages
RBS MOPs Troubleshooting
No ratings yet
RBS MOPs Troubleshooting
6 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Vxrail Simulator
No ratings yet
Vxrail Simulator
4 pages
8086 Architecture, Pin Diagram, Addressing Modes
100% (1)
8086 Architecture, Pin Diagram, Addressing Modes
52 pages
01 - Introduction To Computer Security Security
No ratings yet
01 - Introduction To Computer Security Security
39 pages
Advanced Java Programming Microproject Report
No ratings yet
Advanced Java Programming Microproject Report
9 pages
Bca C++ Pratical
No ratings yet
Bca C++ Pratical
35 pages
Best Practices For Form Design
No ratings yet
Best Practices For Form Design
133 pages
Quectel GSM MQTT Application Note V1.2
No ratings yet
Quectel GSM MQTT Application Note V1.2
29 pages
AWINIC Shanghai Awinic Tech AW9120QNR - C506177 1
No ratings yet
AWINIC Shanghai Awinic Tech AW9120QNR - C506177 1
33 pages
AUTOSAR SRS ModeManagement
No ratings yet
AUTOSAR SRS ModeManagement
70 pages
Service Quotas: User Guide
No ratings yet
Service Quotas: User Guide
19 pages
Haptic Technology Abstract
50% (2)
Haptic Technology Abstract
3 pages
883 Question Paper
No ratings yet
883 Question Paper
2 pages
Yts C 0111
No ratings yet
Yts C 0111
44 pages
Scientech 2115
No ratings yet
Scientech 2115
49 pages
HVM100 Blaze Manual
No ratings yet
HVM100 Blaze Manual
70 pages
2020 Marvell Product Selector Guide: Total Solutions From Marvell
No ratings yet
2020 Marvell Product Selector Guide: Total Solutions From Marvell
29 pages
Tableau Business Analytics
No ratings yet
Tableau Business Analytics
34 pages
Discord 101 For Creators 1 2
No ratings yet
Discord 101 For Creators 1 2
1 page
Bizgram Daily DIY Pricelist Month 02
No ratings yet
Bizgram Daily DIY Pricelist Month 02
6 pages
T2DDT0 Manual
No ratings yet
T2DDT0 Manual
4 pages
Revit Shortcuts Cheat Sheet
No ratings yet
Revit Shortcuts Cheat Sheet
1 page
GSTIN
No ratings yet
GSTIN
3 pages
The Data Whisperer - Making Sense of Big Data
From Everand
The Data Whisperer - Making Sense of Big Data
Keaton Rivers
No ratings yet
Big Data: How the Information Revolution Is Transforming Our Lives
From Everand
Big Data: How the Information Revolution Is Transforming Our Lives
Brian Clegg
4/5 (5)
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet

TTDS Lecture 1

Uploaded by

TTDS Lecture 1

Uploaded by

TOOLS &

Prepared by – Dr.Danish Jamil

Data Warehouse Selection

Input Data Data Pre- Data Post-

Data integration Pattern discovery Pattern evaluation

 This is a view from typical machine learning and statistics communities

 Multiple types of data: tables, time series, images, graphs, etc

 Spatial and temporal aspects

 Interconnected data of different types:

 Billions of real-life customers:

 The point cards allow companies to collect information about specific

 Web as a document repository: estimated 50 billions of web pages

 Wikipedia: 4 million articles (and counting)

 Twitter: ~300 million tweets every day

 Web: 50 billion pages linked via hyperlinks

 Facebook: 500 million users

 Twitter: 300 million users

 Instant messenger: ~1billion users

 Blogs: 250 million blogs worldwide, presidential candidates run blogs

 Full sequence of 1000 individuals

 3*109 nucleotides per person  3*1012 nucleotides

 Lots more data in fact: medical history of the persons, gene

 Climate data (just an example)

 “a database of temperature, precipitation and pressure records

 “6000 temperature stations, 7500 precipitation stations, 2000

 Data collected for millions of users on a daily basis

and their attributes 1 Yes Single 125K No

describe an object Size: Number of objects

 There are different types of attributes

 Such data set can be represented by an n-by-d

10.23 5.27 15.22 2.7 1.2

 Data that consists of a collection of records, each of which consists of

1 Yes Single High No

Groups of genes and tissues

Stock Value prediction

Data Preprocessing/Integration, Data Warehouses

You might also like

 3109 nucleotides per person  31012 nucleotides