TT02 Data, Methods, and Scenarios

Uploaded by

Venkata Sivaiah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views44 pages

TT02 Data, Methods, and Scenarios

Uploaded by

Venkata Sivaiah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Data, Methods,

and Scenarios

Mining Massive Datasets

Prof. Carlos "ChaTo" Castillo (they/them)
https://github.com/chatox/data-mining-course/
Main Sources
● Data Mining, The Textbook (2015) by Charu Aggarwal
(Chapter 1) + slides by Lijun Zhang
● Mining of Massive Datasets, 2nd edition (2014) by
Leskovec et al. (Chapter 1)
● Data Mining Concepts and Techniques, 3rd edition (2011)
by Han et al. (Chapters 1-2)
Contents
●Types of data
●Types of problem
●Example scenarios
●Major challenges
Data types
Nondependency / Dependency
●Nondependency oriented data can be
structured so items are separate
− Relational data, text data
●Dependency oriented data includes
relationships between items
− Graphs, time series
Mixed attribute data
●
Most attributes we will deal with are numerical, they quantify
something
●
Sometimes attributes are categorical
− Example: elephant, tiger, moose, ...
− Binary (two categories)
●
Example: present, absent
− Ordinal (two or more categories that can be naturally sorted)
●
Example: low, medium, high
●
Real-world datasets include a mixture of types
Binary attributes, sets, dummy vars.
●
Every binary attribute can be used as a marker of
belonging to a set and viceversa
●
One-hot encoding: every categorical attribute
takingName
one of kZip
values
code can be encoded
Parent Capacity as k
“dummy”
Moogbinary attributes
08001 NULL Small
Macarena 08002 NULL Small
Input 08038 NULL Medium
Loft 08018 Razzmatazz Large
Nitsa 08004 Apolo Large
Question
Suppose you encode capacity using one-hot encoding.
How many columns will your new dataset have?

Name Zip code ParentCo Capacity

Moog 08001 NULL Small
Macarena 08002 NULL Small
Input 08038 NULL Medium
Loft 08018 Razzmatazz Large
Nitsa 08004 Apolo Large
Textual data
● Text can be represented as:
− As a string
− “Bag of words”: a set of binary
variables, one for each word in
the dictionary, with value True iff
the word belongs to the text
− “Vector space”: a set of
numerical variables indicating
number of occurrences (often
normalized by collection
frequency)
http://uc-r.github.io/creating-text-features
Time series data
●Contextual attributes
− Timestamps, sequence number, …
●Behavioral attributes
− Readings of a sensor, value of the variable, …

Multivariate time series data has multiple

behavioral attributes
Spatial data
Girona elevation map
●Two (lat/long) or three
(lat/long/elevation) spatial
attributes
●This is remote sensing data,
including satellite and aerial
photos
Spatiotemporal data
Two main representations:
●Spatial and temporal attributes are contextual
− Example: sea surface temperature
●Temporal attribute is contextual, spatial attribute
is behavioral
− Example: trajectories
Example: trajectory data aggregation

Bonchi, F., Castillo, C., Donato, D., & Gionis, A. (2009). Taxonomy-driven lumping for sequence mining.
Data Mining and Knowledge Discovery, 19(2), 227-244.
Example: Functional regions in cities

H. Assem, B. Caglayan, T.S. Buda, D. O’Sullivan. ST-DenNetFus: Deep Spatio-Temporal Dense Networks for Network Demand Prediction.
The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2018
Problem types
Data mining methods try to find
relationships
●Between columns
− Find associations, correlations, …
− If there is one key column: classification, prediction, ...
●Between rows
− Find clusters
− Detect outliers
Example:
Association pattern mining
●
Sparse binary databases
representing, e.g., items a person is
interested in

●
The relative frequency of a pattern is
its support
https://cs.nju.edu.cn/zlj/Course/DM_15.html
Association pattern mining (cont.)
●Given a binary n × d data matrix D,
− determine all subsets of columns such that all the
values in these columns take on the value True for at
least a fraction min_support of the rows in the matrix.
●The relative frequency of a pattern is referred to
as its support
Association pattern mining (cont.)
●The confidence of a rule A→B is
− support(A U B) / support(A)
●Example:
− { Chips, Olives } → { Beer }
Exercise
The confidence of a rule A→B is
support(A U B) / support(A)
Suppose
10 people buy only Chips and Beer
20 people buy only Chips and Olives
30 people buy only Olives and Beer
40 people buy all three: Chips, Olives, and Beer.
What is the confidence of the rule
{ Chips, Olives } → { Beer } ?
Clustering
●Partition records/rows in a way that
− elements in the same partition are similar
− elements in different partitions are different
●Applications:
− Segmentation, summarization, …
− Sometimes a step in a larger DM algorithm
Image credit: http://www.sthda.com/english/articles/tag/pam-clustering/
Clustering is not easy
●What does it mean to be similar?
●How many sets?
●Can a record/row belong to more than one set?
●Can a record/row belong to no set at all? ...

Image credit: sthda.com

Outlier detection
●Given a database, find
records/rows that are different
from the rest of the database
●Applications:
− Intrusion detection, credit card fraud, interesting
sensor events, medical diagnosis, ...

Image credits: https://www.kdnuggets.com/2017/01/3-methods-deal-outliers.html

Outlier detection
is not easy
●How different should they be?
●How many can be different?
●What does it mean to be different?
●What should we do with outliers?

Image credits: https://www.kdnuggets.com/2017/01/3-methods-deal-outliers.html

One of my
favorite
outliers
August Landmesser

August Landmesser in 1936

Data classification
●Sometimes data has a feature known as a
class label
●A model can learn from previous data to
associate a record/row to a class label
●One of the most useful tools in your belt!
Tasks with complex data types
●Frequent temporal patterns
●Time series motifs
●Graph motifs
●Trajectory clusters
●Collective classification
●...
Data types x Prototypical problems

Data Mining, The Textbook (2015) by Charu Aggarwal

Example scenarios
Example scenario 1
●Organize products to maximize co-
purchases of items frequently bought
together
− Input data: baskets
− Output: similar pairs
− Algorithm: frequent pattern mining
Example scenario 2
●Recommend media to users (movies,
series, music, books, podcasts, …)
− Input data: viewing history
− Output: recommendations
− Simple algorithm: k nearest neighbors
Example scenario 3
●Help diagnose if an electrocardiogram is
associated to a health problem
− Input data: time series, possibly multi-
dimensional
− Output: binary label or risk score
− Algorithms: outlier detection or classification
Example scenario 4
●Help a sysadmin determine if an intruder
is trying or has accessed the network
−Input data: time series of event records
−Output: binary label or risk score
−Algorithms: event detection
Exercise
Which ones are data mining tasks?
A) Dividing the customers of a company by postal code
B) Finding credit card scammers among customers of a
company
C) Computing the total sales of a company
D) Sorting a student database by student identification
number
E) Predicting the future stock price of a company using past
records
F) Determine when a complex machine needs to be repaired
G) Extracting the frequencies of a sound wave
Major challenges
Methodological challenges
●Mining high-dimensional data
●Handling uncertainty, noise, incompleteness, ...
●Mining data from a domain in which you do not
have expertise, or worse, in which you believe
you have expertise
− Conclusions are often worthless if you do not talk
with domain experts
User interaction challenges
●Users should ask questions that matter to them
●Performing interactive mining
●Presenting and visualizing data mining results
Efficiency and scalability
●Even for polynomial-running-time algorithms, a
process can become
○unreasonably slow or
○require an unreasonable amount of space
●Streaming and/or distributed mining algorithms
can help to some extent
Diversity of database types
●Real databases are
○high dimensional and involve a
○mixture of various data types
●Sometimes you need to integrate from
○dynamic, or
○distributed data sources
Data mining can be harmful
●Social impacts of data mining
− Who wins? And more importantly, who loses?
●Privacy-preserving data mining
− Avoid invisible, pervasive, invasive data mining
Summary
Things to remember
●Types of data
●Types of data mining methods
●Prototypical data mining scenarios
●Typical challenges of data mining
Exercises for this topic
●Section 1.9 of Data Mining, The Textbook
(2015) by Charu Aggarwal
● Exercises 1.7 of Introduction to Data Mining,
Second Edition (2019) by Tan et al.

SIM EMU 6.01 CFG v2.1
0% (1)
SIM EMU 6.01 CFG v2.1
1 page
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Delta EliteX700A Install Guidelines
No ratings yet
Delta EliteX700A Install Guidelines
20 pages
Banking System Project
100% (1)
Banking System Project
93 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Lecture 1 Introduction To Data Mining
No ratings yet
Lecture 1 Introduction To Data Mining
50 pages
Big Data and Its Importance
No ratings yet
Big Data and Its Importance
49 pages
Chapter 1 Introduction To Data Mining
No ratings yet
Chapter 1 Introduction To Data Mining
46 pages
Lecture 1 Introduction To Data Mining
No ratings yet
Lecture 1 Introduction To Data Mining
50 pages
Week 1-2
No ratings yet
Week 1-2
3 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Data Mining Tasks Notes Given
No ratings yet
Data Mining Tasks Notes Given
26 pages
Lec 1
No ratings yet
Lec 1
33 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
DM 2 Part 1
No ratings yet
DM 2 Part 1
50 pages
Lecture - 2 - Data Mining Concepts
No ratings yet
Lecture - 2 - Data Mining Concepts
30 pages
Week1 2
No ratings yet
Week1 2
24 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
CPSC 4830 2025summer Lecture 1
No ratings yet
CPSC 4830 2025summer Lecture 1
57 pages
Unit 2
No ratings yet
Unit 2
37 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Data Mining
No ratings yet
Data Mining
39 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Karpatne, Kumar
28 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Data Mining PDF
No ratings yet
Data Mining PDF
24 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A
No ratings yet
The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A
14 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Week 1 Homework ITS 632 UC
No ratings yet
Week 1 Homework ITS 632 UC
7 pages
Unit 1
No ratings yet
Unit 1
28 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
10 Challenging Problems in Data Mining Research
No ratings yet
10 Challenging Problems in Data Mining Research
8 pages
Data Mining
No ratings yet
Data Mining
37 pages
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
24 pages
John - Fields - HW1 Data Mining
No ratings yet
John - Fields - HW1 Data Mining
10 pages
3 DM
No ratings yet
3 DM
36 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Chap10 Anomaly Detection
No ratings yet
Chap10 Anomaly Detection
24 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
Chap2 Data
No ratings yet
Chap2 Data
105 pages
Data Mining Models - GeeksforGeeks
No ratings yet
Data Mining Models - GeeksforGeeks
4 pages
Data Mining For Exam
No ratings yet
Data Mining For Exam
10 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Unit 2 Data Preprocessing For Students
No ratings yet
Unit 2 Data Preprocessing For Students
169 pages
Data Mining
No ratings yet
Data Mining
87 pages
CS822 DataMining Week1
No ratings yet
CS822 DataMining Week1
97 pages
Slides Courtesy: Ling Chen [email protected]
No ratings yet
Slides Courtesy: Ling Chen [email protected]
42 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Data Preprocessing: Data Cleaning Data Integration and Transformation
No ratings yet
Data Preprocessing: Data Cleaning Data Integration and Transformation
41 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
AI Algorithms: Foundations, Applications, and Advancements
From Everand
AI Algorithms: Foundations, Applications, and Advancements
Anand Vemula
No ratings yet
Benzell Et Al 2023 How Apis Create Growth by Inverting The Firm
No ratings yet
Benzell Et Al 2023 How Apis Create Growth by Inverting The Firm
23 pages
Final Test
No ratings yet
Final Test
4 pages
Transient Stability Report For Beneban (Taqa) Report R05
100% (2)
Transient Stability Report For Beneban (Taqa) Report R05
125 pages
Shanins
No ratings yet
Shanins
60 pages
Kalyan PDF
No ratings yet
Kalyan PDF
14 pages
MODEL NO.: V390HJ1 Suffix: P03: Product Specification
No ratings yet
MODEL NO.: V390HJ1 Suffix: P03: Product Specification
31 pages
React Fundamentals and Environment Setup
No ratings yet
React Fundamentals and Environment Setup
8 pages
(May-2022) New PassLeader DP-900 Exam Dumps
No ratings yet
(May-2022) New PassLeader DP-900 Exam Dumps
8 pages
Lesson Plan # 9: Subject: Computer Science Grade: 10 Time: 30 Min
No ratings yet
Lesson Plan # 9: Subject: Computer Science Grade: 10 Time: 30 Min
3 pages
Fs Mini Project Report
No ratings yet
Fs Mini Project Report
25 pages
ST1236 Rev00
No ratings yet
ST1236 Rev00
14 pages
Chapter 01 Subprograms
No ratings yet
Chapter 01 Subprograms
10 pages
Wireless Camera System Troubleshooting and FAQ
No ratings yet
Wireless Camera System Troubleshooting and FAQ
16 pages
GL Bajaj Dec 2022 14 (2) CHAP-1
No ratings yet
GL Bajaj Dec 2022 14 (2) CHAP-1
8 pages
MC Ty Completingsquare2 2009 1
No ratings yet
MC Ty Completingsquare2 2009 1
5 pages
Paymenow Employee App T&Cs Feb 2025
No ratings yet
Paymenow Employee App T&Cs Feb 2025
15 pages
Seminar Report 2012-13 E-Wallet
No ratings yet
Seminar Report 2012-13 E-Wallet
21 pages
Release Notes
No ratings yet
Release Notes
6 pages
Juniper Care Service Description Document
No ratings yet
Juniper Care Service Description Document
7 pages
An Efficient Forward Secure Proxy Re Encryption 17658971hwytdnvdhxcf
No ratings yet
An Efficient Forward Secure Proxy Re Encryption 17658971hwytdnvdhxcf
15 pages
Unit 04 Modern Approach To Software Project and Economics
No ratings yet
Unit 04 Modern Approach To Software Project and Economics
35 pages
How To Extend A Data Volume in Windows Server 2003, in Windows XP, in Windows 2000, and in Windows Server 2008
No ratings yet
How To Extend A Data Volume in Windows Server 2003, in Windows XP, in Windows 2000, and in Windows Server 2008
4 pages
Intel® Easy Steps: Create An Email Account and Send Emails With or Without Attachments
No ratings yet
Intel® Easy Steps: Create An Email Account and Send Emails With or Without Attachments
6 pages
Machine Learning Engineer
No ratings yet
Machine Learning Engineer
2 pages
LESSON 2 - Assembling A Computer - Performance Checklist
No ratings yet
LESSON 2 - Assembling A Computer - Performance Checklist
2 pages
Blade Server 7873 - HX5 Parts Catalog
No ratings yet
Blade Server 7873 - HX5 Parts Catalog
5 pages
Info - Security U3 Ch. 6 Security Technology Firewalls & VPNs
No ratings yet
Info - Security U3 Ch. 6 Security Technology Firewalls & VPNs
68 pages

TT02 Data, Methods, and Scenarios

Uploaded by

TT02 Data, Methods, and Scenarios

Uploaded by

Data, Methods,

Mining Massive Datasets

Name Zip code ParentCo Capacity

Multivariate time series data has multiple

Image credit: sthda.com

Image credits: https://www.kdnuggets.com/2017/01/3-methods-deal-outliers.html

Image credits: https://www.kdnuggets.com/2017/01/3-methods-deal-outliers.html

August Landmesser in 1936

Data Mining, The Textbook (2015) by Charu Aggarwal

You might also like