0% found this document useful (0 votes)

10 views14 pages

Class 1a-DataCollection

The document provides an overview of data mining and knowledge discovery, highlighting its purpose of extracting useful knowledge from large datasets. It discusses the multidisciplinary nature of data mining, key definitions, and the life-cycle of data mining projects, including motivations and critical dilemmas. Additionally, it outlines various tasks and methods in data mining, as well as examples of discovered rules and open-source software tools for data mining.

Uploaded by

eltcarva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views14 pages

Class 1a-DataCollection

Uploaded by

eltcarva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Prof.

Heitor Silvério Lopes

Prof. Thiago H. Silva

Data Mining &

Knowledge
Discovery
Class 1a – Introduction &
Overview
2025
Data mining → Knowledge discovery
The purpose of D.M. is to find new, useful, and relevant knowledge hidden in
large amounts of data
The Multidisciplinarity of Data Mining
● Data mining uses concepts and methods from many areas:
○ Machine Learning
○ Databases
○ Computational Intelligence (EC, NN, FS)
○ Mathematics / Statistics
○ Programming languages
Data x Information X Knowledge
● Data:
○ Instances (objects, people, timestamps, etc)
○ Describe individual, not collective, properties, and they are:
■ Easy to collect
■ Available in large amounts and forms
■ Few useful for predictions or decision-making
● Information: We are drowning in
○ Classes (groups) of instances information,
○ Describe generic patterns, structures, principles, etc but starving for
■ Hard to obtain knowledge.
■ Few abundant John Naisbitt (1982)
■ Allow generalizations and predictions
● Knowledge
○ Regards the comprehension of something (including facts, habilities and informations)
○ Obtained by means of human perceptions or learning
Data x Information X Knowledge
Knowledge

complexity
Information

Data
Some important definitions of Data Mining
● Automatic/semi-automatic discovery of structural patterns in data (Witten et
al., 2000)

● Extraction of structured knowledge which is useful, previously unknown, non-

trivial, humanly comprehensible, from large amounts of data (Fayyad et al.,
1996)

● Desirable features of discovered knowledge:

○ Correctness
○ Generality
○ Utility
○ Comprehensibility
○ Novelty
Examples of rules discovered using data mining
● Case 1: consider a dataset of patient records from a maternity hospital.
A data-mining procedure found this rule:
Correctness ☺
IF (patient.age >) 15 AND (patient.age < 50) AND Generality ☺
(sector = “surgical clinic”) AND (surgery.type = Utility 
Comprehensibility ☺
“cesarean”) THEN (patient.sex = “female”) Novelty 

● Case 2: consider a dataset of pediatric oncological medical records*.

A data-mining procedure found this rule:
Correctness ☺
IF (histology.type = carcinoma) AND (patient.age < 3) Generality ☺
Utility ☺ ☺
AND (oncological.stage = 1) AND (metastasis=“no”) Comprehensibility ☺
THEN (years.survival > 5) Novelty ☺ ☺ ☺

* Bojarczuk, C.C., Lopes, H.S., Freitas, A.A. A constrained-syntax genetic programming system for discovering
classification rules: application to medical data sets. Artificial Intelligence in Medicine, v. 30, n. 1, p. 27-48, 2004.
Life-cycle of Data Mining projects Hard
work !

Pre-processing:
Collection, formatting,
selection, data cleaning, data
integration reduction
Raw data
Data warehouse

Pattern discovery
Data mining methods
Filtered/cleaned data
Pattern
analysis and
interpretation

Knowledge !!
Motivations for Data Mining
1) VERY LARGE amount of data freely available in the internet
o E-mails and social networks
o Business and bank transactions
o Web page searches (Webscrapping!)
o Medical and biological data
o Scientific and astronomical data
Motivations for Data Mining
2) Business/commercial interest ($$$)
Critical Dilema in Data Mining
● The amount of data generated, created, stored, etc, grows exponentially
● The ability to mine, understand, and effectively use these data grows
linearly (best case!)

• Data mining may help

us to understand
large amounts of data
by extracting useful
knowledge
* https://explodingtopics.com/blog/data-generated-per-day
Tasks x Methods in Data Mining
Tasks Methods
Classification Decision trees (C4.5), Cassification rules, k-nearest-neighboors,
Random forest, Support vector machine, Bayesian classifier,
Neural network, Adaboost
Association Rules Apriori, FP-growth, Eclat, Zigzag

Regression Linear Regression, Polynomial regression, Logistic regression

Feature Selection & Principal component analysis (PCA), Chi-square, Entropy,

Dimensionality Reduction Information gain

Clustering K-means, Kohonen’s self-organized map, Density-based scan,

Hierarchical grouping, t-SNE
Data visualization * Silhouette plot, scatter plot, heatmap, box plot, clusters, t-SNE
Tasks x Methods in Data Mining
● Types of data:
○ Numerical
○ Categorical
○ Text
○ Image/video
○ Time-series/signals

● Some data types require diferent tasks, for instance:

○ Image, time-series/signals can be clustered or classified
○ Text can be classified, but may require other specific tasks (e.g. sentiment analysis)
Some open-source softwares for Data Mining
● Orange (Python): developed and maintained by the University of Ljubljana (SL)
https://orangedatamining.com/
○ Easy-to-use windows interface (visual programming), add-ons for specific tasks, allows
integration with Python code.

● Weka (Java): created and maintained by the Waikato University (NZ)

https://www.cs.waikato.ac.nz/ml/weka
○ Very large library of methods, community support
○ Not-so-user-friendly interface, Poor documentation

● Knime (Java): developed and maintained by the Konztanz Universitaet (GE)

https://www.knime.com/

● Further information: https://www.datamation.com/big-data/open-source-data-

mining-tools/

Data Structures Full Notes
100% (7)
Data Structures Full Notes
90 pages
ITIL 4 Practice Exam Questions - Free Practice Test
No ratings yet
ITIL 4 Practice Exam Questions - Free Practice Test
14 pages
COP CD Unit3
No ratings yet
COP CD Unit3
247 pages
OBR Troubleshooting Guide
No ratings yet
OBR Troubleshooting Guide
153 pages
Angular Interview
No ratings yet
Angular Interview
15 pages
Digital & Tech Solutions Notes - 2024
No ratings yet
Digital & Tech Solutions Notes - 2024
11 pages
NDG Linux Essentials - Module 1 - Introduction To Linux PDF
No ratings yet
NDG Linux Essentials - Module 1 - Introduction To Linux PDF
4 pages
UR EIPwithKeyencePLC v1
No ratings yet
UR EIPwithKeyencePLC v1
13 pages
VISION LIS-protocol Comunicare
No ratings yet
VISION LIS-protocol Comunicare
12 pages
IS Audit Checklist in Excel
100% (3)
IS Audit Checklist in Excel
11 pages
Devry Mgmt408 Full Course Latest 2015 December All Discussions and All Assignments and Final
No ratings yet
Devry Mgmt408 Full Course Latest 2015 December All Discussions and All Assignments and Final
17 pages
SNS Lab Anual
No ratings yet
SNS Lab Anual
33 pages
Wipro HRM PROJECT by Prasanth Karcherla
No ratings yet
Wipro HRM PROJECT by Prasanth Karcherla
85 pages
Session 3 - Al and Data
No ratings yet
Session 3 - Al and Data
23 pages
Data Whare House PDF
No ratings yet
Data Whare House PDF
51 pages
Design, Deployment and Performance Evaluation of An Iot Based Smart Energy Management System For Demand Side Management in Smart Grid
No ratings yet
Design, Deployment and Performance Evaluation of An Iot Based Smart Energy Management System For Demand Side Management in Smart Grid
18 pages
Lect 1 2 Data Mining 3
No ratings yet
Lect 1 2 Data Mining 3
19 pages
Backend Engineer Golang 1
No ratings yet
Backend Engineer Golang 1
3 pages
The Beginner's Guide To Netlify Continuous Deployment From Github For React Apps - by JJ Ashcraft - HackerNoon - Com - Medium
No ratings yet
The Beginner's Guide To Netlify Continuous Deployment From Github For React Apps - by JJ Ashcraft - HackerNoon - Com - Medium
1 page
01 Intro
No ratings yet
01 Intro
61 pages
V6i2 Pices0003
No ratings yet
V6i2 Pices0003
3 pages
Final Draft English - 2
No ratings yet
Final Draft English - 2
12 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
224 pages
Business Intelligence DM1
No ratings yet
Business Intelligence DM1
36 pages
Data Mining
No ratings yet
Data Mining
61 pages
Unit1 IntroductionToDWDM
No ratings yet
Unit1 IntroductionToDWDM
40 pages
17CS834 SMS
No ratings yet
17CS834 SMS
2 pages
AIML-HC Mod 02
No ratings yet
AIML-HC Mod 02
65 pages
1 01intro, 2data (Except2 3), 3preprocessing
No ratings yet
1 01intro, 2data (Except2 3), 3preprocessing
169 pages
BCA32L-Java Programming Lab Manual-ODD Sem - 2024!25!1
No ratings yet
BCA32L-Java Programming Lab Manual-ODD Sem - 2024!25!1
17 pages
Data Mining and Its Branches
No ratings yet
Data Mining and Its Branches
37 pages
DWDM LS1 Fall 24 25
No ratings yet
DWDM LS1 Fall 24 25
42 pages
Chapter Five Data Mining For Healthcare Analytics
No ratings yet
Chapter Five Data Mining For Healthcare Analytics
77 pages
Creating An IP-based Catalyst Store For Veeam Backups
No ratings yet
Creating An IP-based Catalyst Store For Veeam Backups
1 page
Earn $200 Daily The Ultimate Guide To Passive
No ratings yet
Earn $200 Daily The Ultimate Guide To Passive
5 pages
DM-Unit 1
No ratings yet
DM-Unit 1
110 pages
B.TECH IT - Syllabus 1
No ratings yet
B.TECH IT - Syllabus 1
7 pages
Chapter 1 - Tagged
No ratings yet
Chapter 1 - Tagged
46 pages
Himanshi Resume Overleaf
No ratings yet
Himanshi Resume Overleaf
1 page
AcademAI - AI-Based PHD Student Tracking Platform
No ratings yet
AcademAI - AI-Based PHD Student Tracking Platform
13 pages
01 Intro
No ratings yet
01 Intro
40 pages
Lec 1
No ratings yet
Lec 1
33 pages
Week1 1
No ratings yet
Week1 1
18 pages
Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne
No ratings yet
Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne
4 pages
01 Intro
No ratings yet
01 Intro
45 pages
DB 14
No ratings yet
DB 14
97 pages
Revolutionizing Digital Currency Exchange and Business Intergration
No ratings yet
Revolutionizing Digital Currency Exchange and Business Intergration
9 pages
Intro Data Mining
No ratings yet
Intro Data Mining
51 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
Introduction
No ratings yet
Introduction
26 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
Inspire Award Project by Ishmeet Kaur - 20250502 - 180941 - 0000
No ratings yet
Inspire Award Project by Ishmeet Kaur - 20250502 - 180941 - 0000
11 pages
01 Intro
No ratings yet
01 Intro
35 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
01intro (Autosaved)
No ratings yet
01intro (Autosaved)
43 pages
Data Mining
No ratings yet
Data Mining
33 pages
Week 01 Chapt01
No ratings yet
Week 01 Chapt01
49 pages
2020 - UNIT 2 Chapter 1
No ratings yet
2020 - UNIT 2 Chapter 1
73 pages
The Marriott Data Breach Case Study
No ratings yet
The Marriott Data Breach Case Study
3 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
Unit 3
No ratings yet
Unit 3
23 pages
Data Mining: Nicoleta ROGOVSCHI
No ratings yet
Data Mining: Nicoleta ROGOVSCHI
84 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
41 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Mining
No ratings yet
Data Mining
26 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining
No ratings yet
Data Mining
27 pages
Cse5243 Intro. To Data Mining: Chapter 1. Introduction
No ratings yet
Cse5243 Intro. To Data Mining: Chapter 1. Introduction
56 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
IS414: Data Mining: DR - Waleed M.Ead
No ratings yet
IS414: Data Mining: DR - Waleed M.Ead
36 pages
1 - 1 Intro To Data Mining - ch1
No ratings yet
1 - 1 Intro To Data Mining - ch1
18 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Week 1-2
No ratings yet
Week 1-2
3 pages
0 Introduction
No ratings yet
0 Introduction
43 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Lecture 1 Data Mining
No ratings yet
Lecture 1 Data Mining
51 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
LectureSlide 1
No ratings yet
LectureSlide 1
12 pages
Mastering The Art Of Data Analysis From Basics To Informed Decision-Making
From Everand
Mastering The Art Of Data Analysis From Basics To Informed Decision-Making
Space Learn
No ratings yet
01 Intro
No ratings yet
01 Intro
23 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Trackobit Com Blog How Geofencing Improve Field Force Manage
No ratings yet
Trackobit Com Blog How Geofencing Improve Field Force Manage
10 pages
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet

Class 1a-DataCollection

Uploaded by

Class 1a-DataCollection

Uploaded by

Prof.

Heitor Silvério Lopes

Data Mining &

● Extraction of structured knowledge which is useful, previously unknown, non-

● Desirable features of discovered knowledge:

● Case 2: consider a dataset of pediatric oncological medical records*.

• Data mining may help

Regression Linear Regression, Polynomial regression, Logistic regression

Feature Selection & Principal component analysis (PCA), Chi-square, Entropy,

Clustering K-means, Kohonen’s self-organized map, Density-based scan,

● Some data types require diferent tasks, for instance:

● Weka (Java): created and maintained by the Waikato University (NZ)

● Knime (Java): developed and maintained by the Konztanz Universitaet (GE)

● Further information: https://www.datamation.com/big-data/open-source-data-

You might also like