0% found this document useful (0 votes)
10 views14 pages

Class 1a-DataCollection

The document provides an overview of data mining and knowledge discovery, highlighting its purpose of extracting useful knowledge from large datasets. It discusses the multidisciplinary nature of data mining, key definitions, and the life-cycle of data mining projects, including motivations and critical dilemmas. Additionally, it outlines various tasks and methods in data mining, as well as examples of discovered rules and open-source software tools for data mining.

Uploaded by

eltcarva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

Class 1a-DataCollection

The document provides an overview of data mining and knowledge discovery, highlighting its purpose of extracting useful knowledge from large datasets. It discusses the multidisciplinary nature of data mining, key definitions, and the life-cycle of data mining projects, including motivations and critical dilemmas. Additionally, it outlines various tasks and methods in data mining, as well as examples of discovered rules and open-source software tools for data mining.

Uploaded by

eltcarva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Prof.

Heitor Silvério Lopes


Prof. Thiago H. Silva

Data Mining &


Knowledge
Discovery
Class 1a – Introduction &
Overview
2025
Data mining → Knowledge discovery
The purpose of D.M. is to find new, useful, and relevant knowledge hidden in
large amounts of data
The Multidisciplinarity of Data Mining
● Data mining uses concepts and methods from many areas:
○ Machine Learning
○ Databases
○ Computational Intelligence (EC, NN, FS)
○ Mathematics / Statistics
○ Programming languages
Data x Information X Knowledge
● Data:
○ Instances (objects, people, timestamps, etc)
○ Describe individual, not collective, properties, and they are:
■ Easy to collect
■ Available in large amounts and forms
■ Few useful for predictions or decision-making
● Information: We are drowning in
○ Classes (groups) of instances information,
○ Describe generic patterns, structures, principles, etc but starving for
■ Hard to obtain knowledge.
■ Few abundant John Naisbitt (1982)
■ Allow generalizations and predictions
● Knowledge
○ Regards the comprehension of something (including facts, habilities and informations)
○ Obtained by means of human perceptions or learning
Data x Information X Knowledge
Knowledge

complexity
Information

Data
Some important definitions of Data Mining
● Automatic/semi-automatic discovery of structural patterns in data (Witten et
al., 2000)

● Extraction of structured knowledge which is useful, previously unknown, non-


trivial, humanly comprehensible, from large amounts of data (Fayyad et al.,
1996)

● Desirable features of discovered knowledge:


○ Correctness
○ Generality
○ Utility
○ Comprehensibility
○ Novelty
Examples of rules discovered using data mining
● Case 1: consider a dataset of patient records from a maternity hospital.
A data-mining procedure found this rule:
Correctness ☺
IF (patient.age >) 15 AND (patient.age < 50) AND Generality ☺
(sector = “surgical clinic”) AND (surgery.type = Utility 
Comprehensibility ☺
“cesarean”) THEN (patient.sex = “female”) Novelty 

● Case 2: consider a dataset of pediatric oncological medical records*.


A data-mining procedure found this rule:
Correctness ☺
IF (histology.type = carcinoma) AND (patient.age < 3) Generality ☺
Utility ☺ ☺
AND (oncological.stage = 1) AND (metastasis=“no”) Comprehensibility ☺
THEN (years.survival > 5) Novelty ☺ ☺ ☺

* Bojarczuk, C.C., Lopes, H.S., Freitas, A.A. A constrained-syntax genetic programming system for discovering
classification rules: application to medical data sets. Artificial Intelligence in Medicine, v. 30, n. 1, p. 27-48, 2004.
Life-cycle of Data Mining projects Hard
work !

Pre-processing:
Collection, formatting,
selection, data cleaning, data
integration reduction
Raw data
Data warehouse

Pattern discovery
Data mining methods
Filtered/cleaned data
Pattern
analysis and
interpretation

Knowledge !!
Motivations for Data Mining
1) VERY LARGE amount of data freely available in the internet
o E-mails and social networks
o Business and bank transactions
o Web page searches (Webscrapping!)
o Medical and biological data
o Scientific and astronomical data
Motivations for Data Mining
2) Business/commercial interest ($$$)
Critical Dilema in Data Mining
● The amount of data generated, created, stored, etc, grows exponentially
● The ability to mine, understand, and effectively use these data grows
linearly (best case!)

• Data mining may help


us to understand
large amounts of data
by extracting useful
knowledge
* https://explodingtopics.com/blog/data-generated-per-day
Tasks x Methods in Data Mining
Tasks Methods
Classification Decision trees (C4.5), Cassification rules, k-nearest-neighboors,
Random forest, Support vector machine, Bayesian classifier,
Neural network, Adaboost
Association Rules Apriori, FP-growth, Eclat, Zigzag

Regression Linear Regression, Polynomial regression, Logistic regression

Feature Selection & Principal component analysis (PCA), Chi-square, Entropy,


Dimensionality Reduction Information gain

Clustering K-means, Kohonen’s self-organized map, Density-based scan,


Hierarchical grouping, t-SNE
Data visualization * Silhouette plot, scatter plot, heatmap, box plot, clusters, t-SNE
Tasks x Methods in Data Mining
● Types of data:
○ Numerical
○ Categorical
○ Text
○ Image/video
○ Time-series/signals

● Some data types require diferent tasks, for instance:


○ Image, time-series/signals can be clustered or classified
○ Text can be classified, but may require other specific tasks (e.g. sentiment analysis)
Some open-source softwares for Data Mining
● Orange (Python): developed and maintained by the University of Ljubljana (SL)
https://orangedatamining.com/
○ Easy-to-use windows interface (visual programming), add-ons for specific tasks, allows
integration with Python code.

● Weka (Java): created and maintained by the Waikato University (NZ)


https://www.cs.waikato.ac.nz/ml/weka
○ Very large library of methods, community support
○ Not-so-user-friendly interface, Poor documentation

● Knime (Java): developed and maintained by the Konztanz Universitaet (GE)


https://www.knime.com/

● Further information: https://www.datamation.com/big-data/open-source-data-


mining-tools/

You might also like