Class 1a-DataCollection
Class 1a-DataCollection
complexity
Information
Data
Some important definitions of Data Mining
● Automatic/semi-automatic discovery of structural patterns in data (Witten et
al., 2000)
* Bojarczuk, C.C., Lopes, H.S., Freitas, A.A. A constrained-syntax genetic programming system for discovering
classification rules: application to medical data sets. Artificial Intelligence in Medicine, v. 30, n. 1, p. 27-48, 2004.
Life-cycle of Data Mining projects Hard
work !
Pre-processing:
Collection, formatting,
selection, data cleaning, data
integration reduction
Raw data
Data warehouse
Pattern discovery
Data mining methods
Filtered/cleaned data
Pattern
analysis and
interpretation
Knowledge !!
Motivations for Data Mining
1) VERY LARGE amount of data freely available in the internet
o E-mails and social networks
o Business and bank transactions
o Web page searches (Webscrapping!)
o Medical and biological data
o Scientific and astronomical data
Motivations for Data Mining
2) Business/commercial interest ($$$)
Critical Dilema in Data Mining
● The amount of data generated, created, stored, etc, grows exponentially
● The ability to mine, understand, and effectively use these data grows
linearly (best case!)