0% found this document useful (0 votes)
70 views4 pages

Data Mining Fundamentals

Data mining involves the non-trivial extraction of implicit, previously unknown, and potentially useful information from large datasets. It uses algorithms to automatically analyze operational data, documents, experiment results, and more stored in databases. As more data is collected but analysts have limited time to examine it all, data mining aims to reveal hidden patterns and insights within existing information.

Uploaded by

Shah Saima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views4 pages

Data Mining Fundamentals

Data mining involves the non-trivial extraction of implicit, previously unknown, and potentially useful information from large datasets. It uses algorithms to automatically analyze operational data, documents, experiment results, and more stored in databases. As more data is collected but analysts have limited time to examine it all, data mining aims to reveal hidden patterns and insights within existing information.

Uploaded by

Shah Saima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DB

MG Data mining fundamentals


DataBase and Data Mining Group of Politecnico di Torino

Data analysis
 Most companies own huge databases
containing
 operational data
Data mining fundamentals  textual documents
 experiment results

DB
 These databases are a potential

MG
source of useful information
Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis
Politecnico di Torino
DB
MG
2

Data analysis Data mining


 Information is hidden in huge datasets  Non trivial extraction of
 not immediately evident  implicit
 human analysts need a large amount of time for the  previously unknown
analysis  potentially useful
 most data is never analyzed at all information from available data
4,000,000

3,500,000 The Data Gap


 Extraction is automatic
3,000,000  performed by appropriate algorithms
2,500,000

2,000,000
 Extracted information is represented by means of
1,500,000
Disk space (TB) abstract models
since 1995
1,000,000  denoted as pattern
500,000
Analyst
number
0

DB DB
1995 1996 1997 1998 1999

MG MG
3 4
From R. Grossman, C. Kamath, V. Kumar, Data
Mining for Scientific and Engineering Applications

Example: profiling Example: profiling


 User/service profiling
 Consumer behavior in e-commerce sites
 Recommendation systems
 Selected products, requested information,
 Advertisements
 Search engines and portals
 Market basket analysis
 Query keywords, searched topics and objects
 Correlated objects for cross selling
 Social network data  User registration, fidelity cards
 Facebook, google+ profiles  Context-aware data analysis
 Dynamic data: posts on blogs, FB, tweets  Integration of different dimensions
 Maps and georeferenced data  E.g., location, time of the day, user interest
 Localization, interesting locations for users  Text mining
 Brand reputation, sentiment analysis, topic trends

DB
MG
5 DB
MG
6

Elena Baralis
Politecnico di Torino
1
DB
MG Data mining fundamentals
DataBase and Data Mining Group of Politecnico di Torino

Example: biological data Biological analysis objectives


 Microarray  Clinical analysis
 expression level of genes in a cellular tissue  detecting the causes of a pathology
 various types (mRNA, DNA)  monitoring the effect of a therapy
diagnosis improvement and definition of new specific
 Patient clinical records CLID
PATIENT shx013: shv060: shq077: shx009: shx014: shq082: shq083: shx008:
therapies
ID 49A34 45A9 52A28 4A34 61A31 99A6 46A15 41A31
 personal and demographic data IMAGE:740604
ISG20 || interferon-1.02
IMAGE:767176
TNFSF13 || tumor-0.52
stimulated-2.34
gene 20kDa
necrosis -4.06
1.44
factor (ligand)
0.57 -0.13
-0.29superfamily,
0.71member1.03
13 -0.67
0.12 0.34
0.22
-0.51
-0.09
 Bio-discovery
 exam results IMAGE:366315
LOC93343 || **hypothetical
-0.25 -4.08 protein BC011840
0.06 0.13 0.08 0.06 -0.08 -0.05
IMAGE:235135
ITGA4 || integrin,-1.375
alpha 4 (antigen
-1.605 CD49D, 0.155alpha -0.015
4 subunit of0.035
VLA-4 receptor)
-0.035 0.505 -0.865  gene network discovery
 Textual data in public collections  analysis of multifactorial genetic pathologies
 heterogeneous formats, different objectives
 Pharmacogenesis
 scientific literature (PUBMed)
 lab design of new drugs for genic therapies
 ontologies (Gene Ontology)

DB
MG
7 DB
MG
8

Knowledge Discovery Process Preprocessing


data cleaning
reduces the effect of noise
selection
identifies or removes outliers
preprocessing solves inconsistencies
preprocessing
data integration
transformation reconciles data extracted
data selected from different sources
selected data mining data integrates metadata
data preprocessed identifies and solves data
preprocessed interpretation
data value conflicts
data manages redundancy
transformed
data Real world data is dirty
pattern
knowledge Without good quality data, no good quality
KDD = Knowledge Discovery from Data pattern
DB
MG
10 DB
MG
11

Data mining origins Analysis techniques


 Draws from
 statistics, artificial intelligence (AI)
 Descriptive methods
 pattern recognition, machine  Extract interpretable models describing data
learning Machine Learning,  Example: client segmentation
Statistics,
database systems AI Pattern

Recognition  Predictive methods
 Traditional techniques are not
 Exploit some known variables to predict
appropriate because of Data Mining
unknown or future values of (other) variables
 significant data volume
large data dimensionality
 Example: spam email detection
 Database
systems
 heterogeneous and distributed
nature of data
From: P. Tan, M. Steinbach, V. Kumar,
Introduction to Data Mining

DB
MG
12 DB
MG
13

Elena Baralis
Politecnico di Torino
2
DB
MG Data mining fundamentals
DataBase and Data Mining Group of Politecnico di Torino

Classification Classification
 Objectives Approaches
decision trees
 prediction of a class label
bayesian classification
 definition of an interpretable model of a given classification rules
phenomenon neural networks
k-nearest neighbours
training data training data SVM

model model

unclassified data classified data unclassified data classified data

DB
MG
14 DB
MG
15

Classification Classification
Requirements  Applications
accuracy  detection of customer propension to leave a company
(churn or attrition)
interpretability  fraud detection
scalability  classification of different pathology types
noise and outlier 
management
training data dati di training

model modello

unclassified data classified data dati non classificati dati classificati

DB
MG
16 DB
MG
17

Clustering Clustering
Approaches
 Objectives
partitional (K-means)
 detecting groups of similar data objects
hierarchical
 identifying exceptions and outliers
density-based (DBSCAN)
SOM

Requirements
scalability
management of
noise and outliers
large dimensionality
interpretability

DB
MG
18 DB
MG
19

Elena Baralis
Politecnico di Torino
3
DB
MG Data mining fundamentals
DataBase and Data Mining Group of Politecnico di Torino

Clustering Association rules


 Applications  Objective
 customer segmentation  extraction of frequent correlations or pattern from a
 clustering of documents containing similar information transactional database
 grouping genes with similar expression pattern
 Tickets at a supermarket
counter  Association rule
TID Items diapers beer
1 Bread, Coke, Milk
 2% of transactions contains
2 Beer, Bread both items
3 Beer, Coke, Diapers, Milk  30% of transactions
4 Beer, Bread, Diapers, Milk containing diapers also
5 Coke, Diapers, Milk contain beer

DB
MG
20 DB
MG
21

Association rules Other data mining techniques


 Applications  Sequence mining
 market basket analysis  ordering criteria on analyzed data are taken into
 cross-selling account
 shop layout or catalogue design  example: motif detection in proteins
 Time series and geospatial data
Tickets at a supermarket
 temporal and spatial information are considered
counter  Association rule
 example: sensor network data
TID Items diapers beer  Regression
1 Bread, Coca Cola, Milk Sensor network
 2% of transactions contains  prediction of a continuous value
2 Beer, Bread both items
 example: prediction of stock quotes
3 Beer, Coca Cola, Diapers, Milk  30% of transactions
 Outlier detection
4 Beer, Bread, Diapers, Milk containing diapers also
 example: intrusion detection in network traffic
5 Coca Cola, Diapers, Milk contain beer analysis

DB
MG
22 DB
MG
23

Open issues

 Scalability to huge data volumes


 Data dimensionality
 Complex data structures, heterogeneous data
formats
 Data quality
 Privacy preservation
 Streaming data

DB
MG
24

Elena Baralis
Politecnico di Torino
4

You might also like