Introchimiometrie
Introchimiometrie
[email protected]
Data Mining, Machine Learning, Deep
Learning, Chemometrics
Definitions, Common Points and Trends
Introduction
• Concepts like Machine Learning, Data Mining or Artificial Intelligence
have become part of our daily life.
• This is mostly due to the incredible advances made in computation
(hardware and software),
• The increasing capabilities of generating and storing all types of
data and, especially, the benefits (societal and economical) that
generate the analysis of such data.
• Simultaneously, Chemometrics has played an important role since
the late 1970s, analyzing data within natural science (and especially
in Analytical Chemistry).
Introduction
• It is still difficult to clearly define or differentiate the meaning of Machine
Learning, Data Mining, Artificial Intelligence, Deep Learning and
Chemometrics
Model
THE DATA (X),
GI-GO
“IF THE DATA DO NOT CONTAIN ANY INFORMATION RELATED TO WHAT YOU
WANT TO MEASURE AND/OR IF THE REFERENCES ARE NOT RELATED TO WHAT
YOU WANT TO MEASURE, YOU WILL NOT OBTAIN GOOD RESULTS REGARDLESS
OF THE MODEL USED”
• The GI-GO truism is, by far, the major cause of the frustration in the data
analysis procedures
• We blame the algorithms most of the time, forgetting that the algorithm will not
find the information if it is not in the data
• One of the biggest mistakes that we can commit is hypothesizing that the model
will give the solution we are looking for.
• On the contrary, the solution must be in the data and its correlation with the
reference values.
• The model is, and will always be, the tool that helps us to find data patterns or
the correlation between the data and the reference (if it exists)
THE DATA (X), GI-GO
The data
• The analytical information that we measure is (or can be seen as)
multivariate.
• Many variables/observations are normally collected
• We usually want to compare samples assuming that the differences
or similarities between them will be found in the variables (or
groups of variables) that we measure
• we want to obtain useful information and get rid of the noise.
The data
DATA = INFORMATION + NOISE
« The nature (structure), amount and quality” .
The data
Examples: Spectroscopy
Can be grouped to form a multivariate dataset in a row manner, so we obtain a
matrix.
Pre-processing/Normalization: There is DEPENDENCY in the columns
Argue that the data structure is the same (or, at least, very similar)
• We must be aware that the reference values contain an analytical error and a
calibration range, together with a limit of detection and quantitation that is utterly
linked to the data X.
• Sometimes, the reference value error can be neglected if the error of the data X is
larger.
Data types in chemistry
THE REFERENCE (Y)
The reference values
• However, assuming this statement without verifying it might lead to a
wrong interpretation of the result.
• The normal procedure to verify the error of the reference values is to make repeated
measurements of the same sample and ascertain that the variance (standard deviation
of the mean) between the replicates is within certain confidence levels.
Use: The information can be used in both the data matrix and /or the
independency Y matrix
Data types in chemistry
Data that for one sample we obtain one point of information that has a value
from a limited independent amount of possibilities
Use: The information can be used in both the data matrix and /or the
independency Y matrix
Coding the variables: Independence must be assured between the different levels
!!! As many columns as levels, Dummy
Data types in chemistry
where b is what we normally call the regression vector and e is the vector
(M x 1) containing the residuals.
Pattern recognition models
We just want to study the patterns (points in common and
trends) in the data X
X = information + noise
SM = Supervised Methods
Defined as series of methods that learn from data to make or to construct a model
tha can make informed decision based on what is learned,
SM = Supervised Methods