0% found this document useful (0 votes)

9 views

Introchimiometrie

This document discusses different types of data in chemistry and their implications for data analysis. It defines key terms like machine learning, data mining, and chemometrics. It emphasizes the importance of understanding the nature, amount, and quality of data before applying any analysis methods.

Uploaded by

imane benda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Introchimiometrie

Uploaded by

imane benda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Introduction to Chemometrics

Handling of multivariate data

Pr Abdelaziz BOUKLOUZE
Laboratoire de Pharmacologie et Toxicologie, Unité d’Analyse
Instrumentale et de Traitement de données
Equipe de Recherche des Analyses Bio pharmaceutiques et Toxicologiques

[email protected]
Data Mining, Machine Learning, Deep
Learning, Chemometrics
Definitions, Common Points and Trends
Introduction
• Concepts like Machine Learning, Data Mining or Artificial Intelligence
have become part of our daily life.
• This is mostly due to the incredible advances made in computation
(hardware and software),
• The increasing capabilities of generating and storing all types of
data and, especially, the benefits (societal and economical) that
generate the analysis of such data.
• Simultaneously, Chemometrics has played an important role since
the late 1970s, analyzing data within natural science (and especially
in Analytical Chemistry).
Introduction
• It is still difficult to clearly define or differentiate the meaning of Machine
Learning, Data Mining, Artificial Intelligence, Deep Learning and
Chemometrics

• Curiously, in recent years, we have been asked increasingly more often if

what we do is Machine Learning, Data Mining, or even Deep Learning.
• Chemometrics
• Bioinformatics
• Econometrics
Introduction
• We apply “the mathematical procedure that we need
to solve the problem we are dealing with if we need
one at all”
• What we do is Chemometrics
• In the literature with the terms Machine Learning, Data Mining, Deep
Learning and Artificial Intelligence.
• Dangerous trend in the scientific literature towards their usage,
• It observed that every analytical data issue could be solved by just
constructing more complex algorithms, including more non-linear
parameters and without placing more attention on the quality of
the data being used.
Objectives
1) Differentiate the terms mentioned beforehand,
2) Highlight the most important facts of building a multivariate model

The three most important aspects of data analysis:

“ The
data, The reference values and The
model “
Objectives
•The second : define the terms
•Data Mining, Machine Learning, Deep
Learning and Artificial Intelligence and,
• puts them into the framework of
Chemometrics
BEST ADVICE THAT I CAN GIVE YOU IN THIS COURSE
The most important facts that we NEED to clarify before any analysis:

1) What is the scientific question we need to answer?

2) Which analytical tools do we need to use?

3) Spend time to make a proper Design of the Experiments

Model
THE DATA (X),
GI-GO

• The data are just a mere collection of information containing relevant

information and noise (i.e. not relevant information)
• The most important aspect for success when applying any data analysis strategy
is to have GOOD data X and, if needed, GOOD reference values Y

“IF THE DATA DO NOT CONTAIN ANY INFORMATION RELATED TO WHAT YOU
WANT TO MEASURE AND/OR IF THE REFERENCES ARE NOT RELATED TO WHAT
YOU WANT TO MEASURE, YOU WILL NOT OBTAIN GOOD RESULTS REGARDLESS
OF THE MODEL USED”

GI-GO (Garbage In – Garbage Out) truism

THE DATA (X), GI-GO

• The GI-GO truism is, by far, the major cause of the frustration in the data
analysis procedures
• We blame the algorithms most of the time, forgetting that the algorithm will not
find the information if it is not in the data
• One of the biggest mistakes that we can commit is hypothesizing that the model
will give the solution we are looking for.
• On the contrary, the solution must be in the data and its correlation with the
reference values.
• The model is, and will always be, the tool that helps us to find data patterns or
the correlation between the data and the reference (if it exists)
THE DATA (X), GI-GO

• The major issue here is that sometimes those

patterns/correlations are difficult to find because they
represent a small amount of variance/co-variance in the data
• This is there where different algorithmic approaches can be
tested, but always after being completely sure that we fully
understand the nature (structure) and origin of our data.

KNOW YOUR DATA AND, IF ANY, YOUR REFERENCE

THE DATA (X), GI-GO

The data
• The analytical information that we measure is (or can be seen as)
multivariate.
• Many variables/observations are normally collected
• We usually want to compare samples assuming that the differences
or similarities between them will be found in the variables (or
groups of variables) that we measure
• we want to obtain useful information and get rid of the noise.

DATA = INFORMATION + NOISE

TYPES of DATA
Data Mining
is a set of methods used to extract usable information from a larger set of any raw data.
It implies analyzing data patterns in already existing data using one or more algorithms.
Premises:
-We have existing data in the shape of a matrix with samples and variables
-The variables can be of very different nature
TYPES of DATA
Machine learning
Methods that parse data, learn from that data, and make informed decisions based
on what it has learned.
Premises:
-Now the data needs to learn. Process of training (learning)
-Because we will predict using the constructed model
THE DATA (X), GI-GO

The data
DATA = INFORMATION + NOISE
« The nature (structure), amount and quality” .

Knowing the nature (structure) of the data; will help us

1) The appropriate pre-processing methods
2) The subsequent models.
THE DATA (X), GI-GO

The nature (structure), amount and quality”

A normal arrangement of the data is in the shape of a matrix

there are many ways in which X can be

constructed;
even with the same apparent dimensions,
different scientific instruments might
provide data with a completely different
structure
THE DATA (X), GI-GO

The data

• Matrix X is normally composed of samples in rows and

variables in columns
• Figure 1a represents the typical data coming from any spectroscopic device, where the
spectrum at N variables has been collected for each of the mth samples

Variables and their nature are the key!

THE DATA (X), GI-GO
The data coming from the same instrument can be arranged/handled in different ways
The chromatogram of a sample is measured, we can study the chromatogram variation
between samples (Figure 1c) or construct a table where the integrated area of the peaks
composes the variables
THE DATA (X), GI-GO

This difference in the nature of the signal has a strong impact,

Especially in the pre-processing and variable normalization steps

that need to be applied before data analysis, where the
relationship between the variables plays a fundamental role.
Data types in chemistry
Continuous univariate data
Data that for one sample we obtain one point of information that can have an infinite (sort to
say) number of responses
Examples: Temperature, pH, concentrations, length
Can be grouped to form a multivariate dataset in a column manner, so we obtain a matrix.

Pre-processing/Normalization: There is INDEPENDENCY in the columns

in principle, are independent. Therefore, there is no relevance in constructing a

[pH Temperature] matrix

or [Temperature pH] matrix
Data types in chemistry
Continuous univariate data

Examples: Spectroscopy
Can be grouped to form a multivariate dataset in a row manner, so we obtain a
matrix.
Pre-processing/Normalization: There is DEPENDENCY in the columns

When a spectrum is measured, for instance, in Near Infrared,

there is a correlation between wavelength1, wavelength2 and
wavelength3
constructing a matrix such as [wavelength1 wavelength3
wavelength2] will have an impact on the pre-processing applied
to that matrix because the correlation/continuity between
variables has been broken.
Data types in chemistry

Continuous multiway data

Data that for one sample we obtain a matrix in which, theoretically, each point at each M,N
position can have an infinite (sort to say) responses
Examples: Excitation emission spectroscopy, single channel images, Hyphenated chromatography
(GC/MS, UPLC/MS/MS)

Pre-processing/Normalization: There is normally DEPENDENCY in the rows and columns, There is

independency in the third dimension

Treatment: Multiway analysis methods can be used, Tensor

Algebra (PARAFAC, Tucker, PARAFAC2, etc
Data types in chemistry
• Considering the spectral and chromatographic profile matrices

Argue that the data structure is the same (or, at least, very similar)

* the data source is completely different, and, therefore, the issues

coming from the measurement must be addressed properly.
* The issues normally found in chromatograms (baseline drifts, peak
misalignment, normalization by standard peaks, etc.) are not the
same as the issues normally found in spectroscopy.
Consequently, both scenarios need different pre-processing methods
Data types in chemistry
• The situation dealing with 3D data cubes (a.k.a. tensors)
• There is a third direction in the data that expresses either another
analytical variable or another measurement condition
• The data coming from chromatographic measurements where the
detector is a multichannel detector normally leads to data cubes
where there is an extra spectral dimension.
Data types in chemistry
• In a hyperspectral measurement, one sample is normally
visualized as a data cube

• The measurements in one sample give a matrix

composed of the emission spectra at different excitation
wavelengths in multiway fluorescence .
Data types in chemistry

THE DATA (X),

The amount and quality of data

YOUR MODEL NEEDS AS MANY SAMPLES AS NECESSARY TO CERTIFY
THE RELIABILITY AND REPRESENTATIVENESS OF YOUR MODEL AND
WHATEVER YOU WANT TO EXPRESS WITH YOUR MODEL
Data types in chemistry

The quality of the data can be assessed by controlling different

parameters of the measurement scenario in the measured signal
• The instrumental noise: Data will contain noise. It is a fact. Nevertheless, even
though different methodologies can minimize that noise, it must not be
higher than the signal being measured.
• The composition of the sample and the plausible interferences in the signal
that we are looking for:
• Measuring multivariate data means that not all of the variables are useful for answering
the analytical question.
• Moreover, the signals that will give us the answer (spectral bands, chromatographic
peaks, etc.) may be strongly influenced by another chemical (or physical compound that
could be in the sample matrix.
Data types in chemistry
• The quality of the data can be assessed by controlling different
parameters of the measurement scenario in the measured signal:

• The influence of the environmental conditions:

• Measuring in laboratory conditions is sometimes completely different from measuring in
more uncontrolled conditions.
• The correlation with the property that we want to measure:
• In regression and classification, the close connection of the data with the reference value
is essential.
Data types in chemistry
The amount and quality of data

Many of the previously mentioned issues can be partially minimized by

developing a proper protocol in the calibration of the instrument and
the inclusion of reference compounds to normalize the data.
Also, having a good Design of the Experiment and a perfectly
optimized analytical method will help us to understand the plausible
confounding factors that we might have when designing our
experiment.
Data types in chemistry
THE REFERENCE (Y)
The reference values
• When regression or classification is needed, the role of the reference values is
essential to identifying reliable correlations between them and the data X
through the model, since having good data X is not sufficient to obtain a good
model.
• The reference values Y can be obtained in many different ways, depending on
the aim of the experiment.
• However, they can be summarized into two major blocks:
• Regression and classification
Data types in chemistry
THE REFERENCE (Y)
The reference values
• Regression
• When a regression model is made, the reference value normally comes from a
standardized analytical procedure that the proper regulatory agencies have
approved;

• sometimes, it comes by being the most accepted procedure by the analytical

community.

• We must be aware that the reference values contain an analytical error and a
calibration range, together with a limit of detection and quantitation that is utterly
linked to the data X.

• Sometimes, the reference value error can be neglected if the error of the data X is
larger.
Data types in chemistry
THE REFERENCE (Y)
The reference values
• However, assuming this statement without verifying it might lead to a
wrong interpretation of the result.

• The normal procedure to verify the error of the reference values is to make repeated
measurements of the same sample and ascertain that the variance (standard deviation
of the mean) between the replicates is within certain confidence levels.

• These concepts come from classical analytical chemistry procedures.

• Nevertheless, developing such protocols is sometimes time-consuming, or there is not

enough budget to perform as many as we would like.
Data types in chemistry
Categorial Discret data
The reference values
Classification
• Classification is directly linked to the assignation of the belonging of one
sample to one class, several classes, or none, depending on the classification
strategy .
• Therefore, the reference values are normally given by a categorical
indexation of Y, where an inter-correlation between the different columns in
Y is expected or, at least, assumed.
• In many scenarios, the Y class is imposed by applying certain thresholds to
continuous variables
(Temperature < 20 is cold, 21 < Temperature < 30 is mild, Temperature > 30 is
hot).
Data types in chemistry