• Departemen Kimia FMIPA
• Pusat Studi Biofarmaka Tropika
Data preprocessing dalam
analisis metabolomik
Mohamad Rafi
Kuliah Umum Metabolomik
Forum Metabolomik Indonesia, 31 Agustus 2024
Analisis Metabolomik
Analytical
Methods
Biological
knowledge
Data analysis,
chemometrics
Analisis data - kemometrik
Data spektrum Spektrum original
original dari sampel X
Signal preprocessing, data pretreatment
The pretreatments and
Pemilihan model the logical flow of different
calibration, validation, and
prediction sets
Set kalibrasi Set validasi Set prediksi
IDAKK: Identifikasi, diskriminasi,
Model IDAKK
autentikasi, klasifikasi, kalibrasi
I. C. Yang et al., J. Food Drug Anal. 21 (2013) 268
Identification & authentication of Phyllanthus niruri
UV-Vis spectra of P. niruri (a) and L. Leucocephala (b)
Sains Malaysiana 50(4)(2021): 997-1006
http://doi.org/10.17576/jsm-2021-5004-10
Identification & authentication of Phyllanthus niruri
L. leucocephala
• Instrumentation: Spectrophotometer
UV-Vis
• Variable: absorbance from 250-700 nm
• Preprocessing signal: standard normal
variate
• Chemometrics method: principal P. niruri
component analysis (PCA)
PCA Plot
Sains Malaysiana 50(4)(2021): 997-1006
http://doi.org/10.17576/jsm-2021-5004-10
Identification & authentication of Phyllanthus niruri
50% P
• Instrumentation: 25%P
5% P
Spektrophotometer UV-Vis P
• Variable: absorbance from 250-700
nm
• Preprocessing signal: standard
normal variate M
• Chemometrics method: canonical
variate analysis (CVA)
CVA plot of PN (∆), 5% LL in PN (○), 25% LL in PN
(□), 50% LL in PN (-), dan LL (◊)
Sains Malaysiana 50(4)(2021): 997-1006
http://doi.org/10.17576/jsm-2021-5004-10
Autentikasi tepung talas
Tepung talas
Harga
Tepung singkong
Tepung terigu
M Rafi, SA Fikriah, RH Khuluk, UD Syafitri. 2020. Discrimination of cassava, taro, and wheat flour using
near-infrared spectroscopy and chemometrics. Jurnal Kimia Sains dan Aplikasi 23(10): 360-364.
Autentikasi tepung talas
Representative NIR spectra of
cassava (a), taro (b), and wheat (c)
flour for several location.
Autentikasi tepung talas
• Instrumentation: NIR
spectrophotometer
• Variables: reflectance
• Preprocessing signal: normalization
• Chemometrics method:
discriminant analysis
DA plot of cassava (■), taro (♦), and wheat (•) flour.
DA validation plot of cassava (■), validation sample of cassava (□), taro (♦), validation sample
of taro (◊), wheat (•), and validation sample of wheat (○) flour.
,,,,,
Pengelompokan S. rhombifolia extracts
Base peak chromatogram of S.
rhombifolia using
(a) water,
(b) ethanol 30%,
(c) ethanol 50%,
(d) ethanol 70%, and
(e) ethanol p.a
AH Karomah, A Ilmiawati, UD Syafitri, DA Septaningsih, M Adfa, M Rafi. 2023. South African as extracting solvents
Journal of Botany Volume 161: 418-427
Pengelompokan S. rhombifolia extracts
Principal component analysis
a b
PCA score plot (a) before pre-processing and (b) after preprocessing using correlation optimized
warping. Variable used peak intensities from the LC-HRMS chromatogram
AH Karomah, A Ilmiawati, UD Syafitri, DA Septaningsih, M Adfa, M Rafi. 2023. South African
Journal of Botany Volume 161: 418-427
Alur pengembangan metode IDAKK
Data
preprocessing
Koleksi sinyal Raw data Clean data
Data
pretreatment
Data fit for
Model IDAKK
analysis
Data preprocessing and pretreatment
• The reliability of analytical results is vitally dependent on the quality
of the measurements leading to their determination. Signal
processing refers to a variety of operations that can be carried out
on a continuous or discrete sequence of measurements in order to
enhance the quality of information they are intended to convey.
• Data preprocessing steps are applied in order to generate 'clean'
data. These clean data can be used as the input for data analysis.
• Sometimes it is important to use an appropriate data pretreatment
method before starting data analysis. Data pretreatment methods
convert the clean data to a different scale (for instance, relative or
logarithmic scale).
Data preprocessing and pretreatment
• They aim to focus on the relevant (biological) information and to reduce the influence of
disturbing factors such as measurement noise. Procedures that can be used for data
pretreatment are scaling, centering and transformations.
• The choice for a data pretreatment method does not only depend on the biological
information to be obtained, but also on the data analysis method chosen since different data
analysis methods focus on different aspects of the data.
• For example, a clustering method focuses on the analysis of (dis)similarities, whereas
principal component analysis (PCA) attempts to explain as much variation as possible in as
few components as possible. Changing data properties using data pretreatment may
therefore enhance the results of a clustering method, while obscuring the results of a PCA
analysis.
Data preprocessing and pretreatment
Preprocessing
– Tujuan:
• To improve the robustness and accuracy of
subsequent multivariate analyses
• To increase the interpretability of the data by
correcting issues associated with spectral data
acquisition
– Jenis:
• De-noising: boxcar, Savitsky Golay
• Spectral correction: Baseline, Derivatisation
• Normalization: Vector, SNV
Data preprocessing --- spektrum
Visual effect of different pre-processing steps on a set of FTIR spectra
Data preprocessing
Representative second derivative FTIR spectra of C. longa
(A), C. xanthorrhiza (B), and Z. cassumunar (C)
Data preprocessing
Spektrum original
Prapemrosesan: linear
baseline correction dan
area normalization
Data preprocessing
Spektrum original
Prapemrosesan:
standard normal
variate
Data preprocessing
Hasil cross-validation model
Linear baseline correction and area normalization
from \ to CL CX ZC Total % correct
CL 0 7 28 35 0.00%
CX 9 2 25 36 5.56%
ZC 7 1 22 30 73.33%
Total 16 10 75 101 23.76%
Linear baseline correction and maximum normalization
from \ to CL CX ZC Total % correct
CL 32 3 0 35 91.43%
CX 1 35 0 36 97.22%
ZC 0 1 29 30 96.67%
Total 33 39 29 101 95.05%
Standard normal variate
from \ to CL CX ZC Total % correct
CL 33 2 0 35 94.29%
CX 0 36 0 36 100.00%
ZC 0 0 30 30 100.00%
Total 33 37 29 101 98.02%
Data preprocessing
Pengelompokan ekstrak daun tempuyung berdasarkan perbedaan
pelarut pengekstraksi
Sangat kompleks dengan berbagai macam informasi.
Infomasi yang menyertai data tidak semuanya penting,
tetapi ada yang dapat menurunkan kualitas data
3200-2800 + 1800-400 cm-1
Penggabungan beberapa spektrum
pada panjang gelombang tertentu
dapat membentuk dan memunculkan
ciri maupun informasi baru
Data preprocessing
a b
Plot skor PCA (a) tanpa pre-processing (b) Savitzky golay smoothing
ekstrak air (■), ekstrak etanol 10% (●), ekstrak
etanol 30% (▲), etanol 50% ( ), etanol 70% (+),
dan ekstrak etanol pa (o)
Data preprocessing
• Smoothing Savitzky-Golay untuk
mengeleminasi noise tanpa
mengurangi jumlah variabel
• SGs tidak merubah spektrum secara
fundamental tapi lebih kepada
menghaluskan spektrum yang
dihasilkan
• Data eksperimen menjadi kurang
terlihat akibat jumlah titik yang
digunakan untuk fitting data akan
semakin menghaluskan spektrum
Data preprocessing
Scores
15
10
5
PC-2 (16%)
-5
-10
SNV works by calculating the standard
-30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 deviation of all data points in a given
spectra and then the entire spectra is
PC-1 (79%)
normalized by this value, thus giving the
Plot skor PCA dengan SNV spectra a unit standard deviation. SNV
removes slope variation and also the
scatter effects
Data preprocessing --- kromatogram
Alignment
Reference
peak
Algorithm:
Correlation
Optimized
Warping
Data preprocessing
Identifikasi Sonchus arvensis (bagian tumbuhan dan asal
geografis)
RH Khuluk, A Yunita, E Rohaeti, UD Syafitri, M Rafi,
R Linda, LW Lim, T Takeuchi. Separations 8 (2): 12.
Identifikasi Sonchus arvensis (bagian tumbuhan dan asal
geografis)
Column: Zorbax eclipse plus C18, 5 µm,
4.6x150 mm
Eluent: Methanol (A) and formic acid 0.2% (B): 0-
10 menit (15-30% A), 10-45 menit (30-50% A),
45- 50 menit (50-80% A), 50-55 menit (80-95% A),
55-60 menit (100% A)
Analyte: (1) orientin, (2) hyperuside, (3) rutin, (4)
myricetin, (5) Luteolin, (6) quercetin, (7)
kaempferol, (8) apigenin
Detection: 340 nm
Preprocessing chromatogram
Correlation
Optimized Warping
(COW)
Preprocessing chromatogram
Optimasi parameter COW
Parameter COW Kromatogram yang dihasilkan memiliki
Indeks
No Panjang Ukuran korelasi yang tinggi dengan kromatogram
kesamaan
segmen slack yang lain, sehingga adanya ketidaksejajaran
1 - - 0.004132 pada sumbu x dapat diatasi
2 5 1 0.004637
3 20 4 0.004267
4 50 10 0.00337
5 60 12 0.002331
6 70 14 0.005361 Kondisi optimum
7 75 15 0.004603 parameter COW
8 100 20 0.000518
9 500 100 0.001791
Preprocessing chromatogram
Kromatogram ekstrak tempuyung diperbesar Kromatogram 30 sampel ekstrak tempuyung
dari menit ke-32 sampai menit ke-34.5 setelah diterapkannya COW dengan Panjang
segmen (70) dan ukuran slack (14)
Preprocessing chromatogram
Kromatogram 30 sampel ekstrak tempuyung (A) sebelum dan (B) setelah menggunakan COW
Pengelompokan ekstrak tempuyung
Bandung Bandung
Barat Barat
Bogor
Bogor
Gambar 4 Plot skor PCA ekstrak tempuyung (A) sebelum dan (B) sesudah penerapan
COW, (F) Bogor dan (D) Bandung Barat, ( ) F. Akar, ( ) F. Batang, ( ) F.
Daun, ( ) D. Akar, ( ) D. Batang, ( ) D. Daun
Pengelompokan ekstrak tempuyung
Bandung Bandung
Barat Barat
Bogor
Bogor
Gambar 4 Plot skor PCA ekstrak tempuyung (A) sebelum dan (B) sesudah penerapan
COW, (F) Bogor dan (D) Bandung Barat, ( ) F. Akar, ( ) F. Batang, ( ) F.
Daun, ( ) D. Akar, ( ) D. Batang, ( ) D. Daun
12th Intl Conf of the Indonesian Chemical Society 2024
https://hkibali.org/icics2024/
TERIMA KASIH