0% found this document useful (0 votes)
17 views36 pages

Data Preprocessing

Uploaded by

Yanuar Pramana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views36 pages

Data Preprocessing

Uploaded by

Yanuar Pramana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

• Departemen Kimia FMIPA

• Pusat Studi Biofarmaka Tropika

Data preprocessing dalam


analisis metabolomik
Mohamad Rafi

Kuliah Umum Metabolomik


Forum Metabolomik Indonesia, 31 Agustus 2024
Analisis Metabolomik
Analytical
Methods
Biological
knowledge

Data analysis,
chemometrics
Analisis data - kemometrik
Data spektrum Spektrum original
original dari sampel X

Signal preprocessing, data pretreatment

The pretreatments and


Pemilihan model the logical flow of different
calibration, validation, and
prediction sets

Set kalibrasi Set validasi Set prediksi

IDAKK: Identifikasi, diskriminasi,


Model IDAKK
autentikasi, klasifikasi, kalibrasi

I. C. Yang et al., J. Food Drug Anal. 21 (2013) 268


Identification & authentication of Phyllanthus niruri

UV-Vis spectra of P. niruri (a) and L. Leucocephala (b)


Sains Malaysiana 50(4)(2021): 997-1006
http://doi.org/10.17576/jsm-2021-5004-10
Identification & authentication of Phyllanthus niruri

L. leucocephala
• Instrumentation: Spectrophotometer
UV-Vis
• Variable: absorbance from 250-700 nm
• Preprocessing signal: standard normal
variate
• Chemometrics method: principal P. niruri
component analysis (PCA)

PCA Plot
Sains Malaysiana 50(4)(2021): 997-1006
http://doi.org/10.17576/jsm-2021-5004-10
Identification & authentication of Phyllanthus niruri

50% P
• Instrumentation: 25%P
5% P
Spektrophotometer UV-Vis P
• Variable: absorbance from 250-700
nm
• Preprocessing signal: standard
normal variate M
• Chemometrics method: canonical
variate analysis (CVA)

CVA plot of PN (∆), 5% LL in PN (○), 25% LL in PN


(□), 50% LL in PN (-), dan LL (◊)
Sains Malaysiana 50(4)(2021): 997-1006
http://doi.org/10.17576/jsm-2021-5004-10
Autentikasi tepung talas

Tepung talas

Harga
Tepung singkong

Tepung terigu

M Rafi, SA Fikriah, RH Khuluk, UD Syafitri. 2020. Discrimination of cassava, taro, and wheat flour using
near-infrared spectroscopy and chemometrics. Jurnal Kimia Sains dan Aplikasi 23(10): 360-364.
Autentikasi tepung talas

Representative NIR spectra of


cassava (a), taro (b), and wheat (c)
flour for several location.
Autentikasi tepung talas
• Instrumentation: NIR
spectrophotometer
• Variables: reflectance
• Preprocessing signal: normalization
• Chemometrics method:
discriminant analysis

DA plot of cassava (■), taro (♦), and wheat (•) flour.

DA validation plot of cassava (■), validation sample of cassava (□), taro (♦), validation sample
of taro (◊), wheat (•), and validation sample of wheat (○) flour.
,,,,,

Pengelompokan S. rhombifolia extracts

Base peak chromatogram of S.


rhombifolia using
(a) water,
(b) ethanol 30%,
(c) ethanol 50%,
(d) ethanol 70%, and
(e) ethanol p.a
AH Karomah, A Ilmiawati, UD Syafitri, DA Septaningsih, M Adfa, M Rafi. 2023. South African as extracting solvents
Journal of Botany Volume 161: 418-427
Pengelompokan S. rhombifolia extracts
Principal component analysis

a b
PCA score plot (a) before pre-processing and (b) after preprocessing using correlation optimized
warping. Variable used peak intensities from the LC-HRMS chromatogram

AH Karomah, A Ilmiawati, UD Syafitri, DA Septaningsih, M Adfa, M Rafi. 2023. South African


Journal of Botany Volume 161: 418-427
Alur pengembangan metode IDAKK
Data
preprocessing

Koleksi sinyal Raw data Clean data

Data
pretreatment

Data fit for


Model IDAKK
analysis
Data preprocessing and pretreatment

• The reliability of analytical results is vitally dependent on the quality


of the measurements leading to their determination. Signal
processing refers to a variety of operations that can be carried out
on a continuous or discrete sequence of measurements in order to
enhance the quality of information they are intended to convey.

• Data preprocessing steps are applied in order to generate 'clean'


data. These clean data can be used as the input for data analysis.

• Sometimes it is important to use an appropriate data pretreatment


method before starting data analysis. Data pretreatment methods
convert the clean data to a different scale (for instance, relative or
logarithmic scale).
Data preprocessing and pretreatment

• They aim to focus on the relevant (biological) information and to reduce the influence of
disturbing factors such as measurement noise. Procedures that can be used for data
pretreatment are scaling, centering and transformations.

• The choice for a data pretreatment method does not only depend on the biological
information to be obtained, but also on the data analysis method chosen since different data
analysis methods focus on different aspects of the data.

• For example, a clustering method focuses on the analysis of (dis)similarities, whereas


principal component analysis (PCA) attempts to explain as much variation as possible in as
few components as possible. Changing data properties using data pretreatment may
therefore enhance the results of a clustering method, while obscuring the results of a PCA
analysis.
Data preprocessing and pretreatment

Preprocessing
– Tujuan:
• To improve the robustness and accuracy of
subsequent multivariate analyses
• To increase the interpretability of the data by
correcting issues associated with spectral data
acquisition
– Jenis:
• De-noising: boxcar, Savitsky Golay
• Spectral correction: Baseline, Derivatisation
• Normalization: Vector, SNV
Data preprocessing --- spektrum

Visual effect of different pre-processing steps on a set of FTIR spectra


Data preprocessing

Representative second derivative FTIR spectra of C. longa


(A), C. xanthorrhiza (B), and Z. cassumunar (C)
Data preprocessing

Spektrum original

Prapemrosesan: linear
baseline correction dan
area normalization
Data preprocessing

Spektrum original

Prapemrosesan:
standard normal
variate
Data preprocessing
Hasil cross-validation model
Linear baseline correction and area normalization
from \ to CL CX ZC Total % correct
CL 0 7 28 35 0.00%
CX 9 2 25 36 5.56%
ZC 7 1 22 30 73.33%
Total 16 10 75 101 23.76%

Linear baseline correction and maximum normalization


from \ to CL CX ZC Total % correct
CL 32 3 0 35 91.43%
CX 1 35 0 36 97.22%
ZC 0 1 29 30 96.67%
Total 33 39 29 101 95.05%

Standard normal variate


from \ to CL CX ZC Total % correct
CL 33 2 0 35 94.29%
CX 0 36 0 36 100.00%
ZC 0 0 30 30 100.00%
Total 33 37 29 101 98.02%
Data preprocessing
Pengelompokan ekstrak daun tempuyung berdasarkan perbedaan
pelarut pengekstraksi

Sangat kompleks dengan berbagai macam informasi.


Infomasi yang menyertai data tidak semuanya penting,
tetapi ada yang dapat menurunkan kualitas data

3200-2800 + 1800-400 cm-1


Penggabungan beberapa spektrum
pada panjang gelombang tertentu
dapat membentuk dan memunculkan
ciri maupun informasi baru
Data preprocessing

a b

Plot skor PCA (a) tanpa pre-processing (b) Savitzky golay smoothing

ekstrak air (■), ekstrak etanol 10% (●), ekstrak


etanol 30% (▲), etanol 50% ( ), etanol 70% (+),
dan ekstrak etanol pa (o)
Data preprocessing

• Smoothing Savitzky-Golay untuk


mengeleminasi noise tanpa
mengurangi jumlah variabel

• SGs tidak merubah spektrum secara


fundamental tapi lebih kepada
menghaluskan spektrum yang
dihasilkan

• Data eksperimen menjadi kurang


terlihat akibat jumlah titik yang
digunakan untuk fitting data akan
semakin menghaluskan spektrum
Data preprocessing

Scores
15

10

5
PC-2 (16%)

-5

-10
SNV works by calculating the standard
-30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 deviation of all data points in a given
spectra and then the entire spectra is
PC-1 (79%)

normalized by this value, thus giving the


Plot skor PCA dengan SNV spectra a unit standard deviation. SNV
removes slope variation and also the
scatter effects
Data preprocessing --- kromatogram

Alignment
Reference
peak

Algorithm:
Correlation
Optimized
Warping
Data preprocessing
Identifikasi Sonchus arvensis (bagian tumbuhan dan asal
geografis)

RH Khuluk, A Yunita, E Rohaeti, UD Syafitri, M Rafi,


R Linda, LW Lim, T Takeuchi. Separations 8 (2): 12.
Identifikasi Sonchus arvensis (bagian tumbuhan dan asal
geografis)

Column: Zorbax eclipse plus C18, 5 µm,


4.6x150 mm

Eluent: Methanol (A) and formic acid 0.2% (B): 0-


10 menit (15-30% A), 10-45 menit (30-50% A),
45- 50 menit (50-80% A), 50-55 menit (80-95% A),
55-60 menit (100% A)

Analyte: (1) orientin, (2) hyperuside, (3) rutin, (4)


myricetin, (5) Luteolin, (6) quercetin, (7)
kaempferol, (8) apigenin

Detection: 340 nm
Preprocessing chromatogram

Correlation
Optimized Warping
(COW)
Preprocessing chromatogram

Optimasi parameter COW

Parameter COW Kromatogram yang dihasilkan memiliki


Indeks
No Panjang Ukuran korelasi yang tinggi dengan kromatogram
kesamaan
segmen slack yang lain, sehingga adanya ketidaksejajaran
1 - - 0.004132 pada sumbu x dapat diatasi
2 5 1 0.004637
3 20 4 0.004267
4 50 10 0.00337
5 60 12 0.002331
6 70 14 0.005361 Kondisi optimum
7 75 15 0.004603 parameter COW
8 100 20 0.000518
9 500 100 0.001791
Preprocessing chromatogram

Kromatogram ekstrak tempuyung diperbesar Kromatogram 30 sampel ekstrak tempuyung


dari menit ke-32 sampai menit ke-34.5 setelah diterapkannya COW dengan Panjang
segmen (70) dan ukuran slack (14)
Preprocessing chromatogram

Kromatogram 30 sampel ekstrak tempuyung (A) sebelum dan (B) setelah menggunakan COW
Pengelompokan ekstrak tempuyung

Bandung Bandung
Barat Barat
Bogor
Bogor

Gambar 4 Plot skor PCA ekstrak tempuyung (A) sebelum dan (B) sesudah penerapan
COW, (F) Bogor dan (D) Bandung Barat, ( ) F. Akar, ( ) F. Batang, ( ) F.
Daun, ( ) D. Akar, ( ) D. Batang, ( ) D. Daun
Pengelompokan ekstrak tempuyung

Bandung Bandung
Barat Barat
Bogor
Bogor

Gambar 4 Plot skor PCA ekstrak tempuyung (A) sebelum dan (B) sesudah penerapan
COW, (F) Bogor dan (D) Bandung Barat, ( ) F. Akar, ( ) F. Batang, ( ) F.
Daun, ( ) D. Akar, ( ) D. Batang, ( ) D. Daun
12th Intl Conf of the Indonesian Chemical Society 2024

https://hkibali.org/icics2024/
TERIMA KASIH

You might also like