Remote Photoplethysmography for Heart Rate Measurement a Review

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Biomedical Signal Processing and Control 88 (2024) 105608

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control


journal homepage: www.elsevier.com/locate/bspc

Remote photoplethysmography for heart rate measurement: A review


Hanguang Xiao a ,∗, Tianqi Liu a ,∗,1 , Yisha Sun b , Yulin Li a , Shiyi Zhao a , Alberto Avolio c
a
School of Artificial Intelligence, Chongqing University of Technology, Chongqing 401135, China
b
School of Computer & Information Science, Chongqing Normal University, Chongqing 401331, China
c
Macquarie Medical School, Faculty of Medicine, Health and Human Sciences, Macquarie University, Sydney 2019, Australia

ARTICLE INFO ABSTRACT

Keywords: Heart rate (HR) ranks among the most critical physiological indicators in the human body, significantly
Heart rate illuminating an individual’s state of physical health. Distinguished from traditional contact-based heart rate
Remote photoplethysmography measurement, the utilization of Remote Photoplethysmography (rPPG) for remote heart rate monitoring
Non-contact
eliminates the need for skin contact, relying solely on a camera for detection. This non-contact measurement
Deep learning
method has emerged as an increasingly noteworthy research area. With the rapid development of deep learning,
new technologies in this area have spurred the emergence of many new rPPG methods for HR measurement.
However, comprehensive review papers in this field are scarce. Consequently, this paper aims to provide a
comprehensive overview centered around rPPG methods employed for the purpose of heart rate measurement.
We systematically organized the existing rPPG methods, with a specific focus on those based on deep learning,
and described and analyzed the structures and key aspects of these methods. Additionally, we summarized the
datasets and tools used for related research and compiled the performance of different methods on prominent
datasets. Finally, this paper discusses the current research barriers in rPPG methods, as well as the latest
practical applications and potential future directions for development. We hope that this review will help
researchers quickly understand this field and promote the exploration of more unknown challenges.

1. Introduction are captured by the photodetector, generating the PPG signal [1]. PPG
is effective because light absorption follows Beer–Lambert’s law, which
Physiological indicators, such as HR, heart rate variability (HRV), states that the amount of light absorbed by blood is proportional to the
respiratory rate (RR), blood oxygen saturation (SpO2), and blood pres- concentration of hemoglobin in the skin and blood. Therefore, during
sure (BP), are commonly used to assess a person’s physical health the cardiac cycle, small changes in hemoglobin concentration cause
status, detect potential diseases, and monitor recovery during clinical fluctuations in the amount of light absorbed by the blood vessels, result-
treatment [1–6]. Among these indicators, HR is the most widely used ing in changes in the skin intensity value [7]. Contact devices such as
and can detect certain cardiovascular problems, including atherosclero- pulse oximeters and fitness watches use PPG to non-invasively measure
sis, myocardial infarction, and arrhythmia [2]. Photoplethysmography these small changes in the skin based on this principle. However, these
(PPG) is a non-invasive and cost-effective method of measuring these traditional contact devices have many disadvantages, such as being
physiological parameters [2–6]. Medical devices based on PPG have unsuitable for detecting skin conditions in vulnerable populations such
been widely used in clinical settings to detect and monitor various phys- as infants and patients with skin diseases [8], causing discomfort or
iological indicators. PPG is also used in daily devices, such as sports
even skin infections with long-term use [9], and being affected by skin
watches and finger pulse oximeters. The use of PPG is beneficial in both
humidity, temperature, color, and patient movement, which can affect
clinical and non-clinical settings, as it provides real-time monitoring of
their accuracy [10]. To avoid these disadvantages, researchers have
physiological indicators, facilitates early detection of health problems,
begun to explore non-contact methods of remote HR monitoring, and
and helps maintain a healthy lifestyle.
rPPG has become a powerful alternative. rPPG can use a camera (such
The basic principle of PPG is to use a light source and a photodetec-
as a web camera, infrared camera, or RGB camera) to record video
tor to measure changes in the volume of blood vessels under the skin.
of the subject’s face, and extract subtle color changes in the skin to
When the tissue is illuminated by the light source, small changes in the
reflection or transmission intensity of the light caused by the blood flow generate the remote PPG signal [11]. The principle of the rPPG method

∗ Corresponding authors.
E-mail addresses: [email protected] (H. Xiao), [email protected] (T. Liu).
1
Co-first author.

https://doi.org/10.1016/j.bspc.2023.105608
Received 8 June 2023; Received in revised form 27 September 2023; Accepted 15 October 2023
Available online 21 October 2023
1746-8094/© 2023 Elsevier Ltd. All rights reserved.
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

To fill this gap, this paper timely and systematically reviews the
latest progress in rPPG methods for HR measurement. This study aims
to provide a systematic review and introduction of rPPG methods for
researchers. We categorize the rPPG methods used for HR measure-
ment into traditional and deep learning methods, further dividing the
deep learning methods into supervised and unsupervised methods, and
critically analyze their advantages, limitations, and performance based
on model architectures and methodologies. Additionally, we have also
provided an introduction to other aspects pertinent to the research of
rPPG methods. In summary, this paper has three main contributions:
(1) This work systematically review and analyze rPPG methods used
for remote HR measurement, covering all representative methods since
the first method appeared, with a particular focus on deep learning
methods.
Fig. 1. Schematic diagram of rPPG signal generation. The camera captures the specular
reflection and diffuse reflection produced by the skin under environmental light. The (2) We introduce the latest commonly used resources for rPPG meth-
specular reflection contains meaningless surface information, while the diffuse reflection ods and summarize the performance of various methods on datasets.
indicates changes in the volume of blood vessels, from which the rPPG signal can be (3) The primary challenges and difficulties faced in current research
further extracted. on rPPG methods are discussed in this paper, along with an outlook on
the latest application domains and potential future research directions
of rPPG methods.
is similar to that of the conventional PPG method, in which pulsating The remainder of this paper is organized as follows: Section 2
blood propagated in the cardiovascular system changes the blood vol- analyzes the main conventional methods. Section 3 provides a detailed
ume in the microvascular tissue bed under the skin with each heartbeat, description of supervised methods in deep learning approaches. Sec-
generating periodic waves. However, the main difference between the tion 4 elucidates unsupervised methods in deep learning approaches.
two methods lies in the way the PPG signal is captured: rPPG methods Section 5 summarizes the datasets and tools currently utilized in rPPG
capture the signal from video recordings of the subject’s face, while methods, as well as the performance of the proposed methods on these
conventional PPG methods require a physical sensor to be in contact datasets. Section 6 primarily analyzes the challenges currently faced
with the skin. As shown in Fig. 1, the principle of rPPG can be further in rPPG research. Section 7 introduces the latest application areas of
explained by the dichromatic reflection model (DRM) [12]. When rPPG. Finally, in Section 8, we provide a conclusion and an outlook
ambient light shines on the skin, it produces specular reflection and for possible future research directions.
diffuse reflection. Specular reflection occurs above the incident light
and the skin surface and does not contain meaningful physiological 2. Conventional methods
signals, while diffuse reflection occurs on the blood vessels and contains
meaningful physiological signals. The signal captured using the camera In this section, we will introduce some representative rPPG con-
is a combination of specular and diffuse reflections. Therefore, the rPPG ventional methods for remote HR measurement. Before the prevalence
method needs to separate specular and diffuse reflections and extract of deep learning methods, conventional rPPG methods were the main
meaningful diffuse reflections to generate the rPPG signal. Currently, methods for remote HR measurement. These conventional methods
rPPG has been proven to be superior because not only do subjects not often relied mainly on mathematics and algorithms, and their main
need to wear contact devices to avoid the various drawbacks of contact purpose was to eliminate motion artifacts and noise generated in facial
devices, but it is also suitable for long-term continuous monitoring and videos, thereby obtaining better quality rPPG signals. In addition to
is friendly to various patients. Furthermore, the camera required for the the conventional methods initially proposed by Verkruysse et al. [13],
rPPG method is low-cost and easy to obtain, making it highly suitable we divide conventional methods into blind source separation (BSS)
for wide promotion and application [13]. However, rPPG methods are based methods and model based methods. BSS based methods may
more challenging to use in real-world scenarios due to various factors be ideal for separating pulses without prior information, while model
such as lighting conditions, facial hair, and skin tone, which can affect based methods can use color vector knowledge of different components
the accuracy of the extracted rPPG signal. The rPPG signal is also to control separation. We summarize these conventional methods in
weaker than that extracted using the conventional contact method due Table 1.
to the differences in principle, requiring careful and precise processing.
In previous studies, Verkruysse et al. first proposed the use of 2.1. Conventional methods based on BSS
consumer-grade cameras to extract rPPG signals for HR measure-
ment [13]. In their work, it was found that different channels of the Conventional BSS: BSS refers to the recovery of unobserved sig-
RGB signal had varying relative strengths of PPG signals, with the green nals or sources from a set of observed mixtures without any prior
channel containing the strongest pulsatile signal. This observation is information about the mixing process. Typically, the observations are
consistent with the fact that hemoglobin is most sensitive to changes outputs of sensors, each of which is a combination of the sources [29].
in oxygenation of green light absorption, successfully demonstrating Independent component analysis (ICA) is a typical method for BSS and
the feasibility of using rPPG methods to measure HR from ordinary has been shown to be effective in many fields [30]. Poh et al. [19]
consumer-grade camera footage. Since then, various rPPG methods proposed a ICA algorithm base on joint approximate diagonalization of
for remote HR measurement have emerged, with a large number of eigenmatrices to remove the correlations and high-order dependencies
researchers still actively engaged in this field. The development of among the RGB channels and extract the HR components in sit-still
rPPG methods has gone through two stages: conventional methods and and sit-move-naturally scenarios. The root mean square error (RMSE)
deep learning methods. Although there are many review articles on corresponding to the motion scenario decreased from 19.36 bpm to
conventional rPPG methods [11,14–16] and some on deep learning- 4.63 bpm, demonstrating the feasibility of ICA for HR estimation. It
based rPPG methods [17,18], with the rapid development of deep is noteworthy that they employed the Viola–Jones face detector [31]
learning, new technologies in this area have spurred the emergence to automatically generate regions of interest (ROI) for the first time.
of many new rPPG methods and applications, rendering current review Lewandowska et al. [20] proposed using principal component anal-
articles insufficient to match the pace of deep learning advancements. ysis (PCA) to define three independent linear combinations of color

2
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 1
Analysis of conventional methods.
Category Ref. Year Methods Description
Original [13] 2008 band-pass It was the first time that the feasibility of rPPG method was
Filter + FFT demonstrated, and it was found that the green channel
contained the strongest pulsatile signal.

BSS-based [19] 2010 ICA Applying ICA in signal processing technology to remote HR
estimation.

[20] 2011 PCA The BSS-based method involves using PCA, which has been
proven to save computational costs.

[21] 2011 SCICA A new method for reducing artifacts composed of planar
motion compensation and BSS.

[22] 2014 JBSS + IVA Applying IVA to jointly analyze color signals from multiple
facial sub-regions.
[23] 2017 JBSS Improving PPG signal by combining facial sub regional
landmark localization and JBSS method to extract
physiological signals.

[24] 2017 CEEMDAN + Using CEEMDAN and CCA to eliminate noise artifacts.
CCA

Model-based [12] 2013 CHROM A robust technique for extracting HR from CCD camera
videos based on CHROM during motion.

[25] 2014 PBV It is proposed that PBV is a signal in skin reflectance spectra
that can be used to distinguish physiological signals from
motion noise.

[26] 2014 NLMS A illumination rectification method based on the NLMS


adaptive filter is proposed to reduce noise caused by changes
in lighting and rigid head movements.

[27] 2016 2SR A completely data-driven method utilizing 2SR is proposed


to improve motion robustness.

[28] 2017 POS Using the POS imaging to measure HR, combining
normalized RGB channels into two new channels, and
weighting them to merge into the desired signal.

channels and demonstrated that PCA is as effective as ICA but can introduced the JBSS method into the field of rPPG, mainly applying
greatly reduce the computational complexity. Sun et al. [21] introduced independent vector analysis (IVA) to jointly analyze color signals from
a new artifact reduction method composed of planar motion compen- multiple facial subregions. Preliminary experimental results show that
sation and BSS, in which their BSS mainly refers to single-channel ICA the measurement of HR is more accurate compared with the ICA-based
(SCICA). The performance evaluation based on facial videos captured BSS method. Later, Qi et al. [23] proposed a new non-contact HR
from a repeatedly exercising volunteer suggests that the proposed measurement method by exploring the correlation between facial sub-
method can track HR. BSS-based methods had somewhat the ability region datasets through JBSS. Test results on large public databases
of tolerating motions but still showed limited improvement, especially also show that the proposed JBSS method outperforms previous ICA-
in dealing with severe movements. Since the orders of the extracted based methods. But current HR estimation using the JBSS method is still
components via BSS are random, usually fast Fourier transform (FFT) preliminary. In the future, in addition to color signals and multimodal
is utilized to determine the most probable HR frequency. Therefore, data collections from facial subregions, other types of data collections
BSS-based methods cannot handle the case where the frequency of can be used by JBSS for more accurate and robust telemetric HR
periodic motion artifacts falls within the normal HR frequency range. measurements.
Subsequently, Al-Naji et al. [24] proposed to estimate HR from video
sequences captured by hovering unmanned aerial vehicle by combin- 2.2. Conventional methods based on models
ing complete ensemble empirical mode decomposition with adaptive
noise (CEEMDAN) and canonical correlation analysis (CCA). The com- Owing to the capacity of model-based methods to leverage the
bined method of CEEMDAN and CCA outperforms the use of ICA or data provided by color vectors for managing component separation, a
PCA methods, particularly in the presence of noise caused by lighting prominent attribute shared by these methods is the ability to eradicate
changes, subject motion, and camera motion. the reliance of RGB signals on the mean skin reflection chromatic chan-
Joint BSS: Conventional BSS techniques were originally designed nel [28]. Model-based methods generally allude to approaches based
for processing a single data set, e.g., decomposing multiple color chan- on the chrominance model (CHROM) [20], which exploit the blood
nel signals from a single facial ROI region into independent compo- volume pulse signature (PBV) feature to discriminate pulse signals from
nents [32]. But color channel signals from multiple facial ROI sub- motion distortions [25], and approaches based on the plane orthogonal
regions can be used for more accurate HR measurement [33]. With to the skin (POS) [28].
the increasing availability of multiple data sets, various joint BSS De Haan et al. [12] developed a CHROM to consider diffuse reflec-
(JBSS) methods have been proposed to accommodate multiple data sets tion components and specular reflection contributions, which together
simultaneously. From a multi-set and multimodal perspective, several made the observed color varied depending on the distance (angle) from
realistic neurophysiological applications highlight the benefits of the the camera to the skin and to the light sources. Therefore, following the
JBSS approach as an efficient and promising tool for neurophysiolog- CHROM approach, the influence of motion artifacts can be eliminated
ical data analysis. The goal of JBSS is to extract underlying sources by utilizing linear combinations of individual R, G, and B channels.
within each data set and meanwhile keep a consistent ordering of the Experimental results demonstrated that CHROM outperformed previous
extracted sources across multiple data sets [30]. Guo et al. [22] first ICA and PCA-based methods during motion. To further address the

3
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

impact of motion artifacts, Li et al. [26] proposed an illumination nor-


malization method based on the normalized least-mean-square (NLMS)
adaptive filter. They assumed that both facial ROI and background
follow the Lambertian model and share the same light source. The dis-
criminative response map fitting [34] and Kanade–Lucas–Tomasi [35]
algorithms were used for face detection and tracking to address rigid
Fig. 2. The architecture of HR-CNN.
head motion. Subsequently, De Haan et al. [25] proposed a PBV-based
method to improve the robustness of motion artifacts. The PBV-based
method uses the characteristic of blood volume change to differenti-
the first rPPG deep learning method, HR-CNN, for remote HR mea-
ate color changes caused by pulse and motion artifacts in the RGB
surement. It is a two-stage CNN architecture that consists of a feature
trace over time. Here, PBV is considered as a signal of various skin
extractor and an HR estimator, as shown in Fig. 2. By training the
reflectance spectra to distinguish physiological signals from motion-
2D CNN feature extractor to maximize the signal-to-noise ratio (SNR),
induced noise. Experimental results showed significant improvement
higher quality rPPG signals can be extracted from a sequence of video
compared to CHROM-based methods when subjects were in motion.
frames. These signals are then input into the HR estimator to output
Wang et al. [27] proposed a conceptually novel data-driven rPPG
predicted HR values, which is the basic workflow of most deep learning
algorithm, called spatial subspace rotation (2SR), to enhance motion
methods. HR-CNN was mainly trained and validated on the PURE [38]
robustness. They used 2SR to estimate the spatial subspace of skin
physiological dataset. The results showed that the model achieved
pixels in the RGB image and evaluated its temporal rotation to measure
an MAE of 1.84 and an RMSE of 2.37, while the best-performing
HR, i.e., obtaining the pulse by estimating the temporal rotation of skin
traditional method, POS [28], achieved an MAE of 3.14 and an RMSE
pixel subspace in RGB. Experimental results showed that the proposed
of 10.57. As we can see, HR-CNN significantly improved the results
2SR method outperformed ICA, CHROM, and PBV-based methods in
compared to traditional methods, greatly inspiring the development
different skin tones and body motions under a clear skin mask. Addi-
of deep learning methods. Furthermore, HR-CNN demonstrates strong
tionally, the proposed 2SR method has the advantages of simplicity and
robustness in dealing with video compression issues that traditional
scalability. Building upon most model-based traditional methods, Wang
methods find challenging.
et al. [28] proposed another model-based rPPG algorithm, called POS,
Inspired by HR-CNN [37], Chen et al. [39] proposed a novel model
which defines a POS tone in the time-normalized RGB space for pulse
named DeepPhys (also known as CAN). Similar to HR-CNN, rPPG
extraction. They compared all classic traditional methods, including
signals are extracted using a 2D CNN network, as shown in Fig. 3. How-
Green [13], ICA [19], PCA [20], CHROM [12], PBV [25], 2SR [27], and
ever, they designed a motion model and an appearance model based
POS [36], on their privately collected dataset involving different skin
on the DRM [12]. The appearance model guides the motion model to
tones, lighting changes, and motion challenges. Overall, POS exhibited
learn motion representation through an attention mechanism, and the
the best performance, mainly due to the physiological plausibility of
motion model uses the normalized difference between adjacent frames
the defined POS tone, which made POS particularly advantageous in
as input motion representation to simulate motion and color changes,
fitness challenges with high skin mask noise. They also demonstrated
which enhances DeepPhys’s motion robustness. Additionally, DeepPhys
that POS and CHROM performed well both in stationary and motion
introduced an attention mechanism to learn soft attention masks from
scenarios, although they may face challenges in distinguishing the pulse
raw video frames and assigned higher weights to skin regions with
from near-amplitude distortion.
stronger signals. This mechanism also visualizes the spatial–temporal
distribution of physiological signals. While DeepPhys outperforms HR-
3. Supervised deep learning methods CNN under various influencing factors such as lighting changes and
motion artifacts, it cannot capture the temporal information of rPPG
Supervised methods are a type of deep learning method. If a deep signals due to the lack of temporal information in 2D CNN. To address
learning method requires the use of real label values (ground truth) dur- this issue, Liu et al. [32] proposed MTTS-CAN, which builds upon
ing training, then it is classified as a supervised method. Currently, the DeepPhys by introducing the Temporal Shift Module (TSM) [41] to
majority of deep learning methods used for remote HR measurement capture temporal information. The TSM allows information exchange
are supervised methods, which often require a large amount of train- between adjacent frames and avoids complex convolution operations by
ing data containing ground truth, but exhibit excellent performance. moving blocks in tensors along the time axis, which enables the model
As various techniques of deep learning continue to develop, we will to capture temporal information to some extent. Unlike DeepPhys,
classify and introduce relevant supervised methods according to their the input of their appearance model is a frame obtained by averag-
respective techniques, and we will compare the performance of these ing adjacent multiple frames, instead of raw video frames, ensuring
methods in Section 5.2. the acquisition of temporal information. Results on the challenging
physiological dataset UBFC-rPPG [42] show that MTTS-CAN achieves
3.1. 2D convolutional neural network (2D CNN) methods significantly better results (RMSE is 2.72) than DeepPhys (RMSE is
10.82) and has a faster processing speed. Antecedent 2D CNN methods
Prior to being applied in the rPPG field, 2D CNN have been ex- have primarily focused on studying models and networks, without con-
tensively utilized in numerous computer vision methods, exhibiting sidering the use of prior knowledge for image preprocessing. However,
outstanding performance and successfully showcasing that these deep for HR estimation, variations in skin color reflect changes in blood
learning methods are often superior to traditional multi-stage methods flow and consequently reveal the heartbeat period. Thus, amplifying
that require manual feature engineering. Therefore, using a 2D CNN the changes in skin color will enhances the display of the heartbeat
framework for rPPG signal recovery is a viable option. Throughout the period. Taking into account this crucial prior knowledge, a novel 2D
entire development process of rPPG deep learning methods, 2D CNN CNN-based method, named EVM-CNN, combines the Eulerian Video
were first introduced into the rPPG field and successfully implemented Magnification (EVM) algorithm [43] with 2D CNN to extract facial
remote HR measurement, which has significant implications. We sum- color variations and estimate HR. The EVM module utilizes spatial
marize in Table 2 the rPPG methods that utilize a 2D CNN framework decomposition and temporal filtering to extract facial color variations,
for remote HR measurement. and broader bandpass filters to capture signals within the typical
Under the influence of the efficient performance of various 2D CNN range of human HRs. This generates feature maps, corresponding to
methods in the field of computer vision, Petlík et al. [37] proposed a specialized preprocessing technique. Extensive experiments on the

4
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 2
Analysis of 2D CNN methods.
Name Year Network Methods Description
HR-CNN 2018 2D CNN – A two-stage 2D CNN composed of an extractor and a
[37] HR estimator was developed to measure HR, which is
the first deep learning rPPG method.

DeepPhys 2018 2D CNN Attention A VGG-style 2D CNN was employed to jointly train
[39] motion and appearance models.

EVM-CNN 2018 2D CNN EVM EVM is employed to extract facial color variations,
[40] while a 2D CNN is utilized to estimate HR.

MTTS-CAN 2020 2D CNN Attention + TSM Using TSM to capture temporal information and
[32] utilizing attention mechanism to guide the motion
model.

Fig. 4. An illustration of spatial–temporal map generation from face video. A T-frame


RGB face video is converted into a three-channel image of size n × T.

Fig. 3. The architecture of DeepPhys.

MMSE-HR [44] dataset demonstrate the effectiveness of the proposed Fig. 5. The architecture of SynRhythm.
method, with an RMSE of 6.95 and a Pearson correlation coefficient
of 0.98. These results underscore the irreplaceable importance of prior
knowledge, even in the domain of deep learning methods.
The overall architecture is shown in Fig. 5. In addition to applying
spatial–temporal maps, they also employed transfer learning to train
3.2. Spatial–temporal map methods
the HR estimator by transferring the pre-trained model to the real
HR estimation task. The RMSE achieved on the MAHNOB-HCI dataset
Notwithstanding the successful utilization of 2D CNN for imple-
was 4.49, which is comparable to benchmark methods, suggesting
menting deep learning-based rPPG methods for remote HR measure-
that spatial–temporal maps effectively emphasize HR information while
ment, a prominent limitation of these methods is the absence of tem-
attenuating irrelevant signals. Niu et al. [46] proposed a new method
poral information. As rPPG signals exhibit periodicity, the temporal
information plays a crucial role in accurately estimating rPPG signals, called ST-Attention, which introduces an attention mechanism based on
rendering the lack of temporal information as one of the foremost spatial–temporal maps. They utilized spatial–temporal maps to obtain
constraints of 2D CNN methods. In order to mitigate this limitation, effective representations of rPPG signals from facial videos and used the
Niu et al. [45] introduced the notion of spatial–temporal maps. As attention mechanism to remove noise. The attention mechanism filters
illustrated in Fig. 4, for a video consisting of T frames, the detected out irrelevant features from video sequences and learns rich representa-
facial region is partitioned into an M × N matrix, which is further tions, thereby enhancing the effectiveness of spatial–temporal maps and
subdivided into n ROI blocks, with the assumption of alignment among improving remote HR measurement to some extent. They utilized all
distinct blocks. The utilization of average pooling operation aids in generated spatial–temporal maps to train the HR estimator and estimate
mitigating sensor noise in the HR signal. Specifically, let C(x, y, t) rPPG signals.
denote the value of the RGB channel at position (x, y) in the t-th frame. Complex spatial–temporalnap: The spatial–temporal map meth-
The average pooling value of the 𝑖th ROI block in each channel at the ods are typically constructed directly from RGB color channels, which
t-th frame 𝐶𝑖 (𝑡) can be expressed as: may result in generated spatial–temporal maps lacking sufficient pulse
∑ information. To address this limitation, Niu et al. [47] proposed a
𝑥,𝑦∈𝑅𝑂𝐼𝑖 𝐶(𝑥, 𝑦, 𝑡) new benchmark for spatial–temporal map methods called RhythmNet,
𝐶𝑖 (𝑡) = (1)
|𝑅𝑂𝐼𝑖 | as shown in Fig. 6. In RhythmNet, they convert facial images to
Where |𝑅𝑂𝐼𝑖 | represents the area of ROI block, i.e., the number of YUV color channels instead of traditional RGB channels to generate
pixels. Therefore, for each facial video, a 3 × n time series of length 𝑇 spatial–temporal maps, effectively separating the visual feature signals
can be obtained in the RGB channels, e.g., 𝐶𝑖 = {𝐶𝑖 (1), 𝐶𝑖 (2), … , 𝐶𝑖 (𝑡)}, of HR from a large amount of background signals. Furthermore, to
where represents one of the RGB channels and i represents the index of account for the temporal correlations in HR measurements in video
ROI. To fully utilize the information, min–max normalization is applied sequences, they utilized gated recurrent units (GRU) [50]. In addition
to each time series signal to scale the values to [0, 255]. Finally, n time to the color channel transformation approach proposed by [47], Song
series are arranged in rows to form a spatial–temporal map from the et al. [36] considered directly using rPPG signals to construct spatial–
original video sequence of size n × 𝑇 × 3, which serves as the input to temporal maps. They chose to extract rPPG signals from ROIs using the
the subsequent network. Table 3 summarizes the rPPG methods that CHROM method [12], and generated spatial–temporal maps based on
utilize spatial–temporal map. the preliminary estimated rPPG signals, resulting in spatial–temporal
RGB spatial–temporal map: Based on the proposed concept of maps with stronger motion robustness and clearer structures for sub-
spatial–temporal map, Niu et al. [45] introduced the first rPPG method, sequent CNN to learn from. Similar to the idea of Song et al. [36],
SynRhythm, which utilizes spatial–temporal maps for HR measurement. Hao et al. [49] chose to use the POS method [28] for initial rPPG

5
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 3
Analysis of spatial–temporal map methods.
Name Year Network Methods Description
SynRhythm 2018 2D CNN Spatial–Temporal Map The first method for remote HR measurement
[45] using spatio-temporal map and addressing data
scarcity through transfer learning.

ST-Attention 2019 2D CNN Spatial–Temporal Map + HR estimation using spatio-temporal Map with
[46] Attention Noise removal using attention mechanism.

RhythmNet 2020 2D CNN Spatial–Temporal Map + GRU are used to consider the relationship between
[47] GRU adjacent HR measurements in video sequences,
and a combined approach of 2D CNN and GRU is
employed for HR estimation.

CVD [48] 2020 2D CNN Spatial-Temporal Map + Removing noise in spatial–temporal map via
Disentangled Feature Cross-Validated feature disentanglement, supervised
Learning simultaneously using rPPG signal and HR.

NAS-HR [49] 2021 2D CNN Spatial–Temporal Map + Using NAS to find a lightweight optimal 2D CNN
NAS to estimate HR from spatial–temporal map.

Fig. 6. The architecture of RhythmNet.

Fig. 7. The architecture of CVD.


signal extraction and spatial–temporal map construction, and proposed
a new method called NAS-HR. NAS-HR defined ROIs based on facial
landmarks, extracted raw temporal pulse signals from the RGB channels methods make up for this shortcoming to a certain extent, but this
of each ROI, and used the POS algorithm to extract rPPG signals, still cannot solve the problem from the root. 3D CNN can analyze
which were combined with the raw temporal pulse signals of the R the spatial and temporal characteristics of the video at the same time,
and G channels to create spatial–temporal maps. Notably, they em- which is very suitable for the characteristics of the rPPG signal, which
ployed neural architecture search (NAS) [51] to discover an optimal is beneficial to remote HR measurement. Therefore, researchers in-
lightweight 2D CNN for HR estimation from spatial–temporal maps, troduced 3D CNN and proposed a series of 3D CNN-based methods
instead of designing a new 2D CNN. The test results on the PURE for remote HR measurement. In 3D CNN methods, researchers often
dataset showed that NAS-HR achieved an RMSE of 2.02, outperforming focus on rPPG signal extraction, and excellent rPPG signal quality will
the initial spatial–temporal map method SynRhythm [45], validating produce excellent rPPG signal quality. The HR measurement effect of
the feasibility of utilizing NAS for backbone network search. the 3D CNN method introduced in this paper is summarized in Table 4.
Despite the significant improvement in the estimation quality of
rPPG signals, spatial–temporal maps are also accompanied by a sub- 3.3.1. Basic 3D CNN methods
stantial amount of noise, which limits the information obtained from The basic 3D CNN methods refers to the use of 3D CNN as back-
heavily noisy spatial–temporal maps. In this regard, Niu et al. [48] pro- bone networks, with some preprocessing and post-processing steps
posed a novel method, referred to as CVD, based on spatial–temporal to enhance the effectiveness of the backbone network, without in-
maps, which employs cross-validated feature demixing to eliminate troducing other mechanisms. These methods often excel in terms of
noise. As illustrated in Fig. 7, CVD converts the input facial videos computational complexity and have lower computational costs, but
into multiscale spatial–temporal map, referred to as MSTmap, which there is still significant room for improvement in terms of performance.
distinctively differ from traditional spatial–temporal maps as they re- Bousefsaf et al. [53] proposed the first rPPG method based on 3D
tain the predominant temporal features of periodic physiological signals CNN, abbreviated as 3D CNN, which is a typical 3D CNN Method, as
while suppressing irrelevant background and noise features. Paired shown in Fig. 8. In their method, videos are treated as consistent
MSTmaps are then utilized as the input to an autoencoder architec- collections of frames, and the raw video streams are directly input
ture equipped with two encoders, one for physiological information into the 3D CNN backbone network without prior image processing
and the other for non-physiological information, followed by a cross- steps, such as automatic face detection and tracking. The 3D CNN can
validation scheme to obtain demixed physiological features separated extract features from the unprocessed video streams and input them
from non-physiological features. These demixing features are subse- into a multilayer perceptron (MLP) for HR regression. Interestingly,
quently employed for the joint prediction of multiple physiological the results of training and testing on the UBFC-rPPG dataset showed
signals, such as average HR values and rPPG signals. Importantly, CVD that the RMSE of the 3D CNN was only 8.64, lagging behind some
is the first to incorporate real PPG signals and average HR values for traditional 2D CNN methods. This indicates that even though the
supervised training simultaneously. Results from training and testing characteristics of the 3D CNN align with the characteristics of rPPG
on their proprietary dataset OBF [52] demonstrate an RMSE of 1.26 signals, fine processing and operations are still needed to achieve good
for CVD, which validates the feasibility of this approach. results. In addition, they proposed a data augmentation method that
synthesizes rPPG signals along with randomly generated noise using
3.3. 3D convolutional neural network (3D CNN) methods vector repetition, in order to compensate for the lack of data. However,
this kind of data may not be conducive to network training. The method
Conventional 2D CNN methods lack the ability to learn temporal proposed by Bousefsaf et al. [53] did not fully exploit the potential
contextual features of facial sequences, although spatio-temporal map of 3D CNN, and the results were not as convincing as compared to

6
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 4
Analysis of 3D CNN methods.
Name Year Network Methods Description
3D CNN [53] 2019 3D CNN Data Augmentation The pioneering use of 3D CNN for signal extraction,
data augmentation methods for generating videos
with synthetic rPPG signals, and a multi-layer
perceptron for HR estimation.
PhysNet [54] 2019 3D CNN – Comparing the spatiotemporal network of 2D
CNN+RNN and 3D CNN, indicating that 3D CNN is
more suitable for rPPG methods.

rPPGNet [55] 2019 3D CNN Attention A two-stage 3D CNN method that can not only
estimate rPPG signals, but also overcome the
challenges of highly compressed facial videos.

HeartTrack 2020 3D CNN Attention 3D CNN combined with soft attention mechanism and
[56] hard attention mechanism for signal extraction.

AutoHR [57] 2020 3D CNN NAS Using NAS to automatically find suitable backbone
3D CNN for rPPG signal extraction.

DeeprPPG 2020 3D CNN – Using different skin regions as input for rPPG signal
[58] estimation, allowing for customizable ROI selection
and broader applications.

Siamese-rPPG 2020 Siamese Spatiotemporal Aggregation Using Siamese network with two different facial
[59] 3D CNN regions, cheek region and forehead region, as ROIs,
each corresponding to a 3D CNN for rPPG signal
estimation.

ETA-rPPGNet 2021 3D CNN Attention Proposed an ETA module that utilizes effective
[60] temporal domain attention to improve the accuracy
and stability of HR estimation, using 3D CNN for
rPPG signal estimation.

SAM-rPPGNet 2021 2D CNN + Attention Proposed a SAM for learning salient features to
[61] 3D CNN reduce head motion noise, used in conjunction with
3D CNN for signal estimation.

select the most relevant facial regions based on spatial and temporal in-
formation, similar to the characteristics of attention mechanisms. There
are two types of attention mechanisms: hard attention mechanisms
and soft attention mechanisms. Soft attention mechanisms generally
show better performance, but hard attention mechanisms have lower
computational costs. HeartTrack [56] is a 3D CNN approach that
Fig. 8. The architecture of 3D CNN. combines both types of attention mechanisms. In HeartTrack, attention
mechanisms are used to enhance the denoising capability of 3D CNN.
The hard attention mechanism helps HeartTrack to ignore irrelevant
previous 2D CNN methods. However, the new benchmark-level 3D CNN background information, while the soft attention mechanism helps to
method PhysNet proposed by Yu et al. [54] subsequently demonstrated filter out occluded regions. In extensive experiments on the UBFC-rPPG
the advantages of 3D CNN methods. PhysNet also does not perform dataset, HeartTrack achieves the best RMSE of 3.37, outperforming
preprocessing operations, and directly inputs the raw RGB video frames the initial 3D CNN method [53]. Heavily compressed videos can pose
into the 3D CNN backbone network. However, their backbone network challenges for the backbone network of 3D CNN to capture salient fea-
can effectively learn temporal and spatial contextual features of facial tures from facial videos, resulting in degraded quality of extracted rPPG
sequences, and directly outputs the rPPG signal without the need for signals. To overcome this issue, Yu et al. [55] proposed a two-stage 3D
post-processing operations. In order to compare the performance of 3D CNN approach, consisting of two 3D CNNs for different tasks. One is
CNN and 2D CNN, they proposed a 2D CNN version of PhysNet for called Spatio-Temporal Video Enhancement Network (STVEN), which
comparison. Experimental results on the private OBF dataset showed is responsible for video enhancement, and the other is called rPPGNet,
that the RMSE corresponding to the 2D CNN version of PhysNet was which serves as the backbone network for rPPG signal estimation.
This two-stage approach can effectively handle heavily compressed
2.94, while the RMSE corresponding to the 3D CNN version of PhysNet
facial videos, as shown in Fig. 9. STVEN enhances video quality
was 1.81, indicating a significant performance difference, establishing
and retains as much information as possible from compressed facial
the important position of 3D CNN in rPPG signal estimation. Interest-
video inputs. Within the backbone network rPPGNet, an attention
ingly, PhysNet also considered for the first time the application of rPPG
mechanism is applied to extract dominant rPPG features from the skin
signals to emotion recognition.
region. rPPGNet can extract rPPG signals independently or be jointly
trained with STVEN for better performance. Experimental results show
3.3.2. Attention mechanism methods that rPPGNet performs excellently and demonstrates strong robustness
Facial videos often contain redundant information, and motion arti- in handling compressed videos. The RMSE obtained on the heavily
facts introduced by body movements can result in significant biases in compressed dataset MAHNOB-HCI [62] is 5.93 for rPPGNet, while the
estimating rPPG signals. To address these limitations and obtain more baseline method PhysNet [53] achieves an RMSE of 8.76, validating
stable rPPG signals, researchers have introduced attention mechanisms the effectiveness of STVEN and rPPGNet.
in video-based rPPG estimation. These attention mechanisms help to Attention in other modules: Different from the use of atten-
learn salient features related to facial information in videos, allowing tion mechanism in the backbone 3D CNN, Hu et al. [61] proposed
the model to focus on relevant information and reduce motion artifacts. a spatial–temporal attention module (SAM) based on 3D CNN for
Attention in backbone: The effectiveness of 3D CNN in denoising learning salient features. They further proposed a 3D CNN approach
can be attributed to attention mechanisms, which enable the model to called SAM-rPPGNet, which incorporates the attention mechanism.

7
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Fig. 9. The architecture of rPPGNet.

Fig. 12. The architecture of Siamese-rPPG.

gradient-based NAS methods [51,64], resulting in the identification of


Fig. 10. The architecture of SAM-rPPGNet. a backbone 3D CNN network for estimating rPPG signals. It is worth
mentioning that they designed a hybrid loss function with time and
frequency constraints, and adopted two data augmentation strategies
to assist the backbone 3D CNN network identified by NAS. Experi-
mental results on the large-scale dataset VIPL-HR [65] demonstrated
that AutoHR achieved MAE of 5.68 and RMSE of 8.86, indicating the
satisfactory performance of the backbone network identified by NAS.
The basic idea of Siamese network [66] is to simultaneously operate
Fig. 11. The architecture of ETA-rPPGNet.
on two samples. In previous rPPG methods, usually only one ROI was
used for single-line operation. However, Tsou et al. [59] proposed a
Siamese network-based method called Siamese-rPPG, where they se-
SAM-rPPGNet comprises three modules: the facial feature extraction lected two different facial regions with more rPPG information, namely
module, the rPPG signal extraction module, and the rPPG signal pro- the forehead and cheek regions, as ROIs for simultaneous dual-line
cessing module, as illustrated in Fig. 10. The facial feature extraction operation, as illustrated in Fig. 12. In Siamese-rPPG, the two ROIs
module is designed to extract facial features from input video frames correspond to the forehead branch and the cheek branch, respectively,
and apply aggregation functions to merge long-range spatiotemporal both of which are 3D CNN with the same structure for signal extraction.
feature maps into short-segment spatiotemporal feature maps. It is a Weight mechanisms are also applied to these two branches, so that
spatiotemporal facial feature extraction module based on short-segment Siamese-rPPG can still extract signals using other regions even if the
modules, which reduces redundant spatial information and enhances cheek or forehead region is contaminated with noise. The outputs
the ability to integrate information from long-range facial videos. The of the two branches are fused by addition operation, followed by
signal extractor module utilizes a specialized signal extractor with two one-dimensional convolutions and an average pooling to generate
multiple spatiotemporal convolutions [63]. Additionally, it applies the the predicted rPPG signal. Results from training and testing on the
spatiotemporal strip pooling method and SAM to the extracted rPPG COHFACE dataset [67], which contains severe compression challenges,
signal to adapt to head motion and avoid ignoring important local showed that Siamese-rPPG achieved an RMSE of 1.29, demonstrat-
information. Testing results on the PURE dataset with challenging head ing its superior performance on the COHFACE dataset. Similarly to
motion challenges demonstrate that the RMSE of SAM-rPPGNet is 1.21. Siamese-rPPG [59], Liu et al. [58] incorporated the concept of utilizing
To further mitigate head motion noise, Hu et al. [60] proposed a multiple different ROIs in their proposed method called DeeprPPG.
novel approach called ETA-rPPGNet, which focuses on the problem of However, in contrast to Siamese-rPPG, which utilized only two pre-set
redundant video information extraction, as shown in Fig. 11. They ROIs, DeeprPPG offers the flexibility of customizing multiple different
constructed a Time-Domain Segment Subnet using attention mecha- skin regions as ROIs. Each different skin region ROI is used as input
to the 3D CNN backbone network for rPPG signal extraction. Liu
nism to divide the video into several segments and fed them into the
et al. [58] also designed a spatio-temporal rPPG aggregation strategy
temporal domain subnet to extract important spatial facial features
to adaptively aggregate rPPG signals from multiple skin regions into
and aggregate temporal information separately. At the same time, a
the last skin region. However, as different regions may introduce
temporal attention mechanism was designed in the backbone 3D CNN
different noises, they employed a spatio-temporal aggregation function
network, where one-dimensional convolutions were used to effectively
to mitigate the effects of noise-contaminated regions and improve the
model information correlations in the local temporal domain, thereby
robustness of DeeprPPG. Experimental results on the PURE dataset
enhancing the learning of temporal information and reducing the bias
demonstrated that DeeprPPG achieved an impressively low RMSE of
in rPPG signal extraction caused by noise. Similarly, when trained and
0.43, with a Pearson correlation coefficient approaching 1, indicating
tested on the PURE dataset, ETA-rPPGNet achieved an RMSE of 0.77,
exceptional performance.
surpassing SAM-rPPGNet [61] significantly.
3.4. Recurrent neural network (RNN) methods
3.3.3. Hybrid methods
In order to enhance the capability of 3D CNN methods for extracting Compared to CNN, RNN is commonly used for tasks involving
rPPG signals, researchers have started to combine techniques from temporal information. In the field of rPPG, researchers have leveraged
other fields with 3D CNN, referred to as hybrid methods. Similar to RNN to better utilize the temporal context of rPPG signals. RNN can
NAS-HR, which utilizes NAS to search for 2D CNN backbone net- be used independently or in combination with CNN. Integrating RNN
works [49], Yu et al. [57] proposed a new method called AutoHR with 2D CNN allows the 2D CNN to capture temporal information,
based on 3D CNN, in which they used NAS to search for the most while combining RNN with 3D CNN further enhances information
suitable 3D CNN backbone network for rPPG signal extraction. How- extraction. Currently, two key variants of RNN, namely LSTM [68] and
ever, unlike NAS-HR, AutoHR incorporates a special 3D convolution GRU [50], are commonly used with varying suitability for various tasks.
operation called temporal difference convolution (TDC) to assist in In the field of rPPG, researchers often employ LSTM to explore novel
tracking the ROI and enhance robustness in the presence of motion and directions for improving the quality estimation of rPPG signals. Table 5
low-light conditions. Moreover, the NAS analysis in AutoHR employed summarizes the rPPG methods that utilize RNN.

8
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 5
Analysis of RNN methods.
Name Year Network Methods Description
Bian et al. [69] 2019 LSTM – The first method of rPPG using LSTM, traditional
methods extract rough signals, and two-layer LSTM is
used to filter signals.

TWO-STREAM 2019 TWO-STREAM Spatial–Temporal TWO-STREAM network is employed to estimate HR


[70] +LSTM Map by connecting two networks for different tasks.

Botina et al. [71] 2020 LSTM – A Long Short-Term Deep-Filter is proposed for
filtering rPPG signals.

Huang et al. [72] 2020 2D CNN + – 2D CNN for spatial feature extraction, LSTM for
LSTM temporal information capturing, utilizing fully
connected layers for HR estimation.

RhythmNet [47] 2020 2D CNN + Spatial–Temporal GRU is used to consider the relationship between
GRU Map adjacent HR measurements in video sequences, and a
combined approach of 2D CNN and GRU is employed
for HR estimation.

Meta-rPPG [73] 2020 2D CNN + Meta Learning Utilizing a transfer meta-learner to acquire unlabeled
LSTM data for rapid adaptation to diverse sample
distributions, employing a 2D CNN with BiLSTM
spatiotemporal architecture for signal extraction.

PRNet [74] 2021 3D CNN + – A one-stage Remote HR Measurement Framework by


LSTM Combining 3D CNN with LSTM.

Fig. 14. The architecture of Bian et al.’s method.

Deep Filter (LSTM-DF) for rPPG signal filtering. The LSTM-based LSTM-
DF can learn the feature shapes of rPPG signals, especially the temporal
structure of rPPG signals, thereby reducing noise in the rPPG signals
and improving their quality. Experimental results showed that using
Fig. 13. The memory function of LSTM.
traditional methods to extract rough signals and then filtering them
with LSTM improved the performance to some extent, but the signals
extracted by traditional methods were still too rough, making it difficult
3.4.1. LSTM methods to achieve excellent signal quality even with filtering.
A typical LSTM unit undergoes several basic operations to retain or LSTM for estimating: Compared with traditional methods, CNN-
forget certain information. The retained information can be interpreted based methods have been shown to yield better quality rPPG signals,
as the cell state, while the forgotten information can be interpreted as thus the combination of LSTM and CNN has been proven to significantly
the hidden state, which is a key concept of LSTM. The cell state can improve the estimation quality of rPPG signals. Huang et al. [72]
retain relevant information throughout the entire process of input time proposed a new approach that combines 2D CNN with LSTM. In their
series, while the hidden state contains information from previous data. method, 2D CNN is used to extract spatial features and local temporal
These two states can effectively utilize the contextual information of information from each frame’s ROI in the input video, while LSTM is
rPPG signals. The memory function of LSTM can be illustrated by the
used to capture global temporal information from consecutive frames.
following diagram, where [𝑋0 , 𝑋1 , 𝑋2 ] represents the input sequence,
The output of LSTM is then directly fed into a fully connected layer
[𝐻0 , 𝐻1 , 𝐻2 ] represents the corresponding hidden states (cell states),
for HR estimation, bypassing the step of rPPG signal extraction to save
and [𝑌0 , 𝑌1 , 𝑌2 ] represents the outputs. In Fig. 13, the color of the
computation time. Experiments on the UBFC-rPPG dataset showed that
matrix blocks (green, red, blue) represents the different information
the RMSE reached 2.84, and the HR can be updated in about one
contained in the input time series at t = 0, 1, 2. When t = 2, the
second using this method. However, Huang et al. [72] chose to bypass
previous input information can also flow to the last hidden state or
the step of signal extraction using 2D CNN and directly estimated
directly output.
LSTM for denoising: In summary, the use of LSTM can effectively HR using a fully connected layer, which may introduce some errors.
filter out noisy signals and retain useful signals in a data-driven man- Wang et al. [70] used 2D CNN as the backbone network for feature
ner. Bian et al. [69] proposed the first rPPG method that utilizes LSTM. extraction and designed a two-stream network with separate streams
They proposed training a two-layer LSTM to filter out rough rPPG for feature extraction and rPPG signal extraction, corresponding to two
signals, as illustrated in Fig. 14. Instead of directly estimating the rPPG different tasks, as shown in Fig. 15. In the TWO-STREAM approach,
signal using LSTM, they chose to first estimate the rough rPPG signal spatiotemporal maps are used as inputs, which are generated from
from facial videos using traditional methods, and then input it into the the input video and fed into the feature extraction stream and rPPG
trained two-layer LSTM for filtering, resulting in a refined version of signal extraction stream. The feature extraction stream is based on 2D
the original rough signal. During the training of the two-layer LSTM, CNN, used to extract synchronized spatial features from the spatio-
they generated a large number of synthetic signals with significant temporal maps, thereby improving the robustness of face detection
noise using an algorithm to enhance the model’s generalization ability. and reducing ROI alignment errors. The rPPG signal extraction stream
Similarly, considering the use of LSTM for signal filtering, Botina- consists of a combination of 2D CNN and LSTM, where 2D CNN is
Monsalve et al. [71] specifically designed a Long Short-Term Memory used for initial rPPG signal extraction, and a two-layer LSTM is used

9
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Fig. 15. The architecture of TWO-STREAM.

Fig. 17. The basic architecture of GAN.

3.4.2. GRU methods


In fact, the GRU can be considered a simplified version of the LSTM.
While LSTM utilizes both the cell state and hidden state to retain
all past and current information, GRU faces a binary choice between
current and past information, discarding most of the past information
if the current information is chosen. GRU introduces two information
Fig. 16. The architecture of Meta-rPPG. control gates: the update gate and the reset gate, where the reset gate
determines the amount of information to be forgotten, and the update
gate determines the amount of information to be continued. Due to
for further refinement of the initial signals. The outputs of the two the characteristics of the rPPG signal, LSTM is more appropriate for
streams are concatenated and used as input for the HR estimator. estimating the rPPG signal. To our understanding, among all the rPPG
Experimental results on the COHFACE dataset demonstrated that the methods used for remote HR measurement, only RhythmNet [47] has
RMSE of TWO-STREAM was 9.96. utilized GRU, which combines 2D CNN, spatio-temporal map, and GRU.
In this section, we will mainly focus on the GRU part of RhythmNet.
Hybrid LSTM: In order to enable models to adapt to unforeseen
For further information on other parts of RhythmNet, please refer to
variations during testing, meta-learning [75] has been proposed for fast
Section 3.2. Considering the prior knowledge that the changes between
adaptation, and LSTM is often applied in meta learning to leverage
adjacent measurements are very small, GRU is used to model the rela-
its temporal characteristics. Lee et al. [73] proposed a new approach
tionship between adjacent HR measurements in the video. The features
called Meta-rPPG that combines meta learning with LSTM to enhance
extracted by the backbone 2D CNN are fed into GRU, which regresses
its ability to handle random variations. They proposed a transformation the HR values of individual video frames based on the relationship
meta-learner that collects unlabeled samples during the testing process between adjacent HR, constrained by a loss function that enforces
for self-supervised weight adjustment, as shown in Fig. 16. The Meta- smoothness in HR measurements. The final HR measurement result of
rPPG network consists of two parts: a 2D-based feature extractor and the video is generated based on the average of all the measured HR.
an LSTM-based rPPG estimator. The feature extractor is used to extract
latent features from two streams of facial images, and the extracted 3.5. Generative adversarial network(GAN) methods
features are passed to a Bidirectional LSTM (BiLSTM) network to
model temporal context, while the rPPG signal estimation is done Since their introduction in 2014, GAN [79] have emerged as a
by an MLP. Meanwhile, a synthetic gradient generator based on the prominent approach in the fields of image processing and computer
Shallow Hourglass network [76] is used for transductive learning to vision, renowned for their exceptional performance and effectiveness.
generate gradients for unlabeled data [77], in order to handle random GAN comprise of two neural networks, the generator G and the dis-
variations during the testing process. Experimental results on the DEAP criminator D, which are trained in an adversarial manner. As depicted
dataset [78], which contains various challenging factors, show that in Fig. 17, the generator G produces fake target signals to confuse
Meta-rPPG achieves an RMSE of 6.00, demonstrating good robustness. the discriminator, while the discriminator D assesses the generated
Due to the lack of temporal information in 2D CNN, LSTM is often com- signals and real signals to incentivize G to generate results that re-
bined with 2D CNN. However, Huang et al. [74] proposed a different semble real data. As one of the most commonly used and effective
approach by considering the combination of 3D CNN and LSTM, and generative models, GAN aim to learn the underlying data distribution
introduced a single-stage remote HR measurement framework called from a limited dataset. From a generative model perspective, estimating
rPPG signal can be considered as a generative problem. As a result,
PRNet. PRNet defines HR estimation as a regression task based on deep
researchers have explored the integration of GAN into the field of rPPG
neural networks, directly mapping videos to HR values. 3D CNN itself
to enhance the quality of rPPG signal estimation through adversarial
has the ability to capture both spatial and temporal features, so they use
training and achieve improved results for remote HR measurement.
3D CNN to extract spatial features from the defined ROI and capture
Table 6 summarizes the existing GAN-based methods for rPPG.
local temporal features to generate feature maps. The LSTM extrac-
Under the rapid development of GAN, Sabokrou et al. [80] intro-
tor further extracts global temporal features from the feature maps
duced GAN into the field of rPPG for the first time and proposed a
generated by 3D CNN to enhance the temporal feature information. novel GAN-based method called Deep-HR, which consists of a front-end
Interestingly, PRNet does not rely on power spectral density (PSD) module (FE) and a back-end module (BE), as illustrated in Fig. 18. The
and FFT algorithms to compute HR from rPPG signals, but directly FE utilizes Receptive Field Blockade (RFB) [81] to detect the ROI of the
designs a HR estimator to perform this operation, thus achieving good subject, and designs a GAN-based module to further enhance the ROI
computational speed. However, comparative experiments on the UBFC- and extract the rPPG signal. This module consists of two refined net-
rPPG dataset show that the RMSE of PRNet is 7.24, which is not works: a deep encoder–decoder network based on 2D CNN acts as the
superior to the combination of 2D CNN and LSTM as in the method generator to regenerate the detected ROI and generate the estimated
proposed in [72] (RMSE is 2.84), which may indicate the limited rPPG signal, and a CNN that understands the distribution of high-
applicability of 3D CNN and LSTM combination. quality ROI acts as the discriminator to supervise the generator. The

10
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 6
Analysis of GAN methods.
Name Year Network Methods Description
Deep-HR [80] 2021 2D CNN + RFB Utilizing GAN-based modules for enhancing
DNN + GAN detected ROI and eliminating noise, employing 2D
CNN for signal extraction.

PulseGAN [82] 2021 GAN CHROM Employing GAN for signal filtering to generate
high-fidelity rPPG signal from coarse rPPG signal.

Dual-GAN [84] 2021 2D CNN + Spatial-Temporal BVP-GAN learns the denoising mapping from input
Dual GAN Map +Disentangled to real BVP, while Noise-GAN learns the noise
Feature Learning distribution, and they mutually promote each
other to enhance feature disentanglement.

Fig. 18. The architecture of Deep-HR.

estimation quality of the rPPG signal is enhanced through adversarial


learning between these two networks. The BE is a lightweight deep
regression autoencoder module based on Deep Neural Network (DNN)
used for HR estimation. The experimental results indicate that Deep-HR
achieves an RMSE of 3.41 on the well-established physiological dataset
MAHNOB-HCI, outperforming the contemporaneous spatio-temporal
map method RhythmNet (with an RMSE of 3.99) and the 3D CNN-
based method rPPGNet (with an RMSE of 5.93). In contrast to Deep-HR,
which directly uses GAN for rPPG signal extraction, Song et al. [82]
propose another approach using GAN for filtering rough rPPG signals,
and introduce a GAN-based rPPG method called PulseGAN. They obtain Fig. 19. The architecture of Dual-GAN.
a rough rPPG signal using the CHROM algorithm [12] on the delineated
ROI, and then use PulseGAN to generate realistic and high-quality rPPG
signals based on this rough signal as input. The structure of PulseGAN
is based on the conditional GAN (CGAN) approach [83], where the Fig. 20. Building upon the transformer, Swin transformer researchers
condition in PulseGAN is the rough signal generated by CHROM. Exper- have made improvements for computer vision tasks [86], such as
iments on the UBFC-rPPG dataset demonstrate that PulseGAN achieves ViT [87,88]. While the transformer structure includes both an encoder
an RMSE of 2.10 (see Fig. 18). and a decoder, ViT and Swin transformer only employ the transformer
Unlike typical GAN-based methods that employ a single GAN net- encoder for computer vision tasks. TransRPPG [89] utilizes ViT to
work, Dual-GAN [84] introduces two GAN networks, namely BVP-GAN extract rPPG features from preprocessed signal maps for face 3D mask
and Noise-GAN, for the estimation of rPPG signals, as illustrated in presentation attack detection, successfully showcasing the potential of
Fig. 19. Dual-GAN jointly models the blood volume pulse (BVP) pre- transformer in the field of rPPG. Subsequently, researchers have further
dictor and noise distribution, enabling physiological measurements investigated the suitability of transformer in rPPG methods for remote
based on rPPG. The objective of BVP-GAN is to learn a noise-resistant HR measurement, and we provide a summary of all rPPG methods
mapping from input to real BVP, while the objective of Noise-GAN is to utilizing transformer in Table 7.
learn the distribution of noise. In addition to modeling the BVP predic-
tor, this architecture explicitly models the noise distribution through 3.6.1. ViT methods
adversarial learning, enabling robust representation of pulse signals In the field of NLP, the transformer architecture is composed of an
even in videos with low visibility and stronger noise, enhancing the fea- encoder and a decoder, as depicted in Fig. 20. However, for computer
ture disentanglement of pulse signals and improving HR measurement vision tasks, the ViT [87] and Swin transformer [88] models exclu-
accuracy. Additionally, they propose a ready-to-use ROI alignment and sively employ the transformer encoder. The ViT model was specifically
fusion (ROl-AF) block to mitigate inconsistencies between different designed by Dosovitskiy et al. [87] for image recognition utilizing
ROIs and extract informative features from a wider acceptance field the transformer architecture. The visual transformer architecture is
of ROIs. It is worth mentioning that Dual-GAN has achieved highly illustrated in Fig. 21. As the transformer only accepts one-dimensional
competitive performance on multiple datasets and is currently one of sequences, the images are initially divided into equally sized patches
the top-performing methods, achieving an RMSE of 0.67 on UBFC-rPPG, and then flattened into a 2D patch sequence. Patch embeddings are
surpassing the previously best-performing method Siamese-rPPG [59] obtained through linear projection, and positional embeddings are
(RMSE of 0.97), and an RMSE of 1.31 on PURE, which significantly added prior to feeding these patch embeddings into the transformer
outperforms another GAN-based method PulseGAN [82] (RMSE of encoder. The transformer encoder retains the multi-head self-attention
4.29) and most deep learning-based methods. (MSA) mechanism of the original transformer, but also incorporates a
MLP. Finally, an MLP head module is employed for image classification.
3.6. Transformer methods ViT, as a prominent variant of the transformer architecture in the field
of computer vision, has emerged as the primary choice for researchers
The transformer, initially proposed by Vaswani et al. in [85] for in remote HR measurement tasks, distinguishing it from the original
sequence data modeling in natural language processing (NLP), con- transformer.
sists primarily of Multi-Head Attention (MHA) and Positionwise Feed- Pure ViT: Under the influence of the wide application of trans-
Forward Networks. The original transformer architecture is depicted in former in the field of computer vision, Ambareesh et al. [90] first

11
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Fig. 23. The architecture of PhysFormer.

Although Instantaneous_transformer [90] successfully applied trans-


former to the field of remote HR measurement, its performance is not
satisfactory and it did not fully utilize the advantages and capabilities
of transformer. Yu et al. [91] made full use of the advantages of
transformer and proposed a new method called Physformer for HR
measurement, reaching a new height in the performance of rPPG meth-
ods, as shown in Fig. 23. Physformer mainly uses a time-difference
transformer to explore long-range spatiotemporal relationships in rPPG
signal estimation, with good spatiotemporal representation ability for
local time-difference description and global spatiotemporal modeling.
The designed time-difference transformer utilizes time differences to
guide global attention-enhanced quasi-periodic rPPG features, and re-
fines local spatiotemporal representations for interference situations,
with the aim of obtaining globally refined local rPPG features. Phys-
former uses a face detector based on MTCNN [92] to perform initial ROI
selection on the given RGB facial video input and generates spatiotem-
poral tags, i.e., rPPG information blocks, as input to the time-difference
transformer. The time-difference transformer does not directly gener-
Fig. 20. The transformer-model architecture. The left half of the picture is an encoder
ate rPPG signals but feeds the obtained rPPG features into an rPPG
and the right half of the picture is a decoder.
prediction head for temporal upsampling and spatial averaging, thus
enhancing the rPPG features and projecting them into a 1D signal
to generate rPPG signals. Through extensive experiments, Physformer
has achieved excellent results on multiple datasets, with RMSE of
0.71 on the standard physiological dataset UBFC-rPPG, RMSE of 7.79
on the large-scale dataset VIPL-HR, and RMSE of 1.75 on the PURE
dataset, demonstrating good robustness and being one of the state-of-
the-art methods along with Dual-GAN [84]. To further improve the
performance of Physformer and achieve better results, Yu et al. [93]
proposed an enhanced version of Physformer called Physformer++.
PhysFormer++ is based on Physformer and incorporates a dual-channel
SlowFast architecture with complex cross-velocity interactions. Unlike
Physformer, which only uses the slow channel, Physformer++ extracts
Fig. 21. The ViT architecture. On the right is the structure of transformer encoder. and fuses attention features from both the slow and fast channels
and designs time-difference periodicity and cross-attention transformer
for the slow and fast channels, respectively, to enhance the dynamic
representation ability and robustness of periodic rPPG signals. Results
of training and testing on the large-scale VIPL-HR dataset show that
the RMSE of PhysFormer++ is 7.62, slightly better than Physformer
(RMSE of 7.79), but PhysFormer++ greatly increases the computational
complexity and cost.
Fig. 22. The architecture of Instantaneous_transformer. The excellent compatibility of ViT allows for easy integration with
other methods and strategies. Recently, Gupta et al. [94] proposed a
new method called RADIANT by combining signal embedding with
applied transformer to remote HR measurement and proposed a new transformer, as shown in Fig. 24. The purpose of introducing signal
rPPG method called Instantaneous_transformer, which has the ability embedding is to enhance the representation of rPPG features and
to estimate HR instantaneously. The architecture is shown in Fig. 22. suppress noise. They used traditional methods to extract time signals,
Instantaneous_transformer takes video input and uses a video trans- i.e., rough rPPG signals, from ROI regions, and used MLP layers to
former based on ViT to generate predicted rPPG signals. The video embed the time signals to preserve more rPPG information. The embed-
transformer consists of a spatial backbone and a frame-level temporal ded signals are only related to the corresponding time signals, without
aggregation module to help learn the temporal correlation of the bio- being influenced by other signals. Signal embedding provides better
signal waveform. The spatial backbone uses an architecture based on feature representation for the main transformer network, thus extract-
DeepPhys [39], and the temporal module uses a transformer-encoder ing higher-quality rPPG signals. Moreover, the generated embedded
architecture. Results of training and testing on the UBFC-rPPG dataset signals can be used for pre-training the transformer network to mitigate
show that the RMSE of Instantaneous_transformer is 13.94, which is the shortage of training data. Experimental results on the UBFC-rPPG
not superior to other learning-based methods in terms of performance. dataset showed that RADIANT achieved an RMSE of 4.52.
However, it has the advantage of fast computation, as it can perform MaxViT: Previous methods that did not use transformer may also
HR measurement 13 times within one minute. achieve good results, but they have certain limitations. For example, 3D

12
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 7
Analysis of transformer methods.
Name Year Network Methods Description
Instantaneous_ 2022 2D CNN + ViT – The first method utilizing transformer for
transformer [90] real-Time physiological estimation.
PhysFormer [91] 2022 ViT Temporal-Difference A temporal difference transformer for exploring
Learning long-range temporal–spatial relationships in rPPG
measurements.

PhysFormer++ [93] 2023 ViT Temporal-Difference The dual-channel slowFast architecture design with
Learning + SlowFast complex cross-speed interaction is added on the
basis of PhysFormer for more robust head motion.

APNET [95] 2022 MaxViT Axis Projection The concept of APNET is proposed, which obtains
information from each direction by projecting
videos onto different axes.

RADIANT [94] 2023 ViT Signal Embedding The domain-generalized rPPG network based on
decoupling feature learning is the first method
addressing domain generalization issue in rPPG
methods.

EfficientPhys [97] 2023 2D CNN + Swin TSM Eliminating preprocessing completely and
Transformer comparing 2D CNN-based and transformer-based
backbone networks.

With the remarkable computational efficiency of Swin transformer,


Liu et al. [97] proposed a novel approach called EfficientPhys, which
eliminates all preprocessing steps such as face detection, segmentation,
normalization, color space conversion, etc, for video inputs, and di-
Fig. 24. The architecture of RADIANT.
rectly utilizes the raw video frames as input. EfficientPhys is built upon
the 2D Swin transformer, which learns spatial features to map the origi-
nal RGB values to latent representations of individual frames and target
signals. However, 2D Swin transformer lacks the capability to model
temporal relationships beyond consecutive frames. Therefore, Liu et al.
added a TSM [41], similar to MTTS-CAN [32], before each 2D Swin
Fig. 25. The architecture of Swin transformer. transformer block to facilitate information exchange across the time
axis. It is worth noting that their TSM does not introduce any trainable
parameters, thus the proposed transformer architecture has the same
number of parameters as the original Swin transformer. Experimental
CNN methods require a large amount of memory, spatiotemporal graph
results on the UBFC-rPPG dataset showed that EfficientPhys achieved
methods require pre-processing, and methods using DRM cannot learn
a RMSE of 1.81, demonstrating good performance even without any
long-term temporal characteristics, etc. Kim et al. [95] proposed the
preprocessing.
concept of Axis Projection Network (APNET) using the latest variant of
ViT called MaxViT [96], which addresses these limitations. By project-
3.7. Data augmentation methods
ing videos onto different axes, data information from each direction
is obtained, which is the new time series feature analysis method of
Due to the high cost of collecting high-quality medical video
APNET. APNET consists of axis feature extractors, feature mixers, and
datasets, the data quality of most datasets is limited. To alleviate the
PPG decoders, which are feature extractors based on MaxViT with
problem of insufficient well-annotated training data, data augmen-
the same shape. Compared with ViT, MaxViT has the advantage of
tation has been widely adopted. In addition to traditional augmen-
learning both global and local features simultaneously. The axis feature
tation strategies such as horizontal flipping, rotation, and cropping,
extractors are designed to extract specialized features along each axis,
learning-based automated data generation processes have been shown
and their role is to extract features from each axis of the video. The
to significantly improve object detection and image classification tasks.
feature mixers combine the outputs of feature extractors and calculate
Distinct from data augmentation methods aimed at expanding data,
the optimal features. The PPG decoders process the optimal features to the data augmentation methods introduced in this section refer to
generate rPPG signals. Experimental results on the UBFC-rPPG dataset approaches aimed at enhancing the estimation performance of rPPG
showed that APNET achieved an RMSE of 0.77, which is close to the signals, thus improving the effectiveness of remote HR measurement.
excellent performance of PhysFormer (RMSE of 0.71), validating the Table 8 summarizes the data augmentation methods introduced in this
powerful capability of MaxViT. study.
Composite video: The Multi-task [98] method is the first approach
3.6.2. Swin transformer methods that focuses on generating synthetic videos with specific rPPG signals.
In the realm of computer vision, Liu et al. [88] proposed a visual To achieve this goal, they designed a multi-task framework consisting
transformer variant called Swin Transformer, which employs a sliding of three networks. The first network is a signal extractor, which is a 3D
window mechanism to enhance computational efficiency. They de- CNN that directly extracts rPPG signals from input facial videos. The
vised hierarchical feature maps to obtain multi-resolution feature maps, second network is a reconstruction network that generates synthetic
thereby making Swin transformer a versatile backbone for computer videos from real images. The third network is a synthesis network that
vision applications. The structure of Swin transformer is illustrated in generates videos from real videos. The reconstruction and synthesis
Fig. 25, where the Swin transformer Block is composed of two sub- networks are enhanced with specified signals from another video to
modules: window multi-head self-attention (W-MSA) and shifted win- generate synthetic videos with enriched rPPG information, which is
dow multi-head self-attention (SW-MSA), which replace the multi-head then input to the signal extractor for signal extraction, thus benefiting
attention mechanism in ViT. from the data augmentation in extracting rich rPPG information. This

13
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Fig. 26. The architecture of RErPPGNet. Fig. 27. The architecture of PRN augmented.

data augmentation approach not only facilitates signal extraction but


also mitigates the issue of limited training data. Experimental results
on the heavily compressed dataset COHFACE show that the RMSE of
Multi-task reaches 1.65, which outperforms most methods tested on
COHFACE, demonstrating the effectiveness of this data augmentation
approach in the face of the challenge of video compression. Inspired
Fig. 28. The architecture of AND-rPPG.
by Multi-task [98], Hsieh et al. [99] proposed a new method called
RErPPGNet, which considers using a novel double cycle consistent
constraint [100] for video synthesis, as shown in Fig. 26. RErPPGNet
consists of two networks: RemovalNet and EmbeddingNet. For a given 3.8. Other methods
video, RemovalNet is used to remove all rPPG signal information from
the original video, and then EmbeddingNet is used to embed specific Researchers have proposed various sub-categorical methods, which
rPPG signals into the video, resulting in a video with strong rPPG are not conveniently classified, in addition to the broad categories of
information, which leads to excellent estimation of rPPG signals in methods mentioned earlier, in order to improve the performance and
the enhanced video. Different double cycle consistent constraints are effectiveness of remote HR measurement. This section will provide an
employed for these two networks to ensure the well-preserved rPPG- overview of these sub-categorical methods, and Table 9 summarizes
related information. It is worth noting that extensive experiments on these methods for easy reference.
the UBFC-rPPG and PURE datasets show that the data augmenta- Enhanced denoising: Head motion and illumination changes have
tion approach of RErPPGNet significantly improves the accuracy, with always posed challenges for researchers in the field. Unlike most ap-
an RMSE of 0.56 on UBFC-rPPG and 0.54 on PURE, making it the proaches that use attention mechanisms to focus on regions containing
best-performing method on the UBFC-rPPG dataset. rich rPPG information, Nowara et al. [103] proposed a novel reverse
Enhanced video: In order to enhance the rPPG feature information attention mechanism that utilizes regions with minimal rPPG informa-
contained in facial videos, Yue et al. [101] proposed a data aug-
tion, such as the face, hair, and background, to reduce noise caused by
mentation approach that jointly utilizes an rPPG information restora-
head motion and illumination changes. They used the inverse of the
tion network, rPPGRNet, to enhance the facial video resolution, and
attention mask learned by LSTM to generate noise estimation, which
a feature-enhanced network, THRNet, to improve the discriminative
was then used for denoising the temporal signal. The denoised signal
power of facial images. The method consists of two stages. In the first
stage, rPPGRNet captures subtle and significant differences between the was input into a convolutional attention network to learn which regions
original low-resolution images and their temporally adjacent frames of the video contain physiological signals and generate preliminary
and designs an rPPG loss to guide the restoration of color changes estimates. The noise estimation was obtained by taking the pixel in-
related to HR, generating high-resolution (SR) facial images with en- tensities of the inverse portion of the attention mask learned by LSTM,
hanced rPPG information. In the second stage, the SR images are which was then used to improve the estimation of physiological signals.
integrated and fed into THRNet for HR measurement. Here, THRNet Experiments on the classical physiological dataset MMSE-HR showed
utilizes multiple Time-Weighted Attention (TWA) blocks designed to that their method achieved an RMSE of 4.90 . Similarly, for denoising,
optimize the weight allocation across different channels in the temporal Lokendra et al. [104] introduced action units (AU) as a technique to
domain to extract distinguishable spatio-temporal features. TWA blocks denoise the temporal signal, and proposed a new method called AND-
can automatically learn and allocate weights to different channels of rPPG for remote HR measurement, as shown in Fig. 28, by combining
the feature maps in the temporal domain, enhancing the discriminative AU with a temporal convolutional network (TCN) architecture. They
power of facial images and suppressing minor noise in the facial images, designed a new network based on the TCN architecture called denoising
enabling THRNet to focus on indicating pulsation features and thus TCN, which consists of AUs. Each denoising TCN automatically learns
improving the HR measurement performance. On the DEAP dataset, the how to denoise, primarily targeting the temporal signals extracted
RMSE of rPPGRNet + THRNet reached 5.47. For facial video skin color, from specific facial regions that are disrupted by facial expressions.
previous remote HR measurement methods often did not require a They clipped the input video into several fixed-size non-overlapping
specific skin color, which could be yellow, white, or black. However, Ba segments, detected the entire face in each clip, and divided it into
et al. [102] found that dark skin color is more favorable for rPPG signal different regions to extract temporal signals, which were then denoised
extraction and proposed a new data augmentation approach, PRN aug-
using 𝑁 denoising TCNs. In addition, AND-rPPG also first utilized
mented, specifically from the perspective of skin color. The architecture
Delaunay triangle regions as ROI, allowing for more ROIs to be used
is shown in Fig. 27. In order to convert facial videos into a uniform
for manipulation. The results of training and testing on the UBFC-rPPG
dark style, they utilized style transfer technology and designed a skin
dataset showed that the RMSE of AND-rPPG reached 4.07, which is at
color generator based on 3D CNN to transform facial images of people
with different skin colors into a uniform dark skin video. The skin color a moderate level of performance.
generator is trained to learn visual appearance and subtle color changes Fusion multi-channel: Most existing methods for remote HR mea-
associated with potential blood volume changes, enabling high-fidelity surement from facial videos solely employ RGB channels, while rPPG-
dark skin color enhancement to benefit subsequent rPPG estimation FuseNet [105] is the first to fuse images from both RGB and Multispec-
networks. Experimental results on the UBFC-rPPG dataset with various tral Remote Sensing (MSR) channels for this purpose. The feasibility
skin colors demonstrate that PRN augmented has an RMSE of 1.31, of the approach of rPPG-FuseNet can be explained by prior knowledge
indicating strong robustness. It is worth noting that converting facial that RGB is sensitive to illumination changes but encodes information
skin color to a uniform dark style can also reduce bias between different about subtle facial features, whereas MSR images may be less sensitive
races to some extent. to subtle information but are robust to lighting changes. Based on this

14
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 8
Analysis of data augmentation methods.
Name Year Network Methods Description
Multi-task [98] 2021 3D CNN Data Augmentation A multi-task framework that proposes simultaneous
learning of rPPG signal extraction model and data
augmentation model has been introduced.

rPPGRNet + THRNet 2021 3D CNN Data Augmentation rPPGRNet is used to recover rPPG information, while
[101] THRNet is used to enhance discriminative features of
facial images and suppress small noise in facial
images.

RErPPGNet [99] 2022 3D CNN Data Augmentation + The utilization of a Double Cycle Consistent Learning
Double Cycle for data augmentation significantly enhances the
Consistent Learning estimation quality of signals.

PRN augmented 2022 3D CNN Data Augmentation + A 3D CNN-based skin tone generator for converting
[102] Style Transfer facial images with different skin tones into a
consistent dark-toned style.

Fig. 29. The architecture of Hu et al.’s method.

Fig. 30. The architecture of Arbitrary_Resolution_rPPG.


prior knowledge, rPPG-FuseNet preprocesses the input video frames to
separate RGB and MSR images, performs facial detection using 68 facial
landmarks, and fuses the selected ROI. The fused ROIs are partitioned two proposed modules are designed for processing temporal and spatial
into k blocks, and the RGB values of each block are connected to form flows, respectively. In the temporal flow, the TFA interpolates frames to
a spatio-temporal map. The generated spatio-temporal map contains the same shape to generate temporally aligned features containing rich
information from both RGB and MSR images and demonstrates good temporal information. In the spatial flow, the PFE adaptively encodes
robustness to different lighting conditions and low-light scenarios. face frames of arbitrary resolutions into fixed-resolution face structural
Experimental results on the VIPL-HR dataset with challenging illumi- features. The temporal features and facial structural features generated
nation variations showed that rPPG-FuseNet achieved an impressive by TFA and PFE can be fed into the backbone network of most rPPG
RMSE of 8.03. signal estimation methods to estimate higher-quality rPPG signals. The
Embedded modules: Embedded modules refer to modules that RMSE of TFA and PFE combined with their own designed backbone
can be easily integrated into most networks to improve network per- network on UBFC-rPPG is 1.62. For PhysFormer, which combines TFA
formance and effectiveness. Hu et al. [106] proposed two embedded and PFE, the RMSE reaches 1.72 when the facial video resolution is set
modules, namely the Attention Module (AM) and the Temporal Fu- to 128 × 128, outperforming the original PhysFormer (RMSE is 2.41).
sion Module (TFM), as foundational modules of their network, and Domain generalization: In the spatio-temporal map method CVD
utilized them to propose a new method that fully exploits temporal [48], researchers have previously employed disentangled feature learn-
information, as illustrated in Fig. 29. The TFM module consists of an ing to denoise spatiotemporal maps. Inspired by CVD, Chung et al.
aggregation branch and a temporal mask branch (TMB), designed to [108] propose a novel Domain Generalized rPPG Network, DG-rPPGNet,
fully utilize temporal information, reduce redundant video information, which is the first method to address the domain generalization problem
and enhance the correlation of long-distance videos. The TFM mod- in rPPG signal estimation. They view the cross-dataset testing problem
ule addresses the redundancy of information caused by slow content as a domain generalization problem, assuming that different ‘‘domains’’
refer to different features in rPPG benchmarks (e.g., lighting conditions
changes in long-distance videos and fuses information in the temporal
or camera devices), and any domain shifts (e.g., video-to-video transi-
dimension. The AM module, composed of residual attention mechanism
tions, lighting modifications, and noise perturbations) can significantly
and key module stacking, utilizes soft attention mechanism to learn
degrade the estimation quality of rPPG signals. To achieve the goal of
attention weights, enabling the module to focus more on regions with
domain generalization, they devised a feature disentanglement learning
strong physiological amplitude. This feature effectively avoids further
framework that separates rPPG, identity (ID), and domain features from
selection of ROI in facial videos and enhances the robustness of the
input facial video data to address variations across different domains.
backbone network to actions and spatial context information of moni-
To further bolster disentangled feature learning, they devised a novel
toring targets. It is worth noting that to facilitate the learning of weights domain permutation strategy to ensure that the disentangled rPPG
between channels, the POS algorithm [28] was used to project RGB features remain invariant across different source domains. Furthermore,
images and add motion representations to complement the extraction they proposed an adversarial domain augmentation strategy that ex-
of physiological signals. Experimental results on the PURE dataset show pands the domain scope during model training, extending the model to
that their method achieved an RMSE of 0.48, making it one of the domains with low visibility. The proposed disentangled feature learn-
best-performing methods on the PURE dataset, fully demonstrating the ing framework in conjunction with domain permutation and domain
strong capabilities of the AM and TFM modules. augmentation shows potential for addressing the challenge of domain
Building on the embedded modules AM and TFM proposed by generalization to some extent. In cross-dataset experiments trained on
Hu et al. [106], Li et al. [107] recently proposed two additional the PURE and COHFACE datasets and tested on the UBFC-rPPG dataset,
convenient plug-and-play embedded modules: the Physiological Signal DG-rPPGNet achieves an RMSE of 0.63, which is better than Dual-GAN
Feature Extraction Block (PFE) and the Temporal Face Alignment Block (RMSE of 1.02), one of the currently best-performing methods.
(TFA), to further mitigate the effects of distance changes and head Lightweight models: Model complexity and weight have always
movements. These two modules were used to propose a new rPPG been an inevitable issue in deep learning methods, and it is no ex-
method called Arbitrary_Resolution_rPPG, as shown in Fig. 30. The ception in rPPG methods. Different from most heavy models, Coma

15
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 9
Analysis of other methods.
Name Year Network Methods Description
Nowara et al. [103] 2021 Attention Network Attention Proposing an anti-attention mechanism utilizing
+ LSTM facial, hair, and background regions to reduce
noise caused by head movement and illumination.

Hu et al. [106] 2021 DNN Attention Attention module and temporal fusion module
were utilized as fundamental modules in the
network.

AND-rPPG [104] 2022 TCN AU Applying AU for denoising time signals improves
the simulation of time signals.

rPPG-FuseNet [105] 2022 DCNN Spatial–Temporal The fusion of RGB and MSR signals was employed,
Map utilizing two DCNNs for estimating rPPG signals.

DG-rPPGNet [108] 2022 Domain Disentangled A Domain Generalized rPPG Network based on
Generalization Feature Learning disentangled feature learning was proposed,
Network highlighting the issue of domain generalization in
rPPG methods for the first time.

TDMTALOS [109] 2022 2D CNN DTC A lightweight model was proposed, utilizing a
TDM module to estimate rPPG signals, and
employing the TALOS loss function to handle bias.

Arbitrary_ 2022 3D CNN + MT CNN Data Augmentation Two plug-and-play modules, PFE and TFA, were
Resolution_rPPG employed to alleviate the degradation caused by
[107] changes in distance and head movements.

et al. [109] propose a lightweight model. A lightweight model usually


refers to a model with fewer parameters, but this often leads to a
performance drop. To improve the performance on the basis of the
lightweight model, they propose a new objective loss function called
time-adaptive location shift (TALOS). TALOS is a new time loss function
for training learning-based models designed to learn the potential
movement probability of label signals during training, which allows Fig. 31. The Fundamentals of contrastive learning. Positive example is pulled closer
minimizing the error between the predicted signal and the label signal and negative example is pushed farther away.
under the estimated relative shift. In their model, a time derivative
module (TDM) composed of differential time convolutions (DTC) is
used to model the rPPG signal, which is built by aggregating multi-
ple convolution derivatives incrementally, simulating a Taylor series
expansion to the required order. TALOS is introduced to handle the
robustness of predicted signal to label signal shifts. Experimental results
on UBFC-rPPG demonstrate that TDM + TALOS achieves an RMSE of
3.08, which shows good competitiveness in terms of parameter number
and computational cost.

4. Unsupervised deep learning methods Fig. 32. The architecture of Gideon et al.’s method.

Supervised methods for rPPG require large-scale datasets that in-


clude both facial videos and synchronized authentic physiological sig- positive and negative examples need to be further constructed, as
nals, but obtaining such physiological signals through contact sensors illustrated in Fig. 31, which explains the process of contrastive learn-
poses challenges. To overcome this complexity, researchers have started ing. The core idea of contrastive learning is to minimize the distance
considering unsupervised methods that can learn feature representa- between positive examples and anchor examples, while maximizing
tions from unlabeled data without relying on real labels. Contrastive the distance between negative examples and anchor examples, thereby
learning [110], a widely-used unsupervised learning approach for video achieving clustering effects. In short, the goal of contrastive learning
and image feature embedding, has been introduced into the rPPG field is to train a feature extractor in an unsupervised manner to obtain the
to achieve unsupervised methods. This has led to successful imple- desired features from given images without the need for labels.
mentations of unsupervised remote HR measurement methods, with Without prior knowledge: Guided by the concept of contrastive
most current unsupervised methods being based on contrastive learn- learning, Gideon et al. [111] proposed a fully unsupervised rPPG
ing. Unsupervised methods represent the latest development in rPPG method, which opened the door for unsupervised approaches in re-
for remote HR measurement and hold promising research prospects. mote HR measurement, as shown in Fig. 32. The method relies on
Table 10 provides a comprehensive list of all the unsupervised methods. contrastive learning with triplet loss, where negative samples are rPPG
signals extracted from downsampled videos, positive samples are rPPG
4.1. Contrastive learning methods signals generated by upsampling the negative samples, and anchor
samples are rPPG signals extracted from the original videos. The triplet
Contrastive learning is a discriminative representation learning loss is used to push positive samples closer to anchor samples and
framework based on the concept of contrast, primarily used in unsu- push negative samples away from anchor samples, with the predicted
pervised methods. Contrastive learning typically involves using three rPPG signal determined by maximizing cross-correlation as the loss
samples for comparison: positive examples, anchor examples, and function. Although the method has been validated on various phys-
negative examples. The input sample serves as the anchor point, while iological datasets with a reported RMSE of 4.28 on the UBFC-rPPG

16
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

dataset and 2.90 on the PURE dataset, indicating inferior performance


compared to supervised methods, it demonstrates the potential of un-
supervised approaches. However, further optimization and fine-tuning
may be needed to improve accuracy and make it more comparable to
supervised methods.
Given that contrastive learning requires comparing three samples,
which may introduce more noise and degrade the effectiveness of
contrastive learning, Wang et al. [112] proposed a spatial enhancement Fig. 33. The architecture of Contrast-Phys.
method and a data augmentation method to enhance the effectiveness
of contrastive learning in their unsupervised method called SLF-RPM.
Their proposed spatial enhancement method is based on the DRM [12], dataset showed that Contrast-Phys achieved an RMSE of 1.00, which
which involves segmenting the face into multiple informative parts to is comparable to the performance of the state-of-the-art supervised
represent subtle skin color fluctuations. In addition to spatial enhance- method Physformer (RMSE of 0.71), and outperformed most supervised
ment, they also utilized the Nyquist-Shannon sampling theorem [113] methods. This successfully demonstrates the promising prospects of
to devise a time-based enhancement method based on sparsity, aim- unsupervised methods.
ing to effectively capture periodic temporal variations by modeling To advance the performance of unsupervised methods and minimize
physiological signal features. The original video is transformed through the performance discrepancy with supervised methods, recent research
sparsity-based time enhancement and label-based spatial enhancement by Yue et al. [116] has proposed incorporating the Frequency Augmen-
to generate different views, i.e., different samples, for contrastive learn- tation Module (LFA) and rPPG Expert Aggregation Module (REA) into
ing in SLF-RPM. Meanwhile, SLF-RPM also underwent a series of data contrastive learning. LFA, as a trainable data augmentation module,
augmentation to generate pseudo-labels for constraining the learning aims to generate diverse and abundant negative samples, as the quality
process. The training and testing results on the UBFC dataset showed of constructed negative samples significantly impacts the effectiveness
that the RMSE of SLF-RPM reached 9.70. The powerful capabilities of of contrastive learning. The core of the LFA module is the Frequency
the transformer have given rise to many high-performance supervised Modulation Block (FMB), which modulates the frequency of the rPPG
methods. To explore the potential of leveraging the transformer to signal to generate negative sample videos. On the other hand, REA is
enhance unsupervised methods, Park et al. [114] combined contrastive designed as an rPPG signal extraction module that takes into consider-
learning with ViT to propose a novel unsupervised approach, named ation the distinct distributions of blood vessels and noise in different
Fusion ViViT. Similar to the idea of rPPG-FuseNet [105], they consid- facial regions. REA employs an attention mechanism to selectively
ered using RGB and near-infrared (NIR) images for jointly representing focus on relevant facial regions, amplifies the pulse-sensitive areas,
features, using a video encoder to take RGB and NIR facial video frames suppresses the background and pulse-insensitive areas, and estimates
as inputs and generate a fused RGB-NIR rPPG representation vector. the signal using 3D CNN. The positive sample videos are obtained
The proposed Fusion ViViT leverages the self-attention mechanism of by spatial enhancement of the original videos, and the rPPG signals
transformer to effectively unify the global context of both modalities generated by REA from positive and negative sample videos are utilized
and fully represent the spatio-temporal information in RGB and NIR as positive and negative sample signals for contrastive learning. The
videos. The fused RGB-NIR rPPG representation vector is then used for authors have also devised three types of loss functions to effectively
extracting the rPPG signal. Meanwhile, they used data augmentation constrain the contrastive learning process and optimize the rPPG signal.
techniques to generate sufficient samples for contrastive learning and Experimental results on widely-used datasets UBFC-rPPG and PURE
designed a forced contrastive loss function to constrain the results demonstrate that the proposed approach achieves low RMSE values of
of contrastive learning. Experiments on the VIPL-HR dataset, which 0.94 and 2.01, respectively, which are in close proximity to the state-
contains rich challenges, demonstrated that the RMSE of Fusion ViViT of-the-art unsupervised method Contrast-Phys, and further narrows the
reached 14.86, outperforming SLF-RPM (RMSE of 16.59). Interestingly, performance gap with supervised methods.
the use of contrastive learning with transformer did not result in However, these contrastive learning methods that leverage prior
overfitting. knowledge have not taken into consideration the inherent periodicity
With prior knowledge: Unsupervised methods often have infe- in the data and have been unable to learn representations capturing
rior performance compared to supervised methods due to the lack of periodicity or frequency attributes. Consequently, Yang et al. [117]
strong constraints from real labels. However, Sun et al. [115] pro- incorporated previously overlooked resampling factors and introduced
posed a novel method called Contrast-Phys, which, for the first time, a soft modification to the InfoNCE loss [120], thereby devising a
achieves performance on par with supervised methods, as demonstrated generalized contrastive loss. This approach utilizes relative sampling
in Fig. 33. Contrast-Phys is based on four key observations of rPPG rates and the generalized contrastive loss to effectively and robustly
signals: (1) spatial similarity of rPPG signals, (2) temporal similarity learn periodic representations. Unfortunately, the performance of their
of rPPG signals, (3) dissimilarity of rPPG signals across videos, and proposed method, SimPer, is not outstanding, which may suggest that
(4) limitations of human HR range. These four key observations serve even with prior knowledge, some preprocessing of the original videos
as important prior knowledge constraints on rPPG signals, eliminating is necessary. Furthermore, their framework is not entirely unsupervised
the need for real labels. Additionally, they introduce a spatiotemporal and requires fine-tuning using real PPG signal labels.
rPPG (ST-rPPG) block to capture more spatio-temporal dimensions of
rPPG information, and use 3D CNN to extract rPPG signals from these 4.2. Non-contrastive learning methods
ST-rPPG blocks. Unlike previous methods that use the rPPG signals as
samples for contrastive learning, Sun et al. used the PSD corresponding Contrastive learning has been demonstrated as an excellent solution
to the rPPG signals as samples, with the PSD of the target video for unsupervised methods. However, contrastive learning also has cer-
as the anchor and positive samples, while the PSD of another facial tain limitations, such as high computational complexity and significant
video as the negative samples. The contrastive loss function, generated computational costs associated with a large number of samples, which
based on these four important prior knowledge constraints, is used may hinder further applications. In contrast to previous unsupervised
to guide the contrastive learning process. Interestingly, the four key methods based on contrastive learning, Speth et al. [118] successfully
prior knowledge constraints summarized by Sun et al. can be applied implemented a non-contrastive unsupervised method called SiNC, as
to any rPPG-based HR measurement method, demonstrating a certain shown in Fig. 34. While contrastive learning generates results based
degree of generalizability. Extensive experiments on the UBFC-rPPG on the comparison between a large number of samples, their proposed

17
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 10
Analysis of unsupervised deep learning methods.
Name Year Network Methods Description
Gideon et al. [111] 2021 3D CNN Contrastive Learning The first unsupervised rPPG method utilized
contrastive learning for unsupervised learning to
estimate rPPG signals.

SLF-RPM [112] 2022 3D CNN Data Augmentation + A landmark-based spatial enhancement method is
Contrastive Learning proposed to improve the effectiveness of
contrastive learning.

Fusion ViViT [114] 2022 ViT Contrastive Learning Utilizing RGB and NIR for joint feature
representation with transformer-based contrastive
learning.

Contrast-Phys [115] 2022 3D CNN Data Augmentation + Proposing ST-rPPG blocks based on four
Contrastive Learning observations of rPPG signals for contrastive
learning of spatio-temporal rPPG signals.

Yue et al. [116] 2022 3D CNN Data Augmentation + LFA for data augmentation and positive/negative
Contrastive Learning sample generation, and REA for estimating rPPG
signals.

SimPer [117] 2023 2D CNN Data Augmentation + Learning Efficient and Robust Periodic
Contrastive Learning Representations through Relative Sampling Rates
and Generalized Contrastive Loss.

SiNC [118] 2023 3D CNN Data Augmentation + Penalized regression was utilized in the design of
Penalized Regression the loss function, as the first unsupervised method
without contrastive learning.

rPPG-MAE [119] 2023 ViT Spatial-Temporal Map The inaugural rPPG method to incorporate MAE,
+ MAE alongside the design of a novel PC-STMap, has
achieved the best unsupervised performance.

referred to as PC-STMap, augmenting its capability to handle complex


video content. In contrast to conventional MAE models that solely rely
on pixel reconstruction loss, they ingeniously designed a loss function
tailored to the rPPG domain. This specialized loss function serves as a
constraint during MAE pretraining, enabling ViT to effectively acquire
the periodic information embedded within rPPG signals. Astonishingly,
through the amalgamation of various sophisticated techniques, rPPG-
Fig. 34. The architecture of SiNC.
MAE achieved a remarkable RMSE of 0.21 on the highly reputable
UBFC-rPPG dataset. This achievement not only establishes rPPG-MAE
as the leading unsupervised method but also surpasses the performance
non-contrastive learning method primarily relies on prior knowledge of of the most robust supervised methods on the same dataset. These
periodic signals. Utilizing prior knowledge in unsupervised learning can results underscore the significant potential of MAE in the rPPG do-
impose significant constraints on the solution space. For physiological main and suggest that unsupervised methodologies may become the
signals, the upper and lower limits of frequency can be understood, and mainstream approach in future rPPG research.
the desired signal to be extracted is sparse in the frequency domain,
while the model can filter out noise signals present in the video. With 5. Research resources
these constraints in place, the unsupervised method can be simplified
into a signal feature extraction problem. Therefore, they perform data
In this section, we will present the latest research resources, includ-
augmentation on input images and estimate rough rPPG signals using
ing datasets and open-source tools, that are currently used in rPPG
a 3D CNN-based signal estimator. The key to achieving unsupervised
methods, to assist researchers, particularly those new to the field,
learning without contrastive learning lies in the loss functions. SiNC
in accelerating their research progress. Furthermore, in Section 5.2,
combines three loss functions: bandwidth loss, sparsity loss, and vari-
we will perform performance comparisons of the methods introduced
ance loss. They consider strong priors on the bandwidth and periodic
in this paper, for the convenience of researchers for reference and
placement of the rough rPPG signals. Signals observed outside the
comparison purposes.
expected frequency range are considered contaminants, and the model
is penalized to perform forward propagation, thereby discarding such
noisy visual features and optimizing the rPPG signals more effectively. 5.1. Datasets
Test results on the UBFC and PURE datasets show that SiNC achieves
RMSE values of 1.83 and 1.84, respectively, exhibiting slight inferiority The dataset plays a crucial role in evaluating a method, as testing
compared to the current best contrastive learning unsupervised method on different datasets allows for comparison of the performance of
Contrast-Phys (RMSE of 1.00 and 1.4), demonstrating the potential of various methods and reflects their generalization ability under different
non-contrastive learning methods. circumstances. Moreover, for supervised methods mentioned earlier,
Lately, Liu et al. [119] introduced the novel technology of Masked the quality of the dataset and the labels it contains are crucial for
Autoencoder (MAE) [121] into the field of rPPG for the first time, their training. In the following section, we will introduce a majority
presenting a new unsupervised approach termed rPPG-MAE. In this of the physiological datasets used for rPPG methods, and their main
method, they employed MAE for unsupervised pretraining of ViT, characteristics will be listed in Table 11.
with the primary objective of uncovering self-similarity patterns within DEAP [78] was initially designed as a dataset for emotion analysis,
physiological signals. Additionally, to further enhance noise reduction but it can also be used for evaluating rPPG methods due to its inclusion
in video data, they introduced an innovative spatiotemporal framework of authentic PPG signals. DEAP consists of data from 32 participants,

18
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 11
Summary of public camera physiological measurement datasets.
Dataset Subjects Videos Imaging Gold Standard Free Access
DEAP [78] 32 874 Resolution: 720 × 576 Frame rate: 56 fps ECG YES

MAHNOB-HCI [112] 27 527 Resolution: 1040 × 1392 Frame rate: 24 fps ECG YES

AFRL [122] 25 300 Resolution: 658 × 492 Frame rate: 120 fps PPG, ECG, RR NO

PURE [38] 10 60 Resolution: 640 × 480 Frame rate: 30 fps PPG, SpO2 YES

MMSE-HR [44] 40 102 Resolution: 1040 × 1392 Frame rate: 25 fps HR, BP YES

COHFACE [67] 40 160 Resolution: 640 × 480 Frame rate: 20 fps PPG YES

ECG-Fitness [37] 17 204 Resolution: 1920 × 1080 Frame rate:30 fps PPG, ECG YES

OBF [52] 100 200 Resolution: 1920 × 1080 Frame rate: 60 fps PPG, ECG, RR NO

VIPL-HR [65] 107 3130 Resolution: 960 × 720, 1920 × 1080, 640 × 480 Frame rate: 60 fps, 30 fps PPG, HR, SpO2 YES

MR-NIRP [78] 19 190 Resolution: 640 × 640 Frame rate: 60 fps PPG YES

UBFC-rPPG [42] 50 50 Resolution: 640 × 480 Frame rate: 30 fps PPG, HR YES

VicarPPG-2 [123] 50 50 Resolution: 1280 × 720 Frame rate: 30 fps PPG, HR YES

MMVS [101] 129 762 Resolution: 1920 × 1080 Frame rate: 25 fps PPG NO

V4V [124] 179 1358 Resolution: 1720 × 720 Frame rate: 25 fps PPG, HR, BP YES

UBFC-Phys [125] 56 168 Resolution: 1024 × 1024 Frame rate: 30 fps PPG, HR YES

Scamps [126] 2800 2800 Resolution: 320 × 240 Frame rate: 30 fps PPG, PR, RR YES

MMPD [127] 22 55 Resolution: 1280 × 720 Frame rate: 30 fps PPG, HR YES

with a total of 874 videos recorded at a resolution of 720 × 576 and a cloud cover through a large window. Real PPG signals were collected
frame rate of 50 fps. Each participant was asked to watch a 1-minute using a CMS50E finger pulse oximeter with a sampling rate of 60 Hz. It
music video to generate varying emotional states, leading to changes is worth mentioning that the images in PURE are stored in lossless PNG
in HR. DEAP collected authentic PPG signals, and the real HR values format, which benefits the estimation performance of rPPG signals.
can be calculated from these authentic PPG signals. MMSE-HR [44] involves 40 participants from diverse racial back-
MAHNOB-HCI [62] is a multimodal database involving 27 partici- grounds, including Asian, White, Black, and Hispanic/Latino. A total
pants, with each participant recording 20 videos, resulting in a total of of 102 videos were recorded, with each video being recorded at a
527 videos. The videos were recorded at a resolution of 780 × 580 and a resolution of 1040 × 1392 and a frame rate of 25 fps. The original
frame rate of 61 fps. While the original purpose of MAHNOB-HCI was purpose of MMSE-HR was also for facial expression analysis, but MMSE-
emotion recognition and implicit tagging research, it is also suitable HR recorded the true values of physiological signs such as HR. Due to
for evaluating rPPG-based remote HR measurement methods due to the inclusion of participants with different skin tones in MMSE-HR, it is
its inclusion of real physiological signals such as electrocardiogram well-suited for evaluating the performance of methods under different
(ECG). All participants took part in emotion induction and implicit skin tones.
tagging experiments, during which HR fluctuated due to changes in COHFACE [67] is a publicly available dataset proposed by Idiap
participants’ emotions. Additionally, six cameras were used to capture Research Institute, designed for researchers to evaluate their rPPG
different views of the participants (frontal view, profile view, wide- methods on COHFACE with standardized and fair criteria. COHFACE
angle view, close-up view), making this dataset suitable for evaluating consists of 40 participants, including 28 males and 12 females, each of
method performance in handling pose and angle variations. whom recorded four video segments, resulting in a total of 160 videos.
AFRL [122] is proposed by the U.S. Air Force Research Laboratory Each video was recorded at a resolution of 640 × 480 and a frame
and includes recordings from 25 participants (17 males and 8 females), rate of 20 fps. Additionally, each participant wore a contact-based PPG
consisting of 300 videos. Each video was recorded at a resolution of sensor to obtain real PPG signals and other related data. The lighting
658 × 492 and a frame rate of 120 fps. For each participant, 6 record- conditions were taken into consideration during video recording. Two
ings were made with increased head movements during each task. In video segments were recorded for each participant under two lighting
the first two tasks, participants were asked to sit still and rotate their conditions: (1) Studio lighting, with windows closed to avoid natural
heads around the vertical axis at angular velocities of 10◦ /s, 20◦ /s, light and sufficient artificial light to stably illuminate the participant’s
and 30◦ /s, completing three motion tasks. In the last task, participants face; (2) Natural light, with windows open and all artificial lights
were asked to randomly position their heads to one of nine pre-defined turned off. The main limitation of COHFACE is that the videos are
locations every second. The background of the environment was either heavily compressed, resulting in significant noise that can greatly affect
a solid black fabric or a patterned colored fabric. Additionally, real the estimation of rPPG signals.
physiological signals including PPG, ECG, and respiration signals were ECG-Fitness [37] comprises 17 participants, consisting of 14 males
collected as part of the recordings. and 3 females, engaged in four different activities (speaking, rowing,
PURE [38] consists of recordings from 10 participants, including 8 exercising on a stationary bicycle, and elliptical trainer). The videos
males and 2 females. Each participant recorded 6 videos, resulting in a were recorded using two Logitech C920 web cameras and a FLIR
total of 60 videos. Each video was recorded at a resolution of 640 × 480 thermal imager under three distinct lighting conditions: natural light
and a frame rate of 30 fps, with a duration of one minute. Each from nearby windows, 400 W halogen lamps, and 30 W LED lamps.
participant performed six different tasks: (1) sitting still, (2) talking, Each participant generated 12 videos across three lighting conditions
(3) slow head movement, (4) fast head movement, (5) rotating the and four activity states, resulting in a total of 204 videos. Each video
head at a 20-degree angle, and (6) rotating the head at a 35◦ angle, was recorded at a resolution of 1920 × 1080 pixels and 30 frames per
in order to introduce variations in head movements. The PURE dataset second, with a duration of 1 min. Remarkably, the ECG-Fitness dataset
also considered changes in illumination by using natural sunlight and is unique in containing data for the rowing activity.

19
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

OBF [52] is a large dataset proposed by the University of Oulu a frame rate of 60 frames per second. Each participant contributed
in Finland, specifically designed for remote physiological signal mea- four videos, with the first video depicting a static state. In the second
surement. The OBF dataset comprises 100 subjects, with a total of video, each participant executed five different pre-planned body/head
200 high-quality RGB facial videos, each lasting 5 min, recorded at movements, including head tilting left and right (nodding), head mov-
a resolution of 1920 × 1080 and 60 fps. The subjects in the OBF ing up and down (head shaking), a combination of head shaking and
dataset consist of two types: healthy participants and atrial fibrillation nodding (rotation), moving their eyes while keeping the head still,
(AF) patients, with recordings taken of the resting state and post- and natural head movements while listening to music (dance). In
exercise state (5 min of exercise) for healthy participants, and pre- and the third video, participants were engaged in a stress-inducing game,
post-cardioversion states for AF patients. In addition, the OBF dataset and in the fourth video, participants sat unrestrictedly after undergo-
includes contact-based devices for recording real PPG signals and other ing fatigue-inducing physical exercise. VicarPPG-2 employed CMS50E
information. Due to the high video quality of the OBF dataset, it can pulse oximeters connected to the participants’ fingertips to record
enhance the performance of rPPG methods to a certain extent. authentic PPG waveforms. This dataset is well-suited for evaluating the
VIPL-HR [65] is a challenging large-scale multimodal dataset that robustness of rPPG methods in extreme scenarios, such as stress and
includes data from 107 subjects. Three different types of videos, namely excessive physical activity states.
RGB videos, NIR videos, and smartphone camera videos, were recorded MMVS [101] is a private dataset that contains multimodal and
using RGB cameras, RGB-D cameras, and smartphone cameras, re- multisubject physiological signals. It includes data from 129 healthy
spectively. A total of 3130 visible light facial videos were recorded subjects, ranging in age from 16 to 83 years old. A total of 762 videos
in the VIPL-HR dataset. RGB videos were recorded using both RGB were recorded, with each video recorded at a resolution of 1920 × 1080
cameras and RGB-D cameras, with a resolution of 960 × 720 and a and a frame rate of 25 fps, lasting approximately one minute. Uniform
frame rate of 25 fps for RGB camera recordings, and a resolution of indoor ambient lighting was used, without specific pre-set backgrounds.
1920 × 1080 and a frame rate of 30 fps for RGB-D camera recordings. MMVS utilizes finger-based pulse oximeters to record real PPG signals,
NIR videos were recorded using RGB-D cameras, which are capable of and employs programming techniques to calibrate the PPG signals with
recording both RGB and NIR videos, with a resolution of 640 × 480 video frames.
and a frame rate of 30 fps for NIR recordings. Smartphone camera V4V [124] is a physiological dataset specifically introduced for the
videos were recorded using smartphone cameras, with a resolution ICCV 2021 Vision for Vitals Challenge. It comprises a total of 179
of 1920 × 1080 and a frame rate of 30 fps. The purpose of using participants, including African Americans, Caucasians, and Asians, each
multiple types of videos is to enable researchers to test the robustness of whom engaged in up to 10 experimental tasks. Each task is metic-
of their methods across different video modalities. Furthermore, the ulously designed to elicit specific emotions among the participants,
dataset introduces two influencing factors, namely head motion (stable, resulting in a total of 1358 videos. These videos vary in length from 5 s
large motion, speaking) and illumination changes (lab, dark, bright), to 206 s, recorded at a resolution of 1280 × 720 pixels and a frame rate
for researchers to evaluate the overall robustness of their proposed of 25 fps. V4V leverages the BIOPAC MP150 data acquisition system
methods. Additionally, VIPL-HR includes various real labels, such as to collect authentic labels, including PPG signals, heart rate, blood
HR, SPO2, and BVP, for comprehensive analysis. pressure, and other physiological measurements. It is worth noting
MR-NIRP [78] is the first physiological video dataset that includes that, despite its substantial scale and diverse challenges, V4V maintains
driving scenarios. This dataset consists of 190 videos recorded from 19 consistent lighting conditions throughout the dataset.
subjects while driving and while being inside a parked car. Each subject UBFC-Phys [125] is a dataset primarily designed for emotion recog-
also performed actions such as speaking and randomly moving their nition and consists of 56 participants, including 46 females and 10
head during the recordings. The videos were captured at a resolution males. Participants were involved in an experiment inspired by the
of 640 × 640 and a frame rate of 60 fps. The MR-NIRP dataset is Trier Social Stress Test (TSST). Each participant was required to com-
designed to evaluate the applicability of various rPPG methods in new plete three tasks (resting, speaking, and arithmetic), resulting in a total
driving scenarios, beyond conventional laboratory environments. The of 168 videos, each recorded at a resolution of 1024 × 1024 pixels and a
dataset records real PPG signals synchronized with the video using a frame rate of 35 frames per second. UBFC-Phys utilizes the Empatica E4
finger pulse oximeter. RGB and NIR data are collected simultaneously, wristband to collect PPG signals and measurements of skin conductance
although researchers often use the NIR data for training and testing (EDA). Additionally, participants filled out a questionnaire before and
purposes in practice. It is worth mentioning that this dataset has some after the experiment to compute their self-reported anxiety scores. In
imperfections, such as many zero values in the PPG signals, which pose the future, UBFC-Phys may become an important publicly available
challenges for evaluating rPPG methods. dataset for research in rPPG-based methods for emotion recognition.
UBFC-rPPG [42] is a dataset specifically designed for evaluating Scamps [126] is a large-scale synthetic physiological dataset that
rPPG methods. The UBFC-rPPG dataset comprises 50 videos, each includes 2800 videos, with a resolution of 320 × 240 and a frame rate
recorded from a different subject, with a resolution of 640 × 480 and a of 30 fps. Scamps provides frame-level ground truth labels, including
frame rate of 30 fps. The recordings take into consideration variations PPG, pulse interval, respiratory waveform, respiratory interval, and 10
in sunlight and indoor lighting. The UBFC-rPPG dataset consists of facial actions. It also offers video-level ground truth labels for multiple
two sub-datasets. Sub-dataset 1 is a simplified version with 8 videos, physiological indicators. These parameters are used to generate 20-
where subjects are asked to sit still, although some videos may involve second PPG waveforms at 300 Hz and action unit intensity. Each video
movement. Sub-dataset 2 is a more practical dataset with 42 videos, is rendered using the corresponding waveform, action unit intensity,
where subjects are asked to play a time-sensitive mathematical game to and randomly sampled appearance attributes such as skin texture,
increase their HR. UBFC-rPPG is currently one of the most widely used hair, clothing, lighting, and environment. The extensive synthetic data
datasets by researchers. The videos in UBFC-rPPG are uncompressed in SCAMPS has demonstrated its potential in various applications, as
and have good video quality, and real data such as HR and PPG signals collecting such data in a real-world manner can be challenging in
are recorded, which is beneficial for researchers to use. Although UBFC- existing datasets. However, Scamps is often used for training rather
rPPG includes two sub-datasets, in practical usage, researchers often than testing purposes.
only use sub-dataset 2 due to its ample recording preparation and good MMPD [127] is the first dataset recorded entirely with smartphone
video quality. cameras. MMPD includes 33 subjects and a total of 660 one-minute
VicarPPG-2 [123] consists of 10 participants with an average age of videos, recorded at a resolution of 1280 × 720 and a frame rate of 30
29 years. A total of 40 videos were recorded, each having a duration of fps. However, for ease of sharing, the researchers compressed the videos
5 min, and were captured at a resolution of 1280 × 720 pixels with to a resolution of 320 × 240. MMPD considers four different skin tones,

20
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 12
A summary of the performance of the conventional methods. MAE and RMSE in bmp. The best results are in bold.
Name Year Deap MAHNOB-HCI PURE MMSE-HR COHFACE VIPL-HR UBFC-rPPG MMPD
MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R
GREEN [13] 2008 8.10 11.17 0.80 – – – 4.39 11.60 0.99 11.53 21.77 – 10.94 16.72 – – – – 7.50 14.14 0.62 11.73 15.75 0.24
ICA [19] 2010 – – – −8.95 25.9 0.08 15.23 21.25 – 5.28 – 0.70 8.89 14.55 0.42 – – – 5.17 11.76 0.65 7.94 11.67 0.28
PCA [20] 2011 – – – – – – 22.25 30.20 – – – – – – – – – – – 12.08 0.54 – – –
CHROM [12] 2013 7.47 10.31 0.82 −2.89 10.70 0.82 2.073 2.50 0.99 9.41 13.97 0.55 7.80 12.45 0.26 11.40 16.99 0.28 2.37 4.91 0.89 5.89 9.72 0.39
PBV [25] 2014 – – – – – – 23.31 30.73 0.51 – – – – – – – – – 13.63 24.12 0.32 6.46 9.66 0.50
LiCVPR [26] 2014 – – – −3.30 7.62 0.81 28.22 30.96 −0.38 – – – – – – – – – – – – – – –
2SR [27] 2016 – – – – – – 2.44 3.06 0.98 – – – 20.97 25.98 −0.32 11.50 17.20 0.30 15.95 11.65 – – – –
POS [28] 2017 7.93 10.25 0.82 – – – 3.14 10.57 0.95 5.77 – 0.82 – – – 5.79 8.94 0.73 4.05 8.75 0.78 5.22 9.74 0.46

four different lighting conditions (LED high, LED low, incandescent, POS [28] for benchmarking. Furthermore, Boccignone et al. [131]
natural), and four different activities (resting, head rotation, conversa- proposed a MATLAB open-source toolbox for implementing various
tion, and walking) to provide researchers with diverse environmental traditional methods, which covers a wide range of traditional methods.
conditions to test the robustness of their methods. Additionally, MMPD The previous MATLAB toolboxes were only capable of implementing
conducted four additional experiments to investigate the impact of traditional methods. Recently, some researchers have proposed Python
motion on static scenes, requiring subjects to perform high knee raises toolboxes that can implement deep learning methods. PyVHR [132]
or other vigorous exercises to raise their HR before recording. After is the first toolbox that can implement deep learning methods, and
completing all the exercises, subjects were given sufficient rest time it is actually an installable Python package that is easy to install and
to calm down before participating in the next experiment. MMPD also use, similar to other environment packages. With PyVHR, researchers
records real labels such as HR and actual PPG signals. can implement and evaluate eight traditional methods and one deep
learning method, MTTS-CAN [32], on 10 datasets, which facilitates
5.2. Evaluation metrics and performance comparison benchmarking of rPPG methods. PyVHR also provides other commonly
used preprocessing and postprocessing techniques, such as ROI selec-
When evaluating rPPG methods for remote HR measurement, re- tion, signal conversion, PSD calculation, and plotting, etc. Moreover,
searchers typically use three metrics: Mean Absolute Error (MAE), the deep learning methods proposed by researchers can be tested
RMSE, and Pearson correlation coefficient (R), which are used in using PyVHR, but cannot be trained using PyVHR. PyVHR can also
combination to assess the performance of a method. MAE and RMSE be easily used for various applications such as anti-spoofing, activity
are measured in beats per minute (bpm), with smaller values indicating detection, affective computing, and biometrics. rPPG Toolbox [133] is
lower error. R ranges from 0 to 1, with values closer to 1 indicating the latest proposed rPPG toolbox and currently the most comprehensive
lower error. In this paper, these three evaluation metrics will also one. It can be used for both training and testing of deep learning
be used for performance comparison. In Table 12 to Table 14 we methods. rPPG Toolbox includes code for preprocessing multiple public
will present the performance of supervised methods, unsupervised datasets, implementation of supervised and unsupervised deep learn-
methods, and traditional methods on the most commonly used pub- ing methods (including training code), as well as postprocessing and
lic datasets, including DEAP [78], MAHNOB-HCI [112], PURE [38], evaluation tools. This toolbox supports three public datasets, namely
MMSE-HR [44], COHFACE [67], VIPL-HR [65], MR-NIRP [78], UBFC- SCAMPS [126], UBFC-rPPG [42], PURE [38], and MMPD [127]. rPPG
rPPG [42], Scamps [126], and MMPD [127]. Although Scamps [126] Toolbox provides a parameter file for researchers to modify the param-
is also a publicly available dataset, it is commonly used for training eters for training and testing, allowing researchers to freely customize
rather than testing, so it is not included as an experimental object. it to meet the requirements of their methods. By fully utilizing rPPG
All experimental data are obtained from our own experiments and Toolbox, researchers can reduce the time required for deploying their
publicly available experimental data from researchers. Due to the lack methods and facilitate fair evaluation of various methods.
of experimental data on some datasets, Table 12 will not include
experimental data from MMSE-HR and MMPD datasets, Table 13 will 6. Research gaps
not include experimental data from MMPD and MR-NIRP dataset and
Table 14 will not include experimental data from MR-NIRP dataset. Despite the significant achievements and advancements in rPPG
methods for HR measurement, there are still many areas that have not
5.3. Toolboxes been fully addressed or explored by researchers. In this section, we will
summarize the key influencing factors of current research, in order to
With the increasing attention from researchers, open-source tool- guide researchers in exploring new directions from these challenges.
boxes have been proposed to facilitate the study of rPPG. These tool-
boxes assist researchers in completing essential steps of rPPG methods, 6.1. Influencing factors
such as ROI selection and PPG signal conversion to HR, thereby greatly
facilitating research in this field. McDuff et al. [128] introduced the The performance of rPPG methods can be influenced by various
first open-source toolbox, iPhys, which is a MATLAB toolbox capable of interfering factors. In fact, most methods mentioned in this paper
implementing various methods, including classical traditional methods aim to overcome these adverse factors in order to achieve better
such as GREEN [13], POS [28], CHROM [12], and ICA [19]. It also pro- performance. The main influencing factors of rPPG methods currently
vides functionalities for common steps in rPPG methods, such as face include motion artifacts, lighting changes, video compression, and skin
detection, ROI definition, and skin segmentation. Additionally, iPhys color variations. Motion artifacts refer to ghosting effects caused by
offers functions for plotting and signal quality assessment to evaluate head or body movements of the subject during facial video recording,
performance. Similarly, Pilz et al. [129] developed a new open-source which can significantly impact the performance of rPPG. To address
toolbox, PPGI-Toolbox, written in MATLAB. The primary purpose of this issue, several methods [12,39,54,55,57] have been proposed. For
PPGI-Toolbox is to implement their proposed methods, namely Local example, a spatio-temporal attention module was designed in [61]
Group Invariance (LGI) [130] and Riemannian-PPGI (SPH) [129], while to learn salient features and reduce the impact of motion artifacts.
also incorporating classic traditional methods such as 2SR [27] and Lighting changes can cause color variations in the face and affect the

21
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Table 13
A summary of the performance of the supervised methods. MAE and RMSE in bmp. The best results are in bold.
Name Year Deap MAHNOB-HCI PURE MMSE-HR COHFACE VIPL-HR UBFC-rPPG
MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R
HR-CNN [37] 2018 – – – 7.25 9.24 0.51 1.84 2.37 0.98 – – – 8.10 10.80 0.29 – – – 4.90 5.89 0.64
DeepPhys [39] 2018 – – – 4.57 – – 0.83 1.54 0.99 – – – 8.25 14.71 0.28 11.00 13.80 0.11 6.27 10.82 0.65
EVM-CNN [40] 2018 6.96 8.81 0.84 – 3.26 0.95 – – – – 6.95 0.98 – – – – – – – – –
SynRhythm [45] 2018 4.48 6.52 0.89 0.30 4.49 – 2.71 4.86 0.98 −0.85 5.03 0.86 – 4.49 – – – – 5.59 6.82 0.72
3D CNN [53] 2019 – – – – – – – – – – – – – – – – – – 5.45 8.64 –
PhysNet [54] 2019 – – – 6.85 8.76 0.69 2.10 2.60 0.99 – 13.25 0.44 8.63 9.36 0.54 10.80 14.80 0.20 2.95 3.67 0.97
rPPGNet [55] 2019 6.21 7.73 0.83 4.03 5.93 0.88 0.74 1.21 1.00 – – – – – – – – – 0.56 0.73 0.99
ST-Attention 2019 – – – – – – – – – – – – – – – 5.40 7.99 0.66 – – –
[46]
TWO-STREAM 2019 – – – – – – 9.81 11.81 0.42 – – – 8.09 9.96 0.40 – – – – – –
[70]
Bian et al. [69] 2019 – – – – – – – – – 4.35 10.15 0.83 – – – – – – – – –
Meta-rPPG [73] 2020 5.16 6.00 0.87 – – – 2.52 4.63 0.98 – – – 9.31 12.27 0.19 – – – 5.97 7.42 0.53
MTTS-CAN [32] 2020 – – – – – – 2.48 9.01 0.92 3.85 7.21 0.86 – – – – – – 1.70 2.72 0.99
CVD [48] 2020 – – – – – – – – – – – – – – – 5.02 7.97 0.79 – – –
Song et al. [36] 2020 5.65 7.17 0.85 5.98 7.45 0.75 – – – – – – – – – – – – – – –
RhythmNet [47] 2020 7.47 8.96 0.82 – 3.99 0.87 – – – – 7.33 0.78 – – – 5.30 8.14 0.76 – – –
Siamese-rPPG 2020 – – – – – – 0.51 1.56 0.83 – – – 0.70 1.29 0.73 – – – 0.48 0.97 –
[59]
AutoHR [57] 2020 – – – – – – – – - – 5.87 0.89 – – – 5.68 8.68 0.72 – – –
DeeprPPG [58] 2020 – – – – – – 0.28 0.43 0.99 – – – 3.07 7.06 0.86 – – – – – –
HeartTrack [56] 2020 – – – – – – – – - – – – – – – – – – 2.41 3.37 0.98
Huang et al. 2020 – – – – – – – – - – – – – – – – – – 2.08 2.84 –
[72]
Deep-HR [80] 2021 – – – 2.08 3.41 0.92 – – - – – – – – – – – – – – –
PulseGAN [82] 2021 4.86 5.70 0.88 – – – 2.28 4.29 0.99 – – – – – – – – – 1.19 2.10 0.98
Dual-GAN [84] 2021 3.25 4.11 0.91 – – – 0.82 1.31 0.99 – – – – – – 4.93 7.68 0.81 0.44 0.67 0.99
Multi-task [98] 2021 – – – – – – 0.40 1.07 0.92 – – – 0.68 1.65 0.72 – – – 0.47 2.09 –
NAS-HR [49] 2021 – – – – – – 1.65 2.02 0.99 – – – – – – 5.12 8.01 0.79 – – –
Nowara et al. 2021 – – – – – – – – - 2.27 4.90 0.94 – – – – – – – – –
[103]
rPPGRNet + 2021 4.23 5.45 0.89 – – – – – - – – – – – - – – – – – –
THRNet [101]
SAM-rPPGNet 2021 – – – – – – 0.74 1.21 1.00 – – – 5.19 7.52 0.68 – – – – – –
[61]
PRNet [74] 2021 – – – 5.01 6.42 0.84 – – - – – – – – - – – – 5.29 7.24 0.73
Hu et al. [106] 2021 – – – – – – 0.23 0.48 0.99 0.43 1.16 0.99 – – – – – – 1.43 3.13 0.97
Instantaneous_ 2022 – – – – – – – – - – – – 19.66 22.65 - – – – 11.28 13.94 –
transformer [90]
Physformer [91] 2022 3.03 3.96 0.92 3.25 3.97 0.87 1.10 1.75 0.99 2.84 5.36 0.92 – – – 4.97 7.79 0.78 0.40 0.71 0.99
RErPPGNet [99] 2022 – – – – – – 0.38 0.54 0.96 – – – – – – – – – 0.41 0.56 0.99
AND-rPPG [104] 2022 – – – – – – – – – – – – 6.81 8.06 0.63 – – – 2.67 4.07 0.92
rPPG-FuseNet 2022 – – – 2.08 3.41 0.92 – – – −0.65 4.57 0.87 – – – 4.32 8.03 0.81 1.52 2.86 0.92
[105]
DG-rPPGNet 2022 – – – – – – 3.02 4.69 – – – – 7.19 8.99 – – – – 0.63 1.35 –
[108]
PRN augmented 2022 – – – – – – – – – – – – – – – – – – 0.68 1.31 0.86
[102]
APNET [95] 2022 – – – – – – – – – – – – – – – – – – 0.53 0.77 0.97
TDM + TALOS 2022 – – – – – – 1.83 2.30 0.99 – – – – – – – – – 2.32 3.08 0.99
[109]
EfficientPhys 2023 – – – – – – – – – – – – – – – – – – 1.14 1.81 0.99
[97]
Arbi- 2023 – – – – – – 1.44 2.50 – – – – 1.31 3.92 – – – – 0.76 1.62 –
trary_Resolution_
rPPG [107]
PhysFormer++ 2023 – – – 3.23 3.88 0.87 – – – 2.71 5.15 0.93 – – – 4.88 7.62 0.80 – – –
[93]
RADIANT [94] 2023 – – – – – – – – – – – – 8.01 10.12 – – – – 2.91 4.52 –

Table 14
A summary of the performance of the unsupervised methods. MAE and RMSE in bmp. The best results are in bold.
Name Year Deap MAHNOB-HCI PURE COHFACE VIPL-HR MR-NIRP UBFC-rPPG
MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R
Gideon et al. [111] 2021 5.13 6.16 0.86 – – – 2.3 2.9 0.99 1.50 4.60 0.90 9.80 15.48 0.38 4.75 9.14 0.61 1.85 4.28 0.93
SLF-RPM [112] 2022 – – – 3.60 4.67 0.92 – – – – – – 12.56 16.59 0.32 – – – 8.39 9.70 0.70
Fusion viViT [114] 2022 – – – – – – – – – – – – 11.70 14.86 −0.09 12.90 16.94 0.51 – – –
Contrast-Phys [115] 2022 – – – – – – 1.00 1.40 0.99 – – – 7.49 14.40 0.49 2.68 4.77 0.85 0.64 1.00 0.99
Yue et al. [116] 2022 4.20 5.18 0.90 – – – 1.23 2.01 0.99 – – – – – – – – – 0.58 0.94 0.99
SimPer [117] 2023 – – – – – – 3.98 – – – – – – – – – – – 4.24 – –
SiNC [118] 2023 – – – – – – 0.61 1.84 1.00 – – – – – – – – – 0.59 1.83 0.99
rPPG-MAE [119] 2023 – – – – – – 0.40 0.90 0.99 – – – 4.52 7.49 0.81 – – – 0.17 0.21 0.99

22
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

reflection of light, posing challenges for rPPG methods in dealing with Additionally, there is a need for further improvement and updates to
continuous lighting changes in a video. In [134], researchers conducted open-source toolboxes that facilitate researchers in training and testing
a detailed evaluation of the performance of the rPPG method under their models efficiently. Most existing toolboxes only provide inte-
different lighting conditions. Different approaches have been proposed grated methods and datasets, which restrict researchers from flexibly
to address lighting changes [14,47,103,105,107]. For instance, rPPG- deploying their own proposed networks and models.
FuseNet [105] combines MSR signals for remote HR estimation, which
mitigates the effects of different lighting factors. Video compression 6.4. Research on unsupervised deep learning method
can be a challenging factor in real-world applications. Although many
videos in existing datasets are uncompressed and of high quality, rPPG Notwithstanding the rapid advancement of deep learning method-
methods may need to be applied to compressed videos in practical ologies for rPPG, the prevailing paradigm at present is that of super-
scenarios. Solutions have been proposed specifically for video compres- vised learning. Nevertheless, supervised approaches necessitate authen-
sion [37,55,59,98]. For example, a spatio-temporal video enhancement tic physiological labels, thereby amplifying the intricacy of training and
network was proposed in [37] to improve video quality while retain- testing, and impeding their practical deployability. In 2021, Gideon
ing as much information as possible, thereby addressing the issues et al. [111] were at the vanguard of employing contrastive learning
caused by video compression to some extent. Color variations caused to achieve unsupervised deep learning for rPPG, culminating in the
by changes in blood volume are subtle, and skin color variations can emergence of several nascent unsupervised methodologies [112,114–
affect the measurement results. In remote HR measurement, subjects 116,118]. Nevertheless, the research of unsupervised approaches has
with dark skin color often have better measurement results, as blood been relatively sluggish, with the majority of researchers still fixated
vessels are more visible in dark skin color [102]. To address the issue of on supervised learning. Furthermore, the performance of existing unsu-
skin color variations, various methods have been proposed [8,102,112], pervised methods still significantly lags behind supervised methods and
such as a skin color transformation generator proposed in [102], which fails to meet the requisite benchmark for practical applications. Conse-
converts the skin color of all videos to dark skin color while preserving quently, further incisive investigation and exploration by researchers
the underlying blood volume changes. This largely solves the problem are warranted in the burgeoning realm of unsupervised methodologies,
of skin color variations and mitigates biases towards different skin color as it may hold the promise of becoming the mainstream direction for
populations to some extent. the future advancement of rPPG methods for remote HR monitoring.

6.2. Complex models 6.5. Near-infrared videos

Currently, researchers in the field of rPPG are predominantly direct- Presently, rPPG methods heavily rely on common RGB videos, ex-
ing their research towards deep learning methods, which diverge from hibiting good performance in well-lit conditions. However, RGB videos
traditional approaches where the emphasis is on algorithms, while deep suffer from reduced visibility in low-light situations, rendering rPPG
learning methods typically focus on models and networks. However, methods potentially inaccurate or even completely ineffective in special
the rapid advancement of deep learning has resulted in increasingly real-world scenarios, such as nighttime conditions [135]. NIR cameras,
larger and more complex models. Despite the notable achievements of which augment the amount of light reflected from the face, enable
many methods [91,99,108] that have utilized such complex models and NIR videos to maintain higher visibility in dark environments and
networks to attain excellent performance, the size of these architec- are commonly employed in nocturnal settings. Consequently, some
tures presents challenges for practical implementation. Consequently, researchers have proposed dedicated rPPG methods tailored for NIR
some researchers are shifting their attention towards the investigation videos [136,137] to measure heart rates in dark conditions. Addi-
of lightweight methods, leading to the development of lightweight tionally, certain approaches [114,138] consider the joint utilization
models [49,109] that aim to reduce computational costs and time of RGB and NIR videos as a multimodal input strategy to mitigate
complexity, shorten HR measurement time, and enhance processing the impact of lighting variations, thereby enhancing the quality of
speed. Nevertheless, these lightweight models often exhibit inferior rPPG signal estimation and, consequently, improving remote heart
performance compared to state-of-the-art methods. Therefore, finding a rate measurement. Nevertheless, these methods, on the whole, still
balance between lightweight models and optimal performance is likely exhibit suboptimal performance, warranting further research, which
to be a critical research direction for future researchers. holds significant implications for extending the applicability of rPPG
methods to more complex scenarios.
6.3. Open resources
7. Applications
Open source resources play a crucial role in supporting researchers
in their investigations, and for rPPG methods, datasets are an important With the rapid advancement of research and technology, rPPG
resource. Despite the availability of several datasets for evaluating rPPG methods have found applications in diverse domains beyond remote HR
methods, there is still a limited number of high-quality open datasets. measurement, thereby providing compelling evidence of their research
Currently, the three most commonly used datasets by researchers are potential and application prospects. In this section, we will introduce
UBFC-rPPG [42], PURE [38], and COHFACE [67], which primarily some of the latest applications that have been achieved using rPPG
focus on two influencing factors: motion artifacts and illumination methods, as well as potential future applications. The aim is to provide
changes. However, these datasets lack consideration of certain specific researchers with insights and inspirations for further exploration and
factors, such as changes in body state, emotional fluctuations, and deliberation in this exciting field.
environmental variations, making it challenging to comprehensively
evaluate the merits of a method. Moreover, the factors emphasized in 7.1. Measuring multiple vital signs
the current datasets do not fully encompass potential future develop-
ments, such as multi-person measurement and long-distance estimation, In addition to HR measurement, rPPG methods have been utilized
which may require new datasets for supplementation. In the context of to measure a wide range of other physiological parameters. Blood
deep learning methods, open sourcing of code is of paramount impor- pressure, a critical indicator of cardiovascular health, is commonly used
tance for researchers and newcomers to the field. However, currently, for detecting conditions like hypertension. Numerous studies have em-
accessing code for various learning-based methods is challenging, and ployed rPPG methods for remote blood pressure monitoring, resulting
it requires collective efforts from researchers to improve this situation. in promising measurement outcomes and highlighting the potential of

23
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

rPPG methods in this application [6,139–142]. Blood oxygen saturation 7.4. Face anti-spoofing
(SPO2), which measures the capacity of blood to carry and transport
oxygen, and indicates the saturation level of oxygen in the blood, is Since the onset of the information age in human society, the uti-
crucial for assessing oxygenation status. Lower SPO2 values suggest hy- lization of individual biometric characteristics, such as fingerprints
poxia and can be indicative of health risks. While some rPPG methods and facial features, for identity verification has gained immense pop-
have been used for SPO2 measurement [143–146], their performance is ularity. Currently, facial recognition and fingerprint recognition are
still moderate and further research is needed to improve their accuracy the most prevalent methods of identity authentication. Facial recog-
and reliability. RR and HRV, which are often measured alongside HR, nition, which relies on facial feature analysis [158], is vulnerable to
have also been successfully measured using rPPG methods with excel- biometric spoofing attacks. For instance, malicious actors can obtain
lent results. Recently, Kossack et al. [147] pioneered the application facial photos or videos of the target from alternative sources and
of rPPG methods for assessing tissue perfusion, which refers to the employ them for photo attacks or replay attacks, successfully deceiving
amount of blood flowing through tissues per unit of time. Insufficient facial recognition systems and exposing the target to significant risks
tissue perfusion indicates inadequate blood supply to local tissues or and vulnerabilities [159]. Consequently, there has been a growing
organs. The successful application of rPPG methods in evaluating tissue interest among researchers in developing anti-spoofing techniques for
perfusion underscores the potential of rPPG methods in measuring facial recognition, commonly known as facial anti-spoofing (FAS). With
other physiological parameters. In future research, researchers can con- the rapid advancement of remote rPPG methods, researchers have
sider expanding the application of rPPG methods to measure additional recognized the potential of leveraging rPPG techniques to enhance
physiological parameters, such as arterial stiffness and transcutaneous facial recognition systems [160]. Subsequently, the utilization of rPPG
oxygen saturation, to further explore the capabilities and potential of methods for facial anti-spoofing has emerged as a prominent research
rPPG methods in remote physiological monitoring. area. Kossack et al. [161] conducted a localized analysis of rPPG
signals, aimed at thwarting facial spoofing by assessing the blood flow
7.2. Affective computing information in the subject’s facial region. Simultaneously, numerous
other researchers have introduced novel anti-spoofing methods based
rPPG methods have shown promising potential in affective com- on rPPG signals [162–166], showcasing the substantial growth and
puting due to their combination of image processing and physiologi-
research potential of rPPG methods in the field of FAS.
cal sensing. Currently, researchers have successfully demonstrated the
application of rPPG methods in the field of affective computing, partic-
8. Conclusion
ularly in stress estimation and emotion recognition. McDuff et al. [148]
first utilized rPPG methods to measure HRV and further estimated
In recent years, rPPG methods for HR measurement have gained
stress levels of subjects using HRV with an accuracy of 85%, showcas-
increasing attention from researchers and have shown remarkable po-
ing the potential of rPPG methods in stress estimation. Subsequently,
tential for development. In this paper, we provide a comprehensive
in [125], researchers further explored the potential of rPPG methods in
review of this promising technology, encompassing traditional meth-
stress estimation and proposed a multimodal dataset, UBFC-Phys [125],
ods and deep learning approaches, with a particular focus on deep
for emotion and stress estimation. Emotion recognition is currently
learning methods. We further categorize deep learning methods into
a hot topic in research, and in [149], Gupta et al. first considered
supervised and unsupervised approaches, providing a classification and
the use of rPPG methods for micro-expression recognition. Moreover,
overview of their principles and mechanisms, with special emphasis on
PhysNet [54] proposed a new method for remote HR measurement
the emerging and promising field of unsupervised methods. We also
and also considered the use of rPPG methods for emotion recogni-
introduce research resources for rPPG methods, including datasets and
tion. Yu et al. [150] combined knowledge graphs with remote HR
toolboxes, and systematically summarize the performance of existing
measurement for emotion recognition, achieving promising results.
methods on datasets to assist researchers in accelerating their research.
In addition, researchers have also proposed rPPG methods for pain
Additionally, we discuss current research challenges and gaps in rPPG
recognition [151], demonstrating the potential of rPPG methods in pain
recognition. In the foreseeable future, researchers can further explore methods, and propose potential future research directions. Finally, we
the application of rPPG methods in affective computing domains such highlight the broad applications of rPPG methods in various fields,
as human–computer interaction and psychological testing. demonstrating their wide-ranging potential and future directions. Based
on the thriving development of rPPG methods in remote HR mea-
7.3. Deepfake detection surement, we suggest the following recommendations: (1) More efforts
should be focused on measuring different physiological indicators and
Deepfake, a combination of deep learning and fake, refers to the use applying them in diverse scenarios to further deepen the practical
of deep learning algorithms to simulate and fabricate audio and video significance of rPPG methods. (2) The focus of rPPG method research
content. Currently, Deepfake has become a highly popular field, with should continue to be on addressing various influencing factors to
the most common application being AI-based face swapping techniques, improve the performance of rPPG methods to real-world application
as well as voice synthesis, facial synthesis, and video generation. Its levels. (3) Unsupervised deep learning methods should be further inves-
emergence has made it possible to manipulate or generate highly real- tigated, as they can overcome the reliance on real labels in supervised
istic and difficult-to-detect audio and video content, ultimately making methods and facilitate practical applications. We believe that this paper
it challenging for observers to discern truth from falsity with the naked provides researchers with a more comprehensive understanding of
eye. Therefore, researchers have been paying attention to how to distin- rPPG methods for HR measurement, guides researchers to focus on real
guish such high-tech falsified content. The study by Ciftci et al. [152] challenges, promotes further exploration in this field, and inspires more
successfully demonstrated that the measurement of HR from facial applications of rPPG methods in medical and other domains.
videos can be used to determine whether a video is real or fake, and one
of the main applications of rPPG methods is remote HR measurement. Declaration of competing interest
As a result, researchers have begun widely employing rPPG methods for
Deepfake detection, proposing various novel methods [153–157], and The authors declare the following financial interests/personal rela-
achieving promising performance and results, effectively demonstrating tionships which may be considered as potential competing interests:
the potential of rPPG methods in this field. It is worth mentioning that Hanguang Xiao reports financial support was provided by Chongqing
the application of rPPG methods for Deepfake detection remains one of Natural Science Foundation. Hanguang Xiao reports a relationship with
the most valuable research directions currently. Chongqing Natural Science Foundation that includes: funding grants.

24
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

Data availability [23] H. Qi, Z. Guo, X. Chen, Z. Shen, Z.J. Wang, Video-based human heart rate
measurement using joint blind source separation, Biomed. Signal Process.
Control 31 (2017) 309–320.
No data was used for the research described in the article.
[24] A. Al-Naji, A.G. Perera, J. Chahl, Remote monitoring of cardiorespiratory signals
from a hovering unmanned aerial vehicle, Biomed. Eng. Online 16 (2017) 1–20.
Acknowledgments [25] G. De Haan, A. Van Leest, Improved motion robustness of remote-PPG by using
the blood volume pulse signature, Physiol. Meas. 35 (9) (2014) 1913.
This work was supported by the National Natural Science Founda- [26] X. Li, J. Chen, G. Zhao, M. Pietikainen, Remote heart rate measurement from
face videos under realistic situations, in: Proceedings of the IEEE Conference
tion of China (Grant Nos. 61971078), and Chongqing Natural Science
on Computer Vision and Pattern Recognition, 2014, pp. 4264–4271.
Foundation (Grant No. CSTB2022NSCQ-MSX0923). This study does not [27] W. Wang, S. Stuijk, G. De Haan, A novel algorithm for remote photoplethys-
involve any ethical issue. mography: Spatial subspace rotation, IEEE Trans. Biomed. Eng. 63 (9) (2016)
1974–1984.
References [28] W. Wang, A.C. Den Brinker, S. Stuijk, G. De Haan, Algorithmic principles of
remote PPG, IEEE Trans. Biomed. Eng. 64 (7) (2017) 1479–1491.
[29] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, E. Moulines, A blind source
[1] A. Challoner, C. Ramsay, A photoelectric plethysmograph for the measurement
separation technique using second-order statistics, IEEE Trans. Signal Process.
of cutaneous blood flow, Phys. Med. Biol. 19 (3) (1974) 317.
45 (2) (1997) 434–444.
[2] L. Scalise, Non contact heart monitoring, Adv. Electrocardiogr.-Methods Anal.
[30] X. Chen, Z.J. Wang, M. McKeown, Joint blind source separation for neurophys-
4 (2012) 81–106.
iological data analysis: Multiset and multimodal methods, IEEE Signal Process.
[3] A. Gudi, M. Bittner, J. van Gemert, Real-time webcam heart-rate and variability
Mag. 33 (3) (2016) 86–107.
estimation with clean ground truth for evaluation, Appl. Sci. 10 (23) (2020)
[31] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple
8630.
features, in: Proceedings of the 2001 IEEE Computer Society Conference on
[4] C. Massaroni, A. Nicolo, M. Sacchetti, E. Schena, Contactless methods
Computer Vision and Pattern Recognition, Vol. 1, CVPR 2001, IEEE, 2001, p.
for measuring respiratory rate: A review, IEEE Sens. J. 21 (11) (2020)
I.
12821–12839.
[5] R. Yousefi, M. Nourani, Separating arterial and venous-related components of [32] X. Liu, J. Fromm, S. Patel, D. McDuff, Multi-task temporal shift attention
photoplethysmographic signals for accurate extraction of oxygen saturation and networks for on-device contactless vitals measurement, Adv. Neural Inf. Process.
respiratory rate, IEEE J. Biomed. Health Inf. 19 (3) (2014) 848–857. Syst. 33 (2020) 19400–19411.
[6] F. Schrumpf, P. Frenzel, C. Aust, G. Osterhoff, M. Fuchs, Assessment of deep [33] B. Wei, X. He, C. Zhang, X. Wu, Non-contact, synchronous dynamic measure-
learning based blood pressure prediction from PPG and rPPG signals, in: ment of respiratory rate and heart rate based on dual sensitive regions, Biomed.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Eng. Online 16 (2017) 1–21.
Recognition, 2021, pp. 3820–3830. [34] A. Asthana, S. Zafeiriou, S. Cheng, M. Pantic, Robust discriminative response
[7] D.F. Swinehart, The beer-lambert law, J. Chem. Educ. 39 (7) (1962) 333. map fitting with constrained local models, in: Proceedings of the IEEE
[8] L.A. Aarts, V. Jeanne, J.P. Cleary, C. Lieber, J.S. Nelson, S.B. Oetomo, W. Conference on Computer Vision and Pattern Recognition, 2013, pp. 3444–3451.
Verkruysse, Non-contact heart rate monitoring utilizing camera photoplethys- [35] C. Tomasi, T. Kanade, Detection and tracking of point, Int. J. Comput. Vis. 9
mography in the neonatal intensive care unit—A pilot study, Early Hum. Dev. (1991) 137–154.
89 (12) (2013) 943–948. [36] R. Song, S. Zhang, C. Li, Y. Zhang, J. Cheng, X. Chen, Heart rate estimation from
[9] L.A. Aarts, V. Jeanne, J.P. Cleary, C. Lieber, J.S. Nelson, S.B. Oetomo, W. facial videos using a spatiotemporal representation with convolutional neural
Verkruysse, Non-contact heart rate monitoring utilizing camera photoplethys- networks, IEEE Trans. Instrum. Meas. 69 (10) (2020) 7411–7421.
mography in the neonatal intensive care unit—A pilot study, Early Hum. Dev. [37] R. Špetlík, V. Franc, J. Matas, Visual heart rate estimation with convolutional
89 (12) (2013) 943–948. neural network, in: Proceedings of the British Machine Vision Conference,
[10] A. Al-Naji, K. Gibson, S.-H. Lee, J. Chahl, Monitoring of cardiorespiratory signal: Newcastle, UK, 2018, pp. 3–6.
Principles of remote measurements and review of methods, IEEE Access 5 [38] R. Stricker, S. Müller, H.-M. Gross, Non-contact video-based pulse rate measure-
(2017) 15776–15790. ment on a mobile service robot, in: The 23rd IEEE International Symposium
[11] D.J. McDuff, J.R. Estepp, A.M. Piasecki, E.B. Blackford, A survey of remote on Robot and Human Interactive Communication, IEEE, 2014, pp. 1056–1062.
optical photoplethysmographic imaging methods, in: 2015 37th Annual Inter- [39] W. Chen, D. McDuff, Deepphys: Video-based physiological measurement using
national Conference of the IEEE Engineering in Medicine and Biology Society, convolutional attention networks, in: Proceedings of the European Conference
EMBC, IEEE, 2015, pp. 6398–6404. on Computer Vision, ECCV, 2018, pp. 349–365.
[12] G. De Haan, V. Jeanne, Robust pulse rate from chrominance-based rPPG, IEEE [40] Y. Qiu, Y. Liu, J. Arteaga-Falconi, H. Dong, A. El Saddik, EVM-CNN: Real-
Trans. Biomed. Eng. 60 (10) (2013) 2878–2886. time contactless heart rate estimation from facial video, IEEE Transactions on
[13] W. Verkruysse, L.O. Svaasand, J.S. Nelson, Remote plethysmographic imaging Multimedia 21 (7) (2018) 1778–1787.
using ambient light, Opt. Express 16 (26) (2008) 21434–21445. [41] J. Lin, C. Gan, S. Han, Tsm: Temporal shift module for efficient video
[14] P.V. Rouast, M.T. Adam, R. Chiong, D. Cornforth, E. Lux, Remote heart rate understanding, in: Proceedings of the IEEE/CVF International Conference on
measurement using low-cost RGB face video: a technical literature review, Computer Vision, 2019, pp. 7083–7093.
Front. Comput. Sci. 12 (2018) 858–872. [42] S. Bobbia, R. Macwan, Y. Benezeth, A. Mansouri, J. Dubois, Unsupervised skin
[15] F.-T.-Z. Khanam, A. Al-Naji, J. Chahl, Remote monitoring of vital signs in tissue segmentation for remote photoplethysmography, Pattern Recognit. Lett.
diverse non-clinical and clinical scenarios using computer vision systems: A 124 (2019) 82–90.
review, Appl. Sci. 9 (20) (2019) 4474. [43] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, W. Freeman, Eulerian
[16] X. Chen, J. Cheng, R. Song, Y. Liu, R. Ward, Z.J. Wang, Video-based heart video magnification for revealing subtle changes in the world, ACM Trans.
rate measurement: Recent advances and future prospects, IEEE Trans. Instrum. Graph. 31 (4) (2012) 1–8.
Meas. 68 (10) (2018) 3600–3615. [44] Z. Zhang, J.M. Girard, Y. Wu, X. Zhang, P. Liu, U. Ciftci, S. Canavan, M. Reale,
[17] A. Ni, A. Azarang, N. Kehtarnavaz, A review of deep learning-based contactless A. Horowitz, H. Yang, et al., Multimodal spontaneous emotion corpus for human
heart rate measurement methods, Sensors 21 (11) (2021) 3719. behavior analysis, in: Proceedings of the IEEE Conference on Computer Vision
[18] C.-H. Cheng, K.-L. Wong, J.-W. Chin, T.-T. Chan, R.H. So, Deep learning and Pattern Recognition, 2016, pp. 3438–3446.
methods for remote heart rate measurement: A review and future research [45] X. Niu, H. Han, S. Shan, X. Chen, Synrhythm: Learning a deep heart rate
agenda, Sensors 21 (18) (2021) 6296. estimator from general to specific, in: 2018 24th International Conference on
[19] M.-Z. Poh, D.J. McDuff, R.W. Picard, Non-contact, automated cardiac pulse Pattern Recognition, ICPR, IEEE, 2018, pp. 3580–3585.
measurements using video imaging and blind source separation, Opt. Express [46] X. Niu, X. Zhao, H. Han, A. Das, A. Dantcheva, S. Shan, X. Chen, Robust remote
18 (10) (2010) 10762–10774. heart rate estimation from face utilizing spatial-temporal attention, in: 2019
[20] M. Lewandowska, J. Rumiński, T. Kocejko, J. Nowak, Measuring pulse rate 14th IEEE International Conference on Automatic Face & Gesture Recognition,
with a webcam—a non-contact method for evaluating cardiac activity, in: 2011 FG 2019, IEEE, 2019, pp. 1–8.
Federated Conference on Computer Science and Information Systems, FedCSIS, [47] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation
IEEE, 2011, pp. 405–410. from face via spatial-temporal representation, IEEE Trans. Image Process. 29
[21] Y. Sun, S. Hu, V. Azorin-Peris, S. Greenwald, J. Chambers, Y. Zhu, (2020) 2409–2423.
Motion-compensated noncontact imaging photoplethysmography to monitor [48] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, G. Zhao, Video-based remote physiological
cardiorespiratory status during exercise, J. Biomed. Opt. 16 (7) (2011) 077010. measurement via cross-verified feature disentangling, in: Computer Vision–
[22] Z. Guo, Z.J. Wang, Z. Shen, Physiological parameter monitoring of drivers based ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
on video data and independent vector analysis, in: 2014 IEEE International Proceedings, Part II 16, Springer, 2020, pp. 295–310.
Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2014, [49] H. Lu, H. Han, NAS-HR: Neural architecture search for heart rate estimation
pp. 4374–4378. from face videos, Virtual Real. Intell. Hardw. 3 (1) (2021) 33–42.

25
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

[50] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated [75] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation
recurrent neural networks on sequence modeling, in: NIPS 2014 Workshop on of deep networks, in: International Conference on Machine Learning, PMLR,
Deep Learning, December 2014, 2014. 2017, pp. 1126–1135.
[51] H. Liu, K. Simonyan, Y. Yang, DARTS: Differentiable architecture search, in: [76] A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose estima-
International Conference on Learning Representations, 2019. tion, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam,
[52] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. the Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, Springer,
Tulppo, G. Zhao, The obf database: A large face video database for remote 2016, pp. 483–499.
physiological signal measurement and atrial fibrillation detection, in: 2018 13th [77] S.X. Hu, P.G. Moreno, Y. Xiao, X. Shen, G. Obozinski, N.D. Lawrence, A.
IEEE International Conference on Automatic Face & Gesture Recognition, FG Damianou, Empirical bayes transductive meta-learning with synthetic gradients,
2018, IEEE, 2018, pp. 242–249. 2020, arXiv preprint arXiv:2004.12696.
[53] F. Bousefsaf, A. Pruski, C. Maaoui, 3D convolutional neural networks for remote [78] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun,
pulse rate measurement and mapping from facial video, Appl. Sci. 9 (20) (2019) A. Nijholt, I. Patras, Deap: A database for emotion analysis; using physiological
4364. signals, IEEE Trans. Affect. Comput. 3 (1) (2011) 18–31.
[54] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from [79] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
facial videos using spatio-temporal networks, 2019, arXiv preprint arXiv:1905. A. Courville, Y. Bengio, Generative adversarial nets, in: Neural Information
02419. Processing Systems, 2014.
[55] Z. Yu, W. Peng, X. Li, X. Hong, G. Zhao, Remote heart rate measurement from [80] M. Sabokrou, M. Pourreza, X. Li, M. Fathy, G. Zhao, Deep-hr: Fast heart rate
highly compressed facial videos: an end-to-end deep learning solution with estimation from face video under realistic conditions, Expert Syst. Appl. 186
video enhancement, in: Proceedings of the IEEE/CVF International Conference (2021) 115596.
on Computer Vision, 2019, pp. 151–160. [81] S. Liu, D. Huang, et al., Receptive field block net for accurate and fast object
[56] O. Perepelkina, M. Artemyev, M. Churikova, M. Grinenko, HeartTrack: Con- detection, in: Proceedings of the European Conference on Computer Vision,
volutional neural network for remote video-based heart rate monitoring, in: ECCV, 2018, pp. 385–400.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [82] R. Song, H. Chen, J. Cheng, C. Li, Y. Liu, X. Chen, Pulsegan: Learning to
Recognition Workshops, 2020, pp. 288–289. generate realistic pulse waveforms in remote photoplethysmography, IEEE J.
[57] Z. Yu, X. Li, X. Niu, J. Shi, G. Zhao, Autohr: A strong end-to-end baseline Biomed. Health Inf. 25 (5) (2021) 1373–1384.
for remote heart rate measurement with neural searching, IEEE Signal Process. [83] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014, arXiv
Lett. 27 (2020) 1245–1249. preprint arXiv:1411.1784.
[58] S.-Q. Liu, P.C. Yuen, A general remote photoplethysmography estimator with [84] H. Lu, H. Han, S.K. Zhou, Dual-gan: Joint bvp and noise modeling for remote
spatiotemporal convolutional network, in: 2020 15th IEEE International Con- physiological measurement, in: Proceedings of the IEEE/CVF Conference on
ference on Automatic Face and Gesture Recognition, FG 2020, IEEE, 2020, pp. Computer Vision and Pattern Recognition, 2021, pp. 12404–12413.
481–488. [85] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł.
[59] Y.-Y. Tsou, Y.-A. Lee, C.-T. Hsu, S.-H. Chang, Siamese-rPPG network: Remote
Kaiser, I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst.
photoplethysmography signal estimation from face videos, in: Proceedings of
30 (2017).
the 35th Annual ACM Symposium on Applied Computing, 2020, pp. 2066–2073.
[86] H. Xiao, L. Li, Q. Liu, X. Zhu, Q. Zhang, Transformers in medical image
[60] M. Hu, F. Qian, D. Guo, X. Wang, L. He, F. Ren, ETA-rPPGNet: Effective
segmentation: A review, Biomed. Signal Process. Control 84 (2023) 104791.
time-domain attention network for remote heart rate measurement, IEEE Trans.
[87] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
Instrum. Meas. 70 (2021) 1–12.
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth
[61] M. Hu, F. Qian, X. Wang, L. He, D. Guo, F. Ren, Robust heart rate estimation
16 × 16 words: Transformers for image recognition at scale, 2020, arXiv
with spatial–temporal attention network from facial videos, IEEE Trans. Cogn.
preprint arXiv:2010.11929.
Dev. Syst. 14 (2) (2021) 639–647.
[88] G. Balakrishnan, A. Zhao, M.R. Sabuncu, J. Guttag, A.V. Dalca, Voxelmorph:
[62] M. Soleymani, J. Lichtenauer, T. Pun, M. Pantic, A multimodal database for
a learning framework for deformable medical image registration, IEEE Trans.
affect recognition and implicit tagging, IEEE Trans. Affect. Comput. 3 (1) (2011)
Med. Imaging 38 (8) (2019) 1788–1800.
42–55.
[89] Z. Yu, X. Li, P. Wang, G. Zhao, Transrppg: Remote photoplethysmography
[63] Z. Qiu, T. Yao, T. Mei, Learning spatio-temporal representation with pseudo-
transformer for 3d mask face presentation attack detection, IEEE Signal Process.
3d residual networks, in: Proceedings of the IEEE International Conference on
Lett. 28 (2021) 1290–1294.
Computer Vision, 2017, pp. 5533–5541.
[90] A. Revanur, A. Dasari, C.S. Tucker, L.A. Jeni, Instantaneous physiological
[64] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, H. Xiong, PC-DARTS: Partial
estimation using video transformers, in: Multimodal AI in Healthcare: A
channel connections for memory-efficient architecture search, in: International
Paradigm Shift in Health Intelligence, Springer, 2022, pp. 307–319.
Conference on Learning Representations, 2020.
[65] X. Niu, H. Han, S. Shan, X. Chen, VIPL-HR: A multi-modal database for pulse [91] Z. Yu, Y. Shen, J. Shi, H. Zhao, P.H. Torr, G. Zhao, PhysFormer: facial
estimation from less-constrained face video, in: Computer Vision–ACCV 2018: video-based physiological measurement with temporal difference transformer,
14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
2018, Revised Selected Papers, Part V 14, Springer, 2019, pp. 562–576. Recognition, 2022, pp. 4186–4196.
[66] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verification [92] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using
using a ‘‘siamese’’ time delay neural network, Adv. Neural Inf. Process. Syst. 6 multitask cascaded convolutional networks, IEEE Signal Process. Lett. 23 (10)
(1993). (2016) 1499–1503.
[67] G. Heusch, A. Anjos, S. Marcel, A reproducible study on remote heart rate [93] Z. Yu, Y. Shen, J. Shi, H. Zhao, Y. Cui, J. Zhang, P. Torr, G. Zhao,
measurement, 2017, arXiv preprint arXiv:1709.00962. PhysFormer++: Facial video-based physiological measurement with SlowFast
[68] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) temporal difference transformer, Int. J. Comput. Vis. 131 (6) (2023) 1307–1330.
(1997) 1735–1780. [94] A.K. Gupta, R. Kumar, L. Birla, P. Gupta, RADIANT: Better rPPG estimation
[69] M. Bian, B. Peng, W. Wang, J. Dong, An accurate lstm based video heart rate using signal embeddings and transformer, in: Proceedings of the IEEE/CVF
estimation method, in: Pattern Recognition and Computer Vision: Second Chi- Winter Conference on Applications of Computer Vision, 2023, pp. 4976–4986.
nese Conference, PRCV 2019, Xi’an, China, November 8–11, 2019, Proceedings, [95] D.-Y. Kim, S.-Y. Cho, K. Lee, C.-B. Sohn, A study of projection-based at-
Part III, Springer, 2019, pp. 409–417. tentive spatial–temporal map for remote photoplethysmography measurement,
[70] Z.-K. Wang, Y. Kao, C.-T. Hsu, Vision-based heart rate estimation via a two- Bioengineering 9 (11) (2022) 638.
stream cnn, in: 2019 IEEE International Conference on Image Processing, ICIP, [96] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, Y. Li, Maxvit:
IEEE, 2019, pp. 3327–3331. Multi-axis vision transformer, in: Computer Vision–ECCV 2022: 17th European
[71] D. Botina-Monsalve, Y. Benezeth, R. Macwan, P. Pierrart, F. Parra, K. Nakamura, Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV,
R. Gomez, J. Miteran, Long short-term memory deep-filter in remote photo- Springer, 2022, pp. 459–479.
plethysmography, in: Proceedings of the IEEE/CVF Conference on Computer [97] X. Liu, B. Hill, Z. Jiang, S. Patel, D. McDuff, EfficientPhys: Enabling simple,
Vision and Pattern Recognition Workshops, 2020, pp. 306–307. fast and accurate camera-based cardiac measurement, in: Proceedings of the
[72] B. Huang, C.-M. Chang, C.-L. Lin, W. Chen, C.-F. Juang, X. Wu, Visual heart IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp.
rate estimation from facial video based on CNN, in: 2020 15th IEEE Conference 5008–5017.
on Industrial Electronics and Applications, ICIEA, IEEE, 2020, pp. 1658–1662. [98] Y.-Y. Tsou, Y.-A. Lee, C.-T. Hsu, Multi-task learning for simultaneous video
[73] E. Lee, E. Chen, C.-Y. Lee, Meta-rppg: Remote heart rate estimation using generation and remote photoplethysmography estimation, in: Proceedings of
a transductive meta-learner, in: Computer Vision–ECCV 2020: 16th European the Asian Conference on Computer Vision, Springer, 2021, pp. 392–407.
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, [99] C.-J. Hsieh, W.-H. Chung, C.-T. Hsu, Augmentation of rPPG benchmark datasets:
Springer, 2020, pp. 392–409. Learning to remove and embed rPPG signals via double cycle consistent learning
[74] B. Huang, C.-L. Lin, W. Chen, C.-F. Juang, X. Wu, A novel one-stage framework from unpaired facial videos, in: Computer Vision–ECCV 2022: 17th European
for visual pulse rate estimation using deep neural networks, Biomed. Signal Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI,
Process. Control 66 (2021) 102387. Springer, 2022, pp. 372–387.

26
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

[100] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image transla- [125] R.M. Sabour, Y. Benezeth, P. De Oliveira, J. Chappe, F. Yang, Ubfc-phys: A
tion using cycle-consistent adversarial networks, in: Proceedings of the IEEE multimodal database for psychophysiological studies of social stress, IEEE Trans.
International Conference on Computer Vision, 2017, pp. 2223–2232. Affect. Comput. (2021).
[101] Z. Yue, S. Ding, S. Yang, H. Yang, Z. Li, Y. Zhang, Y. Li, Deep super-resolution [126] D. McDuff, M. Wander, X. Liu, B. Hill, J. Hernandez, J. Lester, T. Baltrusaitis,
network for rPPG information recovery and noncontact heart rate estimation, Scamps: Synthetics for camera measurement of physiological signals, Adv.
IEEE Trans. Instrum. Meas. 70 (2021) 1–11. Neural Inf. Process. Syst. 35 (2022) 3744–3757.
[102] Y. Ba, Z. Wang, K.D. Karinca, O.D. Bozkurt, A. Kadambi, Style transfer with bio- [127] J. Tang, K. Chen, Y. Wang, Y. Shi, S. Patel, D. McDuff, X. Liu, MMPD:
realistic appearance manipulation for skin-tone inclusive rPPG, in: 2022 IEEE Multi-domain mobile video physiology dataset, 2023, arXiv preprint arXiv:
International Conference on Computational Photography, ICCP, IEEE, 2022, pp. 2302.03840.
1–12. [128] D. McDuff, E. Blackford, Iphys: An open non-contact imaging-based physiolog-
[103] E.M. Nowara, D. McDuff, A. Veeraraghavan, The benefit of distraction: De- ical measurement toolbox, in: 2019 41st Annual International Conference of
noising camera-based physiological measurements using inverse attention, in: the IEEE Engineering in Medicine and Biology Society, EMBC, IEEE, 2019, pp.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 6521–6524.
ICCV, 2021, pp. 4955–4964. [129] C. Pilz, On the vector space in photoplethysmography imaging, in: Proceedings
[104] B. Lokendra, G. Puneet, AND-rPPG: A novel denoising-rPPG network for of the IEEE/CVF International Conference on Computer Vision Workshops,
improving remote heart rate estimation, Comput. Biol. Med. 141 (2022) 2019.
105146. [130] C.S. Pilz, S. Zaunseder, J. Krajewski, V. Blazek, Local group invariance for
[105] K.B. Jaiswal, T. Meenpal, rPPG-FuseNet: Non-contact heart rate estimation from heart rate estimation from face videos in the wild, in: Proceedings of the IEEE
facial video via RGB/MSR signal fusion, Biomed. Signal Process. Control 78 Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp.
(2022) 104002. 1254–1262.
[106] M. Hu, D. Guo, M. Jiang, F. Qian, X. Wang, F. Ren, rPPG-based heart rate [131] G. Boccignone, D. Conte, V. Cuculo, A. d’Amelio, G. Grossi, R. Lanzarotti, An
estimation using spatial-temporal attention network, IEEE Trans. Cogn. Dev. open framework for remote-PPG methods and their assessment, IEEE Access 8
Syst. 14 (4) (2021) 1630–1641. (2020) 216083–216103.
[107] J. Li, Z. Yu, J. Shi, Learning motion-robust remote photoplethysmogra- [132] G. Boccignone, D. Conte, V. Cuculo, A. D’Amelio, G. Grossi, R. Lanzarotti, E.
phy through arbitrary resolution videos, in: AAAI Conference on Artificial Mortara, pyVHR: a Python framework for remote photoplethysmography, PeerJ
Intelligence, 2023. Comput. Sci. 8 (2022) e929.
[108] W.-H. Chung, C.-J. Hsieh, S.-H. Liu, C.-T. Hsu, Domain generalized RPPG [133] X. Liu, X. Zhang, G. Narayanswamy, Y. Zhang, Y. Wang, S. Patel, D. McDuff,
network: Disentangled feature learning with domain permutation and domain Deep physiological sensing toolbox, 2022, arXiv preprint arXiv:2210.00716.
augmentation, in: Proceedings of the Asian Conference on Computer Vision, [134] Z. Yang, H. Wang, F. Lu, Assessment of deep learning-based heart rate
2022, pp. 807–823. estimation using remote photoplethysmography under different illuminations,
IEEE Trans. Hum.-Mach. Syst. 52 (6) (2022) 1236–1246.
[109] J. Comas, A. Ruiz, F. Sukno, Efficient remote photoplethysmography with
temporal derivative modules and time-shift invariant loss, in: Proceedings of [135] Y. Cho, S.J. Julier, N. Marquardt, N. Bianchi-Berthouze, Robust tracking of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, respiratory rate in high-dynamic range scenes using mobile thermal imaging,
pp. 2182–2191. Biomed. Opt. Express 8 (10) (2017) 4480–4503.
[136] S.B. Park, G. Kim, H.J. Baek, J.H. Han, J.H. Kim, Remote pulse rate mea-
[110] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for
surement from near-infrared videos, IEEE Signal Process. Lett. 25 (8) (2018)
contrastive learning of visual representations, in: International Conference on
1271–1275.
Machine Learning, PMLR, 2020, pp. 1597–1607.
[137] J. Cheng, P. Wang, R. Song, Y. Liu, C. Li, Y. Liu, X. Chen, Remote heart rate
[111] J. Gideon, S. Stent, The way to my heart is through contrastive learning: Remote
measurement from near-infrared videos based on joint blind source separation
photoplethysmography from unlabelled video, in: Proceedings of the IEEE/CVF
with delay-coordinate transformation, IEEE Trans. Instrum. Meas. 70 (2020)
International Conference on Computer Vision, 2021, pp. 3995–4004.
1–13.
[112] H. Wang, E. Ahn, J. Kim, Self-supervised representation learning framework for
[138] D.Q. Le, J.-C. Chiang, W.-N. Lie, Remote PPG estimation from RGB-nir facial im-
remote physiological measurement using spatiotemporal augmentation loss, in:
age sequence for heart rate estimation, in: 2022 IEEE International Symposium
AAAI Conference on Artificial Intelligence, 2022.
on Circuits and Systems, ISCAS, IEEE, 2022, pp. 2077–2081.
[113] H. Nyquist, Certain topics in telegraph transmission theory, Trans. Am. Inst.
[139] D. Djeldjli, F. Bousefsaf, C. Maaoui, F. Bereksi-Reguig, A. Pruski, Remote
Electr. Eng. 47 (2) (1928) 617–644.
estimation of pulse wave features related to arterial stiffness and blood pressure
[114] S. Park, B.-K. Kim, S.-Y. Dong, Self-supervised RGB-nir fusion video vision
using a camera, Biomed. Signal Process. Control 64 (2021) 102242.
transformer framework for rPPG estimation, IEEE Trans. Instrum. Meas. 71
[140] H. Luo, D. Yang, A. Barszczyk, N. Vempala, J. Wei, S.J. Wu, P.P. Zheng,
(2022) 1–10.
G. Fu, K. Lee, Z.-P. Feng, Smartphone-based blood pressure measurement
[115] Z. Sun, X. Li, Contrast-phys: Unsupervised video-based remote physiological
using transdermal optical imaging technology, Circ. Cardiovasc. Imaging 12
measurement via spatiotemporal contrast, in: Computer Vision–ECCV 2022:
(8) (2019) e008857.
17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
[141] X. Fan, Q. Ye, X. Yang, S.D. Choudhury, Robust blood pressure estimation using
Part XII, Springer, 2022, pp. 492–510.
an RGB camera, J. Ambient Intell. Humaniz. Comput. 11 (2020) 4329–4336.
[116] Z. Yue, M. Shi, S. Ding, Video-based remote physiological measurement via [142] B.-F. Wu, B.-J. Wu, B.-R. Tsai, C.-P. Hsu, A facial-image-based blood pressure
self-supervised learning, 2022, arXiv preprint arXiv:2210.15401. measurement system without calibration, IEEE Trans. Instrum. Meas. 71 (2022)
[117] Y. Yang, X. Liu, J. Wu, S. Borac, D. Katabi, M.-Z. Poh, D. McDuff, Simper: 1–13.
Simple self-supervised learning of periodic targets, 2022, arXiv preprint arXiv: [143] G. Casalino, G. Castellano, G. Zaza, A mhealth solution for contact-less self-
2210.03115. monitoring of blood oxygen saturation, in: 2020 IEEE Symposium on Computers
[118] J. Speth, N. Vance, P. Flynn, A. Czajka, Non-contrastive unsupervised learning and Communications, ISCC, IEEE, 2020, pp. 1–7.
of physiological signals from video, in: Proceedings of the IEEE/CVF Conference [144] D. Shao, C. Liu, F. Tsow, Y. Yang, Z. Du, R. Iriya, H. Yu, N. Tao, Noncontact
on Computer Vision and Pattern Recognition, 2023. monitoring of blood oxygen saturation using camera and dual-wavelength
[119] X. Liu, Y. Zhang, Z. Yu, H. Lu, H. Yue, J. Yang, rPPG-MAE: Self-supervised imaging system, IEEE Trans. Biomed. Eng. 63 (6) (2015) 1091–1098.
pre-training with masked autoencoders for remote physiol. meas., 2023, arXiv [145] L. Kong, Y. Zhao, L. Dong, Y. Jian, X. Jin, B. Li, Y. Feng, M. Liu, X. Liu, H.
preprint arXiv:2306.02301. Wu, Non-contact detection of oxygen saturation based on visible light imaging
[120] A.v.d. Oord, Y. Li, O. Vinyals, Representation learning with contrastive device using ambient light, Opt. Express 21 (15) (2013) 17464–17471.
predictive coding, 2018, arXiv preprint arXiv:1807.03748. [146] A.H. Ayesha, D. Qiao, F. Zulkernine, A web application for experimenting and
[121] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders validating remote measurement of vital signs, in: Information Integration and
are scalable vision learners, in: Proceedings of the IEEE/CVF Conference on Web Intelligence: 24th International Conference, IiWAS 2022, Virtual Event,
Computer Vision and Pattern Recognition, 2022, pp. 16000–16009. November 28–30, 2022, Proceedings, Springer, 2022, pp. 237–251.
[122] J.R. Estepp, E.B. Blackford, C.M. Meier, Recovering pulse rate during motion [147] B. Kossack, E. Wisotzky, P. Eisert, S.P. Schraven, B. Globke, A. Hilsmann, Perfu-
artifact with a multi-imager array for non-contact imaging photoplethysmogra- sion assessment via local remote photoplethysmography (rPPG), in: Proceedings
phy, in: 2014 IEEE International Conference on Systems, Man, and Cybernetics, of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022,
SMC, IEEE, 2014, pp. 1462–1469. pp. 2192–2201.
[123] A. Gudi, M. Bittner, J. Van Gemert, Real-time webcam heart-rate and variability [148] D. McDuff, S. Gontarek, R. Picard, Remote measurement of cognitive stress
estimation with clean ground truth for evaluation, Appl. Sci. 10 (23) (2020) via heart rate variability, in: 2014 36th Annual International Conference of the
8630. IEEE Engineering in Medicine and Biology Society, IEEE, 2014, pp. 2957–2960.
[124] A. Revanur, Z. Li, U.A. Ciftci, L. Yin, L.A. Jeni, The first vision for vitals (v4v) [149] P. Gupta, B. Bhowmick, A. Pal, Exploring the feasibility of face video based
challenge for non-contact video-based physiological estimation, in: Proceedings instantaneous heart-rate for micro-expression spotting, in: Proceedings of the
of the IEEE/CVF International Conference on Computer Vision, 2021, pp. IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018,
2760–2767. pp. 1316–1323.

27
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608

[150] W. Yu, S. Ding, Z. Yue, S. Yang, Emotion recognition from facial expressions [158] I.M. Alsaadi, Physiological biometric authentication systems, advantages, dis-
and contactless heart rate using knowledge graph, in: 2020 IEEE International advantages and future development: A review, Int. J. Sci. Technol. Res. 4 (12)
Conference on Knowledge Graph, ICKG, IEEE, 2020, pp. 64–69. (2015) 285–289.
[151] V. Kessler, P. Thiam, M. Amirian, F. Schwenker, Pain recognition with camera [159] S. Kumar, S. Singh, J. Kumar, A comparative study on face spoofing attacks, in:
photoplethysmography, in: 2017 Seventh International Conference on Image 2017 International Conference on Computing, Communication and Automation,
Processing Theory, Tools and Applications, IPTA, IEEE, 2017, pp. 1–5. ICCCA, IEEE, 2017, pp. 1104–1108.
[152] U.A. Ciftci, I. Demir, L. Yin, Fakecatcher: Detection of synthetic portrait videos [160] S. Liu, P.C. Yuen, S. Zhang, G. Zhao, 3D mask face anti-spoofing with
using biological signals, IEEE Trans. Pattern Anal. Mach. Intell. (2020). remote photoplethysmography, in: Computer Vision–ECCV 2016: 14th European
[153] S. Fernandes, S. Raj, E. Ortiz, I. Vintila, M. Salter, G. Urosevic, S. Jha, Predicting Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings,
heart rate variations of deepfake videos using neural ode, in: Proceedings of Part VII 14, Springer, 2016, pp. 85–100.
the IEEE/CVF International Conference on Computer Vision Workshops, 2019. [161] B. Kossack, E.L. Wisotzky, A. Hilsmann, P. Eisert, Local remote photoplethys-
[154] J. Hernandez-Ortega, R. Tolosana, J. Fiérrez, A. Morales, DeepFakesON-phys: mography signal analysis for application in presentation attack detection, in:
DeepFakes detection based on heart rate estimation, in: AAAI Conference on VMV, 2019, pp. 135–142.
Artificial Intelligence, 2021. [162] X. Li, J. Komulainen, G. Zhao, P.-C. Yuen, M. Pietikäinen, Generalized face
[155] Y. Xu, R. Zhang, C. Yang, Y. Zhang, Z. Yang, J. Liu, New advances in remote anti-spoofing by detecting pulse from face videos, in: 2016 23rd International
heart rate estimation and its application to DeepFake detection, in: 2021 Conference on Pattern Recognition, ICPR, IEEE, 2016, pp. 4244–4249.
International Conference on Culture-Oriented Science & Technology, ICCST, [163] S.-Q. Liu, X. Lan, P.C. Yuen, Remote photoplethysmography correspondence
IEEE, 2021, pp. 387–392. feature for 3D mask face presentation attack detection, in: Proceedings of the
[156] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, Y. Liu, J. Zhao, Deeprhythm: European Conference on Computer Vision, ECCV, 2018, pp. 558–573.
Exposing deepfakes with attentional visual heartbeat rhythms, in: Proceedings [164] B. Lin, X. Li, Z. Yu, G. Zhao, Face liveness detection by rppg features and
of the 28th ACM International Conference on Multimedia, 2020, pp. 4318–4327. contextual patch-based cnn, in: Proceedings of the 2019 3rd International
[157] G. Boccignone, S. Bursic, V. Cuculo, A. D’Amelio, G. Grossi, R. Lanzarotti, Conference on Biometric Engineering and Applications, 2019, pp. 61–68.
S. Patania, DeepFakes have no heart: A simple rPPG-based method to reveal [165] Z. Yu, X. Li, P. Wang, G. Zhao, Transrppg: Remote photoplethysmography
fake videos, in: Image Analysis and Processing–ICIAP 2022: 21st International transformer for 3d mask face presentation attack detection, IEEE Signal Process.
Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part II, Springer, 2022, Lett. 28 (2021) 1290–1294.
pp. 186–195. [166] Z. Yu, R. Cai, Z. Li, W. Yang, J. Shi, A.C. Kot, Benchmarking joint face spoofing
and forgery detection with visual and physiological cues, 2022, arXiv preprint
arXiv:2208.05401.

28

You might also like