Remote Photoplethysmography for Heart Rate Measurement a Review
Remote Photoplethysmography for Heart Rate Measurement a Review
Remote Photoplethysmography for Heart Rate Measurement a Review
Keywords: Heart rate (HR) ranks among the most critical physiological indicators in the human body, significantly
Heart rate illuminating an individual’s state of physical health. Distinguished from traditional contact-based heart rate
Remote photoplethysmography measurement, the utilization of Remote Photoplethysmography (rPPG) for remote heart rate monitoring
Non-contact
eliminates the need for skin contact, relying solely on a camera for detection. This non-contact measurement
Deep learning
method has emerged as an increasingly noteworthy research area. With the rapid development of deep learning,
new technologies in this area have spurred the emergence of many new rPPG methods for HR measurement.
However, comprehensive review papers in this field are scarce. Consequently, this paper aims to provide a
comprehensive overview centered around rPPG methods employed for the purpose of heart rate measurement.
We systematically organized the existing rPPG methods, with a specific focus on those based on deep learning,
and described and analyzed the structures and key aspects of these methods. Additionally, we summarized the
datasets and tools used for related research and compiled the performance of different methods on prominent
datasets. Finally, this paper discusses the current research barriers in rPPG methods, as well as the latest
practical applications and potential future directions for development. We hope that this review will help
researchers quickly understand this field and promote the exploration of more unknown challenges.
1. Introduction are captured by the photodetector, generating the PPG signal [1]. PPG
is effective because light absorption follows Beer–Lambert’s law, which
Physiological indicators, such as HR, heart rate variability (HRV), states that the amount of light absorbed by blood is proportional to the
respiratory rate (RR), blood oxygen saturation (SpO2), and blood pres- concentration of hemoglobin in the skin and blood. Therefore, during
sure (BP), are commonly used to assess a person’s physical health the cardiac cycle, small changes in hemoglobin concentration cause
status, detect potential diseases, and monitor recovery during clinical fluctuations in the amount of light absorbed by the blood vessels, result-
treatment [1–6]. Among these indicators, HR is the most widely used ing in changes in the skin intensity value [7]. Contact devices such as
and can detect certain cardiovascular problems, including atherosclero- pulse oximeters and fitness watches use PPG to non-invasively measure
sis, myocardial infarction, and arrhythmia [2]. Photoplethysmography these small changes in the skin based on this principle. However, these
(PPG) is a non-invasive and cost-effective method of measuring these traditional contact devices have many disadvantages, such as being
physiological parameters [2–6]. Medical devices based on PPG have unsuitable for detecting skin conditions in vulnerable populations such
been widely used in clinical settings to detect and monitor various phys- as infants and patients with skin diseases [8], causing discomfort or
iological indicators. PPG is also used in daily devices, such as sports
even skin infections with long-term use [9], and being affected by skin
watches and finger pulse oximeters. The use of PPG is beneficial in both
humidity, temperature, color, and patient movement, which can affect
clinical and non-clinical settings, as it provides real-time monitoring of
their accuracy [10]. To avoid these disadvantages, researchers have
physiological indicators, facilitates early detection of health problems,
begun to explore non-contact methods of remote HR monitoring, and
and helps maintain a healthy lifestyle.
rPPG has become a powerful alternative. rPPG can use a camera (such
The basic principle of PPG is to use a light source and a photodetec-
as a web camera, infrared camera, or RGB camera) to record video
tor to measure changes in the volume of blood vessels under the skin.
of the subject’s face, and extract subtle color changes in the skin to
When the tissue is illuminated by the light source, small changes in the
reflection or transmission intensity of the light caused by the blood flow generate the remote PPG signal [11]. The principle of the rPPG method
∗ Corresponding authors.
E-mail addresses: [email protected] (H. Xiao), [email protected] (T. Liu).
1
Co-first author.
https://doi.org/10.1016/j.bspc.2023.105608
Received 8 June 2023; Received in revised form 27 September 2023; Accepted 15 October 2023
Available online 21 October 2023
1746-8094/© 2023 Elsevier Ltd. All rights reserved.
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
To fill this gap, this paper timely and systematically reviews the
latest progress in rPPG methods for HR measurement. This study aims
to provide a systematic review and introduction of rPPG methods for
researchers. We categorize the rPPG methods used for HR measure-
ment into traditional and deep learning methods, further dividing the
deep learning methods into supervised and unsupervised methods, and
critically analyze their advantages, limitations, and performance based
on model architectures and methodologies. Additionally, we have also
provided an introduction to other aspects pertinent to the research of
rPPG methods. In summary, this paper has three main contributions:
(1) This work systematically review and analyze rPPG methods used
for remote HR measurement, covering all representative methods since
the first method appeared, with a particular focus on deep learning
methods.
Fig. 1. Schematic diagram of rPPG signal generation. The camera captures the specular
reflection and diffuse reflection produced by the skin under environmental light. The (2) We introduce the latest commonly used resources for rPPG meth-
specular reflection contains meaningless surface information, while the diffuse reflection ods and summarize the performance of various methods on datasets.
indicates changes in the volume of blood vessels, from which the rPPG signal can be (3) The primary challenges and difficulties faced in current research
further extracted. on rPPG methods are discussed in this paper, along with an outlook on
the latest application domains and potential future research directions
of rPPG methods.
is similar to that of the conventional PPG method, in which pulsating The remainder of this paper is organized as follows: Section 2
blood propagated in the cardiovascular system changes the blood vol- analyzes the main conventional methods. Section 3 provides a detailed
ume in the microvascular tissue bed under the skin with each heartbeat, description of supervised methods in deep learning approaches. Sec-
generating periodic waves. However, the main difference between the tion 4 elucidates unsupervised methods in deep learning approaches.
two methods lies in the way the PPG signal is captured: rPPG methods Section 5 summarizes the datasets and tools currently utilized in rPPG
capture the signal from video recordings of the subject’s face, while methods, as well as the performance of the proposed methods on these
conventional PPG methods require a physical sensor to be in contact datasets. Section 6 primarily analyzes the challenges currently faced
with the skin. As shown in Fig. 1, the principle of rPPG can be further in rPPG research. Section 7 introduces the latest application areas of
explained by the dichromatic reflection model (DRM) [12]. When rPPG. Finally, in Section 8, we provide a conclusion and an outlook
ambient light shines on the skin, it produces specular reflection and for possible future research directions.
diffuse reflection. Specular reflection occurs above the incident light
and the skin surface and does not contain meaningful physiological 2. Conventional methods
signals, while diffuse reflection occurs on the blood vessels and contains
meaningful physiological signals. The signal captured using the camera In this section, we will introduce some representative rPPG con-
is a combination of specular and diffuse reflections. Therefore, the rPPG ventional methods for remote HR measurement. Before the prevalence
method needs to separate specular and diffuse reflections and extract of deep learning methods, conventional rPPG methods were the main
meaningful diffuse reflections to generate the rPPG signal. Currently, methods for remote HR measurement. These conventional methods
rPPG has been proven to be superior because not only do subjects not often relied mainly on mathematics and algorithms, and their main
need to wear contact devices to avoid the various drawbacks of contact purpose was to eliminate motion artifacts and noise generated in facial
devices, but it is also suitable for long-term continuous monitoring and videos, thereby obtaining better quality rPPG signals. In addition to
is friendly to various patients. Furthermore, the camera required for the the conventional methods initially proposed by Verkruysse et al. [13],
rPPG method is low-cost and easy to obtain, making it highly suitable we divide conventional methods into blind source separation (BSS)
for wide promotion and application [13]. However, rPPG methods are based methods and model based methods. BSS based methods may
more challenging to use in real-world scenarios due to various factors be ideal for separating pulses without prior information, while model
such as lighting conditions, facial hair, and skin tone, which can affect based methods can use color vector knowledge of different components
the accuracy of the extracted rPPG signal. The rPPG signal is also to control separation. We summarize these conventional methods in
weaker than that extracted using the conventional contact method due Table 1.
to the differences in principle, requiring careful and precise processing.
In previous studies, Verkruysse et al. first proposed the use of 2.1. Conventional methods based on BSS
consumer-grade cameras to extract rPPG signals for HR measure-
ment [13]. In their work, it was found that different channels of the Conventional BSS: BSS refers to the recovery of unobserved sig-
RGB signal had varying relative strengths of PPG signals, with the green nals or sources from a set of observed mixtures without any prior
channel containing the strongest pulsatile signal. This observation is information about the mixing process. Typically, the observations are
consistent with the fact that hemoglobin is most sensitive to changes outputs of sensors, each of which is a combination of the sources [29].
in oxygenation of green light absorption, successfully demonstrating Independent component analysis (ICA) is a typical method for BSS and
the feasibility of using rPPG methods to measure HR from ordinary has been shown to be effective in many fields [30]. Poh et al. [19]
consumer-grade camera footage. Since then, various rPPG methods proposed a ICA algorithm base on joint approximate diagonalization of
for remote HR measurement have emerged, with a large number of eigenmatrices to remove the correlations and high-order dependencies
researchers still actively engaged in this field. The development of among the RGB channels and extract the HR components in sit-still
rPPG methods has gone through two stages: conventional methods and and sit-move-naturally scenarios. The root mean square error (RMSE)
deep learning methods. Although there are many review articles on corresponding to the motion scenario decreased from 19.36 bpm to
conventional rPPG methods [11,14–16] and some on deep learning- 4.63 bpm, demonstrating the feasibility of ICA for HR estimation. It
based rPPG methods [17,18], with the rapid development of deep is noteworthy that they employed the Viola–Jones face detector [31]
learning, new technologies in this area have spurred the emergence to automatically generate regions of interest (ROI) for the first time.
of many new rPPG methods and applications, rendering current review Lewandowska et al. [20] proposed using principal component anal-
articles insufficient to match the pace of deep learning advancements. ysis (PCA) to define three independent linear combinations of color
2
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 1
Analysis of conventional methods.
Category Ref. Year Methods Description
Original [13] 2008 band-pass It was the first time that the feasibility of rPPG method was
Filter + FFT demonstrated, and it was found that the green channel
contained the strongest pulsatile signal.
BSS-based [19] 2010 ICA Applying ICA in signal processing technology to remote HR
estimation.
[20] 2011 PCA The BSS-based method involves using PCA, which has been
proven to save computational costs.
[21] 2011 SCICA A new method for reducing artifacts composed of planar
motion compensation and BSS.
[22] 2014 JBSS + IVA Applying IVA to jointly analyze color signals from multiple
facial sub-regions.
[23] 2017 JBSS Improving PPG signal by combining facial sub regional
landmark localization and JBSS method to extract
physiological signals.
[24] 2017 CEEMDAN + Using CEEMDAN and CCA to eliminate noise artifacts.
CCA
Model-based [12] 2013 CHROM A robust technique for extracting HR from CCD camera
videos based on CHROM during motion.
[25] 2014 PBV It is proposed that PBV is a signal in skin reflectance spectra
that can be used to distinguish physiological signals from
motion noise.
[28] 2017 POS Using the POS imaging to measure HR, combining
normalized RGB channels into two new channels, and
weighting them to merge into the desired signal.
channels and demonstrated that PCA is as effective as ICA but can introduced the JBSS method into the field of rPPG, mainly applying
greatly reduce the computational complexity. Sun et al. [21] introduced independent vector analysis (IVA) to jointly analyze color signals from
a new artifact reduction method composed of planar motion compen- multiple facial subregions. Preliminary experimental results show that
sation and BSS, in which their BSS mainly refers to single-channel ICA the measurement of HR is more accurate compared with the ICA-based
(SCICA). The performance evaluation based on facial videos captured BSS method. Later, Qi et al. [23] proposed a new non-contact HR
from a repeatedly exercising volunteer suggests that the proposed measurement method by exploring the correlation between facial sub-
method can track HR. BSS-based methods had somewhat the ability region datasets through JBSS. Test results on large public databases
of tolerating motions but still showed limited improvement, especially also show that the proposed JBSS method outperforms previous ICA-
in dealing with severe movements. Since the orders of the extracted based methods. But current HR estimation using the JBSS method is still
components via BSS are random, usually fast Fourier transform (FFT) preliminary. In the future, in addition to color signals and multimodal
is utilized to determine the most probable HR frequency. Therefore, data collections from facial subregions, other types of data collections
BSS-based methods cannot handle the case where the frequency of can be used by JBSS for more accurate and robust telemetric HR
periodic motion artifacts falls within the normal HR frequency range. measurements.
Subsequently, Al-Naji et al. [24] proposed to estimate HR from video
sequences captured by hovering unmanned aerial vehicle by combin- 2.2. Conventional methods based on models
ing complete ensemble empirical mode decomposition with adaptive
noise (CEEMDAN) and canonical correlation analysis (CCA). The com- Owing to the capacity of model-based methods to leverage the
bined method of CEEMDAN and CCA outperforms the use of ICA or data provided by color vectors for managing component separation, a
PCA methods, particularly in the presence of noise caused by lighting prominent attribute shared by these methods is the ability to eradicate
changes, subject motion, and camera motion. the reliance of RGB signals on the mean skin reflection chromatic chan-
Joint BSS: Conventional BSS techniques were originally designed nel [28]. Model-based methods generally allude to approaches based
for processing a single data set, e.g., decomposing multiple color chan- on the chrominance model (CHROM) [20], which exploit the blood
nel signals from a single facial ROI region into independent compo- volume pulse signature (PBV) feature to discriminate pulse signals from
nents [32]. But color channel signals from multiple facial ROI sub- motion distortions [25], and approaches based on the plane orthogonal
regions can be used for more accurate HR measurement [33]. With to the skin (POS) [28].
the increasing availability of multiple data sets, various joint BSS De Haan et al. [12] developed a CHROM to consider diffuse reflec-
(JBSS) methods have been proposed to accommodate multiple data sets tion components and specular reflection contributions, which together
simultaneously. From a multi-set and multimodal perspective, several made the observed color varied depending on the distance (angle) from
realistic neurophysiological applications highlight the benefits of the the camera to the skin and to the light sources. Therefore, following the
JBSS approach as an efficient and promising tool for neurophysiolog- CHROM approach, the influence of motion artifacts can be eliminated
ical data analysis. The goal of JBSS is to extract underlying sources by utilizing linear combinations of individual R, G, and B channels.
within each data set and meanwhile keep a consistent ordering of the Experimental results demonstrated that CHROM outperformed previous
extracted sources across multiple data sets [30]. Guo et al. [22] first ICA and PCA-based methods during motion. To further address the
3
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
4
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 2
Analysis of 2D CNN methods.
Name Year Network Methods Description
HR-CNN 2018 2D CNN – A two-stage 2D CNN composed of an extractor and a
[37] HR estimator was developed to measure HR, which is
the first deep learning rPPG method.
DeepPhys 2018 2D CNN Attention A VGG-style 2D CNN was employed to jointly train
[39] motion and appearance models.
EVM-CNN 2018 2D CNN EVM EVM is employed to extract facial color variations,
[40] while a 2D CNN is utilized to estimate HR.
MTTS-CAN 2020 2D CNN Attention + TSM Using TSM to capture temporal information and
[32] utilizing attention mechanism to guide the motion
model.
MMSE-HR [44] dataset demonstrate the effectiveness of the proposed Fig. 5. The architecture of SynRhythm.
method, with an RMSE of 6.95 and a Pearson correlation coefficient
of 0.98. These results underscore the irreplaceable importance of prior
knowledge, even in the domain of deep learning methods.
The overall architecture is shown in Fig. 5. In addition to applying
spatial–temporal maps, they also employed transfer learning to train
3.2. Spatial–temporal map methods
the HR estimator by transferring the pre-trained model to the real
HR estimation task. The RMSE achieved on the MAHNOB-HCI dataset
Notwithstanding the successful utilization of 2D CNN for imple-
was 4.49, which is comparable to benchmark methods, suggesting
menting deep learning-based rPPG methods for remote HR measure-
that spatial–temporal maps effectively emphasize HR information while
ment, a prominent limitation of these methods is the absence of tem-
attenuating irrelevant signals. Niu et al. [46] proposed a new method
poral information. As rPPG signals exhibit periodicity, the temporal
information plays a crucial role in accurately estimating rPPG signals, called ST-Attention, which introduces an attention mechanism based on
rendering the lack of temporal information as one of the foremost spatial–temporal maps. They utilized spatial–temporal maps to obtain
constraints of 2D CNN methods. In order to mitigate this limitation, effective representations of rPPG signals from facial videos and used the
Niu et al. [45] introduced the notion of spatial–temporal maps. As attention mechanism to remove noise. The attention mechanism filters
illustrated in Fig. 4, for a video consisting of T frames, the detected out irrelevant features from video sequences and learns rich representa-
facial region is partitioned into an M × N matrix, which is further tions, thereby enhancing the effectiveness of spatial–temporal maps and
subdivided into n ROI blocks, with the assumption of alignment among improving remote HR measurement to some extent. They utilized all
distinct blocks. The utilization of average pooling operation aids in generated spatial–temporal maps to train the HR estimator and estimate
mitigating sensor noise in the HR signal. Specifically, let C(x, y, t) rPPG signals.
denote the value of the RGB channel at position (x, y) in the t-th frame. Complex spatial–temporalnap: The spatial–temporal map meth-
The average pooling value of the 𝑖th ROI block in each channel at the ods are typically constructed directly from RGB color channels, which
t-th frame 𝐶𝑖 (𝑡) can be expressed as: may result in generated spatial–temporal maps lacking sufficient pulse
∑ information. To address this limitation, Niu et al. [47] proposed a
𝑥,𝑦∈𝑅𝑂𝐼𝑖 𝐶(𝑥, 𝑦, 𝑡) new benchmark for spatial–temporal map methods called RhythmNet,
𝐶𝑖 (𝑡) = (1)
|𝑅𝑂𝐼𝑖 | as shown in Fig. 6. In RhythmNet, they convert facial images to
Where |𝑅𝑂𝐼𝑖 | represents the area of ROI block, i.e., the number of YUV color channels instead of traditional RGB channels to generate
pixels. Therefore, for each facial video, a 3 × n time series of length 𝑇 spatial–temporal maps, effectively separating the visual feature signals
can be obtained in the RGB channels, e.g., 𝐶𝑖 = {𝐶𝑖 (1), 𝐶𝑖 (2), … , 𝐶𝑖 (𝑡)}, of HR from a large amount of background signals. Furthermore, to
where represents one of the RGB channels and i represents the index of account for the temporal correlations in HR measurements in video
ROI. To fully utilize the information, min–max normalization is applied sequences, they utilized gated recurrent units (GRU) [50]. In addition
to each time series signal to scale the values to [0, 255]. Finally, n time to the color channel transformation approach proposed by [47], Song
series are arranged in rows to form a spatial–temporal map from the et al. [36] considered directly using rPPG signals to construct spatial–
original video sequence of size n × 𝑇 × 3, which serves as the input to temporal maps. They chose to extract rPPG signals from ROIs using the
the subsequent network. Table 3 summarizes the rPPG methods that CHROM method [12], and generated spatial–temporal maps based on
utilize spatial–temporal map. the preliminary estimated rPPG signals, resulting in spatial–temporal
RGB spatial–temporal map: Based on the proposed concept of maps with stronger motion robustness and clearer structures for sub-
spatial–temporal map, Niu et al. [45] introduced the first rPPG method, sequent CNN to learn from. Similar to the idea of Song et al. [36],
SynRhythm, which utilizes spatial–temporal maps for HR measurement. Hao et al. [49] chose to use the POS method [28] for initial rPPG
5
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 3
Analysis of spatial–temporal map methods.
Name Year Network Methods Description
SynRhythm 2018 2D CNN Spatial–Temporal Map The first method for remote HR measurement
[45] using spatio-temporal map and addressing data
scarcity through transfer learning.
ST-Attention 2019 2D CNN Spatial–Temporal Map + HR estimation using spatio-temporal Map with
[46] Attention Noise removal using attention mechanism.
RhythmNet 2020 2D CNN Spatial–Temporal Map + GRU are used to consider the relationship between
[47] GRU adjacent HR measurements in video sequences,
and a combined approach of 2D CNN and GRU is
employed for HR estimation.
CVD [48] 2020 2D CNN Spatial-Temporal Map + Removing noise in spatial–temporal map via
Disentangled Feature Cross-Validated feature disentanglement, supervised
Learning simultaneously using rPPG signal and HR.
NAS-HR [49] 2021 2D CNN Spatial–Temporal Map + Using NAS to find a lightweight optimal 2D CNN
NAS to estimate HR from spatial–temporal map.
6
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 4
Analysis of 3D CNN methods.
Name Year Network Methods Description
3D CNN [53] 2019 3D CNN Data Augmentation The pioneering use of 3D CNN for signal extraction,
data augmentation methods for generating videos
with synthetic rPPG signals, and a multi-layer
perceptron for HR estimation.
PhysNet [54] 2019 3D CNN – Comparing the spatiotemporal network of 2D
CNN+RNN and 3D CNN, indicating that 3D CNN is
more suitable for rPPG methods.
rPPGNet [55] 2019 3D CNN Attention A two-stage 3D CNN method that can not only
estimate rPPG signals, but also overcome the
challenges of highly compressed facial videos.
HeartTrack 2020 3D CNN Attention 3D CNN combined with soft attention mechanism and
[56] hard attention mechanism for signal extraction.
AutoHR [57] 2020 3D CNN NAS Using NAS to automatically find suitable backbone
3D CNN for rPPG signal extraction.
DeeprPPG 2020 3D CNN – Using different skin regions as input for rPPG signal
[58] estimation, allowing for customizable ROI selection
and broader applications.
Siamese-rPPG 2020 Siamese Spatiotemporal Aggregation Using Siamese network with two different facial
[59] 3D CNN regions, cheek region and forehead region, as ROIs,
each corresponding to a 3D CNN for rPPG signal
estimation.
ETA-rPPGNet 2021 3D CNN Attention Proposed an ETA module that utilizes effective
[60] temporal domain attention to improve the accuracy
and stability of HR estimation, using 3D CNN for
rPPG signal estimation.
SAM-rPPGNet 2021 2D CNN + Attention Proposed a SAM for learning salient features to
[61] 3D CNN reduce head motion noise, used in conjunction with
3D CNN for signal estimation.
select the most relevant facial regions based on spatial and temporal in-
formation, similar to the characteristics of attention mechanisms. There
are two types of attention mechanisms: hard attention mechanisms
and soft attention mechanisms. Soft attention mechanisms generally
show better performance, but hard attention mechanisms have lower
computational costs. HeartTrack [56] is a 3D CNN approach that
Fig. 8. The architecture of 3D CNN. combines both types of attention mechanisms. In HeartTrack, attention
mechanisms are used to enhance the denoising capability of 3D CNN.
The hard attention mechanism helps HeartTrack to ignore irrelevant
previous 2D CNN methods. However, the new benchmark-level 3D CNN background information, while the soft attention mechanism helps to
method PhysNet proposed by Yu et al. [54] subsequently demonstrated filter out occluded regions. In extensive experiments on the UBFC-rPPG
the advantages of 3D CNN methods. PhysNet also does not perform dataset, HeartTrack achieves the best RMSE of 3.37, outperforming
preprocessing operations, and directly inputs the raw RGB video frames the initial 3D CNN method [53]. Heavily compressed videos can pose
into the 3D CNN backbone network. However, their backbone network challenges for the backbone network of 3D CNN to capture salient fea-
can effectively learn temporal and spatial contextual features of facial tures from facial videos, resulting in degraded quality of extracted rPPG
sequences, and directly outputs the rPPG signal without the need for signals. To overcome this issue, Yu et al. [55] proposed a two-stage 3D
post-processing operations. In order to compare the performance of 3D CNN approach, consisting of two 3D CNNs for different tasks. One is
CNN and 2D CNN, they proposed a 2D CNN version of PhysNet for called Spatio-Temporal Video Enhancement Network (STVEN), which
comparison. Experimental results on the private OBF dataset showed is responsible for video enhancement, and the other is called rPPGNet,
that the RMSE corresponding to the 2D CNN version of PhysNet was which serves as the backbone network for rPPG signal estimation.
This two-stage approach can effectively handle heavily compressed
2.94, while the RMSE corresponding to the 3D CNN version of PhysNet
facial videos, as shown in Fig. 9. STVEN enhances video quality
was 1.81, indicating a significant performance difference, establishing
and retains as much information as possible from compressed facial
the important position of 3D CNN in rPPG signal estimation. Interest-
video inputs. Within the backbone network rPPGNet, an attention
ingly, PhysNet also considered for the first time the application of rPPG
mechanism is applied to extract dominant rPPG features from the skin
signals to emotion recognition.
region. rPPGNet can extract rPPG signals independently or be jointly
trained with STVEN for better performance. Experimental results show
3.3.2. Attention mechanism methods that rPPGNet performs excellently and demonstrates strong robustness
Facial videos often contain redundant information, and motion arti- in handling compressed videos. The RMSE obtained on the heavily
facts introduced by body movements can result in significant biases in compressed dataset MAHNOB-HCI [62] is 5.93 for rPPGNet, while the
estimating rPPG signals. To address these limitations and obtain more baseline method PhysNet [53] achieves an RMSE of 8.76, validating
stable rPPG signals, researchers have introduced attention mechanisms the effectiveness of STVEN and rPPGNet.
in video-based rPPG estimation. These attention mechanisms help to Attention in other modules: Different from the use of atten-
learn salient features related to facial information in videos, allowing tion mechanism in the backbone 3D CNN, Hu et al. [61] proposed
the model to focus on relevant information and reduce motion artifacts. a spatial–temporal attention module (SAM) based on 3D CNN for
Attention in backbone: The effectiveness of 3D CNN in denoising learning salient features. They further proposed a 3D CNN approach
can be attributed to attention mechanisms, which enable the model to called SAM-rPPGNet, which incorporates the attention mechanism.
7
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
8
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 5
Analysis of RNN methods.
Name Year Network Methods Description
Bian et al. [69] 2019 LSTM – The first method of rPPG using LSTM, traditional
methods extract rough signals, and two-layer LSTM is
used to filter signals.
Botina et al. [71] 2020 LSTM – A Long Short-Term Deep-Filter is proposed for
filtering rPPG signals.
Huang et al. [72] 2020 2D CNN + – 2D CNN for spatial feature extraction, LSTM for
LSTM temporal information capturing, utilizing fully
connected layers for HR estimation.
RhythmNet [47] 2020 2D CNN + Spatial–Temporal GRU is used to consider the relationship between
GRU Map adjacent HR measurements in video sequences, and a
combined approach of 2D CNN and GRU is employed
for HR estimation.
Meta-rPPG [73] 2020 2D CNN + Meta Learning Utilizing a transfer meta-learner to acquire unlabeled
LSTM data for rapid adaptation to diverse sample
distributions, employing a 2D CNN with BiLSTM
spatiotemporal architecture for signal extraction.
Deep Filter (LSTM-DF) for rPPG signal filtering. The LSTM-based LSTM-
DF can learn the feature shapes of rPPG signals, especially the temporal
structure of rPPG signals, thereby reducing noise in the rPPG signals
and improving their quality. Experimental results showed that using
Fig. 13. The memory function of LSTM.
traditional methods to extract rough signals and then filtering them
with LSTM improved the performance to some extent, but the signals
extracted by traditional methods were still too rough, making it difficult
3.4.1. LSTM methods to achieve excellent signal quality even with filtering.
A typical LSTM unit undergoes several basic operations to retain or LSTM for estimating: Compared with traditional methods, CNN-
forget certain information. The retained information can be interpreted based methods have been shown to yield better quality rPPG signals,
as the cell state, while the forgotten information can be interpreted as thus the combination of LSTM and CNN has been proven to significantly
the hidden state, which is a key concept of LSTM. The cell state can improve the estimation quality of rPPG signals. Huang et al. [72]
retain relevant information throughout the entire process of input time proposed a new approach that combines 2D CNN with LSTM. In their
series, while the hidden state contains information from previous data. method, 2D CNN is used to extract spatial features and local temporal
These two states can effectively utilize the contextual information of information from each frame’s ROI in the input video, while LSTM is
rPPG signals. The memory function of LSTM can be illustrated by the
used to capture global temporal information from consecutive frames.
following diagram, where [𝑋0 , 𝑋1 , 𝑋2 ] represents the input sequence,
The output of LSTM is then directly fed into a fully connected layer
[𝐻0 , 𝐻1 , 𝐻2 ] represents the corresponding hidden states (cell states),
for HR estimation, bypassing the step of rPPG signal extraction to save
and [𝑌0 , 𝑌1 , 𝑌2 ] represents the outputs. In Fig. 13, the color of the
computation time. Experiments on the UBFC-rPPG dataset showed that
matrix blocks (green, red, blue) represents the different information
the RMSE reached 2.84, and the HR can be updated in about one
contained in the input time series at t = 0, 1, 2. When t = 2, the
second using this method. However, Huang et al. [72] chose to bypass
previous input information can also flow to the last hidden state or
the step of signal extraction using 2D CNN and directly estimated
directly output.
LSTM for denoising: In summary, the use of LSTM can effectively HR using a fully connected layer, which may introduce some errors.
filter out noisy signals and retain useful signals in a data-driven man- Wang et al. [70] used 2D CNN as the backbone network for feature
ner. Bian et al. [69] proposed the first rPPG method that utilizes LSTM. extraction and designed a two-stream network with separate streams
They proposed training a two-layer LSTM to filter out rough rPPG for feature extraction and rPPG signal extraction, corresponding to two
signals, as illustrated in Fig. 14. Instead of directly estimating the rPPG different tasks, as shown in Fig. 15. In the TWO-STREAM approach,
signal using LSTM, they chose to first estimate the rough rPPG signal spatiotemporal maps are used as inputs, which are generated from
from facial videos using traditional methods, and then input it into the the input video and fed into the feature extraction stream and rPPG
trained two-layer LSTM for filtering, resulting in a refined version of signal extraction stream. The feature extraction stream is based on 2D
the original rough signal. During the training of the two-layer LSTM, CNN, used to extract synchronized spatial features from the spatio-
they generated a large number of synthetic signals with significant temporal maps, thereby improving the robustness of face detection
noise using an algorithm to enhance the model’s generalization ability. and reducing ROI alignment errors. The rPPG signal extraction stream
Similarly, considering the use of LSTM for signal filtering, Botina- consists of a combination of 2D CNN and LSTM, where 2D CNN is
Monsalve et al. [71] specifically designed a Long Short-Term Memory used for initial rPPG signal extraction, and a two-layer LSTM is used
9
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
10
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 6
Analysis of GAN methods.
Name Year Network Methods Description
Deep-HR [80] 2021 2D CNN + RFB Utilizing GAN-based modules for enhancing
DNN + GAN detected ROI and eliminating noise, employing 2D
CNN for signal extraction.
PulseGAN [82] 2021 GAN CHROM Employing GAN for signal filtering to generate
high-fidelity rPPG signal from coarse rPPG signal.
Dual-GAN [84] 2021 2D CNN + Spatial-Temporal BVP-GAN learns the denoising mapping from input
Dual GAN Map +Disentangled to real BVP, while Noise-GAN learns the noise
Feature Learning distribution, and they mutually promote each
other to enhance feature disentanglement.
11
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
12
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 7
Analysis of transformer methods.
Name Year Network Methods Description
Instantaneous_ 2022 2D CNN + ViT – The first method utilizing transformer for
transformer [90] real-Time physiological estimation.
PhysFormer [91] 2022 ViT Temporal-Difference A temporal difference transformer for exploring
Learning long-range temporal–spatial relationships in rPPG
measurements.
PhysFormer++ [93] 2023 ViT Temporal-Difference The dual-channel slowFast architecture design with
Learning + SlowFast complex cross-speed interaction is added on the
basis of PhysFormer for more robust head motion.
APNET [95] 2022 MaxViT Axis Projection The concept of APNET is proposed, which obtains
information from each direction by projecting
videos onto different axes.
RADIANT [94] 2023 ViT Signal Embedding The domain-generalized rPPG network based on
decoupling feature learning is the first method
addressing domain generalization issue in rPPG
methods.
EfficientPhys [97] 2023 2D CNN + Swin TSM Eliminating preprocessing completely and
Transformer comparing 2D CNN-based and transformer-based
backbone networks.
13
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Fig. 26. The architecture of RErPPGNet. Fig. 27. The architecture of PRN augmented.
14
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 8
Analysis of data augmentation methods.
Name Year Network Methods Description
Multi-task [98] 2021 3D CNN Data Augmentation A multi-task framework that proposes simultaneous
learning of rPPG signal extraction model and data
augmentation model has been introduced.
rPPGRNet + THRNet 2021 3D CNN Data Augmentation rPPGRNet is used to recover rPPG information, while
[101] THRNet is used to enhance discriminative features of
facial images and suppress small noise in facial
images.
RErPPGNet [99] 2022 3D CNN Data Augmentation + The utilization of a Double Cycle Consistent Learning
Double Cycle for data augmentation significantly enhances the
Consistent Learning estimation quality of signals.
PRN augmented 2022 3D CNN Data Augmentation + A 3D CNN-based skin tone generator for converting
[102] Style Transfer facial images with different skin tones into a
consistent dark-toned style.
15
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 9
Analysis of other methods.
Name Year Network Methods Description
Nowara et al. [103] 2021 Attention Network Attention Proposing an anti-attention mechanism utilizing
+ LSTM facial, hair, and background regions to reduce
noise caused by head movement and illumination.
Hu et al. [106] 2021 DNN Attention Attention module and temporal fusion module
were utilized as fundamental modules in the
network.
AND-rPPG [104] 2022 TCN AU Applying AU for denoising time signals improves
the simulation of time signals.
rPPG-FuseNet [105] 2022 DCNN Spatial–Temporal The fusion of RGB and MSR signals was employed,
Map utilizing two DCNNs for estimating rPPG signals.
DG-rPPGNet [108] 2022 Domain Disentangled A Domain Generalized rPPG Network based on
Generalization Feature Learning disentangled feature learning was proposed,
Network highlighting the issue of domain generalization in
rPPG methods for the first time.
TDMTALOS [109] 2022 2D CNN DTC A lightweight model was proposed, utilizing a
TDM module to estimate rPPG signals, and
employing the TALOS loss function to handle bias.
Arbitrary_ 2022 3D CNN + MT CNN Data Augmentation Two plug-and-play modules, PFE and TFA, were
Resolution_rPPG employed to alleviate the degradation caused by
[107] changes in distance and head movements.
4. Unsupervised deep learning methods Fig. 32. The architecture of Gideon et al.’s method.
16
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
17
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 10
Analysis of unsupervised deep learning methods.
Name Year Network Methods Description
Gideon et al. [111] 2021 3D CNN Contrastive Learning The first unsupervised rPPG method utilized
contrastive learning for unsupervised learning to
estimate rPPG signals.
SLF-RPM [112] 2022 3D CNN Data Augmentation + A landmark-based spatial enhancement method is
Contrastive Learning proposed to improve the effectiveness of
contrastive learning.
Fusion ViViT [114] 2022 ViT Contrastive Learning Utilizing RGB and NIR for joint feature
representation with transformer-based contrastive
learning.
Contrast-Phys [115] 2022 3D CNN Data Augmentation + Proposing ST-rPPG blocks based on four
Contrastive Learning observations of rPPG signals for contrastive
learning of spatio-temporal rPPG signals.
Yue et al. [116] 2022 3D CNN Data Augmentation + LFA for data augmentation and positive/negative
Contrastive Learning sample generation, and REA for estimating rPPG
signals.
SimPer [117] 2023 2D CNN Data Augmentation + Learning Efficient and Robust Periodic
Contrastive Learning Representations through Relative Sampling Rates
and Generalized Contrastive Loss.
SiNC [118] 2023 3D CNN Data Augmentation + Penalized regression was utilized in the design of
Penalized Regression the loss function, as the first unsupervised method
without contrastive learning.
rPPG-MAE [119] 2023 ViT Spatial-Temporal Map The inaugural rPPG method to incorporate MAE,
+ MAE alongside the design of a novel PC-STMap, has
achieved the best unsupervised performance.
18
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 11
Summary of public camera physiological measurement datasets.
Dataset Subjects Videos Imaging Gold Standard Free Access
DEAP [78] 32 874 Resolution: 720 × 576 Frame rate: 56 fps ECG YES
MAHNOB-HCI [112] 27 527 Resolution: 1040 × 1392 Frame rate: 24 fps ECG YES
AFRL [122] 25 300 Resolution: 658 × 492 Frame rate: 120 fps PPG, ECG, RR NO
PURE [38] 10 60 Resolution: 640 × 480 Frame rate: 30 fps PPG, SpO2 YES
MMSE-HR [44] 40 102 Resolution: 1040 × 1392 Frame rate: 25 fps HR, BP YES
COHFACE [67] 40 160 Resolution: 640 × 480 Frame rate: 20 fps PPG YES
ECG-Fitness [37] 17 204 Resolution: 1920 × 1080 Frame rate:30 fps PPG, ECG YES
OBF [52] 100 200 Resolution: 1920 × 1080 Frame rate: 60 fps PPG, ECG, RR NO
VIPL-HR [65] 107 3130 Resolution: 960 × 720, 1920 × 1080, 640 × 480 Frame rate: 60 fps, 30 fps PPG, HR, SpO2 YES
MR-NIRP [78] 19 190 Resolution: 640 × 640 Frame rate: 60 fps PPG YES
UBFC-rPPG [42] 50 50 Resolution: 640 × 480 Frame rate: 30 fps PPG, HR YES
VicarPPG-2 [123] 50 50 Resolution: 1280 × 720 Frame rate: 30 fps PPG, HR YES
MMVS [101] 129 762 Resolution: 1920 × 1080 Frame rate: 25 fps PPG NO
V4V [124] 179 1358 Resolution: 1720 × 720 Frame rate: 25 fps PPG, HR, BP YES
UBFC-Phys [125] 56 168 Resolution: 1024 × 1024 Frame rate: 30 fps PPG, HR YES
Scamps [126] 2800 2800 Resolution: 320 × 240 Frame rate: 30 fps PPG, PR, RR YES
MMPD [127] 22 55 Resolution: 1280 × 720 Frame rate: 30 fps PPG, HR YES
with a total of 874 videos recorded at a resolution of 720 × 576 and a cloud cover through a large window. Real PPG signals were collected
frame rate of 50 fps. Each participant was asked to watch a 1-minute using a CMS50E finger pulse oximeter with a sampling rate of 60 Hz. It
music video to generate varying emotional states, leading to changes is worth mentioning that the images in PURE are stored in lossless PNG
in HR. DEAP collected authentic PPG signals, and the real HR values format, which benefits the estimation performance of rPPG signals.
can be calculated from these authentic PPG signals. MMSE-HR [44] involves 40 participants from diverse racial back-
MAHNOB-HCI [62] is a multimodal database involving 27 partici- grounds, including Asian, White, Black, and Hispanic/Latino. A total
pants, with each participant recording 20 videos, resulting in a total of of 102 videos were recorded, with each video being recorded at a
527 videos. The videos were recorded at a resolution of 780 × 580 and a resolution of 1040 × 1392 and a frame rate of 25 fps. The original
frame rate of 61 fps. While the original purpose of MAHNOB-HCI was purpose of MMSE-HR was also for facial expression analysis, but MMSE-
emotion recognition and implicit tagging research, it is also suitable HR recorded the true values of physiological signs such as HR. Due to
for evaluating rPPG-based remote HR measurement methods due to the inclusion of participants with different skin tones in MMSE-HR, it is
its inclusion of real physiological signals such as electrocardiogram well-suited for evaluating the performance of methods under different
(ECG). All participants took part in emotion induction and implicit skin tones.
tagging experiments, during which HR fluctuated due to changes in COHFACE [67] is a publicly available dataset proposed by Idiap
participants’ emotions. Additionally, six cameras were used to capture Research Institute, designed for researchers to evaluate their rPPG
different views of the participants (frontal view, profile view, wide- methods on COHFACE with standardized and fair criteria. COHFACE
angle view, close-up view), making this dataset suitable for evaluating consists of 40 participants, including 28 males and 12 females, each of
method performance in handling pose and angle variations. whom recorded four video segments, resulting in a total of 160 videos.
AFRL [122] is proposed by the U.S. Air Force Research Laboratory Each video was recorded at a resolution of 640 × 480 and a frame
and includes recordings from 25 participants (17 males and 8 females), rate of 20 fps. Additionally, each participant wore a contact-based PPG
consisting of 300 videos. Each video was recorded at a resolution of sensor to obtain real PPG signals and other related data. The lighting
658 × 492 and a frame rate of 120 fps. For each participant, 6 record- conditions were taken into consideration during video recording. Two
ings were made with increased head movements during each task. In video segments were recorded for each participant under two lighting
the first two tasks, participants were asked to sit still and rotate their conditions: (1) Studio lighting, with windows closed to avoid natural
heads around the vertical axis at angular velocities of 10◦ /s, 20◦ /s, light and sufficient artificial light to stably illuminate the participant’s
and 30◦ /s, completing three motion tasks. In the last task, participants face; (2) Natural light, with windows open and all artificial lights
were asked to randomly position their heads to one of nine pre-defined turned off. The main limitation of COHFACE is that the videos are
locations every second. The background of the environment was either heavily compressed, resulting in significant noise that can greatly affect
a solid black fabric or a patterned colored fabric. Additionally, real the estimation of rPPG signals.
physiological signals including PPG, ECG, and respiration signals were ECG-Fitness [37] comprises 17 participants, consisting of 14 males
collected as part of the recordings. and 3 females, engaged in four different activities (speaking, rowing,
PURE [38] consists of recordings from 10 participants, including 8 exercising on a stationary bicycle, and elliptical trainer). The videos
males and 2 females. Each participant recorded 6 videos, resulting in a were recorded using two Logitech C920 web cameras and a FLIR
total of 60 videos. Each video was recorded at a resolution of 640 × 480 thermal imager under three distinct lighting conditions: natural light
and a frame rate of 30 fps, with a duration of one minute. Each from nearby windows, 400 W halogen lamps, and 30 W LED lamps.
participant performed six different tasks: (1) sitting still, (2) talking, Each participant generated 12 videos across three lighting conditions
(3) slow head movement, (4) fast head movement, (5) rotating the and four activity states, resulting in a total of 204 videos. Each video
head at a 20-degree angle, and (6) rotating the head at a 35◦ angle, was recorded at a resolution of 1920 × 1080 pixels and 30 frames per
in order to introduce variations in head movements. The PURE dataset second, with a duration of 1 min. Remarkably, the ECG-Fitness dataset
also considered changes in illumination by using natural sunlight and is unique in containing data for the rowing activity.
19
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
OBF [52] is a large dataset proposed by the University of Oulu a frame rate of 60 frames per second. Each participant contributed
in Finland, specifically designed for remote physiological signal mea- four videos, with the first video depicting a static state. In the second
surement. The OBF dataset comprises 100 subjects, with a total of video, each participant executed five different pre-planned body/head
200 high-quality RGB facial videos, each lasting 5 min, recorded at movements, including head tilting left and right (nodding), head mov-
a resolution of 1920 × 1080 and 60 fps. The subjects in the OBF ing up and down (head shaking), a combination of head shaking and
dataset consist of two types: healthy participants and atrial fibrillation nodding (rotation), moving their eyes while keeping the head still,
(AF) patients, with recordings taken of the resting state and post- and natural head movements while listening to music (dance). In
exercise state (5 min of exercise) for healthy participants, and pre- and the third video, participants were engaged in a stress-inducing game,
post-cardioversion states for AF patients. In addition, the OBF dataset and in the fourth video, participants sat unrestrictedly after undergo-
includes contact-based devices for recording real PPG signals and other ing fatigue-inducing physical exercise. VicarPPG-2 employed CMS50E
information. Due to the high video quality of the OBF dataset, it can pulse oximeters connected to the participants’ fingertips to record
enhance the performance of rPPG methods to a certain extent. authentic PPG waveforms. This dataset is well-suited for evaluating the
VIPL-HR [65] is a challenging large-scale multimodal dataset that robustness of rPPG methods in extreme scenarios, such as stress and
includes data from 107 subjects. Three different types of videos, namely excessive physical activity states.
RGB videos, NIR videos, and smartphone camera videos, were recorded MMVS [101] is a private dataset that contains multimodal and
using RGB cameras, RGB-D cameras, and smartphone cameras, re- multisubject physiological signals. It includes data from 129 healthy
spectively. A total of 3130 visible light facial videos were recorded subjects, ranging in age from 16 to 83 years old. A total of 762 videos
in the VIPL-HR dataset. RGB videos were recorded using both RGB were recorded, with each video recorded at a resolution of 1920 × 1080
cameras and RGB-D cameras, with a resolution of 960 × 720 and a and a frame rate of 25 fps, lasting approximately one minute. Uniform
frame rate of 25 fps for RGB camera recordings, and a resolution of indoor ambient lighting was used, without specific pre-set backgrounds.
1920 × 1080 and a frame rate of 30 fps for RGB-D camera recordings. MMVS utilizes finger-based pulse oximeters to record real PPG signals,
NIR videos were recorded using RGB-D cameras, which are capable of and employs programming techniques to calibrate the PPG signals with
recording both RGB and NIR videos, with a resolution of 640 × 480 video frames.
and a frame rate of 30 fps for NIR recordings. Smartphone camera V4V [124] is a physiological dataset specifically introduced for the
videos were recorded using smartphone cameras, with a resolution ICCV 2021 Vision for Vitals Challenge. It comprises a total of 179
of 1920 × 1080 and a frame rate of 30 fps. The purpose of using participants, including African Americans, Caucasians, and Asians, each
multiple types of videos is to enable researchers to test the robustness of whom engaged in up to 10 experimental tasks. Each task is metic-
of their methods across different video modalities. Furthermore, the ulously designed to elicit specific emotions among the participants,
dataset introduces two influencing factors, namely head motion (stable, resulting in a total of 1358 videos. These videos vary in length from 5 s
large motion, speaking) and illumination changes (lab, dark, bright), to 206 s, recorded at a resolution of 1280 × 720 pixels and a frame rate
for researchers to evaluate the overall robustness of their proposed of 25 fps. V4V leverages the BIOPAC MP150 data acquisition system
methods. Additionally, VIPL-HR includes various real labels, such as to collect authentic labels, including PPG signals, heart rate, blood
HR, SPO2, and BVP, for comprehensive analysis. pressure, and other physiological measurements. It is worth noting
MR-NIRP [78] is the first physiological video dataset that includes that, despite its substantial scale and diverse challenges, V4V maintains
driving scenarios. This dataset consists of 190 videos recorded from 19 consistent lighting conditions throughout the dataset.
subjects while driving and while being inside a parked car. Each subject UBFC-Phys [125] is a dataset primarily designed for emotion recog-
also performed actions such as speaking and randomly moving their nition and consists of 56 participants, including 46 females and 10
head during the recordings. The videos were captured at a resolution males. Participants were involved in an experiment inspired by the
of 640 × 640 and a frame rate of 60 fps. The MR-NIRP dataset is Trier Social Stress Test (TSST). Each participant was required to com-
designed to evaluate the applicability of various rPPG methods in new plete three tasks (resting, speaking, and arithmetic), resulting in a total
driving scenarios, beyond conventional laboratory environments. The of 168 videos, each recorded at a resolution of 1024 × 1024 pixels and a
dataset records real PPG signals synchronized with the video using a frame rate of 35 frames per second. UBFC-Phys utilizes the Empatica E4
finger pulse oximeter. RGB and NIR data are collected simultaneously, wristband to collect PPG signals and measurements of skin conductance
although researchers often use the NIR data for training and testing (EDA). Additionally, participants filled out a questionnaire before and
purposes in practice. It is worth mentioning that this dataset has some after the experiment to compute their self-reported anxiety scores. In
imperfections, such as many zero values in the PPG signals, which pose the future, UBFC-Phys may become an important publicly available
challenges for evaluating rPPG methods. dataset for research in rPPG-based methods for emotion recognition.
UBFC-rPPG [42] is a dataset specifically designed for evaluating Scamps [126] is a large-scale synthetic physiological dataset that
rPPG methods. The UBFC-rPPG dataset comprises 50 videos, each includes 2800 videos, with a resolution of 320 × 240 and a frame rate
recorded from a different subject, with a resolution of 640 × 480 and a of 30 fps. Scamps provides frame-level ground truth labels, including
frame rate of 30 fps. The recordings take into consideration variations PPG, pulse interval, respiratory waveform, respiratory interval, and 10
in sunlight and indoor lighting. The UBFC-rPPG dataset consists of facial actions. It also offers video-level ground truth labels for multiple
two sub-datasets. Sub-dataset 1 is a simplified version with 8 videos, physiological indicators. These parameters are used to generate 20-
where subjects are asked to sit still, although some videos may involve second PPG waveforms at 300 Hz and action unit intensity. Each video
movement. Sub-dataset 2 is a more practical dataset with 42 videos, is rendered using the corresponding waveform, action unit intensity,
where subjects are asked to play a time-sensitive mathematical game to and randomly sampled appearance attributes such as skin texture,
increase their HR. UBFC-rPPG is currently one of the most widely used hair, clothing, lighting, and environment. The extensive synthetic data
datasets by researchers. The videos in UBFC-rPPG are uncompressed in SCAMPS has demonstrated its potential in various applications, as
and have good video quality, and real data such as HR and PPG signals collecting such data in a real-world manner can be challenging in
are recorded, which is beneficial for researchers to use. Although UBFC- existing datasets. However, Scamps is often used for training rather
rPPG includes two sub-datasets, in practical usage, researchers often than testing purposes.
only use sub-dataset 2 due to its ample recording preparation and good MMPD [127] is the first dataset recorded entirely with smartphone
video quality. cameras. MMPD includes 33 subjects and a total of 660 one-minute
VicarPPG-2 [123] consists of 10 participants with an average age of videos, recorded at a resolution of 1280 × 720 and a frame rate of 30
29 years. A total of 40 videos were recorded, each having a duration of fps. However, for ease of sharing, the researchers compressed the videos
5 min, and were captured at a resolution of 1280 × 720 pixels with to a resolution of 320 × 240. MMPD considers four different skin tones,
20
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 12
A summary of the performance of the conventional methods. MAE and RMSE in bmp. The best results are in bold.
Name Year Deap MAHNOB-HCI PURE MMSE-HR COHFACE VIPL-HR UBFC-rPPG MMPD
MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R
GREEN [13] 2008 8.10 11.17 0.80 – – – 4.39 11.60 0.99 11.53 21.77 – 10.94 16.72 – – – – 7.50 14.14 0.62 11.73 15.75 0.24
ICA [19] 2010 – – – −8.95 25.9 0.08 15.23 21.25 – 5.28 – 0.70 8.89 14.55 0.42 – – – 5.17 11.76 0.65 7.94 11.67 0.28
PCA [20] 2011 – – – – – – 22.25 30.20 – – – – – – – – – – – 12.08 0.54 – – –
CHROM [12] 2013 7.47 10.31 0.82 −2.89 10.70 0.82 2.073 2.50 0.99 9.41 13.97 0.55 7.80 12.45 0.26 11.40 16.99 0.28 2.37 4.91 0.89 5.89 9.72 0.39
PBV [25] 2014 – – – – – – 23.31 30.73 0.51 – – – – – – – – – 13.63 24.12 0.32 6.46 9.66 0.50
LiCVPR [26] 2014 – – – −3.30 7.62 0.81 28.22 30.96 −0.38 – – – – – – – – – – – – – – –
2SR [27] 2016 – – – – – – 2.44 3.06 0.98 – – – 20.97 25.98 −0.32 11.50 17.20 0.30 15.95 11.65 – – – –
POS [28] 2017 7.93 10.25 0.82 – – – 3.14 10.57 0.95 5.77 – 0.82 – – – 5.79 8.94 0.73 4.05 8.75 0.78 5.22 9.74 0.46
four different lighting conditions (LED high, LED low, incandescent, POS [28] for benchmarking. Furthermore, Boccignone et al. [131]
natural), and four different activities (resting, head rotation, conversa- proposed a MATLAB open-source toolbox for implementing various
tion, and walking) to provide researchers with diverse environmental traditional methods, which covers a wide range of traditional methods.
conditions to test the robustness of their methods. Additionally, MMPD The previous MATLAB toolboxes were only capable of implementing
conducted four additional experiments to investigate the impact of traditional methods. Recently, some researchers have proposed Python
motion on static scenes, requiring subjects to perform high knee raises toolboxes that can implement deep learning methods. PyVHR [132]
or other vigorous exercises to raise their HR before recording. After is the first toolbox that can implement deep learning methods, and
completing all the exercises, subjects were given sufficient rest time it is actually an installable Python package that is easy to install and
to calm down before participating in the next experiment. MMPD also use, similar to other environment packages. With PyVHR, researchers
records real labels such as HR and actual PPG signals. can implement and evaluate eight traditional methods and one deep
learning method, MTTS-CAN [32], on 10 datasets, which facilitates
5.2. Evaluation metrics and performance comparison benchmarking of rPPG methods. PyVHR also provides other commonly
used preprocessing and postprocessing techniques, such as ROI selec-
When evaluating rPPG methods for remote HR measurement, re- tion, signal conversion, PSD calculation, and plotting, etc. Moreover,
searchers typically use three metrics: Mean Absolute Error (MAE), the deep learning methods proposed by researchers can be tested
RMSE, and Pearson correlation coefficient (R), which are used in using PyVHR, but cannot be trained using PyVHR. PyVHR can also
combination to assess the performance of a method. MAE and RMSE be easily used for various applications such as anti-spoofing, activity
are measured in beats per minute (bpm), with smaller values indicating detection, affective computing, and biometrics. rPPG Toolbox [133] is
lower error. R ranges from 0 to 1, with values closer to 1 indicating the latest proposed rPPG toolbox and currently the most comprehensive
lower error. In this paper, these three evaluation metrics will also one. It can be used for both training and testing of deep learning
be used for performance comparison. In Table 12 to Table 14 we methods. rPPG Toolbox includes code for preprocessing multiple public
will present the performance of supervised methods, unsupervised datasets, implementation of supervised and unsupervised deep learn-
methods, and traditional methods on the most commonly used pub- ing methods (including training code), as well as postprocessing and
lic datasets, including DEAP [78], MAHNOB-HCI [112], PURE [38], evaluation tools. This toolbox supports three public datasets, namely
MMSE-HR [44], COHFACE [67], VIPL-HR [65], MR-NIRP [78], UBFC- SCAMPS [126], UBFC-rPPG [42], PURE [38], and MMPD [127]. rPPG
rPPG [42], Scamps [126], and MMPD [127]. Although Scamps [126] Toolbox provides a parameter file for researchers to modify the param-
is also a publicly available dataset, it is commonly used for training eters for training and testing, allowing researchers to freely customize
rather than testing, so it is not included as an experimental object. it to meet the requirements of their methods. By fully utilizing rPPG
All experimental data are obtained from our own experiments and Toolbox, researchers can reduce the time required for deploying their
publicly available experimental data from researchers. Due to the lack methods and facilitate fair evaluation of various methods.
of experimental data on some datasets, Table 12 will not include
experimental data from MMSE-HR and MMPD datasets, Table 13 will 6. Research gaps
not include experimental data from MMPD and MR-NIRP dataset and
Table 14 will not include experimental data from MR-NIRP dataset. Despite the significant achievements and advancements in rPPG
methods for HR measurement, there are still many areas that have not
5.3. Toolboxes been fully addressed or explored by researchers. In this section, we will
summarize the key influencing factors of current research, in order to
With the increasing attention from researchers, open-source tool- guide researchers in exploring new directions from these challenges.
boxes have been proposed to facilitate the study of rPPG. These tool-
boxes assist researchers in completing essential steps of rPPG methods, 6.1. Influencing factors
such as ROI selection and PPG signal conversion to HR, thereby greatly
facilitating research in this field. McDuff et al. [128] introduced the The performance of rPPG methods can be influenced by various
first open-source toolbox, iPhys, which is a MATLAB toolbox capable of interfering factors. In fact, most methods mentioned in this paper
implementing various methods, including classical traditional methods aim to overcome these adverse factors in order to achieve better
such as GREEN [13], POS [28], CHROM [12], and ICA [19]. It also pro- performance. The main influencing factors of rPPG methods currently
vides functionalities for common steps in rPPG methods, such as face include motion artifacts, lighting changes, video compression, and skin
detection, ROI definition, and skin segmentation. Additionally, iPhys color variations. Motion artifacts refer to ghosting effects caused by
offers functions for plotting and signal quality assessment to evaluate head or body movements of the subject during facial video recording,
performance. Similarly, Pilz et al. [129] developed a new open-source which can significantly impact the performance of rPPG. To address
toolbox, PPGI-Toolbox, written in MATLAB. The primary purpose of this issue, several methods [12,39,54,55,57] have been proposed. For
PPGI-Toolbox is to implement their proposed methods, namely Local example, a spatio-temporal attention module was designed in [61]
Group Invariance (LGI) [130] and Riemannian-PPGI (SPH) [129], while to learn salient features and reduce the impact of motion artifacts.
also incorporating classic traditional methods such as 2SR [27] and Lighting changes can cause color variations in the face and affect the
21
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Table 13
A summary of the performance of the supervised methods. MAE and RMSE in bmp. The best results are in bold.
Name Year Deap MAHNOB-HCI PURE MMSE-HR COHFACE VIPL-HR UBFC-rPPG
MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R
HR-CNN [37] 2018 – – – 7.25 9.24 0.51 1.84 2.37 0.98 – – – 8.10 10.80 0.29 – – – 4.90 5.89 0.64
DeepPhys [39] 2018 – – – 4.57 – – 0.83 1.54 0.99 – – – 8.25 14.71 0.28 11.00 13.80 0.11 6.27 10.82 0.65
EVM-CNN [40] 2018 6.96 8.81 0.84 – 3.26 0.95 – – – – 6.95 0.98 – – – – – – – – –
SynRhythm [45] 2018 4.48 6.52 0.89 0.30 4.49 – 2.71 4.86 0.98 −0.85 5.03 0.86 – 4.49 – – – – 5.59 6.82 0.72
3D CNN [53] 2019 – – – – – – – – – – – – – – – – – – 5.45 8.64 –
PhysNet [54] 2019 – – – 6.85 8.76 0.69 2.10 2.60 0.99 – 13.25 0.44 8.63 9.36 0.54 10.80 14.80 0.20 2.95 3.67 0.97
rPPGNet [55] 2019 6.21 7.73 0.83 4.03 5.93 0.88 0.74 1.21 1.00 – – – – – – – – – 0.56 0.73 0.99
ST-Attention 2019 – – – – – – – – – – – – – – – 5.40 7.99 0.66 – – –
[46]
TWO-STREAM 2019 – – – – – – 9.81 11.81 0.42 – – – 8.09 9.96 0.40 – – – – – –
[70]
Bian et al. [69] 2019 – – – – – – – – – 4.35 10.15 0.83 – – – – – – – – –
Meta-rPPG [73] 2020 5.16 6.00 0.87 – – – 2.52 4.63 0.98 – – – 9.31 12.27 0.19 – – – 5.97 7.42 0.53
MTTS-CAN [32] 2020 – – – – – – 2.48 9.01 0.92 3.85 7.21 0.86 – – – – – – 1.70 2.72 0.99
CVD [48] 2020 – – – – – – – – – – – – – – – 5.02 7.97 0.79 – – –
Song et al. [36] 2020 5.65 7.17 0.85 5.98 7.45 0.75 – – – – – – – – – – – – – – –
RhythmNet [47] 2020 7.47 8.96 0.82 – 3.99 0.87 – – – – 7.33 0.78 – – – 5.30 8.14 0.76 – – –
Siamese-rPPG 2020 – – – – – – 0.51 1.56 0.83 – – – 0.70 1.29 0.73 – – – 0.48 0.97 –
[59]
AutoHR [57] 2020 – – – – – – – – - – 5.87 0.89 – – – 5.68 8.68 0.72 – – –
DeeprPPG [58] 2020 – – – – – – 0.28 0.43 0.99 – – – 3.07 7.06 0.86 – – – – – –
HeartTrack [56] 2020 – – – – – – – – - – – – – – – – – – 2.41 3.37 0.98
Huang et al. 2020 – – – – – – – – - – – – – – – – – – 2.08 2.84 –
[72]
Deep-HR [80] 2021 – – – 2.08 3.41 0.92 – – - – – – – – – – – – – – –
PulseGAN [82] 2021 4.86 5.70 0.88 – – – 2.28 4.29 0.99 – – – – – – – – – 1.19 2.10 0.98
Dual-GAN [84] 2021 3.25 4.11 0.91 – – – 0.82 1.31 0.99 – – – – – – 4.93 7.68 0.81 0.44 0.67 0.99
Multi-task [98] 2021 – – – – – – 0.40 1.07 0.92 – – – 0.68 1.65 0.72 – – – 0.47 2.09 –
NAS-HR [49] 2021 – – – – – – 1.65 2.02 0.99 – – – – – – 5.12 8.01 0.79 – – –
Nowara et al. 2021 – – – – – – – – - 2.27 4.90 0.94 – – – – – – – – –
[103]
rPPGRNet + 2021 4.23 5.45 0.89 – – – – – - – – – – – - – – – – – –
THRNet [101]
SAM-rPPGNet 2021 – – – – – – 0.74 1.21 1.00 – – – 5.19 7.52 0.68 – – – – – –
[61]
PRNet [74] 2021 – – – 5.01 6.42 0.84 – – - – – – – – - – – – 5.29 7.24 0.73
Hu et al. [106] 2021 – – – – – – 0.23 0.48 0.99 0.43 1.16 0.99 – – – – – – 1.43 3.13 0.97
Instantaneous_ 2022 – – – – – – – – - – – – 19.66 22.65 - – – – 11.28 13.94 –
transformer [90]
Physformer [91] 2022 3.03 3.96 0.92 3.25 3.97 0.87 1.10 1.75 0.99 2.84 5.36 0.92 – – – 4.97 7.79 0.78 0.40 0.71 0.99
RErPPGNet [99] 2022 – – – – – – 0.38 0.54 0.96 – – – – – – – – – 0.41 0.56 0.99
AND-rPPG [104] 2022 – – – – – – – – – – – – 6.81 8.06 0.63 – – – 2.67 4.07 0.92
rPPG-FuseNet 2022 – – – 2.08 3.41 0.92 – – – −0.65 4.57 0.87 – – – 4.32 8.03 0.81 1.52 2.86 0.92
[105]
DG-rPPGNet 2022 – – – – – – 3.02 4.69 – – – – 7.19 8.99 – – – – 0.63 1.35 –
[108]
PRN augmented 2022 – – – – – – – – – – – – – – – – – – 0.68 1.31 0.86
[102]
APNET [95] 2022 – – – – – – – – – – – – – – – – – – 0.53 0.77 0.97
TDM + TALOS 2022 – – – – – – 1.83 2.30 0.99 – – – – – – – – – 2.32 3.08 0.99
[109]
EfficientPhys 2023 – – – – – – – – – – – – – – – – – – 1.14 1.81 0.99
[97]
Arbi- 2023 – – – – – – 1.44 2.50 – – – – 1.31 3.92 – – – – 0.76 1.62 –
trary_Resolution_
rPPG [107]
PhysFormer++ 2023 – – – 3.23 3.88 0.87 – – – 2.71 5.15 0.93 – – – 4.88 7.62 0.80 – – –
[93]
RADIANT [94] 2023 – – – – – – – – – – – – 8.01 10.12 – – – – 2.91 4.52 –
Table 14
A summary of the performance of the unsupervised methods. MAE and RMSE in bmp. The best results are in bold.
Name Year Deap MAHNOB-HCI PURE COHFACE VIPL-HR MR-NIRP UBFC-rPPG
MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R MAE RMSE R
Gideon et al. [111] 2021 5.13 6.16 0.86 – – – 2.3 2.9 0.99 1.50 4.60 0.90 9.80 15.48 0.38 4.75 9.14 0.61 1.85 4.28 0.93
SLF-RPM [112] 2022 – – – 3.60 4.67 0.92 – – – – – – 12.56 16.59 0.32 – – – 8.39 9.70 0.70
Fusion viViT [114] 2022 – – – – – – – – – – – – 11.70 14.86 −0.09 12.90 16.94 0.51 – – –
Contrast-Phys [115] 2022 – – – – – – 1.00 1.40 0.99 – – – 7.49 14.40 0.49 2.68 4.77 0.85 0.64 1.00 0.99
Yue et al. [116] 2022 4.20 5.18 0.90 – – – 1.23 2.01 0.99 – – – – – – – – – 0.58 0.94 0.99
SimPer [117] 2023 – – – – – – 3.98 – – – – – – – – – – – 4.24 – –
SiNC [118] 2023 – – – – – – 0.61 1.84 1.00 – – – – – – – – – 0.59 1.83 0.99
rPPG-MAE [119] 2023 – – – – – – 0.40 0.90 0.99 – – – 4.52 7.49 0.81 – – – 0.17 0.21 0.99
22
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
reflection of light, posing challenges for rPPG methods in dealing with Additionally, there is a need for further improvement and updates to
continuous lighting changes in a video. In [134], researchers conducted open-source toolboxes that facilitate researchers in training and testing
a detailed evaluation of the performance of the rPPG method under their models efficiently. Most existing toolboxes only provide inte-
different lighting conditions. Different approaches have been proposed grated methods and datasets, which restrict researchers from flexibly
to address lighting changes [14,47,103,105,107]. For instance, rPPG- deploying their own proposed networks and models.
FuseNet [105] combines MSR signals for remote HR estimation, which
mitigates the effects of different lighting factors. Video compression 6.4. Research on unsupervised deep learning method
can be a challenging factor in real-world applications. Although many
videos in existing datasets are uncompressed and of high quality, rPPG Notwithstanding the rapid advancement of deep learning method-
methods may need to be applied to compressed videos in practical ologies for rPPG, the prevailing paradigm at present is that of super-
scenarios. Solutions have been proposed specifically for video compres- vised learning. Nevertheless, supervised approaches necessitate authen-
sion [37,55,59,98]. For example, a spatio-temporal video enhancement tic physiological labels, thereby amplifying the intricacy of training and
network was proposed in [37] to improve video quality while retain- testing, and impeding their practical deployability. In 2021, Gideon
ing as much information as possible, thereby addressing the issues et al. [111] were at the vanguard of employing contrastive learning
caused by video compression to some extent. Color variations caused to achieve unsupervised deep learning for rPPG, culminating in the
by changes in blood volume are subtle, and skin color variations can emergence of several nascent unsupervised methodologies [112,114–
affect the measurement results. In remote HR measurement, subjects 116,118]. Nevertheless, the research of unsupervised approaches has
with dark skin color often have better measurement results, as blood been relatively sluggish, with the majority of researchers still fixated
vessels are more visible in dark skin color [102]. To address the issue of on supervised learning. Furthermore, the performance of existing unsu-
skin color variations, various methods have been proposed [8,102,112], pervised methods still significantly lags behind supervised methods and
such as a skin color transformation generator proposed in [102], which fails to meet the requisite benchmark for practical applications. Conse-
converts the skin color of all videos to dark skin color while preserving quently, further incisive investigation and exploration by researchers
the underlying blood volume changes. This largely solves the problem are warranted in the burgeoning realm of unsupervised methodologies,
of skin color variations and mitigates biases towards different skin color as it may hold the promise of becoming the mainstream direction for
populations to some extent. the future advancement of rPPG methods for remote HR monitoring.
Currently, researchers in the field of rPPG are predominantly direct- Presently, rPPG methods heavily rely on common RGB videos, ex-
ing their research towards deep learning methods, which diverge from hibiting good performance in well-lit conditions. However, RGB videos
traditional approaches where the emphasis is on algorithms, while deep suffer from reduced visibility in low-light situations, rendering rPPG
learning methods typically focus on models and networks. However, methods potentially inaccurate or even completely ineffective in special
the rapid advancement of deep learning has resulted in increasingly real-world scenarios, such as nighttime conditions [135]. NIR cameras,
larger and more complex models. Despite the notable achievements of which augment the amount of light reflected from the face, enable
many methods [91,99,108] that have utilized such complex models and NIR videos to maintain higher visibility in dark environments and
networks to attain excellent performance, the size of these architec- are commonly employed in nocturnal settings. Consequently, some
tures presents challenges for practical implementation. Consequently, researchers have proposed dedicated rPPG methods tailored for NIR
some researchers are shifting their attention towards the investigation videos [136,137] to measure heart rates in dark conditions. Addi-
of lightweight methods, leading to the development of lightweight tionally, certain approaches [114,138] consider the joint utilization
models [49,109] that aim to reduce computational costs and time of RGB and NIR videos as a multimodal input strategy to mitigate
complexity, shorten HR measurement time, and enhance processing the impact of lighting variations, thereby enhancing the quality of
speed. Nevertheless, these lightweight models often exhibit inferior rPPG signal estimation and, consequently, improving remote heart
performance compared to state-of-the-art methods. Therefore, finding a rate measurement. Nevertheless, these methods, on the whole, still
balance between lightweight models and optimal performance is likely exhibit suboptimal performance, warranting further research, which
to be a critical research direction for future researchers. holds significant implications for extending the applicability of rPPG
methods to more complex scenarios.
6.3. Open resources
7. Applications
Open source resources play a crucial role in supporting researchers
in their investigations, and for rPPG methods, datasets are an important With the rapid advancement of research and technology, rPPG
resource. Despite the availability of several datasets for evaluating rPPG methods have found applications in diverse domains beyond remote HR
methods, there is still a limited number of high-quality open datasets. measurement, thereby providing compelling evidence of their research
Currently, the three most commonly used datasets by researchers are potential and application prospects. In this section, we will introduce
UBFC-rPPG [42], PURE [38], and COHFACE [67], which primarily some of the latest applications that have been achieved using rPPG
focus on two influencing factors: motion artifacts and illumination methods, as well as potential future applications. The aim is to provide
changes. However, these datasets lack consideration of certain specific researchers with insights and inspirations for further exploration and
factors, such as changes in body state, emotional fluctuations, and deliberation in this exciting field.
environmental variations, making it challenging to comprehensively
evaluate the merits of a method. Moreover, the factors emphasized in 7.1. Measuring multiple vital signs
the current datasets do not fully encompass potential future develop-
ments, such as multi-person measurement and long-distance estimation, In addition to HR measurement, rPPG methods have been utilized
which may require new datasets for supplementation. In the context of to measure a wide range of other physiological parameters. Blood
deep learning methods, open sourcing of code is of paramount impor- pressure, a critical indicator of cardiovascular health, is commonly used
tance for researchers and newcomers to the field. However, currently, for detecting conditions like hypertension. Numerous studies have em-
accessing code for various learning-based methods is challenging, and ployed rPPG methods for remote blood pressure monitoring, resulting
it requires collective efforts from researchers to improve this situation. in promising measurement outcomes and highlighting the potential of
23
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
rPPG methods in this application [6,139–142]. Blood oxygen saturation 7.4. Face anti-spoofing
(SPO2), which measures the capacity of blood to carry and transport
oxygen, and indicates the saturation level of oxygen in the blood, is Since the onset of the information age in human society, the uti-
crucial for assessing oxygenation status. Lower SPO2 values suggest hy- lization of individual biometric characteristics, such as fingerprints
poxia and can be indicative of health risks. While some rPPG methods and facial features, for identity verification has gained immense pop-
have been used for SPO2 measurement [143–146], their performance is ularity. Currently, facial recognition and fingerprint recognition are
still moderate and further research is needed to improve their accuracy the most prevalent methods of identity authentication. Facial recog-
and reliability. RR and HRV, which are often measured alongside HR, nition, which relies on facial feature analysis [158], is vulnerable to
have also been successfully measured using rPPG methods with excel- biometric spoofing attacks. For instance, malicious actors can obtain
lent results. Recently, Kossack et al. [147] pioneered the application facial photos or videos of the target from alternative sources and
of rPPG methods for assessing tissue perfusion, which refers to the employ them for photo attacks or replay attacks, successfully deceiving
amount of blood flowing through tissues per unit of time. Insufficient facial recognition systems and exposing the target to significant risks
tissue perfusion indicates inadequate blood supply to local tissues or and vulnerabilities [159]. Consequently, there has been a growing
organs. The successful application of rPPG methods in evaluating tissue interest among researchers in developing anti-spoofing techniques for
perfusion underscores the potential of rPPG methods in measuring facial recognition, commonly known as facial anti-spoofing (FAS). With
other physiological parameters. In future research, researchers can con- the rapid advancement of remote rPPG methods, researchers have
sider expanding the application of rPPG methods to measure additional recognized the potential of leveraging rPPG techniques to enhance
physiological parameters, such as arterial stiffness and transcutaneous facial recognition systems [160]. Subsequently, the utilization of rPPG
oxygen saturation, to further explore the capabilities and potential of methods for facial anti-spoofing has emerged as a prominent research
rPPG methods in remote physiological monitoring. area. Kossack et al. [161] conducted a localized analysis of rPPG
signals, aimed at thwarting facial spoofing by assessing the blood flow
7.2. Affective computing information in the subject’s facial region. Simultaneously, numerous
other researchers have introduced novel anti-spoofing methods based
rPPG methods have shown promising potential in affective com- on rPPG signals [162–166], showcasing the substantial growth and
puting due to their combination of image processing and physiologi-
research potential of rPPG methods in the field of FAS.
cal sensing. Currently, researchers have successfully demonstrated the
application of rPPG methods in the field of affective computing, partic-
8. Conclusion
ularly in stress estimation and emotion recognition. McDuff et al. [148]
first utilized rPPG methods to measure HRV and further estimated
In recent years, rPPG methods for HR measurement have gained
stress levels of subjects using HRV with an accuracy of 85%, showcas-
increasing attention from researchers and have shown remarkable po-
ing the potential of rPPG methods in stress estimation. Subsequently,
tential for development. In this paper, we provide a comprehensive
in [125], researchers further explored the potential of rPPG methods in
review of this promising technology, encompassing traditional meth-
stress estimation and proposed a multimodal dataset, UBFC-Phys [125],
ods and deep learning approaches, with a particular focus on deep
for emotion and stress estimation. Emotion recognition is currently
learning methods. We further categorize deep learning methods into
a hot topic in research, and in [149], Gupta et al. first considered
supervised and unsupervised approaches, providing a classification and
the use of rPPG methods for micro-expression recognition. Moreover,
overview of their principles and mechanisms, with special emphasis on
PhysNet [54] proposed a new method for remote HR measurement
the emerging and promising field of unsupervised methods. We also
and also considered the use of rPPG methods for emotion recogni-
introduce research resources for rPPG methods, including datasets and
tion. Yu et al. [150] combined knowledge graphs with remote HR
toolboxes, and systematically summarize the performance of existing
measurement for emotion recognition, achieving promising results.
methods on datasets to assist researchers in accelerating their research.
In addition, researchers have also proposed rPPG methods for pain
Additionally, we discuss current research challenges and gaps in rPPG
recognition [151], demonstrating the potential of rPPG methods in pain
recognition. In the foreseeable future, researchers can further explore methods, and propose potential future research directions. Finally, we
the application of rPPG methods in affective computing domains such highlight the broad applications of rPPG methods in various fields,
as human–computer interaction and psychological testing. demonstrating their wide-ranging potential and future directions. Based
on the thriving development of rPPG methods in remote HR mea-
7.3. Deepfake detection surement, we suggest the following recommendations: (1) More efforts
should be focused on measuring different physiological indicators and
Deepfake, a combination of deep learning and fake, refers to the use applying them in diverse scenarios to further deepen the practical
of deep learning algorithms to simulate and fabricate audio and video significance of rPPG methods. (2) The focus of rPPG method research
content. Currently, Deepfake has become a highly popular field, with should continue to be on addressing various influencing factors to
the most common application being AI-based face swapping techniques, improve the performance of rPPG methods to real-world application
as well as voice synthesis, facial synthesis, and video generation. Its levels. (3) Unsupervised deep learning methods should be further inves-
emergence has made it possible to manipulate or generate highly real- tigated, as they can overcome the reliance on real labels in supervised
istic and difficult-to-detect audio and video content, ultimately making methods and facilitate practical applications. We believe that this paper
it challenging for observers to discern truth from falsity with the naked provides researchers with a more comprehensive understanding of
eye. Therefore, researchers have been paying attention to how to distin- rPPG methods for HR measurement, guides researchers to focus on real
guish such high-tech falsified content. The study by Ciftci et al. [152] challenges, promotes further exploration in this field, and inspires more
successfully demonstrated that the measurement of HR from facial applications of rPPG methods in medical and other domains.
videos can be used to determine whether a video is real or fake, and one
of the main applications of rPPG methods is remote HR measurement. Declaration of competing interest
As a result, researchers have begun widely employing rPPG methods for
Deepfake detection, proposing various novel methods [153–157], and The authors declare the following financial interests/personal rela-
achieving promising performance and results, effectively demonstrating tionships which may be considered as potential competing interests:
the potential of rPPG methods in this field. It is worth mentioning that Hanguang Xiao reports financial support was provided by Chongqing
the application of rPPG methods for Deepfake detection remains one of Natural Science Foundation. Hanguang Xiao reports a relationship with
the most valuable research directions currently. Chongqing Natural Science Foundation that includes: funding grants.
24
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
Data availability [23] H. Qi, Z. Guo, X. Chen, Z. Shen, Z.J. Wang, Video-based human heart rate
measurement using joint blind source separation, Biomed. Signal Process.
Control 31 (2017) 309–320.
No data was used for the research described in the article.
[24] A. Al-Naji, A.G. Perera, J. Chahl, Remote monitoring of cardiorespiratory signals
from a hovering unmanned aerial vehicle, Biomed. Eng. Online 16 (2017) 1–20.
Acknowledgments [25] G. De Haan, A. Van Leest, Improved motion robustness of remote-PPG by using
the blood volume pulse signature, Physiol. Meas. 35 (9) (2014) 1913.
This work was supported by the National Natural Science Founda- [26] X. Li, J. Chen, G. Zhao, M. Pietikainen, Remote heart rate measurement from
face videos under realistic situations, in: Proceedings of the IEEE Conference
tion of China (Grant Nos. 61971078), and Chongqing Natural Science
on Computer Vision and Pattern Recognition, 2014, pp. 4264–4271.
Foundation (Grant No. CSTB2022NSCQ-MSX0923). This study does not [27] W. Wang, S. Stuijk, G. De Haan, A novel algorithm for remote photoplethys-
involve any ethical issue. mography: Spatial subspace rotation, IEEE Trans. Biomed. Eng. 63 (9) (2016)
1974–1984.
References [28] W. Wang, A.C. Den Brinker, S. Stuijk, G. De Haan, Algorithmic principles of
remote PPG, IEEE Trans. Biomed. Eng. 64 (7) (2017) 1479–1491.
[29] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, E. Moulines, A blind source
[1] A. Challoner, C. Ramsay, A photoelectric plethysmograph for the measurement
separation technique using second-order statistics, IEEE Trans. Signal Process.
of cutaneous blood flow, Phys. Med. Biol. 19 (3) (1974) 317.
45 (2) (1997) 434–444.
[2] L. Scalise, Non contact heart monitoring, Adv. Electrocardiogr.-Methods Anal.
[30] X. Chen, Z.J. Wang, M. McKeown, Joint blind source separation for neurophys-
4 (2012) 81–106.
iological data analysis: Multiset and multimodal methods, IEEE Signal Process.
[3] A. Gudi, M. Bittner, J. van Gemert, Real-time webcam heart-rate and variability
Mag. 33 (3) (2016) 86–107.
estimation with clean ground truth for evaluation, Appl. Sci. 10 (23) (2020)
[31] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple
8630.
features, in: Proceedings of the 2001 IEEE Computer Society Conference on
[4] C. Massaroni, A. Nicolo, M. Sacchetti, E. Schena, Contactless methods
Computer Vision and Pattern Recognition, Vol. 1, CVPR 2001, IEEE, 2001, p.
for measuring respiratory rate: A review, IEEE Sens. J. 21 (11) (2020)
I.
12821–12839.
[5] R. Yousefi, M. Nourani, Separating arterial and venous-related components of [32] X. Liu, J. Fromm, S. Patel, D. McDuff, Multi-task temporal shift attention
photoplethysmographic signals for accurate extraction of oxygen saturation and networks for on-device contactless vitals measurement, Adv. Neural Inf. Process.
respiratory rate, IEEE J. Biomed. Health Inf. 19 (3) (2014) 848–857. Syst. 33 (2020) 19400–19411.
[6] F. Schrumpf, P. Frenzel, C. Aust, G. Osterhoff, M. Fuchs, Assessment of deep [33] B. Wei, X. He, C. Zhang, X. Wu, Non-contact, synchronous dynamic measure-
learning based blood pressure prediction from PPG and rPPG signals, in: ment of respiratory rate and heart rate based on dual sensitive regions, Biomed.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Eng. Online 16 (2017) 1–21.
Recognition, 2021, pp. 3820–3830. [34] A. Asthana, S. Zafeiriou, S. Cheng, M. Pantic, Robust discriminative response
[7] D.F. Swinehart, The beer-lambert law, J. Chem. Educ. 39 (7) (1962) 333. map fitting with constrained local models, in: Proceedings of the IEEE
[8] L.A. Aarts, V. Jeanne, J.P. Cleary, C. Lieber, J.S. Nelson, S.B. Oetomo, W. Conference on Computer Vision and Pattern Recognition, 2013, pp. 3444–3451.
Verkruysse, Non-contact heart rate monitoring utilizing camera photoplethys- [35] C. Tomasi, T. Kanade, Detection and tracking of point, Int. J. Comput. Vis. 9
mography in the neonatal intensive care unit—A pilot study, Early Hum. Dev. (1991) 137–154.
89 (12) (2013) 943–948. [36] R. Song, S. Zhang, C. Li, Y. Zhang, J. Cheng, X. Chen, Heart rate estimation from
[9] L.A. Aarts, V. Jeanne, J.P. Cleary, C. Lieber, J.S. Nelson, S.B. Oetomo, W. facial videos using a spatiotemporal representation with convolutional neural
Verkruysse, Non-contact heart rate monitoring utilizing camera photoplethys- networks, IEEE Trans. Instrum. Meas. 69 (10) (2020) 7411–7421.
mography in the neonatal intensive care unit—A pilot study, Early Hum. Dev. [37] R. Špetlík, V. Franc, J. Matas, Visual heart rate estimation with convolutional
89 (12) (2013) 943–948. neural network, in: Proceedings of the British Machine Vision Conference,
[10] A. Al-Naji, K. Gibson, S.-H. Lee, J. Chahl, Monitoring of cardiorespiratory signal: Newcastle, UK, 2018, pp. 3–6.
Principles of remote measurements and review of methods, IEEE Access 5 [38] R. Stricker, S. Müller, H.-M. Gross, Non-contact video-based pulse rate measure-
(2017) 15776–15790. ment on a mobile service robot, in: The 23rd IEEE International Symposium
[11] D.J. McDuff, J.R. Estepp, A.M. Piasecki, E.B. Blackford, A survey of remote on Robot and Human Interactive Communication, IEEE, 2014, pp. 1056–1062.
optical photoplethysmographic imaging methods, in: 2015 37th Annual Inter- [39] W. Chen, D. McDuff, Deepphys: Video-based physiological measurement using
national Conference of the IEEE Engineering in Medicine and Biology Society, convolutional attention networks, in: Proceedings of the European Conference
EMBC, IEEE, 2015, pp. 6398–6404. on Computer Vision, ECCV, 2018, pp. 349–365.
[12] G. De Haan, V. Jeanne, Robust pulse rate from chrominance-based rPPG, IEEE [40] Y. Qiu, Y. Liu, J. Arteaga-Falconi, H. Dong, A. El Saddik, EVM-CNN: Real-
Trans. Biomed. Eng. 60 (10) (2013) 2878–2886. time contactless heart rate estimation from facial video, IEEE Transactions on
[13] W. Verkruysse, L.O. Svaasand, J.S. Nelson, Remote plethysmographic imaging Multimedia 21 (7) (2018) 1778–1787.
using ambient light, Opt. Express 16 (26) (2008) 21434–21445. [41] J. Lin, C. Gan, S. Han, Tsm: Temporal shift module for efficient video
[14] P.V. Rouast, M.T. Adam, R. Chiong, D. Cornforth, E. Lux, Remote heart rate understanding, in: Proceedings of the IEEE/CVF International Conference on
measurement using low-cost RGB face video: a technical literature review, Computer Vision, 2019, pp. 7083–7093.
Front. Comput. Sci. 12 (2018) 858–872. [42] S. Bobbia, R. Macwan, Y. Benezeth, A. Mansouri, J. Dubois, Unsupervised skin
[15] F.-T.-Z. Khanam, A. Al-Naji, J. Chahl, Remote monitoring of vital signs in tissue segmentation for remote photoplethysmography, Pattern Recognit. Lett.
diverse non-clinical and clinical scenarios using computer vision systems: A 124 (2019) 82–90.
review, Appl. Sci. 9 (20) (2019) 4474. [43] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, W. Freeman, Eulerian
[16] X. Chen, J. Cheng, R. Song, Y. Liu, R. Ward, Z.J. Wang, Video-based heart video magnification for revealing subtle changes in the world, ACM Trans.
rate measurement: Recent advances and future prospects, IEEE Trans. Instrum. Graph. 31 (4) (2012) 1–8.
Meas. 68 (10) (2018) 3600–3615. [44] Z. Zhang, J.M. Girard, Y. Wu, X. Zhang, P. Liu, U. Ciftci, S. Canavan, M. Reale,
[17] A. Ni, A. Azarang, N. Kehtarnavaz, A review of deep learning-based contactless A. Horowitz, H. Yang, et al., Multimodal spontaneous emotion corpus for human
heart rate measurement methods, Sensors 21 (11) (2021) 3719. behavior analysis, in: Proceedings of the IEEE Conference on Computer Vision
[18] C.-H. Cheng, K.-L. Wong, J.-W. Chin, T.-T. Chan, R.H. So, Deep learning and Pattern Recognition, 2016, pp. 3438–3446.
methods for remote heart rate measurement: A review and future research [45] X. Niu, H. Han, S. Shan, X. Chen, Synrhythm: Learning a deep heart rate
agenda, Sensors 21 (18) (2021) 6296. estimator from general to specific, in: 2018 24th International Conference on
[19] M.-Z. Poh, D.J. McDuff, R.W. Picard, Non-contact, automated cardiac pulse Pattern Recognition, ICPR, IEEE, 2018, pp. 3580–3585.
measurements using video imaging and blind source separation, Opt. Express [46] X. Niu, X. Zhao, H. Han, A. Das, A. Dantcheva, S. Shan, X. Chen, Robust remote
18 (10) (2010) 10762–10774. heart rate estimation from face utilizing spatial-temporal attention, in: 2019
[20] M. Lewandowska, J. Rumiński, T. Kocejko, J. Nowak, Measuring pulse rate 14th IEEE International Conference on Automatic Face & Gesture Recognition,
with a webcam—a non-contact method for evaluating cardiac activity, in: 2011 FG 2019, IEEE, 2019, pp. 1–8.
Federated Conference on Computer Science and Information Systems, FedCSIS, [47] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation
IEEE, 2011, pp. 405–410. from face via spatial-temporal representation, IEEE Trans. Image Process. 29
[21] Y. Sun, S. Hu, V. Azorin-Peris, S. Greenwald, J. Chambers, Y. Zhu, (2020) 2409–2423.
Motion-compensated noncontact imaging photoplethysmography to monitor [48] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, G. Zhao, Video-based remote physiological
cardiorespiratory status during exercise, J. Biomed. Opt. 16 (7) (2011) 077010. measurement via cross-verified feature disentangling, in: Computer Vision–
[22] Z. Guo, Z.J. Wang, Z. Shen, Physiological parameter monitoring of drivers based ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
on video data and independent vector analysis, in: 2014 IEEE International Proceedings, Part II 16, Springer, 2020, pp. 295–310.
Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2014, [49] H. Lu, H. Han, NAS-HR: Neural architecture search for heart rate estimation
pp. 4374–4378. from face videos, Virtual Real. Intell. Hardw. 3 (1) (2021) 33–42.
25
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
[50] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated [75] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation
recurrent neural networks on sequence modeling, in: NIPS 2014 Workshop on of deep networks, in: International Conference on Machine Learning, PMLR,
Deep Learning, December 2014, 2014. 2017, pp. 1126–1135.
[51] H. Liu, K. Simonyan, Y. Yang, DARTS: Differentiable architecture search, in: [76] A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose estima-
International Conference on Learning Representations, 2019. tion, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam,
[52] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. the Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, Springer,
Tulppo, G. Zhao, The obf database: A large face video database for remote 2016, pp. 483–499.
physiological signal measurement and atrial fibrillation detection, in: 2018 13th [77] S.X. Hu, P.G. Moreno, Y. Xiao, X. Shen, G. Obozinski, N.D. Lawrence, A.
IEEE International Conference on Automatic Face & Gesture Recognition, FG Damianou, Empirical bayes transductive meta-learning with synthetic gradients,
2018, IEEE, 2018, pp. 242–249. 2020, arXiv preprint arXiv:2004.12696.
[53] F. Bousefsaf, A. Pruski, C. Maaoui, 3D convolutional neural networks for remote [78] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun,
pulse rate measurement and mapping from facial video, Appl. Sci. 9 (20) (2019) A. Nijholt, I. Patras, Deap: A database for emotion analysis; using physiological
4364. signals, IEEE Trans. Affect. Comput. 3 (1) (2011) 18–31.
[54] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from [79] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
facial videos using spatio-temporal networks, 2019, arXiv preprint arXiv:1905. A. Courville, Y. Bengio, Generative adversarial nets, in: Neural Information
02419. Processing Systems, 2014.
[55] Z. Yu, W. Peng, X. Li, X. Hong, G. Zhao, Remote heart rate measurement from [80] M. Sabokrou, M. Pourreza, X. Li, M. Fathy, G. Zhao, Deep-hr: Fast heart rate
highly compressed facial videos: an end-to-end deep learning solution with estimation from face video under realistic conditions, Expert Syst. Appl. 186
video enhancement, in: Proceedings of the IEEE/CVF International Conference (2021) 115596.
on Computer Vision, 2019, pp. 151–160. [81] S. Liu, D. Huang, et al., Receptive field block net for accurate and fast object
[56] O. Perepelkina, M. Artemyev, M. Churikova, M. Grinenko, HeartTrack: Con- detection, in: Proceedings of the European Conference on Computer Vision,
volutional neural network for remote video-based heart rate monitoring, in: ECCV, 2018, pp. 385–400.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [82] R. Song, H. Chen, J. Cheng, C. Li, Y. Liu, X. Chen, Pulsegan: Learning to
Recognition Workshops, 2020, pp. 288–289. generate realistic pulse waveforms in remote photoplethysmography, IEEE J.
[57] Z. Yu, X. Li, X. Niu, J. Shi, G. Zhao, Autohr: A strong end-to-end baseline Biomed. Health Inf. 25 (5) (2021) 1373–1384.
for remote heart rate measurement with neural searching, IEEE Signal Process. [83] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014, arXiv
Lett. 27 (2020) 1245–1249. preprint arXiv:1411.1784.
[58] S.-Q. Liu, P.C. Yuen, A general remote photoplethysmography estimator with [84] H. Lu, H. Han, S.K. Zhou, Dual-gan: Joint bvp and noise modeling for remote
spatiotemporal convolutional network, in: 2020 15th IEEE International Con- physiological measurement, in: Proceedings of the IEEE/CVF Conference on
ference on Automatic Face and Gesture Recognition, FG 2020, IEEE, 2020, pp. Computer Vision and Pattern Recognition, 2021, pp. 12404–12413.
481–488. [85] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł.
[59] Y.-Y. Tsou, Y.-A. Lee, C.-T. Hsu, S.-H. Chang, Siamese-rPPG network: Remote
Kaiser, I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst.
photoplethysmography signal estimation from face videos, in: Proceedings of
30 (2017).
the 35th Annual ACM Symposium on Applied Computing, 2020, pp. 2066–2073.
[86] H. Xiao, L. Li, Q. Liu, X. Zhu, Q. Zhang, Transformers in medical image
[60] M. Hu, F. Qian, D. Guo, X. Wang, L. He, F. Ren, ETA-rPPGNet: Effective
segmentation: A review, Biomed. Signal Process. Control 84 (2023) 104791.
time-domain attention network for remote heart rate measurement, IEEE Trans.
[87] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
Instrum. Meas. 70 (2021) 1–12.
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth
[61] M. Hu, F. Qian, X. Wang, L. He, D. Guo, F. Ren, Robust heart rate estimation
16 × 16 words: Transformers for image recognition at scale, 2020, arXiv
with spatial–temporal attention network from facial videos, IEEE Trans. Cogn.
preprint arXiv:2010.11929.
Dev. Syst. 14 (2) (2021) 639–647.
[88] G. Balakrishnan, A. Zhao, M.R. Sabuncu, J. Guttag, A.V. Dalca, Voxelmorph:
[62] M. Soleymani, J. Lichtenauer, T. Pun, M. Pantic, A multimodal database for
a learning framework for deformable medical image registration, IEEE Trans.
affect recognition and implicit tagging, IEEE Trans. Affect. Comput. 3 (1) (2011)
Med. Imaging 38 (8) (2019) 1788–1800.
42–55.
[89] Z. Yu, X. Li, P. Wang, G. Zhao, Transrppg: Remote photoplethysmography
[63] Z. Qiu, T. Yao, T. Mei, Learning spatio-temporal representation with pseudo-
transformer for 3d mask face presentation attack detection, IEEE Signal Process.
3d residual networks, in: Proceedings of the IEEE International Conference on
Lett. 28 (2021) 1290–1294.
Computer Vision, 2017, pp. 5533–5541.
[90] A. Revanur, A. Dasari, C.S. Tucker, L.A. Jeni, Instantaneous physiological
[64] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, H. Xiong, PC-DARTS: Partial
estimation using video transformers, in: Multimodal AI in Healthcare: A
channel connections for memory-efficient architecture search, in: International
Paradigm Shift in Health Intelligence, Springer, 2022, pp. 307–319.
Conference on Learning Representations, 2020.
[65] X. Niu, H. Han, S. Shan, X. Chen, VIPL-HR: A multi-modal database for pulse [91] Z. Yu, Y. Shen, J. Shi, H. Zhao, P.H. Torr, G. Zhao, PhysFormer: facial
estimation from less-constrained face video, in: Computer Vision–ACCV 2018: video-based physiological measurement with temporal difference transformer,
14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
2018, Revised Selected Papers, Part V 14, Springer, 2019, pp. 562–576. Recognition, 2022, pp. 4186–4196.
[66] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verification [92] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using
using a ‘‘siamese’’ time delay neural network, Adv. Neural Inf. Process. Syst. 6 multitask cascaded convolutional networks, IEEE Signal Process. Lett. 23 (10)
(1993). (2016) 1499–1503.
[67] G. Heusch, A. Anjos, S. Marcel, A reproducible study on remote heart rate [93] Z. Yu, Y. Shen, J. Shi, H. Zhao, Y. Cui, J. Zhang, P. Torr, G. Zhao,
measurement, 2017, arXiv preprint arXiv:1709.00962. PhysFormer++: Facial video-based physiological measurement with SlowFast
[68] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) temporal difference transformer, Int. J. Comput. Vis. 131 (6) (2023) 1307–1330.
(1997) 1735–1780. [94] A.K. Gupta, R. Kumar, L. Birla, P. Gupta, RADIANT: Better rPPG estimation
[69] M. Bian, B. Peng, W. Wang, J. Dong, An accurate lstm based video heart rate using signal embeddings and transformer, in: Proceedings of the IEEE/CVF
estimation method, in: Pattern Recognition and Computer Vision: Second Chi- Winter Conference on Applications of Computer Vision, 2023, pp. 4976–4986.
nese Conference, PRCV 2019, Xi’an, China, November 8–11, 2019, Proceedings, [95] D.-Y. Kim, S.-Y. Cho, K. Lee, C.-B. Sohn, A study of projection-based at-
Part III, Springer, 2019, pp. 409–417. tentive spatial–temporal map for remote photoplethysmography measurement,
[70] Z.-K. Wang, Y. Kao, C.-T. Hsu, Vision-based heart rate estimation via a two- Bioengineering 9 (11) (2022) 638.
stream cnn, in: 2019 IEEE International Conference on Image Processing, ICIP, [96] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, Y. Li, Maxvit:
IEEE, 2019, pp. 3327–3331. Multi-axis vision transformer, in: Computer Vision–ECCV 2022: 17th European
[71] D. Botina-Monsalve, Y. Benezeth, R. Macwan, P. Pierrart, F. Parra, K. Nakamura, Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV,
R. Gomez, J. Miteran, Long short-term memory deep-filter in remote photo- Springer, 2022, pp. 459–479.
plethysmography, in: Proceedings of the IEEE/CVF Conference on Computer [97] X. Liu, B. Hill, Z. Jiang, S. Patel, D. McDuff, EfficientPhys: Enabling simple,
Vision and Pattern Recognition Workshops, 2020, pp. 306–307. fast and accurate camera-based cardiac measurement, in: Proceedings of the
[72] B. Huang, C.-M. Chang, C.-L. Lin, W. Chen, C.-F. Juang, X. Wu, Visual heart IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp.
rate estimation from facial video based on CNN, in: 2020 15th IEEE Conference 5008–5017.
on Industrial Electronics and Applications, ICIEA, IEEE, 2020, pp. 1658–1662. [98] Y.-Y. Tsou, Y.-A. Lee, C.-T. Hsu, Multi-task learning for simultaneous video
[73] E. Lee, E. Chen, C.-Y. Lee, Meta-rppg: Remote heart rate estimation using generation and remote photoplethysmography estimation, in: Proceedings of
a transductive meta-learner, in: Computer Vision–ECCV 2020: 16th European the Asian Conference on Computer Vision, Springer, 2021, pp. 392–407.
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, [99] C.-J. Hsieh, W.-H. Chung, C.-T. Hsu, Augmentation of rPPG benchmark datasets:
Springer, 2020, pp. 392–409. Learning to remove and embed rPPG signals via double cycle consistent learning
[74] B. Huang, C.-L. Lin, W. Chen, C.-F. Juang, X. Wu, A novel one-stage framework from unpaired facial videos, in: Computer Vision–ECCV 2022: 17th European
for visual pulse rate estimation using deep neural networks, Biomed. Signal Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI,
Process. Control 66 (2021) 102387. Springer, 2022, pp. 372–387.
26
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
[100] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image transla- [125] R.M. Sabour, Y. Benezeth, P. De Oliveira, J. Chappe, F. Yang, Ubfc-phys: A
tion using cycle-consistent adversarial networks, in: Proceedings of the IEEE multimodal database for psychophysiological studies of social stress, IEEE Trans.
International Conference on Computer Vision, 2017, pp. 2223–2232. Affect. Comput. (2021).
[101] Z. Yue, S. Ding, S. Yang, H. Yang, Z. Li, Y. Zhang, Y. Li, Deep super-resolution [126] D. McDuff, M. Wander, X. Liu, B. Hill, J. Hernandez, J. Lester, T. Baltrusaitis,
network for rPPG information recovery and noncontact heart rate estimation, Scamps: Synthetics for camera measurement of physiological signals, Adv.
IEEE Trans. Instrum. Meas. 70 (2021) 1–11. Neural Inf. Process. Syst. 35 (2022) 3744–3757.
[102] Y. Ba, Z. Wang, K.D. Karinca, O.D. Bozkurt, A. Kadambi, Style transfer with bio- [127] J. Tang, K. Chen, Y. Wang, Y. Shi, S. Patel, D. McDuff, X. Liu, MMPD:
realistic appearance manipulation for skin-tone inclusive rPPG, in: 2022 IEEE Multi-domain mobile video physiology dataset, 2023, arXiv preprint arXiv:
International Conference on Computational Photography, ICCP, IEEE, 2022, pp. 2302.03840.
1–12. [128] D. McDuff, E. Blackford, Iphys: An open non-contact imaging-based physiolog-
[103] E.M. Nowara, D. McDuff, A. Veeraraghavan, The benefit of distraction: De- ical measurement toolbox, in: 2019 41st Annual International Conference of
noising camera-based physiological measurements using inverse attention, in: the IEEE Engineering in Medicine and Biology Society, EMBC, IEEE, 2019, pp.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 6521–6524.
ICCV, 2021, pp. 4955–4964. [129] C. Pilz, On the vector space in photoplethysmography imaging, in: Proceedings
[104] B. Lokendra, G. Puneet, AND-rPPG: A novel denoising-rPPG network for of the IEEE/CVF International Conference on Computer Vision Workshops,
improving remote heart rate estimation, Comput. Biol. Med. 141 (2022) 2019.
105146. [130] C.S. Pilz, S. Zaunseder, J. Krajewski, V. Blazek, Local group invariance for
[105] K.B. Jaiswal, T. Meenpal, rPPG-FuseNet: Non-contact heart rate estimation from heart rate estimation from face videos in the wild, in: Proceedings of the IEEE
facial video via RGB/MSR signal fusion, Biomed. Signal Process. Control 78 Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp.
(2022) 104002. 1254–1262.
[106] M. Hu, D. Guo, M. Jiang, F. Qian, X. Wang, F. Ren, rPPG-based heart rate [131] G. Boccignone, D. Conte, V. Cuculo, A. d’Amelio, G. Grossi, R. Lanzarotti, An
estimation using spatial-temporal attention network, IEEE Trans. Cogn. Dev. open framework for remote-PPG methods and their assessment, IEEE Access 8
Syst. 14 (4) (2021) 1630–1641. (2020) 216083–216103.
[107] J. Li, Z. Yu, J. Shi, Learning motion-robust remote photoplethysmogra- [132] G. Boccignone, D. Conte, V. Cuculo, A. D’Amelio, G. Grossi, R. Lanzarotti, E.
phy through arbitrary resolution videos, in: AAAI Conference on Artificial Mortara, pyVHR: a Python framework for remote photoplethysmography, PeerJ
Intelligence, 2023. Comput. Sci. 8 (2022) e929.
[108] W.-H. Chung, C.-J. Hsieh, S.-H. Liu, C.-T. Hsu, Domain generalized RPPG [133] X. Liu, X. Zhang, G. Narayanswamy, Y. Zhang, Y. Wang, S. Patel, D. McDuff,
network: Disentangled feature learning with domain permutation and domain Deep physiological sensing toolbox, 2022, arXiv preprint arXiv:2210.00716.
augmentation, in: Proceedings of the Asian Conference on Computer Vision, [134] Z. Yang, H. Wang, F. Lu, Assessment of deep learning-based heart rate
2022, pp. 807–823. estimation using remote photoplethysmography under different illuminations,
IEEE Trans. Hum.-Mach. Syst. 52 (6) (2022) 1236–1246.
[109] J. Comas, A. Ruiz, F. Sukno, Efficient remote photoplethysmography with
temporal derivative modules and time-shift invariant loss, in: Proceedings of [135] Y. Cho, S.J. Julier, N. Marquardt, N. Bianchi-Berthouze, Robust tracking of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, respiratory rate in high-dynamic range scenes using mobile thermal imaging,
pp. 2182–2191. Biomed. Opt. Express 8 (10) (2017) 4480–4503.
[136] S.B. Park, G. Kim, H.J. Baek, J.H. Han, J.H. Kim, Remote pulse rate mea-
[110] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for
surement from near-infrared videos, IEEE Signal Process. Lett. 25 (8) (2018)
contrastive learning of visual representations, in: International Conference on
1271–1275.
Machine Learning, PMLR, 2020, pp. 1597–1607.
[137] J. Cheng, P. Wang, R. Song, Y. Liu, C. Li, Y. Liu, X. Chen, Remote heart rate
[111] J. Gideon, S. Stent, The way to my heart is through contrastive learning: Remote
measurement from near-infrared videos based on joint blind source separation
photoplethysmography from unlabelled video, in: Proceedings of the IEEE/CVF
with delay-coordinate transformation, IEEE Trans. Instrum. Meas. 70 (2020)
International Conference on Computer Vision, 2021, pp. 3995–4004.
1–13.
[112] H. Wang, E. Ahn, J. Kim, Self-supervised representation learning framework for
[138] D.Q. Le, J.-C. Chiang, W.-N. Lie, Remote PPG estimation from RGB-nir facial im-
remote physiological measurement using spatiotemporal augmentation loss, in:
age sequence for heart rate estimation, in: 2022 IEEE International Symposium
AAAI Conference on Artificial Intelligence, 2022.
on Circuits and Systems, ISCAS, IEEE, 2022, pp. 2077–2081.
[113] H. Nyquist, Certain topics in telegraph transmission theory, Trans. Am. Inst.
[139] D. Djeldjli, F. Bousefsaf, C. Maaoui, F. Bereksi-Reguig, A. Pruski, Remote
Electr. Eng. 47 (2) (1928) 617–644.
estimation of pulse wave features related to arterial stiffness and blood pressure
[114] S. Park, B.-K. Kim, S.-Y. Dong, Self-supervised RGB-nir fusion video vision
using a camera, Biomed. Signal Process. Control 64 (2021) 102242.
transformer framework for rPPG estimation, IEEE Trans. Instrum. Meas. 71
[140] H. Luo, D. Yang, A. Barszczyk, N. Vempala, J. Wei, S.J. Wu, P.P. Zheng,
(2022) 1–10.
G. Fu, K. Lee, Z.-P. Feng, Smartphone-based blood pressure measurement
[115] Z. Sun, X. Li, Contrast-phys: Unsupervised video-based remote physiological
using transdermal optical imaging technology, Circ. Cardiovasc. Imaging 12
measurement via spatiotemporal contrast, in: Computer Vision–ECCV 2022:
(8) (2019) e008857.
17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
[141] X. Fan, Q. Ye, X. Yang, S.D. Choudhury, Robust blood pressure estimation using
Part XII, Springer, 2022, pp. 492–510.
an RGB camera, J. Ambient Intell. Humaniz. Comput. 11 (2020) 4329–4336.
[116] Z. Yue, M. Shi, S. Ding, Video-based remote physiological measurement via [142] B.-F. Wu, B.-J. Wu, B.-R. Tsai, C.-P. Hsu, A facial-image-based blood pressure
self-supervised learning, 2022, arXiv preprint arXiv:2210.15401. measurement system without calibration, IEEE Trans. Instrum. Meas. 71 (2022)
[117] Y. Yang, X. Liu, J. Wu, S. Borac, D. Katabi, M.-Z. Poh, D. McDuff, Simper: 1–13.
Simple self-supervised learning of periodic targets, 2022, arXiv preprint arXiv: [143] G. Casalino, G. Castellano, G. Zaza, A mhealth solution for contact-less self-
2210.03115. monitoring of blood oxygen saturation, in: 2020 IEEE Symposium on Computers
[118] J. Speth, N. Vance, P. Flynn, A. Czajka, Non-contrastive unsupervised learning and Communications, ISCC, IEEE, 2020, pp. 1–7.
of physiological signals from video, in: Proceedings of the IEEE/CVF Conference [144] D. Shao, C. Liu, F. Tsow, Y. Yang, Z. Du, R. Iriya, H. Yu, N. Tao, Noncontact
on Computer Vision and Pattern Recognition, 2023. monitoring of blood oxygen saturation using camera and dual-wavelength
[119] X. Liu, Y. Zhang, Z. Yu, H. Lu, H. Yue, J. Yang, rPPG-MAE: Self-supervised imaging system, IEEE Trans. Biomed. Eng. 63 (6) (2015) 1091–1098.
pre-training with masked autoencoders for remote physiol. meas., 2023, arXiv [145] L. Kong, Y. Zhao, L. Dong, Y. Jian, X. Jin, B. Li, Y. Feng, M. Liu, X. Liu, H.
preprint arXiv:2306.02301. Wu, Non-contact detection of oxygen saturation based on visible light imaging
[120] A.v.d. Oord, Y. Li, O. Vinyals, Representation learning with contrastive device using ambient light, Opt. Express 21 (15) (2013) 17464–17471.
predictive coding, 2018, arXiv preprint arXiv:1807.03748. [146] A.H. Ayesha, D. Qiao, F. Zulkernine, A web application for experimenting and
[121] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders validating remote measurement of vital signs, in: Information Integration and
are scalable vision learners, in: Proceedings of the IEEE/CVF Conference on Web Intelligence: 24th International Conference, IiWAS 2022, Virtual Event,
Computer Vision and Pattern Recognition, 2022, pp. 16000–16009. November 28–30, 2022, Proceedings, Springer, 2022, pp. 237–251.
[122] J.R. Estepp, E.B. Blackford, C.M. Meier, Recovering pulse rate during motion [147] B. Kossack, E. Wisotzky, P. Eisert, S.P. Schraven, B. Globke, A. Hilsmann, Perfu-
artifact with a multi-imager array for non-contact imaging photoplethysmogra- sion assessment via local remote photoplethysmography (rPPG), in: Proceedings
phy, in: 2014 IEEE International Conference on Systems, Man, and Cybernetics, of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022,
SMC, IEEE, 2014, pp. 1462–1469. pp. 2192–2201.
[123] A. Gudi, M. Bittner, J. Van Gemert, Real-time webcam heart-rate and variability [148] D. McDuff, S. Gontarek, R. Picard, Remote measurement of cognitive stress
estimation with clean ground truth for evaluation, Appl. Sci. 10 (23) (2020) via heart rate variability, in: 2014 36th Annual International Conference of the
8630. IEEE Engineering in Medicine and Biology Society, IEEE, 2014, pp. 2957–2960.
[124] A. Revanur, Z. Li, U.A. Ciftci, L. Yin, L.A. Jeni, The first vision for vitals (v4v) [149] P. Gupta, B. Bhowmick, A. Pal, Exploring the feasibility of face video based
challenge for non-contact video-based physiological estimation, in: Proceedings instantaneous heart-rate for micro-expression spotting, in: Proceedings of the
of the IEEE/CVF International Conference on Computer Vision, 2021, pp. IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018,
2760–2767. pp. 1316–1323.
27
H. Xiao et al. Biomedical Signal Processing and Control 88 (2024) 105608
[150] W. Yu, S. Ding, Z. Yue, S. Yang, Emotion recognition from facial expressions [158] I.M. Alsaadi, Physiological biometric authentication systems, advantages, dis-
and contactless heart rate using knowledge graph, in: 2020 IEEE International advantages and future development: A review, Int. J. Sci. Technol. Res. 4 (12)
Conference on Knowledge Graph, ICKG, IEEE, 2020, pp. 64–69. (2015) 285–289.
[151] V. Kessler, P. Thiam, M. Amirian, F. Schwenker, Pain recognition with camera [159] S. Kumar, S. Singh, J. Kumar, A comparative study on face spoofing attacks, in:
photoplethysmography, in: 2017 Seventh International Conference on Image 2017 International Conference on Computing, Communication and Automation,
Processing Theory, Tools and Applications, IPTA, IEEE, 2017, pp. 1–5. ICCCA, IEEE, 2017, pp. 1104–1108.
[152] U.A. Ciftci, I. Demir, L. Yin, Fakecatcher: Detection of synthetic portrait videos [160] S. Liu, P.C. Yuen, S. Zhang, G. Zhao, 3D mask face anti-spoofing with
using biological signals, IEEE Trans. Pattern Anal. Mach. Intell. (2020). remote photoplethysmography, in: Computer Vision–ECCV 2016: 14th European
[153] S. Fernandes, S. Raj, E. Ortiz, I. Vintila, M. Salter, G. Urosevic, S. Jha, Predicting Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings,
heart rate variations of deepfake videos using neural ode, in: Proceedings of Part VII 14, Springer, 2016, pp. 85–100.
the IEEE/CVF International Conference on Computer Vision Workshops, 2019. [161] B. Kossack, E.L. Wisotzky, A. Hilsmann, P. Eisert, Local remote photoplethys-
[154] J. Hernandez-Ortega, R. Tolosana, J. Fiérrez, A. Morales, DeepFakesON-phys: mography signal analysis for application in presentation attack detection, in:
DeepFakes detection based on heart rate estimation, in: AAAI Conference on VMV, 2019, pp. 135–142.
Artificial Intelligence, 2021. [162] X. Li, J. Komulainen, G. Zhao, P.-C. Yuen, M. Pietikäinen, Generalized face
[155] Y. Xu, R. Zhang, C. Yang, Y. Zhang, Z. Yang, J. Liu, New advances in remote anti-spoofing by detecting pulse from face videos, in: 2016 23rd International
heart rate estimation and its application to DeepFake detection, in: 2021 Conference on Pattern Recognition, ICPR, IEEE, 2016, pp. 4244–4249.
International Conference on Culture-Oriented Science & Technology, ICCST, [163] S.-Q. Liu, X. Lan, P.C. Yuen, Remote photoplethysmography correspondence
IEEE, 2021, pp. 387–392. feature for 3D mask face presentation attack detection, in: Proceedings of the
[156] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, Y. Liu, J. Zhao, Deeprhythm: European Conference on Computer Vision, ECCV, 2018, pp. 558–573.
Exposing deepfakes with attentional visual heartbeat rhythms, in: Proceedings [164] B. Lin, X. Li, Z. Yu, G. Zhao, Face liveness detection by rppg features and
of the 28th ACM International Conference on Multimedia, 2020, pp. 4318–4327. contextual patch-based cnn, in: Proceedings of the 2019 3rd International
[157] G. Boccignone, S. Bursic, V. Cuculo, A. D’Amelio, G. Grossi, R. Lanzarotti, Conference on Biometric Engineering and Applications, 2019, pp. 61–68.
S. Patania, DeepFakes have no heart: A simple rPPG-based method to reveal [165] Z. Yu, X. Li, P. Wang, G. Zhao, Transrppg: Remote photoplethysmography
fake videos, in: Image Analysis and Processing–ICIAP 2022: 21st International transformer for 3d mask face presentation attack detection, IEEE Signal Process.
Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part II, Springer, 2022, Lett. 28 (2021) 1290–1294.
pp. 186–195. [166] Z. Yu, R. Cai, Z. Li, W. Yang, J. Shi, A.C. Kot, Benchmarking joint face spoofing
and forgery detection with visual and physiological cues, 2022, arXiv preprint
arXiv:2208.05401.
28