Paper 2

Systematic Review
published: 01 May 2018

doi: 10.3389/fbioe.2018.00033
Chen Wang1*, Thierry Pun1,2 and Guillaume Chanel1,2

1
Computer Vision and Multimedia Laboratory, Computer Science Department, University of Geneva, Geneva, Switzerland,
2
Swiss Center for Affective Sciences, Campus Biotech, University of Geneva, Geneva, Switzerland
Remotely measuring physiological activity can provide substantial benefits for both
the medical and the affective computing applications. Recent research has proposed
different methodologies for the unobtrusive detection of heart rate (HR) using human
face recordings. These methods are based on subtle color changes or motions of
the face due to cardiovascular activities, which are invisible to human eyes but can
be captured by digital cameras. Several approaches have been proposed such as
signal processing and machine learning. However, these methods are compared with
different datasets, and there is consequently no consensus on method performance.
In this article, we describe and evaluate several methods defined in literature, from
2008 until present day, for the remote detection of HR using human face recordings.
Edited by:
Danilo Emilio De Rossi,
The general HR processing pipeline is divided into three stages: face video processing,
Università degli Studi face blood volume pulse (BVP) signal extraction, and HR computation. Approaches
di Pisa, Italy
presented in the paper are classified and grouped according to each stage. At each
Reviewed by:
stage, algorithms are analyzed and compared based on their performance using the
Gholamreza Anbarjafari,
University of Tartu, Estonia public database MAHNOB-HCI. Results found in this article are limited on MAHNOB-
Andrea Bonarini, HCI dataset. Results show that extracted face skin area contains more BVP informa-
Politecnico di Milano, Italy
tion. Blind source separation and peak detection methods are more robust with head
*Correspondence:
Chen Wang
motions for estimating HR.
[email protected]
Keywords: heart rate, remote sensing, physiological signals, photoplethysmography, human–computer interaction
Specialty section:
This article was submitted to INTRODUCTION
Bionics and Biomimetics,
a section of the journal Heart rate (HR) is a measure of physiological activity and it can indicate a person’s health and affective
Frontiers in Bioengineering status (Malik, 1996; Armony and Vuilleumier, 2013). Physical exercise, mental stress, and medicines
and Biotechnology
all influence on cardiac activities. Consequently, HR information can be used in a wide range of
Received: 21 July 2017 applications, such as medical diagnosis, fitness assessment, and emotion recognition. Traditional
Accepted: 13 March 2018
methods of measuring HR rely on electronic or optical sensors. The majority of these methods
Published: 01 May 2018
require skin-contact, such as electrocardiograms (ECGs), sphygmomanometry and pulse oximetry,
Citation: and the later giving a photoplethysmogram (PPG). Among all cardiac pulse measurements, the
Wang C, Pun T and Chanel G
current gold standard is the usage of ECG (Dawson et al., 2010), which places adhesive gel electrodes
(2018) A Comparative Survey of
Methods for Remote Heart Rate
on the participants’ limbs or chest surface. Another, widely applied contact method, is to compute
Detection From Frontal Face Videos. the blood volume pulse (BVP) from a PPG captured by an oximeter emitting and measuring light
Front. Bioeng. Biotechnol. 6:33. at proper wavelengths (Allen, 2007). However, the skin-contact measurements can be considered as
doi: 10.3389/fbioe.2018.00033 inconvenient, unpractical, and may cause uncomfortable feelings.
Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 1 May 2018 | Volume 6 | Article 33

Wang et al. Modify to HR Detection From Face Videos
In the past decade, researchers have focused on remote each step of the pipeline to finally obtain a full implementation
(i.e., contactless) detection methods, which are mainly based on with the best performance (presenting in Section “Comparative
computer vision techniques. Using human faces as physiologi- Analysis”). This objective is achieved by testing the methods on
cal measurement resources was first proposed in 2007 (Pavlidis a unique set of data: the MAHNOB-HCI database. Given the
et al., 2007). According to Pavlidis et al. (2007), the face area methods’ popularity, this analysis is limited to color intensity-
facilitated observation as it featured a thin layer of tissue. With based methods.
facial thermal imaging, HR can be detected based on bioheat
models (Garbey et al., 2007; Pavlidis et al., 2007). After that, the REMOTE METHODS FOR HR DETECTION
PPG technique, which is non-invasive and optical, was used for
detecting HR. The method is often implemented with dedicated To the best of our knowledge, we are unaware of existing reviews
light sources such as red lights or infrared lights (Allen, 2007; that touch upon this topic. To access the method performance,
Jeanne et al., 2013). this article investigates several methods, which were published
In 2008, Verkruysse et al. (2008) showed the possibility of using in international conferences and journals from 2008 until 2017.
PPG under ambient light to estimate HR from videos of human This time period was selected because 2008 was the year when the
face. Then in 2010, Poh et al. (2010) developed a framework for remote HR detection was first proposed. Methods that require no
automatic HR detection using the color of human face recordings skin-contact and no specific light sources were exclusively taken
obtained from a standard camera. This framework was widely into account because they are more likely to be applied outside
adopted and modified in Poh et al. (2011), Pursche et al. (2012), the laboratory.
and Kwon et al. (2012). For all those methods, the core idea is to The existing remote methods for obtaining HR from face
recover the heartbeat signal using blind source separation (BSS) on videos can be classified as either color intensity-based methods or
the temporal changes of face color. Later in 2013, another method motion-based methods. Currently, the intensity-based methods
for estimating HR based on subtle head motions (Balakrishnan are the most popular (Poh et al., 2011; Kwon et al., 2012; Pursche
et al., 2013, Rubinstein, 2013) was proposed. Besides, researchers et al., 2012; etc.) shown in Table 1. Intensity-based methods come
(Li et al., 2014; Stricker et al., 2014; Xu et al., 2014) investigated from PPG signals captured by digital cameras. Blood absorbs
the estimation of HR directly by applying diverging noise reduc- light more than the surrounding tissues and variations in blood
tion algorithms and optical modeling methods. Alternatively, the volume affect light transmission and reflectance (Verkruysse
usage of manifold learning methods mapping multi-dimensional et al., 2008). That leads to the subtle color changes on human
face video data into one-dimensional space has been studied to skin, which is invisible to human eyes but recorded by cameras.
reveal the HR signal as well. Diverse optical models are applied to extract the intensity of color
As shown above, remote HR detection has been an active field changes caused by pulse. As shown in Figure 1, hemoglobin and
of research for the past decade and produced different strategies oxyhemoglobin both have high ability of absorption in the green
using diverse processing methods and models. However, many color range and low in the red color range. But all three color
implementations were evaluated on different datasets and it is channels contain PPG information (Verkruysse et al., 2008).
consequently difficult to compare them. Furthermore, no survey More detailed information on PPG-based methods can be found
paper has been conducted, with the objective of gathering, clas- in the work of Allen (2007) and Sun et al. (2012).
sifying, and analyzing the existing work within this domain. The Head motions caused by pulse are mixed together with other
objectives of this article are first to fill this gap by presenting a involuntary and voluntary head movements. Subtle upright head
general pipeline composed of several steps and how the different motions in the vertical direction are mainly caused by pulse
state-of-the-art methods can be classified based on the pipeline activities, while the bobbing movements are caused by respira-
(presenting in Section “Remote Methods for HR Detection”). tion (Da et al., 2011; Balakrishnan et al., 2013). Motion-based
The second objective is to evaluate the mainstream methods at methods for detecting HR stemmed from ballistocardiogram
Table 1 | Classification of state-of-the-art methods.
Dimensionality Blind source Independent component Poh et al. (2010), Poh et al. (2011), Pursche et al. (2012), Kwon et al. (2012), Lewandowska et al.
reduction separation analysis (2011), Sahindrakar et al. (2011), Datcu et al. (2013), Jensen and Hannemose (2014), Yu et al. (2015),
Lam and Yoshinori (2015), Kumar et al. (2015), and McDuff et al. (2017)
Principle component Lewandowska et al. (2011), Wei et al. (2012), Rubinstein (2013), Irani et al. (2014), Balakrishnan et al.
analysis (2013), and Chen et al. (2017)
Other dimensionality methods Wei et al. (2012), Rubinstein (2013), and Tran et al. (2015)
Optical modeling Green channel Verkruysse et al. (2008), Pursche et al. (2012), Stricker et al. (2014), Li et al. (2014), Zaunseder et al.
(2014), Muender et al. (2016), Mestha et al. (2014), Kumar et al. (2015), and Moreno et al. (2015)
Other optical modeling methods Pursche et al. (2012), Stricker et al. (2014), Li et al. (2014), Zaunseder et al. (2014), Muender et al.
(2016), Mestha et al. (2014), Kumar et al. (2015), and Moreno et al. (2015)
Motion-based methods Balakrishnan et al. (2013), Rubinstein (2013), and Irani et al. (2014)
Machine learning Monkaresi et al. (2014), Tarassenko et al. (2014), Osman et al. (2015), and Villarroel et al. (2017)

(Starr et al., 1939). Ballistocardiographic head movement is and HR computation (Figure 2). Face video processing aims to
obtained by lying a participant on a low-friction platform from detect faces, improve the motion robustness, reduce quantization
which displacements are measured to get cardiac information. In errors, and prepare the featured signals for further BVP signal
Da et al. (2011), head motion was measured by accelerometers extraction. There are more algorithm variations at this stage
to monitor HR. Balakrishnan et al. (2013) proposed to detect than at BVP signal extraction and HR computation. For BVP
HR remotely from face videos through head motions. The basic signal extraction, temporal filtering, component analysis, and
approach consists of tracking features from a person’s head, filter- other approaches are used to recover HR information from noisy
ing out the velocity of interest, and then extracting the periodic signals. The HR computation stage aims to compute HR from the
signal caused by heartbeats. cardiac signal obtained from the previous stage. At this stage, the
Both subtle color changes and head motions can be easily methods can be grouped into time domain analysis and frequency
“hidden” by other signals during recording. The accuracy of HR domain analysis. For the time domain processing, peak detection
estimation is influenced by the participants’ movements, complex is diffusely applied to get the inter-beat interval (IBI) from which
facial features (face shape, hair, glasses, beards, etc.), facial expres- HR is computed. In frequency domain, the power spectral density
sions, camera noise and distortion, and changing light conditions. is mostly used, where the dominant frequency is taken as HR.
Many papers in this field use strictly controlled experiment set- HR computation can become complex for applications including
tings to eliminate the influential factors. Besides well-controlled buffer handling functions to present HR results after a certain
conditions, algorithms for noise reduction and signal recovery are time period (Stricker et al., 2014).
applied to retrieve HR information. For intensity-based methods,
averaging the pixel values inside a region of interest (ROI) is often Experiment Setting
applied to overcome sensor and quantization noise. Subsequently, Only a few papers used public datasets for remote HR estimation
temporal filters are adopted to extract the signal of interest (Poh from face videos (Li et al., 2014; Werner et al., 2014; Lam and
et al., 2010; Wu et al., 2012). As for motion-based approaches, Yoshinori, 2015; Tulyakov et al., 2016). While other researchers
similar algorithms are used such as face tracking and noise gathered their own datasets, where experiment settings vary sub-
reduction. stantially from camera settings, lighting situations to ground truth
To categorize existing methods, we divide the HR detec- HR measurements as shown in Appendix I in Supplementary
tion procedure into three stages based on the implementation Material. The experimental setting often consists of placing a stable
sequence: face video processing, face BVP signal extraction, digital video camera in front of the participant under a controlled
lighting condition. Furthermore, a ground truth HR measurement
is also collected using a more traditional method. Figure 3 shows
an example of the experimental setting. Finger BVP serves as the
ground truth (13 out of 42 papers), while the face recordings are
captured by the built-in camera of a laptop computer.
The digital cameras used for capturing videos are mainly com-
mercial cameras like web cameras or portable device cameras.
Following the Nyquist–Shannon sampling theorem, it is possible
to capture HR signals at a frame rate of eight frames per second
(fps), under the hypothesis that the human heartbeat frequency
lies between 0.4 and 4 Hz. According to Sun et al. (2013), a frame
Figure 1 | Hemoglobin (green) and oxyhemoglobin (blue) absorption
rate between 15 and 30 fps is sufficient for HR detection. Among
spectra (Jensen and Hannemose, 2014).
existing research, captured video frame rate differs from 15
Figure 2 | General schematic diagram for remote heart rate (HR) detection from face videos.

used for motion-based methods. For intensity-based methods,

when the selected region of the face is too large, the HR signal may
be hidden in background noise. On the other hand, if the selected
ROI is too small, the quantization noise caused by the camera
may not be fully attenuated by the averaging of pixels intensity
inside the ROI. For motion-based methods, significantly more
computation time is required for a larger ROI. But there might
not be enough feature points for effective motion tracking when
the ROI is too small.
This step is similar for both intensity-based methods and
motion-based methods (Figure 3). We classify the methods into
two groups: box ROI detection and model-based ROI detection.
Box ROI is the general area of the face regulated by a rectangle
sometimes coupled with skin detection. While model-based ROI
detection extracts the accurate face contours.
The easiest way of implementing a box ROI extraction is to
manually select the desired area, such as the largest facial area
Figure 3 | Experimental setup (Poh et al., 2010). with a rectangle as the bounding box on the first frame. This
solution is applied for motion-based methods when the video
resource contains solely hidden facial features, such as being
(Poh et al., 2010) to 100 fps (Zaunseder et al., 2014). 30 fps, how- covered by masks, or if the participants back is turned toward
ever, is the most often used within literature (Kwon et al., 2012; the camera (Balakrishnan et al., 2013). It is simple but highly
Pursche et al., 2012; Wei et al., 2012; etc.). It is important to note subjective. Among automatic box face detection methods, the
that for the majority of commercial digital cameras, the frame face detector proposed by Viola and Jones (2001) is often applied
rate is not fixed. Sudden movements or illumination changes may for HR detection (Poh et al., 2010; Balakrishnan et al., 2013; Irani
force to drop or interpolate frames depending on the camera used. et al., 2014; etc.). This method works rapidly and achieves reason-
Frame rate is also closely related with frame resolution. Generally, able detection accuracy which is 93.9% tested on MIT + CMU
cameras capture higher resolution frames at relatively low frame frontal face test set (Viola and Jones, 2001). To remove face edges
rates and vice versa. Both resolution and frame rate influence the and background area of the box ROI only part of the detected face
HR estimation performance and computation load directly. By area is used. According to Poh et al. (2010), 60% of width and full
examining the table in Appendix I in Supplementary Material, it height of detected facial area are used. While Mestha et al. (2014)
can be seen that the majority of research tends to utilize the video use the middle 50% of the rectangle’s width and 90% of its height.
graphic array standard with a video resolution of 640 × 480 pixels Some papers suggested to divide the roughly detected face region
per frame. into a coarse grid with multiple ROIs, with the aim of removing
Illumination is strictly controlled for some experiments (De the effect of head movements and facial expressions (Verkruysse
Haan and Vincent, 2013; Mestha et al., 2014; etc.) with specified et al., 2008; Sun et al., 2013; Kumar et al., 2015; Moreno et al.,
fluorescent lights and no natural sunlight. There is also research 2015). Skin detection methods are usually applied with other face
using indirect sunlight only or fluorescent lights as supplemen- detection solutions such as box ROI approach for HR estimation.
tary. The distance between the tester and the camera highly Further details on this specific group can be consulted in the work
depends on the lens properties. The distance is commonly set at of Vezhnevets et al. (2003), Xu et al. (2014), Sahindrakar et al.
1.5 m to capture the entire face while minimizing the quantity (2011), and Kakumanu et al. (2007).
of visualized background. In addition, the duration of recorded Model-based approaches have been applied with accurate
videos varies as well. Many face recordings are short-term with localization and tracking of facial landmarks (Werner et al.,
an approximate duration of 1 min. A setting description of refer- 2014; Lam and Yoshinori, 2015; Tulyakov et al., 2016). Datcu et al.
ences can be found in Appendix I in Supplementary Material. (2013) uses a statistical method called active appearance model
to handle shape and texture variation. In Stricker et al. (2014),
deformable model fitting by regularized landmark mean-shift
Face Video Processing (Saragih et al., 2011) is applied. Li et al. (2014) applies a similar
ROI Selection model named discriminative response map fitting with 66 facial
Facial ROI selection is used to obtain blood circulation features landmarks inside the face region which is detected a priori by a
and get the raw BVP signal, which highly influences the following Viola Jones face detector (Viola and Jones, 2001). Tulyakov et al.
HR detection steps. First, it affects the tracking directly since a (2016) uses the facial landmark fitting tracker—Intraface (De la
commonly applied tracking method uses first frame ROI (Poh Torre et al., 2015). Alternatively to previously explored detection
et al., 2010; Kwon et al., 2012; Pursche et al., 2012; etc.). Second, methods used in HR estimation, several other approaches exist
the selected ROI regions are regarded as the source of cardiac such as OpenFace (Baltrušaitis et al., 2016), which can detect and
information. The pixel values inside a ROI are used for intensity- track facial landmarks. Each area selected per frame is dynamic
based methods, while feature point locations inside a ROI are when using these model-based methods. Overall this makes the

selection process more robust as it can vary in time based on the Depending on the color channel selection, all pixels of the cor-
features themselves, increasing its efficiency when handling head responding color channel within the ROI area are averaged at
motions and facial expressions. These algorithms, however, are each frame. For a RGB video with n frames, the signal after spatial
more computationally expensive and time-consuming than box average can be expressed as a vector: X(j) = (x1(j), x2(j), …, xn(j)),
ROI detection. Further details on face detection methods can j = 1, 2, 3 where j stands for the color channels. This method is
be found in Hjelmås and Low (2001), Vezhnevets et al. (2003), simple and efficient to get raw featured signals for intensity-based
Rother et al. (2004), and Baltrušaitis et al. (2016). methods. Several research papers used the spatial-averaged signal
directly, as shown in Table 1. On the other hand, some works
Color Channel Decomposition (De Haan and Vincent, 2013; Tulyakov et al., 2016) apply optical
This step is specific for intensity-based methods. The basic idea models and use the chrominance features for HR estimation, which
is that the pixel intensity captured by a digital camera can be takes light transmission and reflection on skin into consideration.
decomposed into the illumination intensity and reflectance of For motion-based methods, the location of time-series xk(n),
the skin. However, several approaches are proposed to relate the yk(n) for each feature point k on frame n is tracked. Only the
pixel value to the PPG signal. This can lead to various choices vertical component yk(n) is taken to extract the trajectory from
of color channel decomposition and combination. For example, each feature point. The longitudinal trajectories are then used as
Huelsbusch and Blazek (2002) separated the noise from the PPG raw featured signals.
signal by building a linear combination of two color channels to
achieve motion robustness. An in-depth description of optical Face BVP Signal Extraction
modeling can be found within the literature referenced in Table 1. Now that the feature signal has been obtained from face videos,
Color channels are based on the color models. There are mainly the heartbeat can be effectively extracted. This section is divided
three color models applied in HR detection: Red-Green-Blue (RGB), into two subsections exploring noise reduction and dimensional-
Hue-Saturation-Intensity (HSI), and YCbCr where Y stands for ity reduction methods.
luminance component and Cb and Cr refer to blue-difference and
red-difference chroma components, respectively. The HSI model Noise Reduction
decouples the intensity component from the hue and saturation As previously explored, color and motion changes caused by
that carry color information of a color image. The skin-color lies the cardiac activities are often noisy. Thus this step is applied
in a certain range of H ([0 50]) and S ([0.23 0.68]) channels and the on the raw signals to remove such changes in light and track-
illumination changes information is separated in I channel. With ing errors. For intensity-based methods, the light variations are
each heartbeat, there is a clear drop in hue channel but its ampli- recorded together with intensity changes caused by blood pulses
tude is very small. For HSI model, only the H channel can be used (Verkruysse et al., 2008; Li et al., 2014; Zaunseder et al., 2014;
for BVP signal extraction. It is motion sensitive but performs better etc.). For motion-based methods, trackers capture trajectories
than RGB model without head motions. According to Sahindrakar that are not solely caused by heartbeats, thus it is necessary to
et al. (2011), YCbCr produced better results in detecting HR than apply noise reduction as well (Balakrishnan et al., 2013; Irani
HSI with limited rotation and no transition. Among these three et al., 2014). We present noise reduction methods based on two
models, the most robust model is still RGB. categories: temporal filtering and background noise estimation.
Among current detection methods, the main color space is still Temporal filtering contains a series of filters that remove irrelevant
RGB, though some research criticizes that it intermixes the color information and keep trajectories and color frequencies that are
and intensity information. According to Verkruysse et al. (2008), of interest. Background noise estimation uses the background to
Stricker et al. (2014), and Ruben (2015), all channels contain PPG estimate the noise caused by light changes.
information, but the green channel gives the strongest signal-to- For temporal filtering, various temporal filters are applied to
noise ratio (SNR). Consequently, the green channel has been the exclude and amplify low-amplitude changes revealing hidden
most popularly used for extracting HR (Verkruysse et al., 2008; information (Poh et al., 2010; Wang, 2017). It contains detrend-
Li et al., 2014; Zaunseder et al., 2014; Chen et al., 2015; etc.). ing, moving-average, and bandpass filters, which are often
However, Lewandowska et al. (2011) showed that the combina- applied to reduce irrelevant noise (Li et al., 2014). A detrending
tion of the R and G channels contain the majority of cardiac filter aims to reduce slow and non-stationary trends of signals
information. Several research papers have also investigated the (Li et al., 2014). After applying a detrending filter, the low fre-
usage of all three color channels in conjunction with BSS for the quencies of the raw signal are reduced drastically. This method
BVP signal extraction (Poh et al., 2011; Kwon et al., 2012; Pursche is as effective as a high-pass, low-cutoff filter with substantially
et al., 2012; etc.). less latency. The moving-average filter removes random noise
with temporal average of consecutive frames. It can efficiently
Raw Featured Signal smooth the trajectories and sudden color changes caused by light
Intensity-based methods use the intensity changes along the time or motions. Additional methods such as a bandpass filter can also
as raw signal containing BVP information, while motion-based be used to remove irrelevant frequencies. Bandpass filters can be
methods use the vertical component of the trajectories instead. Butterworth or other FIR bandpass filters in the literature. It can
The spatial average is commonly employed in the majority of be a Butterworth filter (Balakrishnan et al., 2013; Irani et al., 2014;
intensity-based methods, which aims to increase the SNR of PPG Osman et al., 2015; etc.) or other FIR bandpass filters (Li et al.,
signals and enhance the subtle color changes (Verkruysse et al., 2008). 2014) with cutoff frequency of normal HR. The cutoff frequency

could be 0.7–4 (Villarroel et al., 2017), 0.25–2 (Wei et al., 2012), Joint Approximate Diagonalization of Eigen-matrices (JADE)
or other values. The parameter setting for these three types of (Cardoso, 1999) is popular for HR detection since it is numeri-
filters differs from papers. For example, Ruben (2015) applies a cally efficient in computation (Poh et al., 2010, 2011; Kwon et al.,
fourth order bandpass zero-phase Butterworth filter, while Irani 2012; Pursche et al., 2012; etc.). JADE is a high-order measures
et al. (2014) employs an eighth order Butterworth filter to flat pass of independence for ICA. Further details on the JADE algorithm
band maximally. More temporal filtering information applied for can be found in the work of Hyvärinen and Oja (2000), while
HR detection can be found in the work of Yu et al. (2014) and methods for optimizing JADE is further described by Kumar
Tarvainen et al. (2002). et al. (2015).
Background noise estimation methods target intensity chan Principle component analysis can be used to extract both the
ges and are only suitable under some situations. It is based on intensity-based pulse signal and the head longitude trajectories
the assumptions that (a) both the ROI and background share the caused by pulse (Lewandowska et al., 2011; Balakrishnan et al.,
same light source and (b) the background is static and relatively 2013; Rubinstein, 2013). For motion-based method, the frequency
monotone (Li et al., 2014). Under these assumptions, the intensity spectra of the PCA components with the highest periodicity is
changes in the background are caused by illumination only and selected quantified from spectral power, meanwhile the com-
are correlated with the light noise in the HR signal extracted from ponent with maximum variance is selected for intensity-based
face recordings. Adaptive filters are applied on noised HR signal method as BVP signal (Lewandowska et al., 2011). Compared
and background signal to remove the noise (Chan and Zhang, with ICA, PCA has lower computation complexity. PCA is
2002; Cennini et al., 2010; Li et al., 2014). concerned with finding the directions along which the data have
Once filtered, signals can be used either directly for post- maximum variance in addition to the relative importance of
processing or for further signal extraction (dimensionality reduc- these directions. For HR detection, the goal of applying PCA is
tion). If the signal is used directly, the green channel is mostly to extract the cardiac pulse information from the head motions
used since it contains the stronger PPG signal (Verkruysse et al., or pixel intensity changes, represented it into principal compo-
2008). Under the second case, all the signal channels are kept. nents consisting of a new set of orthogonal variables (Abdi and
Williams, 2010; Balakrishnan et al., 2013). Mathematically, PCA
Dimensionality Reduction Methods depends on the Eigen-decomposition of positive semi-definite
The BVP signal is a periodic one-dimensional signal in the time matrices and the singular value decomposition of rectangular
domain. Dimensionality reduction algorithms are used to reduce matrices (Wold et al., 1987).
the dimensionality from raw signals in order to more clearly
reveal BVP information. The main idea is to find a mapping HR Computation
between higher dimensional space, such as three-dimensional Once the BVP signal is effectively extracted, a post-processing
RGB color spaces and one-dimensional space uncovering cardiac procedure follows. HR can be estimated from time domain
information. Dimensionality reduction contains classic linear analysis (peak detection methods) or frequency domain analy-
algorithms, e.g., BSS methods [e.g., independent component sis. Signals can be transformed to the frequency domain using
analysis (ICA) and principle component analysis (PCA)], linear standard methods, such as the Fast Fourier Transform (FFT)
discriminant analysis, and manifold learning methods such as and discrete cosine transformation (DCT) methods. Currently,
Isomap, Laplacian Eigenmap (LE), and locally linear embedding supervised learning methods are only applied at this stage for
(Zhang and Zha, 2004). Wei et al. (2012) tested nine commonly both the time and frequency domains.
used dimensionality reduction methods on RGB color channels Frequency domain algorithms are the most common post-
and the result demonstrated that LE performs best for extracting processing methods within the literature. The extracted HR signal
BVP information on their dataset. is converted to the frequency domain either by FFT (Poh et al.,
2010; Pursche et al., 2012; Yu et al., 2013; etc.) or DCT (Irani
Blind Source Separation et al., 2014). Using these methods, there is an assumption that the
After Poh’s publication in 2010, the mainstream technique to HR is the most periodic signal and thus, has the highest power
recover the BVP signal has been BSS which assumes that the of the spectrum within the frequency band corresponding to
observed signals (in our case the featured signals) are a mixture of normal human HR. The drawback is that it can only compute
source signals (BVP and noise). The goal of BSS is to recover the the HR over a certain period instead of detecting instantaneous
sources signals without or with a little prior information about HR changes.
their properties. The most popular BSS methods for HR detection Peak detection methods (Poh et al., 2011; Li et al., 2014) detect
from face video are ICA (Hyvärinen and Oja, 2000) and PCA the peak of HR signals in the time domain directly. With the
(Wold et al., 1987; Abdi and Williams, 2010). detected peaks, IBI can be calculated. The IBI intervals are then
Independent component analysis is based on the assumption averaged and the HR is computed from the average IBI. IBI allows
that all the sources are mutually independent. The basic princi- for the beat-to-beat assessment of HR, however, it is quite sensi-
ple is to maximize the statistical independence of all observed tive to noise. To achieve more reliable results, a sliding window of
components in order to find the underlying components (Liao short-time period is often implemented to average the HR result
and Carin, 2002; Yang, 2006). For cardiac pulse detection, the over the whole video (Li et al., 2014; Ruben, 2015).
observed signals are captured by camera color sensors, which are Supervised learning methods have also been investigated as
mixed with the heartbeat signals. Among various ICA algorithms, a potential solution for HR calculation (Monkaresi et al., 2014;

Tarassenko et al., 2014; Osman et al., 2015). Monkaresi et al. et al. (2017), even with a low constant rate factor to compress
(2014) and Tarassenko et al. (2014) both use supervised learning videos, the signal-to-noise ratio degrades considerably in face
method for power spectrum analysis. With features extracted BVP signals.
from PSD, auto-regression or k-nearest neighbor classifier is
used to predict the HR signal with a degree of accuracy. Osman Method Discussion
et al. (2015) extract the first order derivative of green channels The ROI definition is important since it contains the raw BVP
as the features. A feature at time t is positively labeled, if there signals. However, as mentioned in Section “ROI Selection,” there
is a ground truth BVP peak lies within a certain time tolerance is still no consensus on which ROIs are the most relevant for
and vice versa. These data are then used to train a support vector HR computation. For example, Pursche et al. (2012) claim that
machine algorithm, capable of predicting IBI. the center of the face region provides better PPG information
compared with other facial parts. By contrast, Lewandowska
Discussion et al. (2011) and Stricker et al. (2014) assert that the forehead can
Developed methods tend to be strongly tied into the dataset represent the whole facial region although it can be unreliable
and the specific experiment protocol they were designed for. if covered with hair. In Datcu et al. (2013), cheek and forehead
Unfortunately, this means that they neither generalize nor adapt are suggested as the most reliable parts containing the strongest
to other datasets or scenarios, especially real-life situations. This PPG signals. Moreno et al. (2015) believe that forehead, cheeks,
section analyses the limitation aspect from setting to results. and mouth area provide more accurate heartbeat signals, in
comparison with other parts such as nose and eyes. While Irani
Experiment Setting et al. (2014) state that the forehead and area around the nose
As shown in Section “Experiment Setting,” experimental settings are more reliable. These differences are caused specifically by
can vary significantly. For most experiments, both the partici- the datasets used. For example, with videos containing head
pants’ behaviors and the environment are well-controlled which motions and facial expressions (Lewandowska et al., 2011;
are not applicable for practical use. Non-grid motions like facial Stricker et al., 2014), the eyes and mouth areas tend to be less
expressions are more difficult to handle compared with grid stable than the forehead area, since they are influenced by facial
motions, e.g., head rotate horizontally or vertically. Detecting HR muscles.
from face videos with spontaneous facial expressions is valuable The most popular method to extract the face BVP signal is
for further study such as long-term monitoring and affective by far ICA, as previously detailed in Section “Face BVP Signal
computing. Extraction.” It works experimentally, nevertheless, it also has
So far, there is no research designed specifically to test each some limitations. Based on the Beer–Lambert law, reflected
influential factor of experimental setting such as video resolution, light intensity through facial tissue varies nonlinearly with
frame rate, and illumination changes. Consequently, it is diffi- distance (Wei et al., 2012; Xu et al., 2014), while both PCA and
cult to distinguish whether the study results are affected by the ICA assume that the observed signal is a linear combination of
experimental setting or the implemented approach. For example, several sources. Supporting the possibility that ICA is not the
low video resolution will lead to a limited number of pixels in the best option, Kwon et al. (2012) demonstrated that ICA perfor-
face area, which may be insufficient to extract the BVP signal. mance is slightly lower compared with simple green channel
A high frame rate provides more information but may increase detection methods. Besides, according to Mestha et al. (2014)
the computational load. and Sahindrakar et al. (2011), ICA needs at least 30 s to be
Furthermore, most of the self-collected datasets are not pub- accurately estimated and cannot handle large head motions.
licly accessible, complicating further investigations as researchers The noise estimation method proposed by Li et al. (2014) can
must continuously collect and construct new datasets, which only be applied with monotone backgrounds. Also, it is not suit-
can be time consuming. In fact, it complicates evaluation as able for real-time HR prediction with its adaptive filter, which
methods are tested cross-dataset. A few studies used the open needs a certain time duration to guarantee estimation accuracy.
dataset MAHNOB-HCI (Li et al., 2014; Lam and Yoshinori, For stable frontal face videos, the green channel tends to be a
2015; Tulyakov et al., 2016). However, their results for the same good solution, and provides a low computational complexity
method are not consistent with each other (Li et al., 2014; Lam method for the extraction of BVP signals. Particularly noisy
and Yoshinori, 2015). This is probably due to differences in the sources of the face can often be mitigated through the usage of
implementation. methods like PCA or ICA, and can often achieve better results
Besides, participants with darker skin tone are also rarely used than using the green channel exclusively (Poh et al., 2011; Li
within both self-collected and open datasets. The higher amount et al., 2014).
of melanin present in darker skin tones absorbs a significant At the HR computation stage, the frequency domain methods
amount of incident light, and thus degrades the quality of the are not capable to detect instantaneous heartbeat changes, and
camera-based PPG signals, making the system ineffective for are not as robust as time domain methods according to Poh et al.
extracting vital signs (Kumar et al., 2015). Equipment limitations (2011). Supervised learning methods are mainly (three out of
also exist. For example, the majority of digital cameras are unable four papers) applied at this stage so far. There is one recent paper
to hold stable frame rates during recording, which can impact apply auto-regression to extract BVP signals after the video pro-
HR computation. Video compression is another factor that can cessing (Villarroel et al., 2017), but there is no end-to-end usage
influence performance (Ruben, 2015). According to McDuff of supervised learning methods. Future research might thus focus

on the development of machine learning methods trained to take HR only and require users to stay stable. None of these products
raw videos as inputs and compute the HR information as outputs. provide an estimation of their performance.
Without inter-processing stage, the performance of supervised
learning methods may be improved significantly. COMPARATIVE ANALYSIS
HR Estimation It is of great importance to quantify and compare the perfor-
The HR discussed in this article is actually the average of HR mance of the main algorithms presented in the previous section.
during a certain time interval (e.g., 30 s or 1 min), which cannot In this section, we study how the existing approaches perform
reveal the instantaneous physiological information. The literature in a close to realistic scenario with both gird and non-grid head
often computes HR by examining videos with various lengths and movements.
subsequently calculating the average HR over that time period. In Given the amount of workload, not all methods were tested
reality, the time interval between two connective heartbeats is not exhaustively. For motion-based methods, only two methods
stable. By contrast to averaged HR, we refer to instantaneous or were proposed and applied on face videos (Balakrishnan et al.,
dynamic HR when talking about HR calculated for each IBI. This 2013; Irani et al., 2014). Both of them required limited head
information can be used to reveal short lived phenomenon such as movements. Our test dataset, MANNOB–HCI, is not ideal for
emotions. It can further be used to compute heart rate variability this group of methods. Therefore, we focus on the validation of
(HRV). According to Pavlidis et al. (2007) and Dawson et al. intensity-based approaches which are most often applied. Despite
(2010), HRV is directly related with emotion and disease diagnosis, of this restriction, some methods evaluated in this section can
which is of great value in medical and affective computing domain. also be used as reference for motion-based methods. This is
To the best of our knowledge, so far there is no work focusing for instance the case for ROI selection which can be used in a
on dynamic HR. Also, for average HR estimation, there are no motion-based framework.
state-of-the-art approaches that are robust enough to be fully To offer a panoramic view of the state-of-the-art, this article
operated under real situations with grid and non-grid move- attempted to cover the analysis with methods from different cate-
ments, illumination changes and noise caused by the camera. gories within each processing stage as previously stated in Section
Even with well-established public dataset, one or two influential “Remote Methods for HR Detection.” The methods considered
factors are still under controlled. for comparative analysis are selected based on their popularity
(times that they were adopted for other papers) and their category
Commercial Applications and Software (supervised learning, dimensionality reduction, etc.).
Currently, for application purposes, PPG is mainly obtained The method validation is divided into two parts. In the first
from contact sensors. There are few commercial applications part, we test and compare the algorithms at each stage sequ
and software available estimating HR remotely from the color entially as shown in Figure 4. For one stage, we test several
changes on faces such as Cardiio,1 Pulse Meter,2 and Vital Signs alternatives while keeping the algorithms applied at the other two
Camera.3 Cardiio and Pulse Meter are both phone applications. stages fixed. Once a stage has been tested, the best method on this
They present a circle or a rectangle on the screen for users to stage will be chosen for the following stage test. Before that, the
place their face areas and keep still for a certain time period. Vital algorithms selected for fixed stages are based on simplicity. At the
sign camera is developed by Philips with both software and phone pre-processing stage, the main target is to find out the efficient
application. They apply frequency domain method to calculate face segmentation. For signal extraction, various algorithms are
HR. All these applications and software compute the average tested to compare the most efficient method capable of separating
the HR signal from noise and irrelevant information. At the post-
processing stage, time domain and frequency domain methods
1
Cardiio. Cardiio. https://www.cardiio.com/ (Accessed: April 09, 2018). are compared for HR computation. The second part focuses on
2
Rapsodo. Pulse Meter. http://www.appspy.com/app/667129/pulse-meter (Accessed:
April 09, 2018).
the implementation of state-of-the-art methods presented in the
3
Philips. Philips vitals signs camera. http://www.vitalsignscamera.com (Accessed: work of Poh et al. (2011), Li et al. (2014), and Osman et al. (2015),
April 09, 2018). which are used as baselines for validation.
Figure 4 | Overview of region of interest selection (Stricker et al., 2014).

The state-of-art methods were tested on the MAHNOB-HCI the same width as eyes and the height is the distance between
(Soleymani et al., 2012) database. In MAHNOB-HCI, 27 partici- upper lip border and the lower eye border. For the skin area,
pants (15 females and 12 males) were recorded while watching we removed the eye and mouth regions to avoid noise caused
movie clips to elicit emotions, which can influence their HR and by blinks and other facial expressions. The regions we tested
stimulate facial expressions. This public dataset is multimodal are shown in Figure 5. Since all the five ROI selections are
including frontal face videos and HR information recorded using determined by facial landmarks, the ROI areas are dynamic
a gold standard technique: ECG. The frame rate is 61 fps and the and may change due to the head motions from frame-to-frame.
ECG sampling frequency is 256 Hz. Furthermore, this dataset is Given that in the MAHNOB-HCI database, the participants’
publicly accessible, allowing research to be easily reproduced. We electroencephalogram (EEG) was recorded using a head cap,
filtered 465 samples (i.e., ECG and facial videos which each cor- the forehead area was partially covered by the sensors that may
respond to an emotion-elicited movie clip) from MAHNOB-HCI influence the performance.
obtained from 24 participants, where 12 were males and 12 were To avoid influential factors from other processing steps, we use
females. the spatial-averaged green channel directly of each ROI and then
To reduce the possible risks of bias, we regulated the study apply temporal filtering to obtain the HR signal. The cutoff fre-
from data source to performance validation. The face videos used quency is set from 0.7 to 2 Hz which corresponds to HR between
for the study are from both genders and different skin tones. For 42 and 120 bpm. The PSD method (Poh et al., 2010) is applied to
one participant, 14–20 videos are used to avoid bias from a certain calculate the averaged HR over 30 s.
scenario due to a specific stimulus. All participants we selected
offered their consent before experimentation, where the record- Face BVP Signal Extraction
ing duration surpasses 65 s. Among the 465 samples, 20 samples For this part, we tested the main methods mentioned in Section
did not contain corresponding ECG signals and 1 recording “Noise Reduction.” Temporal filtering with a detrending filter
presented faulty filtering. These samples were removed leaving and a bandpass filter are applied on the raw featured signal before
a total of 444 samples. The validation followed by Li et al. (2014) methods testing in this section. Three signal extraction methods
and Lam and Yoshinori (2015) started each video recording with were compared. PCA, ICA, and background noise estimation
a delay of 5 s. Videos are subsequently cut into 30 s segments, methods are evaluated on spatial-averaged RGB signals from
and subsequently synchronized with their corresponding ECG extracted skin ROI. For ICA, the component with highest energy
signal. All the methods and evaluations are implemented using in the frequency domain is selected as the BVP signal, while the
MATLAB R2016a. To validate obtained results, the HR ground component with the highest variance is selected for PCA. The
truth is obtained by first detecting the R peaks from ECG signals implementation of ICA and PCA follows Poh et al. (2011) and
using the TEAP (Soleymani et al., 2017), which uses the standard Rubinstein (2013), respectively. Background noise estimation is
Tompkins’ method (Tompkins, 1993). Absent and falsely detected implemented following Li et al. (2014) and uses green channel for
peaks are then manually corrected. The mean HR is the averaged extracting HR signals after illumination rectification with nor-
value computed from the instantaneous HR. Thus, a precise aver- malized least mean square adaptive filter. HR is estimated using
age HR is guaranteed as ground truth. the PSD of the extracted face BVP signal (Poh et al., 2010) as well.
HR Computing
Comparison at Each Stage The peak detection method and PSD method are evaluated for the
Face Segmentation calculation of HR. Selected ICA components extracted from skin
As shown in Section “Face Video Processing,” there is no con-
sistent conclusion about which part of the face reveals most
HR information. Thus, there is an interest in comparing HR
detection accuracy for several facial regions. The face region
contains useful features of HR information that may differ
from every frame since appearance changes are spatially and
temporally localized. Instead of using a constant and prese-
lected ROI, we adopt OpenFace (Baltrušaitis et al., 2016) to
detect face segmentations automatically and dynamically.
OpenFace can extract 66 landmarks from the face, marking
the location of eyes, nose, eyebrows, mouth, and the contour
of visage. We compared the performance of forehead, cheeks,
chin, whole face, and extracted skin area (accurate face contour
area without eyes and mouth), which are frequently selected
as ROIs in the state-of-the-art. For the rectangular forehead
area, we used the distance between inner corners of the eyes
as the rectangle width, while the distance from the uppermost
Figure 5 | Face segmentation (1. Forehead; 2. Cheeks; 3. Chin; 4. Whole
face contour landmark to the uppermost eyes landmark
face; 5. Extracted skin).
constitutes the rectangle height. Similarly, the cheek areas had

ROI are used as the HR signal. For peak detection, we applied the mentioned in the papers. The implementation schematic is
algorithm from the open source Toolbox for Emotional feAture shown in Figure 6.
extraction from Physiological signals (TEAP) (Soleymani et al., Poh et al. and Osman et al. tested their method on self-collected
2017). There are different ways of computing PSD. In this section, datasets with BVP signals as ground truth, while Li et al. used
it is estimated via the periodogram method following Monkaresi both self-collected and the MAHNOB-HCI datasets.
et al. (2014). The work of Osman et al. (2015) used finger BVP signals to
label their extracted features. In our reproduction, ECG signals
Comparison on Complete Methods are used from MAHNOB-HCI. We followed the same method but
We reproduced three methods from the work of Poh et al. labeled the feature positive if the R peak existed within the time
(2011), Li et al. (2014), and Osman et al. (2015). This was tolerance of the detected peak from face videos. We randomly
done to compare our analysis with the three main categories selected 12 subjects as training dataset and the other 12 subjects
of HR estimation methods: ICA, background noise estima- as testing dataset. For training, we have 5,000 positive features
tion, and machine learning, respectively. We implement and 5,000 negative features as Osman et al. (2015). For testing,
these three approaches step-by-step and set parameters as there are 6,765 positive features and 7,127 negative ones.
Figure 6 | Schematic diagram for complete method comparison. (A) Schematic diagram from Poh et al. (2011). (B) Schematic diagram from Li et al. (2014).
(C) Schematic diagram from Osman et al. (2015).

Figure 7 | Continued

Figure 7 | Method performance at each stage. “best” indicate the best methods according to average performance. Other methods are tested against the
best method (ns, non-significant; *, p < 0.05; **, p < 0.01). (A) Face segmentation performance. (B) Performance at face blood volume pulse (BVP) extraction.
(C) Performance at heart rate (HR) computation.
RESULTS AND DISCUSSION successfully detected face information which adds noise in face
BVP signal. Considering the computation complexity and detec-
Results at Each Stage tion efficiency, Viola Jones face detector is suitable for slight head
For face segmentation, we can see from Figure 7A that the movements and can be used as a prior method for more complex
extracted skin area performs better than the other facial areas, face detection algorithms.
followed by the entire facial region. Though skin area performs To see how head movements influence signal extraction, we
a bit better than face area, there is no significant difference tested the methods on one video with little head motions. For
between them (t = 1.72; p = 0.07). There is no noticeable differ- this situation, background noise estimation performs better since
ence between the left cheek and the right cheek and the forehead the illumination was the main source of noise. However, there is
does not perform better than other facial areas. According to little difference between the root mean squared error of the green
our results, the more skin area used as BVP resource, the better channel (12.12), PCA (12.51), ICA (11.80), and background noise
performance we can achieve (extracted skin area performs better estimation (11.69). Evaluating on all the samples, ICA performs
than cheek, forehead, and chin areas). The facial expressions, better then background noise estimation (t = 2.27; p = 0.04) and
such as laugh and blinks, tends to add extra noise to the BVP PCA (t = 4.84; p < 0.001) as shown in Figure 7B. Under the
signals but does not influence it considerably when taking the experimental setting of MAHNOB-HCI, head motions tend to
whole face into consideration. We also compared the whole face have a higher influence on the HR detection, rather than illumi-
area with the area detected by Viola Jones face detector (Viola nation changes.
and Jones, 2001). When there are no abrupt head moments and For HR computation, Figure 7C shows that the implemented
other interruptions, the featured raw signals obtained from these peak detection method is more robust than the PSD method
two methods are very similar. The Pearson correlation is statisti- (t = 2.52; p = 0.03). Peak detection reduces the error by averaging
cally significant (r = 0.99; p < 0.001). When the video includes all the IBI over the video duration. For PSD, once the noise takes
more spontaneous movements, OpenFace is more robust with a the dominant frequency and lies in the human HR range, there
higher success detection rate and the correlation between the two is no solution to detect the right HR. The best result from each
methods is degraded (r = 0.83; p < 0.001). Viola Jones detec- stage is shown in Table 2 with extracted skin area, ICA and peak
tor fails face detection on several frames and then uses the last detection method.

Table 2 | Obtained performance for the best method at each stage. Interruptions such as hands partially covering the face will
Stage M (SD)/bpm RMSE/bpm ρ
influence the performance of this method. Furthermore, if the
illumination changes significantly during the HR detection
Face video processing (extracted skin area) 5.34 (14.98) 15.05 0.20 process, noise estimation methods could be applied to improve
Blood volume pulse signal extraction 4.09 (13.37) 13.56 0.32
the overall performance.
(independent component analysis)
Heart rate computing (peak detection) 3.01 (12.14) 12.23 0.55 The main limitation of this study comes specifically from
the selected database and testing methods. Possible risks of bias
Performance is measured by M (mean error), SD (standard derivation), RMSE (root
mean squared error), and ρ (correlation coefficient).
can derive from the small number of subjects (24 participants)
and the selected data source (444 videos and corresponding
ECG signals from MAHNOB dataset). The specific experi-
Table 3 | Obtained performance for the complete methods. mental setting could favor some of the considered methods
and could potentially be detrimental for others. For example,
Method M (SD)/bpm RMSE/bpm ρ
the forehead area is reported have a good performance
Poh et al. (2011) 4.07 (13.04) 13.81 0.28 (Lewandowska et al., 2011; Stricker et al., 2014), while our
Li et al. (2014) 2.15 (10.04) 10.33 0.68 results showed that this was not significant, potentially due to
Osman et al. (2015) 3.37 (12.08) 12.79 0.47 the EEG head cap which partially covers the forehead area in
Performance is measured by M (mean error), SD (standard derivation), RMSE (root MAHNOB videos.
mean squared error), and ρ (correlation coefficient).
CONCLUSION
Results From Complete Methods Remote HR measurements from face videos have improved
The results of three methods tested on MAHNOB-HCI are shown during the last few years. Among the research in this domain,
in Table 3. Unsurprisingly the performance of Poh et al. (2011)’s the designed models, parameter settings, chosen algorithms,
method is significantly dropped than its self-reported results and equipment are plenty, complex, and vary enormously. Some
using the proprietary dataset (mean bias is 0.64, RMSE is 4.63, approaches achieve high accuracy under well-controlled situa-
and correlation coefficient is 0.95). This is probably due to the tions but degrade with illumination changes and head motions.
fact that the original dataset is rather stationary and avoids head In this article, we performed (a) the collection and classification of
motions. Li et al. (2014) and Lam and Yoshinori (2015) both state-of-the-art methods into three stages and (b) the comparison
tested the Poh et al. (2011) method on MAHNOB-HCI dataset. of their performance under HCI conditions.
Our testing result is a bit worse than Li et al. (2014) whose mean The MAHNOB-HCI dataset is used for algorithm testing
bias is 2.04, RMSE is 13.6, and correlation coefficient is 0.36, but and analysis since it is a publicly accessible dataset. Our results
better than the result from Lam and Yoshinori (2015) (RMSE is showed that at the pre-processing stage, accurate face detection
21.3). Following Li et al. (2014) method, we obtained better per- algorithms performed better than rough ROI detection. The
formance than Lam and Yoshinori (2015), but worse results than extracted facial skin area used as the source HR signal obtained
that presented by Li et al. (2014) themselves. Although evalua- better result than any other facial ROI. For signal extraction,
tions are all based on the MAHNOB-HCI dataset, samples, and under most cases, ICA method obtained decent results. When
algorithm parameters are not exactly the same. As to the method the background is monotone, removing the noise estimated from
from Osman et al. (2015), we cannot really compare the results the background increased the HR detection accuracy efficiently.
since it was tested on self-collected dataset only. It shows better As for post-processing, peak detection in time domain was more
result than Poh et al. (2011), but not as good as Li’s method. From reliable than the frequency domain methods. In conclusion, we
our test, the method of Li et al. (2014) performs significantly bet- built an efficient pipeline for non-intrusive HR detection from
ter than Poh et al. (2011) (t = 5.00; p < 0.001) and Osman et al. face videos by combining the methods we found to be the best.
(2015) (t = 4.51; p < 0.001). This pipeline with skin area extraction, ICA and peak detection
We can see from Tables 2 and 3 that the face segmentation demonstrated a state-of-the-art accuracy.
influences results significantly. With the best performed method Though considerable progress has been made in this domain,
of each stage, it can achieve competitive results compared with there are still many difficulties. The state-of-the-art approaches
complete state-of-the-art methods which are more complex. Thus, are not robust enough when applied under natural conditions
we obtained and proposed an efficient pipeline with extracted and are still unable to detect HR in real-time. Technically speak-
skin area, ICA and peak detection for detecting HR remotely ing, machine learning methods may be promising for remote
from face videos. HR detection. Especially with finger oximeter as ground truth
This pipeline could be applied for other studies with front measurement, since the BVP signal detected from facial regions
face recordings under environmental illumination. It is robust should have a similar shape with collected finger or ear BVP
with facial expressions and limited head motions (translation signals. So far there are only three papers using machine learning
and orientation), which is often the case for the majority of methods and both of them concentrate on the post-processing
human–computer interaction processes (online education, category. With the development of deep end-to-end learning, the
computer gaming, etc.). The pipeline is expected to perform robustness and accuracy of HR detection may increase efficiently
better under more stable conditions with less head movements. even under naturalistic situations.

Instantaneous HR reveals more information about the sub- ACKNOWLEDGMENTS

ject’s physiological and affective status than mean HR. According
to Malik (1996) and Armony and Vuilleumier (2013), the heart The authors would like to show the gratitude to Louis Philippe
beat variability can be used for investigating mental workload Simoes Castelo Branco, Computer Vision and Multimedia Labo
or to detect emotions such as anger and sadness. The work ratory, Computer Science Department, University of Geneva, for
from Lakens (2013) has already shown the possibility of using reviewing the early version of the manuscript. His corrections and
smartphones to measure HR variation associated with relived suggestions greatly improved the quality of the manuscript.
experiences of anger and happiness. Thus psychological insights
of HR changes and the link to affective status can be taken into FUNDING
consideration for further study. In short, more efforts could
be devoted to reliable instantaneous HR detection in realistic This work is supported by the Swiss National Science Foundation
scenarios. under the grant No. 200021E - 164326.
AUTHOR CONTRIBUTIONS SUPPLEMENTARY MATERIAL

CW designed and implemented the methods. CW and GC ana- The Supplementary Material for this article can be found online at
lyzed and interpreted the results. All authors contributed to the https://www.frontiersin.org/articles/10.3389/fbioe.2018.00033/
redaction of the manuscript. full#supplementary-material.
REFERENCES De la Torre, F., Chu, W.-S., Xiong, X., Vicente, F., Ding, X., and Jeffrey
Cohn, J. (2015). “IntraFace,” in Automatic Face and Gesture Recognition
Abdi, H., and Williams, L. J. (2010). Principal component analysis. Wiley Interdiscip. (FG), 2015 11th IEEE International Conference and Workshops on, Vol. 1
Rev. 2, 433–459. doi:10.1002/wics.101 (Pittsburgh: IEEE).
Allen, J. (2007). Photoplethysmography and its application in clinical physio Garbey, M., Sun, N., Merla, A., and Pavlidis, I. (2007). Contact-free measurement
logical measurement. Physiol. Meas. 28, R1. doi:10.1088/0967-3334/28/3/R01 of cardiac pulse based on the analysis of thermal imagery. IEEE Trans. Biomed.
Armony, J., and Vuilleumier, P. (eds) (2013). The Cambridge Handbook of Human Eng. 54, 1418–1426. doi:10.1109/TBME.2007.891930
Affective Neuroscience. Cambridge University Press. Hjelmås, E., and Low, B. K. (2001). Face detection: a survey. Comput. Vis. Image
Balakrishnan, G., Durand, F., and Guttag, J. (2013). “Detecting pulse from head Understand. 83, 236–274. doi:10.1006/cviu.2001.0921
motions in video,” in Proceedings of the IEEE Conference on Computer Vision Huelsbusch, M., and Blazek, V. (2002). “Contactless mapping of rhythmical phe-
and Pattern Recognition. Portland, Oregon. nomena in tissue perfusion using PPGI,” in Proc. SPIE 4683, Medical Imaging
Baltrušaitis, T., Robinson, P., and Morency, L.-P. (2016). “Openface: an open source 2002: Physiology and Function from Multidimensional Images, Vol. 110, San
facial behavior analysis toolkit,” in Applications of Computer Vision (WACV), Diego, CA.
2016 IEEE Winter Conference on (IEEE). Hyvärinen, A., and Oja, E. (2000). Independent component analysis: algorithms
Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. and applications. Neural Networks 13, 411–430. doi:10.1016/S0893-6080(00)
Neural Computat. 11, 157–192. doi:10.1162/089976699300016863 00026-5
Cennini, G., Arguel, J., Akşit, K., and van Leest, A. (2010). Heart rate monitoring Irani, R., Nasrollahi, K., and Moeslund, T. B. (2014). “Improved pulse detection
via remote photoplethysmography with motion artifacts reduction. Opt. Exp. from head motions using DCT,” in Computer Vision Theory and Applications
18, 4867–4875. doi:10.1364/OE.18.004867 (VISAPP), 2014 International Conference on, Vol. 3 (Lisbon, Portugal: IEEE).
Chan, K. W., and Zhang, Y. T. (2002). Adaptive reduction of motion artifact from Jeanne, V., Asselman, M., den Brinker, B., and Bulut, M. (2013). “Camera-based
photoplethysmographic recordings using a variable step-size LMS filter. Sensors heart rate monitoring in highly dynamic light conditions,” in 2013 International
2, 1343–1346. doi:10.1109/ICSENS.2002.1037314 Conference on Connected Vehicles and Expo (ICCVE) (Las Vegas, NV: IEEE).
Chen, D.-Y., Wang, J. J., Lin, K. Y., Chang, H. H., Wu, H. K., Chen, Y. S., et al. (2015). Jensen, J. N., and Hannemose, M. (2014). Camera-based Heart Rate Monitoring.
Image sensor-based heart rate evaluation from face reflectance using hilbert– Lyngby, Denmark: Department of Applied Mathematics and Computer Science,
huang transform. IEEE Sens. J. 15, 618–627. doi:10.1109/JSEN.2014.2347397 DTU Computer, 17.
Chen, W., and Picard, R. W. (2017). “Eliminating physiological information from Kakumanu, P., Makrogiannis, S., and Bourbakis, N. (2007). A survey of skin-
facial videos,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th color modeling and detection methods. Pattern Recognit. 40, 1106–1122.
IEEE International Conference on (IEEE). doi:10.1016/j.patcog.2006.06.010
Da, H., Winokur, E. S., and Sodini, C. G. (2011). “A continuous, wearable, and Kumar, M., Veeraraghavan, A., and Sabharwal, A. (2015). DistancePPG: robust
wireless heart monitor using head ballistocardiogram (BCG) and head elec- non-contact vital signs monitoring using a camera. Biomed. Opt. Exp. 6,
trocardiogram (ECG),” in 2011 Annual International Conference of the IEEE 1565–1588. doi:10.1364/BOE.6.001565
Engineering in Medicine and Biology Society (Boston, Massachusetts: IEEE). Kwon, S., Kim, H., and Suk Park, K. (2012). “Validation of heart rate extraction
Datcu, D., Cidota, M., Lukosch, S., and Rothkrantz, L. (2013). “Noncontact using video imaging on a built-in camera system of a smartphone,” in 2012
automatic heart rate analysis in visible spectrum by specific face regions,” in Annual International Conference of the IEEE Engineering in Medicine and
Proceedings of the 14th International Conference on Computer Systems and Biology Society (San Diego, CA: IEEE).
Technologies (New York: ACM). Lakens, D. (2013). Using a smartphone to measure heart rate changes during
Dawson, J. A., Kamlin, C. O. F., Wong, C., Te Pas, A. B., Vento, M., Cole, T. J., relived happiness and anger. Trans. Affect. Comput. 4, 238–241. doi:10.1109/
et al. (2010). Changes in heart rate in the first minutes after birth. Arch. T-AFFC.2013.3
Dis. Childhood Fetal Neonatal Ed. 95, F177–F181. doi:10.1136/adc.2009. Lam, A., and Yoshinori, K. (2015). “Robust heart rate measurement from video
169102 using select random patches,” in Proceedings of the IEEE International Conference
De Haan, G., and Vincent, J. (2013). Robust pulse rate from chrominance- on Computer Vision. Santiago, Chile.
based rPPG. IEEE Transac. Biomed. Eng. 60, 2878–2886. doi:10.1109/TBME. Lewandowska, M., Rumiński, J., Kocejko, T., and Nowak, J. (2011). “Measuring
2013.2266196 pulse rate with a webcam—a non-contact method for evaluating cardiac

activity,” in Computer Science and Information Systems (FedCSIS), 2011 Federated Communication, 2014 RO-MAN: The 23rd IEEE International Symposium on
Conference on (Szczecin, Poland: IEEE). (Edinburgh, Scotland: IEEE), p. 1056–1062.
Li, X., Chen, J., Zhao, G., and Pietikainen, M. (2014). “Remote heart rate measure- Sun, Y., Hu, S., Azorin-Peris, V., Kalawsky, R., and Greenwald, S. (2013). Noncontact
ment from face videos under realistic situations,” in Proceedings of the IEEE imaging photoplethysmography to effectively access pulse rate variability.
Conference on Computer Vision and Pattern Recognition. Columbus, Ohio. J. Biomed. Opt. 18, 061205–061205. doi:10.1117/1.JBO.18.6.061205
Liao, X., and Carin, L. (2002). “A new algorithm for independent component Sun, Y., Papin, C., Azorin-Peris, V., Kalawsky, R., Greenwald, S., and Hu, S. (2012).
analysis with or without constraints,” in Sensor Array and Multichannel Signal Use of ambient light in remote photoplethysmographic systems: comparison
Processing Workshop Proceedings, 2002 (Rosslyn, VA: IEEE). between a high-performance camera and a low-cost webcam. J. Biomed. Opt.
Malik, M. (1996). Heart rate variability. Ann. Noninvasive Electrocardiol. 1, 17, 0370051–03700510. doi:10.1117/1.JBO.17.3.037005
151–181. doi:10.1111/j.1542-474X.1996.tb00275.x Tarassenko, L., Villarroel, M., Guazzi, A., Jorge, J., Clifton, D. A., and Pugh, C.
McDuff, D. J., Blackford, E. B., and Estepp, J. R. (2017). “The impact of video (2014). Non-contact video-based vital sign monitoring using ambient light and
compression on remote cardiac pulse measurement using imaging photopleth- auto-regressive models. Physiol. Meas. 35, 807. doi:10.1049/htl.2014.0077
ysmography,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th Tarvainen, M. P., Ranta-Aho, P. O., and Karjalainen, P. A. (2002). An advanced
IEEE International Conference on (IEEE). detrending method with application to HRV analysis. IEEE Trans. Biomed. Eng.
Mestha, L. K., Kyal, S., Xu, B., Lewis, L. E., and Kumar, V. (2014). “Towards contin- 49, 172–175. doi:10.1109/10.979357
uous monitoring of pulse rate in neonatal intensive care unit with a webcam,” in Tompkins, W. J. (1993). Biomedical Digital Signal Processing: C-language
2014 36th Annual International Conference of the IEEE Engineering in Medicine Examples and Laboratory Experiments for the IBM PC. Hauptbd. Editorial
and Biology Society (Chicago, IL: IEEE). Prentice Hall.
Monkaresi, H., Calvo, R. A., and Yan, H. (2014). A machine learning approach Tran, D. N., Lee, H., and Kim, C. (2015). “A robust real time system for remote
to improve contactless heart rate monitoring using a webcam. IEEE J Biomed. heart rate measurement via camera,” in 2015 IEEE International Conference on
Health Inform. 18, 1153–1160. doi:10.1109/JBHI.2013.2291900 Multimedia and Expo (ICME) (Torino, Italy: IEEE).
Moreno, J., Ramos-Castro, J., Movellan, J., Parrado, E., Rodas, G., and Capdevila, L. Tulyakov, S., Alameda-Pineda, X., Ricci, E., Yin, L., Cohn, J. F., and Sebe, N. (2016).
(2015). Facial video-based photoplethysmography to detect HRV at rest. Int. “Self-adaptive matrix completion for heart rate estimation from face videos
J. Sports Med. 36, 474–480. doi:10.1055/s-0034-1398530 under realistic conditions,” in Proceedings of the IEEE Conference on Computer
Muender, T., Miller, M. K., Birk, M. V., and Mandryk, R. L. (2016). “Extracting heart Vision and Pattern Recognition (Las Vegas, NV: IEEE).
rate from videos of online participants,” in Proceedings of the SIGCHI Conference Verkruysse, W., Svaasand, L. O., and Stuart Nelson, J. (2008). Remote plethysmo-
on Human Factors in Computing Systems (CHI’2016). San Jose, CA. graphic imaging using ambient light. Opt. Exp. 16, 21434–21445. doi:10.1364/
Osman, A., Turcot, J., and El Kaliouby, R. (2015). “Supervised learning approach to OE.16.021434
remote heart rate estimation from facial videos,” in Automatic Face and Gesture Vezhnevets, V., Sazonov, V., and Andreeva, A. (2003). “A survey on pixel-based skin
Recognition (FG), 2015 11th IEEE International Conference and Workshops on, color detection techniques”, in Proc. Graphicon, Vol. 3, 85–92.
Vol. 1 (Washington: IEEE). Villarroel, M., Jorge, J., Pugh, C., and Tarassenko, L. (2017). “Non-contact vital sign
Pavlidis, I., Dowdall, J., Sun, N., Puri, C., Fei, J., and Garbey, M. (2007). Interacting monitoring in the clinic,” in Automatic Face & Gesture Recognition (FG 2017),
with human physiology. Comput. Vis. Img. Understand. 108, 150–170. 2017 12th IEEE International Conference on (IEEE).
doi:10.1016/j.cviu.2006.11.018 Viola, P., and Jones, M. (2001). “Rapid object detection using a boosted cascade
Poh, M.-Z., McDuff, D. J., and Picard, R. W. (2010). Non-contact, automated of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR
cardiac pulse measurements using video imaging and blind source separation. 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1
Opt. Exp. 18, 10762–10774. doi:10.1364/OE.18.010762 (Kauai, HI: IEEE).
Poh, M.-Z., McDuff, D. J., and Picard, R. W. (2011). Advancements in noncontact, Wang, W., den Brinker, A. C., Stuijk, S., and de Haan, G. (2017). “Color-distortion
multiparameter physiological measurements using a webcam. IEEE Trans. filtering for remote photoplethysmography,” in Automatic Face & Gesture
Biomed. Eng. 58, 7–11. doi:10.1109/TBME.2010.2086456 Recognition (FG 2017), 2017 12th IEEE International Conference on (IEEE).
Pursche, T., Krajewski, J., and Moeller, R. (2012). “Video-based heart rate measure- Wei, L., Tian, Y., Wang, Y., Ebrahimi, T., and Huang, T. (2012). “Automatic web-
ment from human faces,” in 2012 IEEE International Conference on Consumer cam-based human heart rate measurements using laplacian eigenmap,” in Asian
Electronics (ICCE) (Berlin: IEEE). Conference on Computer Vision (Berlin, Heidelberg: Springer).
Rother, C., Kolmogorov, V., and Blake, A. (2004). Grabcut: interactive foreground Werner, P., Al-Hamadi, A., Walter, S., Gruss, S., and Traue, H. C. (2014). “Automatic
extraction using iterated graph cuts. ACM Trans. Graph. 23, 309–314. heart rate estimation from painful faces,” in 2014 IEEE International Conference
doi:10.1145/1015706.1015720 on Image Processing (ICIP) (Paris, France: IEEE).
Ruben, N. E. (2015). Remote Heart Rate Estimation Using Consumer-Grade Wold, S., Esbensen, K., and Geladi, P. (1987). Principal component analysis.
Cameras [Dissertation]. Logan, UT: Utah State University. Chemom. Intell. Lab. Syst. 2.1-3, 37–52. doi:10.1016/0169-7439(87)80084-9
Rubinstein, M. (2013). Analysis and Visualization of Temporal Variations in Video Wu, H. Y., Rubinstein, M., Shih, E., Guttag, J. V., Durand, F., and Freeman, W.
[Dissertation]. Cambridge, MA: Massachusetts Institute of Technology. (2012). ACM Transactions on Graphics (TOG) – Proceedings of ACM SIGGRAPH
Sahindrakar, P., de Haan, G., and Kirenko, I. (2011). Improving Motion Robustness 2012, Vol. 31. New York, NY: ACM.
of Contact-Less Monitoring of Heart Rate Using Video Analysis. Eindhoven, The Xu, S., Sun, L., and Kunde Rohde, G. (2014). Robust efficient estimation of
Netherlands: Technische Universiteit Eindhoven, Department of Mathematics heart rate pulse from video. Biomedical Opt. Exp. 5, 1124–1135. doi:10.1364/
and Computer Science. BOE.5.001124
Saragih, J. M., Lucey, S., and Cohn, J. F. (2011). Deformable model fitting by reg- Yang, F. (2006). 独立分量分析的原理与应用, Tsinghua University Press.
ularized landmark mean-shift. Int. J. Comput. Vis. 91, 200–215. doi:10.1007/ Yu, Y.-P., Kwan, B. H., Lim, C. L., Wong, S. L., and Raveendran, P. (2013). “Video-
s11263-010-0380-4 based heart rate measurement using short-time Fourier transform,” in Intelligent
Soleymani, M., Lichtenauer, J., Pun, T., and Pantic, M. (2012). A multimodal database Signal Processing and Communications Systems (ISPACS), 2013 International
for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3, 42–55. Symposium on (Okinawa, Japan: IEEE).
doi:10.1109/T-AFFC.2011.25 Yu, Y.-P., Raveendran, P., and Lim, C.-L. (2014). “Heart rate estimation from
Soleymani, M., Villaro-Dixon, F., Pun, T., and Chanel, G. (2017). Toolbox for facial images using filter bank,” in Communications, Control and Signal
emotional feAture extraction from physiological signals (TEAP). Front. ICT Processing (ISCCSP), 2014 6th International Symposium on (Athens, Greece:
4:1. doi:10.3389/fict.2017.00001 IEEE).
Starr, I., Rawson, A. J., Schroeder, H. A., and Joseph, N. R. (1939). Studies on the Yu, Y.-P., Raveendran, P., and Lim, C.-L. (2015). Dynamic heart rate measure-
estimation of cardiac output in man, and of abnormalities in cardiac functions, ments from video sequences. Biomed. Opt. Exp. 6, 2466–2480. doi:10.1364/
from the heart’s recoil and the blood’s impacts; the ballistocardiogram. Am. BOE.6.002466
J. Physiol. 127, 1–28. Zaunseder, S., Heinke, A., Trumpp, A., and Malberg, H. (2014). ““Heart beat detec-
Stricker, R., Müller, S., and Gross, H. M. (2014). “Non-contact video-based pulse tion and analysis from videos.” Electronics and Nanotechnology (ELNANO),”
rate measurement on a mobile service robot”, in Robot and Human Interactive in 2014 IEEE 34th International Conference on (Kyiv, Ukraine: IEEE).

Zhang, Z., and Zha, H. (2004). Principal manifolds and nonlinear dimensionality Copyright © 2018 Wang, Pun and Chanel. This is an open-access article distrib-
reduction via tangent space alignment. SIAM J Sci. Comput. 26, 313–338. uted under the terms of the Creative Commons Attribution License (CC BY).
doi:10.1137/S1064827502419154 The use, distribution or reproduction in other forums is permitted, provided the
original author(s) and the copyright owner are credited and that the original
Conflict of Interest Statement: The authors declare that the research was con- publication in this journal is cited, in accordance with accepted academic prac-
ducted in the absence of any commercial or financial relationships that could be tice. No use, distribution or reproduction is permitted which does not comply
construed as a potential conflict of interest. with these terms.

Paper 2

Uploaded by

Copyright:

Available Formats

Paper 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper 2

Uploaded by

Copyright:

Available Formats

Systematic Review

published: 01 May 2018

Chen Wang1*, Thierry Pun1,2 and Guillaume Chanel1,2

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 1 May 2018 | Volume 6 | Article 33

Table 1 | Classification of state-of-the-art methods.

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 2 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 3 May 2018 | Volume 6 | Article 33

used for motion-based methods. For intensity-based methods,

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 4 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 5 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 6 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 7 May 2018 | Volume 6 | Article 33

Figure 4 | Overview of region of interest selection (Stricker et al., 2014).

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 8 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 9 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 10 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 11 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 12 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 13 May 2018 | Volume 6 | Article 33

Instantaneous HR reveals more information about the sub- ACKNOWLEDGMENTS

AUTHOR CONTRIBUTIONS SUPPLEMENTARY MATERIAL

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 14 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 15 May 2018 | Volume 6 | Article 33

Frontiers in Bioengineering and Biotechnology | www.frontiersin.org 16 May 2018 | Volume 6 | Article 33

You might also like