Non-contact sensing of neonatal pulse rate using camera-based imaging: a clinical feasibility study

M Paul; S Karthik; J Joseph; M Sivaprakasam; J Kumutha; S Leonhardt; C Hoog Antink

doi:10.1088/1361-6579/ab755c

Abbreviations

BA	Bland–Altman
BR	breathing rate
CMOS	complementary metal-oxide-semiconductor
FFT	fast Fourier transform
FOV	field of view
HD	high-definition
IRT	infrared thermography
KCF	kernelized correlation filter
LED	light-emitting diode
LWIR	long-wavelength infrared
MA	movement activity
NICU	neonatal intensive care unit
NIR	near-infrared
PPG	photoplethysmography
PPGI	photoplethysmography imaging
PR	pulse rate
RGB	red, green, blue
RPC	coefficient of reproducibility
ROI	region of interest
SMCH	Saveetha Medical College Hospital
SQI	signal quality index
STFT	short-time Fourier transform
VIS	visible light

1. Introduction

Current physiological monitoring of infants involves a lot of wires and cabling associated with skin-attached sensors, for example electrocardiography, pulse oximeters, or temperature probes. In addition to the discomfort, contact-based sensors carry the risk of causing injuries, such as 'medical adhesive-related skin injuries' (Lund 2014), which are a problem for patients with vulnerable and fragile skin such as infants.

Non-contact sensing modalities using cameras have emerged over the last years. In the near future, these modalities could complement or even replace some of the existing contact-based technologies. The lack of disposables and the ease of use (no sensors need to be attached), will be considered an advantage by most users. Thus, fast deployment is desirable. With this in mind, in this project, we investigate the feasibility and accuracy of the estimation of the pulse rate (PR) of neonates in the neonatal intensive care unit (NICU) during clinical routine. The PR is equivalent to the heart rate measured in the periphery, for example, by using optical means such as pulse oximeters.

1.1. Sensing modalities

A setup with two camera-based sensing modalities was used to record data of infants: photoplethysmography imaging (PPGI) and infrared thermography (IRT). We report an evaluation using the former modality as preliminary results of the study.

In a way, PPGI is an extension of photoplethysmography (PPG), as it allows spatially resolved measurements of blood volume pulsations in tissue by measuring changes of reflected and backscattered light. In this case, the sensor of a video camera replaces the photosensitive elements in, for example, a pulse oximeter of a contact-PPG probe by combining multiple sensor pixels to facilitate remote and multi-spot measurements. Visible light (VIS) (0.38 to 0.78 µm) or near-infrared (NIR) (IR-A: 0.78 to 1.4 µm) light can be used, thus, measurement with ambient light is possible. However, the PPG signal depends on the light absorption of blood, thus, the effective measurement range is more limited. Depending on the application constraints, the cameras can be equipped with optical filters to reduce the effect of unwanted light. For the most part, movement and unpredictable ambient light changes are the two challenges affecting the measurements. In IRT, specialized cameras sensitive to thermal radiation in the mid-wavelength infrared (3 to 8 µm) or long-wavelength infrared (LWIR) (8 to 15 µm) are used. Specifics about the modality are given in Abbas et al (2011).

In this work, the focus is on PPGI as it is more suitable to assess heart activity either by the PPG principle or by ballistography (movement). By contrast, IRT is suited to analyze breathing activity, for example, by breathing rate (BR) (Abbas et al 2011). However, there is evidence, at least for adults, that ballistocardiographic head movements (caused by the heart) may be used in IRT video sequences to assess heart activity (Barbosa Pereira et al 2018).

Sample images of the two modalities used in the study are given in figure 1 as an illustration.

**Figure 1.** The silhouette of an infant below a radiant warmer is shown. The field of views (FOVs) of the various cameras used in the study are demonstrated (see table 3). The top row shows thermal images in false-color. The second row shows images of three PPGI cameras. In the case of the color camera (red, green, blue (RGB)), the raw image was demosaiced and made perceivable as a color image. The third row shows the individual color channels of the demosaiced RGB camera. The FOV of $\mathrm{CAM}_\mathrm{IRT,HD}$ differs from the other cameras due to positioning (see figure 3).
Download figure:
Standard image High-resolution image

1.2. State-of-the-art

Camera-based sensing is not currently routinely used in clinical practice for monitoring vital signs. Consequently, studies on infants are rare. A recent survey (AlZubaidi et al 2018) has pointed out the potential of contactless sensing modalities (including PPGI and IRT) and their applications in the NICU. To the best of our knowledge, the first PPGI measurements in an incubator setting via a specialized scientific video camera and matched lighting (green) were conducted by Vagedes et al in 2004 (Vagedes et al 2004). At that time, motion artifacts and the requirement to minimize and block stray light were identified as the challenges to overcome (see Hülsbusch (2008)).

Compared to adults, infants have a higher PR and thinner skin. Patients of this group are more heterogenous (Vrancken et al 2018). In infants, the not yet fully controlled movements can be spasmodic and become gentler when maturing: tremors in hands and legs would be replaced by the movement of the arms and legs (Cobos-Torres et al 2018).

A brief overview of the existing studies of neonates and infants is given below.

1.2.1. Subjects

Clinical studies involving several infants and assessing PR based on PPGI can be found in Vagedes et al (2004), Scalise et al (2012), Aarts et al (2013), Klaessens et al (2014), Mestha et al (2014), Villarroel et al (2014), Fernando et al (2015), Blanik et al (2016), Paul et al (2017), Antognoli et al (2018), Cobos-Torres et al (2018).

Measurements on one or a few infants are included in, for example, Wu et al (2012), Zhao et al (2013), Wang et al (2018). The applicability of PPGI in various clinically relevant scenarios has been researched in Aarts et al (2013). As PPGI requires light, skin pigmentation is one aspect that affects the signal-to-noise ratio of the PPG signal. For instance, a lower signal-to-noise ratio has been reported for a subject with darker skin (Aarts et al 2013). In some publications, the skin tone is explicitly mentioned (Fernando et al 2015, Wang et al 2018, Chaichulee et al 2018). A subject set similar to the one explored here containing Indian neonates was presented in Mestha et al (2014).

1.2.2. Recording length

Studies differ greatly in terms of recording length. Antognoli and co-workers (Antognoli et al 2018), for example, recorded until a 10 s window of movement-free video was available, Aarts and colleagues (Aarts et al 2013) recorded between 1 min and 5 min, while Villarroel and co-workers (Villarroel et al 2014) recorded for several hours over consecutive days.

1.2.3. Cameras

In terms of equipment, measurements were conducted with simple webcams (Scalise et al 2012, Antognoli et al 2018), digital cameras (Klaessens et al 2014, Cobos-Torres et al 2018, Gibson et al 2019), industrial and scientific cameras (Zhao et al 2013, Aarts et al 2013) or even augmented reality goggles (Fernando et al 2015). Image resolutions were typically 640 × 480 px and below, while high-definition (HD) resolutions have been rare (e.g. Villarroel et al (2014), or Paul et al (2017) and Antognoli et al (2018)). In the past, charge-coupled device sensors were used, but nowadays, complementary metal-oxide-semiconductor (CMOS) sensors are prevalent.

1.2.4. Light

The PPGI was used with ambient light (natural, artificial, or a mixture of both) (Mestha et al 2014, Villarroel et al 2014, Cobos-Torres et al 2018), using green light (Vagedes et al 2004, Scalise et al 2012) or NIR light (Klaessens et al 2014, Blanik et al 2016).

1.2.5. Algorithms

Raw camera-based PPG time series are normally generated by spatially averaging pixels (e.g. Hülsbusch (2008)). Furthermore, if more than one color channel is available, these time series can be combined, for example, by independent component analysis (Scalise et al 2012, Villarroel et al 2014). Regions of interest (ROIs) have to be identified and tracked to form a continuous time series. The identification is still often performed manually in feasibility studies, and regions are tracked subsequently by typical algorithms.

Wang and colleagues (Wang et al 2018) have recently made one very interesting finding that can be used on single channels: when more non-skin pixels are part of an ROI, the spatial variance yields a better time series compared to spatial averaging. By way of contrast, when skin pixels dominate, spatial averaging would be preferred.

To estimate PR by also exploiting the spatial resolution, our group has used spatio-temporal mappings by applying grid-based approaches (Blanik et al 2016, Paul et al 2017).

Frequency approaches, such as wavelet analysis (Vagedes et al 2004), or Fourier-based analysis, such as power spectral densities/spectrograms (Vagedes et al 2004, Scalise et al 2012, Aarts et al 2013, Mestha et al 2014, Cobos-Torres et al 2018) have often been applied to estimate PR.

An approach that allows compensating for interference if a non-patient ROI is also available was used by Villarroel and co-workers (Villarroel et al 2014): for this, autoregressive models and pole cancellation techniques were used.

The ROIs are often identified manually or by segmentation. Recent research has been directed at segmentation, among others, by a convolutional neural network (Chaichulee et al 2017, Chaichulee et al 2018) or by fuzzy inference systems (Kaur et al 2017) to improve time series extraction.

In this work, the focus is on the feasibility of the PR assessment of neonates using videos exploiting PPG and ballistographic signals without making any further distinction between the two signals. For this, we present an algorithmic processing chain similar to the one used in Mestha et al (2014). Here, we also benchmark signals against a more simplistic approach by tracking the subjects as a whole. Furthermore, the algorithms are challenged by reduced amplitudes due to skin pigmentation in VIS. In addition, we would also like to highlight observations important for the deployment of camera-based sensing in neonatal applications, particularly noise sources for PPGI.

1.3. Outline

In section 2, we describe our study design for camera-based measurements in the NICU and the approaches for signal assessment and evaluation. Subsequently, the results are presented in section 3 and are discussed and evaluated in section 4. Finally, this work concludes with the key findings and future research directions in section 5.

2. Methods

2.1. Study design

We conducted a single-center pilot study where we observed neonates utilizing video cameras. Twenty neonates were involved as subjects (15 South, five North Indian; seven males, 13 females). The study was conducted in April 2018 at Saveetha Medical College Hospital (SMCH) and was approved by the Institutional Ethics Committee of Saveetha University (SMC/IEC/2018/03/067).

A total of 25 measurements of 10 min length each were recorded. One of the recordings had to be omitted from the evaluation because of technical difficulties and one recording was only 8.40 min long but had been included. Thus, the results presented belong to M = 24 measurements of 19 neonates (14 South, five North Indian; six males, 13 females).

The study aimed to show the feasibility of camera-based PR assessment in a clinical scenario. Therefore, the measurements took place during clinical routine with the constraint that interaction with the infant was not possible from the side of the bed with the measurement setup. Here, routine care means that caregivers regularly checked on the neonate, for example, if there was an alarm or the neonate was crying.

With the intention to calm the neonate, caregivers or parents, reacted by massaging feet or hands, by gently stroking the infant's back or abdomen, helped with thumb sucking, or even rocked the infant back to sleep. Such actions introduced unwanted movement and local changes in light intensity not only in the background but also in image parts containing the neonate. Therefore, the measurements reflect reality fairly accurate and contain several challenging situations.

2.1.1. Patient beds

The hospital's standard bed, the Infant Radiant Warmer NWS 102 (Phoenix Medical System Private Limited, Chennai 600032, India) was used as a patient bed. Furthermore, two measurements were conducted in a Transport Incubator TINC 101 (Phoenix Medical System Private Limited, Chennai 600032, India) with two neonates who were born prematurely.

2.1.2. Subject characteristics

The subject and measurement characteristics of the 24 measurements are given in tables 1 and 2, respectively. If a subject has been measured multiple times, the data has also been included. Weight and size were unavailable for one subject; nevertheless, the measurement is also included. There are regional differences in skin tone in India: geographically, from north to south, skin tones become darker on average. This information was added as darker skin is assumed to post more challenges compared to lighter skin due to the increased amount of melanin.

Table 1. Subject characteristics

Body size Head circ Weight (cm) (cm) (g)

Mean	46.3	32.2	2247
SD	5.0	2.0	522

Table 2. Measurement characteristics

Gender Indian (female / male) (North / South)

17 / 7	7 / 17

2.2. Measurement setup

A sketch is given in figure 2 and the realization is depicted in figure 3 to illustrate the measurement setup: a camera setup was positioned on the right-hand side of the patient bed (frame of reference: subject in a supine position, the head oriented to the controls of the radiant warmer). The camera-to-subject distance was approximately 70 cm for the measurements with the radiant warmer (figure 1) and less for the ones using an incubator (figure 4). In the case of the incubator specifically, the setup had to be moved in such a way that a moderate FOV through opened clappers could be achieved. Moreover, light from behind the setup had to be shielded to reduce reflections on the plastic foil which was used to cover the openings of the clappers.

2.2.1. Video cameras

Multiple cameras were used to record the babies resulting in the images given in figure 1 (radiant warmer) and figure 4 (incubator). Accordingly, specifications of all cameras used in the experiments are given in table 3 and below.

**Figure 3.** Measurements were conducted bedside. Several PPGI and IRT cameras recorded videos during standard monitoring and routine care.
Download figure:
Standard image High-resolution image

Three sensitive CMOS cameras were used for PPGI: two Grasshopper 3 GS3-U3-23S6M-C and one color camera (GS3-U3-23S6C-C; FLIR, USA). The two monochrome cameras were each equipped with an optical filter: 850 nm (BN850) or 940 nm (BN940) (Midwest Optical Systems, Inc. USA). In the following, these cameras are identified as $\mathrm{CAM_{850}}$ , $\mathrm{CAM_{940}}$ and $\mathrm{CAM_{RGB}}$ , respectively. Similarly, color channels are identified as $\mathrm{CAM_{RGB~X}}$ , where 'X' is 'R', 'G' or 'B'. The color channels from short to long wavelengths are characterized as follows: $B\mathrel{\widehat{ = }} 470\; {\rm nm}$ , $G\mathrel{\widehat{ = }} 525\; {\rm nm}$ and $R\mathrel{\widehat{ = }} 640\; {\rm nm}$ . The lenses used were the CF12.5HA-1 (Fujifilm Holdings K.K., Japan), a 12.50 mm fixed focal lens.

**Figure 4.** The FOVs of the various cameras used in the study recording an infant in an incubator (see table 3). The top row shows a false color coded thermal image. Similarly, the second row shows images of three PPGI cameras. In the case of the color camera, the raw image was demosaiced and made perceivable as a color image. The incubator clappers have been replaced by a thin plastic foilto facilitate thermal imaging while maintaining a microclimate. Light reflections are visible in $\mathrm{CAM_{850}}$ on the plastic foil.
Download figure:
Standard image High-resolution image

**Figure 4.** The FOVs of the various cameras used in the study recording an infant in an incubator (see table 3). The top row shows a false color coded thermal image. Similarly, the second row shows images of three PPGI cameras. In the case of the color camera, the raw image was demosaiced and made perceivable as a color image. The incubator clappers have been replaced by a thin plastic foilto facilitate thermal imaging while maintaining a microclimate. Light reflections are visible in $\mathrm{CAM_{850}}$ on the plastic foil.
Download figure:
Standard image High-resolution image

Table 3. Cameras used in the experiment.

Fps Resolution ADC Lens Modality Camera Type (Hz) (px × px) Mode (bit) Trigger (mm) Optical filters Model

PPGI	$\mathrm{CAM_{850}}$	mono	25	1920 × 1200	7	12	o	12.5	NIR narrow bandpass; 850 nm	Grasshopper 3 GS3-U3-23S6M-C
	$\mathrm{CAM_{940}}$	mono	25	1920 × 1200	7	12	o	12.5	NIR narrow bandpass; 940 nm	Grasshopper 3 GS3-U3-23S6M-C
	$\mathrm{CAM_{RGB}}$	color	25	1920 × 1200	7	12	o	12.5	(internal) NIR cut-off ca. 700 nm	Grasshopper 3 GS3-U3-23S6C-C
IRT	$\mathrm{CAM_{IRT}}$	LWIR	25	640 × 480	NA	16	o	10	-	Gobi-640-GigE
	$\mathrm{CAM_{IRT,HD}}$	LWIR	30	1024 × 768	NA	16	-	30	-	VarioCAM HD head 820S
none	$\mathrm{CAM_{ref}}$	mono	10	1920 × 1200	7	12	-	12.5	-	Grasshopper 3 GS3-U3-23S6M-C

We used up to two separate cameras for IRT: a Gobi-640-GigE (Xenics nv, Belgium) was used for all measurements and, in the case of the radiant warmer measurements, a VarioCAM HD head 820S (InfraTec GmbH, Germany) LWIR camera was added. These cameras are identified as $\mathrm{CAM_{IRT}}$ and $\mathrm{CAM_{IRT,HD}}$ , respectively.

Moreover, another camera, $\mathrm{CAM_{ref}}$ , was used to record the reference monitor.

2.2.2. Lighting

We used matched light sources for the two NIR cameras ( $\mathrm{CAM_{850}}$ , $\mathrm{CAM_{940}}$ ) consisting of an S75-850-W (' $\mathrm{LED_{850}}$ '), and S75-940-W (' $\mathrm{LED_{940}}$ ') lamp (Smart Vision Lights, USA) with the intention of testing PPGI for night measurements and reducing possible stray light. These machine lighting lamps each consist of 2 × 3 light-emitting diodes (LEDs). Moreover, the light coming from these lamps was diffused by using Lee 416 filters (LEE Filters, UK) and stray light was reduced by covering all but the front of the lamps. The lamps were positioned above the respective camera to be as close to the matching cameras and provide frontal illumination.

No dedicated light source was used for $\mathrm{CAM_{RGB}}$ . Thus, only the ambient light was used which was composed of the following:

ambient light originating from ceiling-mounted lamps;
sunlight entering from behind the camera setup; and
the 'glow' of the heating elements of the radiant warmer.

2.2.3. Synchronization

All cameras, except $\mathrm{CAM_{IRT,HD}}$ and $\mathrm{CAM_{ref}}$ , were triggered, and thus synchronized, by a self-built timing controller based on a MSP430F5329 microcontroller (Texas Instruments, USA). The timing controller can synchronize devices and power up to six of them. The timing controller was connected to an external 24 V medical power supply. In addition to the cameras, the LED lamps were powered via the controller. We had decided to forego triggering the LEDs to avoid problems with temporal light modulations.

2.2.4. Measurement computer and data management

Recordings were acquired and stored on a notebook, a Thinkpad P50 (Lenovo, China) which was equipped with two 1 TB SSDs (Samsung 960 Evo), a quad-core processor (Intel i7-6820HQ) and 32 GB RAM.

Two PPGI cameras and one IRT camera were connected directly to the notebook. Because the number of ports on the notebook was limited, the other cameras were connected via a separate USB 3.1 docking station (Delock 87298). For each measurement, about 240 GB were recorded totaling about 5.80 TB of raw video data.

2.2.5. Reference

A Radical-7 (Masimo, USA) pulse oximeter (the standard monitoring device in the neonatal unit of the hospital) was used for reference. The device displays the following measurements: the PR, perfusion index, peripheral oxygen saturation and the PPG signal. We recorded the monitor with $\mathrm{CAM_{ref}}$ at 10 Hz and extracted the numerical values via optical character recognition and visual inspection to obtain a synchronous reading. Thus, we used these monitor-extracted PR values ( $\mathrm{PR}_\mathrm{ref}$ ) as the basis for the evaluation of the camera-extracted PR estimates $\mathrm{PR}_\mathrm{est}$ .

2.3. Measurement

Temperature and humidity were read from an analog thermometer and hygrometer to test for similar environmental conditions (figure 2). Furthermore, the lenses were focused manually and individually, and the cameras were given at least 20 min to warm up.

The actual measurement consisted of recording the neonates, preferably in a phase of inactivity so that movement would be minimal. {However, inactivity had been rare. In addition, some recordings had to be discarded due to power outages and then repeated after several minutes. While the measurement setup was not affected by power outages, time was given to the subject to calm down again. Videos were recorded via our C++ camera framework (Paul et al 2018) and via the thermography software IRBIS 3 (InfraTec GmbH, Germany).

2.4. Processing chain

A sketch of the offline processing chain is given in figure 5. The processing pipeline was implemented in Matlab 2017b (The MathWorks, Inc. USA) and executed on a dedicated server.

Regarding the offline processing, we first tracked all ROIs and replaced 'not a number' values (unsuccessful tracking) with previous values or by using linear interpolation (only in the first segment if the first values were missing). The remaining processing chain was implemented to work with signal segments.

2.4.1. Image preprocessing

Videos $V_{\lambda}$ were stored as three-dimensional cubes $V_{\lambda}(x,y,t)$ composed of raw pixel data ( $2\ {\rm B\ px}^{-1} = 16\ {\rm bit\ px}^{-1}$ ). This preprocessing stage consisted of a step where each frame was rotated if a camera was mounted upside down, a stage where data was converted to 8 bit (for the tracking but not signal extraction) and a stage which separated different color changes if necessary (e.g. in the case of $\mathrm{CAM_{RGB}}$ ).

The separation into color channels for $\mathrm{CAM_{RGB}}$ only consisted of extracting the pixels corresponding to the Bayer pattern. Indeed, this is in contrast to other publications that use undefined demosaicing algorithms (e.g. videos were not recorded bayered). Consequently, we used raw pixel values for the videos $V_{\mathrm{R}}$ , $V_{\mathrm{G}}$ , and $V_{\mathrm{B}}$ , where the resolution is only a quarter of $V_{\mathrm{RGB}}$ because conventional RGB sensors use twice the number of green pixels compared to red and blue. Thus, each pixel in $V_{\mathrm{G}}$ was calculated as the mean of the green pixels of these two virtual green channels.

**Figure 5.** Processing chain
Download figure:
Standard image High-resolution image

2.4.2. ROI selection

We selected several ROIs manually which are presented in table 4. To illustrate those ROIs, a sketch is given in figure 6. The relevance of the ROIs and the process of annotation are described in the following.

Table 4. Selected ROIs anticipated vital signs. Camera-based sensing is suitable to asses pulse rate (PR), breathing rate (BR) and movement activity (MA).

Anticipated vital sign Region PR BR MA Comment

Bounding box (bbox)	o	o	o	Contains the whole body; simplest region
Head	o		o	Ballistographic signals shake the head; deliberate head turns
Face	o			Not covered by clothing
Forehead	o			Flat and homogenous region
Nose		o		Typically air jets visible (IRT)
Torso	o	o		Ballistography; skin visible if not covered by clothing
Torso+background	o	o		Ballistography (cardio + breathing)
Arm (left, right)			o	Prone to movement, often covered
Hand (left, right)			o	Prone to movement
Leg (left, right)			o	Prone to movement, often covered
Foot (left, right)			o	Prone to movement, sometimes covered

2.4.2.1. ROI relevance

The largest ROI which was evaluated was the bounding box (bbox). Because it covers the whole body, not only all vital parameters should be present but, in the same manner, movement can interfere with the extraction of PR and BR.

We also selected the extremities (here, we were only interested whether PR could be extracted) to provide more detailed information about movement compared to bbox. The PR signal was considered to be deteriorated by movement activity (MA). We assumed these regions to be less well suited to extract PR estimates: this is due to the simple reason that neonates are generally covered with clothing to maintain body temperature. In addition to the torso, arms and legs are often covered as well, thus, preventing access to the skin. However, fortunately, hands and feet are sometimes accessible.

Furthermore, the head was selected because it is possible to assess the heart activity in adults by analyzing head movements caused by ballistic blood movement and, more importantly, because of the accessibility. In the scenario given, often only the face is not covered, and thus, provides a view of the skin. The forehead is a particularly good candidate region for signal retrieval because it is relatively flat (compared to other body parts).

The nose is a facial landmark and can be used for tracking.

We selected two regions from the torso: one which covered the body (torso) and a second which contained the boundary between the torso and the background (torso+bg). While the first region should contain signal components due to light-tissue interaction, the second region was chosen because it could contain ballistic movement (due to heart and breathing activity).

2.4.2.2. ROI annotation

As was stated previously, we selected the ROIs manually and only in the first frame. Thus, if a body part was not visible in the first frame, no region was selected. Furthermore, if a region was partly visible, either the assumed rectangular bound was used as the region or only the visible part of the region was selected as the ROI. The cases are presented here by examples:

if only one finger of the hand was not visible, the hand region would be extended to the assumed border;
if the subject was facing away from the camera, the face region was selected to contain only the visible part.

In addition, all ROIs were selected to be parallel to the image coordinate system, and as such, no rotations were considered. Moreover, rectangular ROIs were chosen for their simplicity and computational efficiency.

2.4.3. ROI tracking

Afterwards, we applied the kernelized correlation Filter (KCF) tracker on all regions selected (Henriques et al 2015). However, the KCF tracker was implemented without a failure recovery mechanism. Consequently, out-of-view targets became a problem.

The size of the template has a great impact on the processing time. This was considered by scaling the template and image in case a certain area (number of pixels) was surpassed.

2.4.4. Time series extraction

Next, the raw signal, a time series, was generated by averaging all pixels of a rectangular region (ROI) spatially. In other words, each time series consisted of the concatenation of spatial averages extracted from a tracked ROI.

2.4.5. Temporal filtering

Subsequently, the signal was divided into overlapping signal segments of length $t_{\mathrm{seg}} = 10\ {\rm s}$ , which, in our case, overlapped for all values except the oldest one (hop-size v = 1). The sampling frequency f_s of the times series was identical to the recording frequency of the videos: f_s = 25 Hz. We designed a finite impulse response bandpass filter of order 3 × f_s and cutoff frequencies $f_{c1} = 1.30\; {\rm Hz}\;\hat{ = }\;78\;{\rm bpm}$ and $f_{c2} = 5\; {\rm Hz}\;\hat{ = }\;300\; {\rm bpm}$ for temporal filtering. This filter was applied after subtracting the mean from a segment. The filter was used for zero-phase filtering using Matlab's function filtfilt, effectively doubling the filter order.

2.4.6. Frequency processing

In the next stage, PR was retrieved from a time-frequency representation of several signal segments using the short-time Fourier transform (STFT).

The STFT consists of a series of Fourier transforms (FFTs) of temporal segments that can overlap. Before applying the FFT, a Hamming window was applied to reduce spectral leakage. We specifically chose the segment size to be $t_{\mathrm{seg}} = 10\ {\rm s}$ ( $n_{\mathrm{seg}} = 250\ {\rm frames}$ ), the overlap to be $n_{\mathrm{ov}} = 249\ {\rm frames}$ and the length of the FFT to be $n_{\mathrm{FFT}} = 1024$ . Furthermore, because $n_{\mathrm{FFT}}>n_{\mathrm{seg}} = 250\ {\rm frames}$ , the remaining values were zero-padded. In addition, the segment size was chosen to cover about 20 heartbeats ( $PR = 120\; {\rm bpm}\;\hat{ = }\;2\; {\rm Hz}$ ) and about six breaths ( $BR = 40\; {\rm bpm}\;\hat{ = }\;0.67\; {\rm Hz}$ ).

2.4.7. Pulse rate extraction

2.4.7.1. PR estimation

We estimated the peak frequency $f_\mathrm{peak}$ in each FFT segment by extracting the frequency with the highest signal energy within the heart frequency band ( $90\ {\rm to}\ 220\ {\rm bpm}\;\hat{ = }\;1.50\ {\rm to}\ 3.67\ {\rm Hz}$ ). Thus, a signal of $\mathrm{F}_\mathrm{peak}$ values at sampling frequency was generated. The PR estimates at 1 Hz ( $\mathrm{PR}_\mathrm{est}$ ) were calculated by downsampling the mean estimate of $p = \mathrm{round}(f_s)$ $f_\mathrm{peak}$ values to compare with the pulse oximeter reference ( $\mathrm{PR}_\mathrm{ref}$ ).

In order to estimate the peak frequency, it is sufficient to find the maximum of the absolute values of the positive FFT coefficients; the other computations are used for scaling and visualization using the spectrograms.

2.4.7.2. Artifact detection

Signal artifacts were clearly present in the raw time series. We used a signal quality index (SQI) to assess the ratio of energy $E_{f_{\mathrm{peak}}}$ at $f_\mathrm{peak}$ to the energy $E_{{\mathrm{tot}}}$ of the whole frequency spectrum, similar to that presented in Blanik et al (2016), for artifact detection. Here, we use

$\begin{equation} \mathrm{SQI} = \frac{E_{f_{\mathrm{peak}}}}{E_{{\mathrm{tot}}}}, f_{\mathrm{peak}}\in [90\ {\rm to}\ 220\ {\rm bpm}]. \end{equation} \tag{ 1 }$

In other words, low SQI values mark signal segments whose energy outside of the heart band is relatively high and, thus, may not be used to extract an estimate $f_\mathrm{peak}$ . We used an empirical value, a lower threshold $\tau_{\mathrm{SQI}} = 0.075$ , which was chosen based on one video sequence where there was only a little movement present to identify those signal segments.

The segments identified were then automatically flagged as artifacts: if one such segment was contained in the calculation of $\mathrm{PR}_\mathrm{est}$ , the final estimate was flagged both as an artifact and unreliable.

2.5. Reference and signal alignment

We used manually induced motion artifacts to synchronize camera measurements and reference by firmly pressing or removing the reference PPG probe for a short time. Thus, no valid reference signal was available during these synchronization events $E_\mathrm{sync}$ . Then, for each measurement, one extracted PR estimate signal ( $\mathrm{PR}_\mathrm{est}$ ) was visually compared and manually aligned to the reference ( $\mathrm{PR}_\mathrm{ref}$ ). Because of the initial synchronization of all cameras, the delay between the reference $\mathrm{PR}_\mathrm{ref}$ and all estimates $\mathrm{PR}_\mathrm{est}$ was maintained constant.

2.6. Evaluation

First, in order to evaluate the methods described quantitatively, the time slots $T_\mathrm{ref}$ where a valid reference was available were identified. Subsequently, for each ROI, the intervals $T_\mathrm{cam,ref}$ where reference $\mathrm{PR}_\mathrm{ref}$ and camera estimate $\mathrm{PR}_\mathrm{est}$ were valid, were computed. For a single ROI, this is illustrated in figure 7. To clarify, $T_\mathrm{cam,ref}$ varied with the ROI, because the ROIs were affected by different artifacts at different points in time, while $T_\mathrm{ref}$ is a fixed time for all ROIs of one camera within a measurement.

**Figure 7.** Sketch of a temporal measurement sequence showing the reference $\mathrm{PR}_\mathrm{ref}$ (**ref**) and a camera-based estimate of $\mathrm{PR}_\mathrm{est}$ (**cam**). Certain events E reduce the time for which the signals can be compared: in fact, a reference is only available for a time $T_\mathrm{ref}< T_\mathrm{rec}$ . This is because there is no reference available ( $E_\mathrm{ref,na}$ ) at the beginning and the end of the recording. Additionally, manual synchronization events $E_\mathrm{sync}$ are excluded because the reference is invalid. Furthermore, the time $T_\mathrm{cam, ref}$ analyzable is even shorter if an artifact is detected in the camera signal ( $E_\mathrm{cam,a}$ ). Here, the time where $\mathrm{PR}_\mathrm{est}$ deviates by only a few beats per minute (e.g. 3 bpm) $T_\mathrm{3\ bpm}$ is used for the evaluation. In particular, artifacts $E_\mathrm{ref,a}$ in the reference can not be detected.
Download figure:
Standard image High-resolution image

**Figure 7.** Sketch of a temporal measurement sequence showing the reference $\mathrm{PR}_\mathrm{ref}$ (**ref**) and a camera-based estimate of $\mathrm{PR}_\mathrm{est}$ (**cam**). Certain events E reduce the time for which the signals can be compared: in fact, a reference is only available for a time $T_\mathrm{ref}< T_\mathrm{rec}$ . This is because there is no reference available ( $E_\mathrm{ref,na}$ ) at the beginning and the end of the recording. Additionally, manual synchronization events $E_\mathrm{sync}$ are excluded because the reference is invalid. Furthermore, the time $T_\mathrm{cam, ref}$ analyzable is even shorter if an artifact is detected in the camera signal ( $E_\mathrm{cam,a}$ ). Here, the time where $\mathrm{PR}_\mathrm{est}$ deviates by only a few beats per minute (e.g. 3 bpm) $T_\mathrm{3\ bpm}$ is used for the evaluation. In particular, artifacts $E_\mathrm{ref,a}$ in the reference can not be detected.
Download figure:
Standard image High-resolution image

Nevertheless, we considered $T_\mathrm{cam,ref}$ as the basis for the evaluation, but only for the ROIs where $T_\mathrm{cam,ref}$ was 30 s or longer. This is chosen to reflect how well the method performs in artifact-free regions, hence, assesses the performance of camera-based sensing and provides an upper limit of performance in this clinical setting.

We calculated the time fractions where the PR estimates of a ROI deviate by only a few beats per minute (3 bpm) from the reference within $T_\mathrm{cam,ref}$ (2) to identify the best results per measurement and camera. Furthermore, we calculated the time fractions where camera and reference were valid (3). In other words, we calculated the coverages:

$\begin{equation} \mathrm{COV}_{\mathrm{3\ bpm/(cam,ref)},r} = \frac{T_{\mathrm{b\ bpm},r}}{T_{\mathrm{cam,ref},r}} \end{equation} \tag{ 2 }$

and

$\begin{equation} \mathrm{COV}_{\mathrm{cam,ref},r} = \frac{T_{\mathrm{cam,ref},r}}{T_\mathrm{ref}}, \end{equation} \tag{ 3 }$

where r indexes the ROIs. We then identified the best result (maximum) per camera and measurement. As an example for a single camera and two different measurements, a graphical representation of the $\mathrm{COV}_{\mathrm{3\ bpm/(cam,ref)},r}$ is given in figure 8, showing the infants of figure 1 (radiant warmer) and figure 4 (incubator) for $\mathrm{CAM_{RGB,G}}$ . By contrast, the results are visualized as boxplots below.

**Figure 8.** Examples of $\mathrm{COV}_{\mathrm{3\ bpm/(cam,ref)},r}$ in percent for a measurement of a subject below the radiant warmer (a), and for another subject nursed in an incubator (b).
Download figure:
Standard image High-resolution image

**Figure 8.** Examples of $\mathrm{COV}_{\mathrm{3\ bpm/(cam,ref)},r}$ in percent for a measurement of a subject below the radiant warmer (a), and for another subject nursed in an incubator (b).
Download figure:
Standard image High-resolution image

To put the results into perspective, instead of only showing the best result per measurement for the coverages defined by (2) and (3), we also show the results of bbox. The rationale behind using bbox as a benchmarking region is that it is one of the easiest approaches one can think of, provided that signals of adequate quality could be extracted, deployment of camera-based monitoring would be simplified and development costs could be reduced.

In addition to the coverages, we also list the number of occurrences of each best performing ROI: we simply counted how often each ROI of each camera in each measurement performed best. The combined results of all cameras are given as bar plots.

Furthermore, we identified the measurement with the highest value of $\mathrm{COV}_{\mathrm{cam,ref}}$ using (3) and used this as an example.

3. Results

In this section, we present the results from evaluating the data sets. The results of the temporal coverage with deviations smaller or equal to 3 bpm from the reference are provided ( $\mathrm{COV}_{\mathrm{3\ bpm/(cam,ref)}}$ ). That is to say, we consider only the best performing ROI and bbox per camera and measurement. Here, results with $T_\mathrm{cam,ref}<30\ {\rm s}$ are excluded from the analysis. The coverage $\mathrm{COV}_\mathrm{cam,ref}$ is also given to put the results into perspective. These results are presented as boxplots showing the median (horizontal line), 25 and 75 percentiles (p₂₅ and p₇₅; box dimensions). Whiskers are connected to the last data points not considered outliers. Outliers are defined as values greater than $p_{75}+1.5 (p_{75}-p_{25})$ and less than $p_{25}+1.5 (p_{75}-p_{25})$ . Thus, the outliers also include the best and the worst performing measurements. The corresponding ROIs are given as bar plots.

Additionally, the results of the ROI with the best $\mathrm{COV}_\mathrm{cam,ref}$ and the corresponding bbox are provided. Specifically, spectrograms, plots of the peak frequency $F_\mathrm{peak}$ , SQI and $\mathrm{PR}_\mathrm{est}$ vs the $\mathrm{PR}_\mathrm{ref}$ are shown. Furthermore, Bland–Altman (BA) plots and correlation plots (created via the functions in Klein (2017)) illustrate the performance achievable.

3.1. Temporal coverage $\mathbf{COV}_{\mathbf{3\ bpm/(cam,ref)}}$

The temporal coverage $\mathrm{COV}_{\mathrm{3\ bpm/(cam,ref)}}$ ( $T_\mathrm{cam,ref}\geq30\ {\rm s}$ ) of bbox and the best performing ROI are given in figures 9 and 10, respectively. Applying the temporal constraint reduces the number of valid measurements (at least one ROI for one camera) by one when considering all ROIs. In case of the two NIR cameras, nine measurements per camera had no ROI which satisfies the constraint. The R and B channel of the RGB camera were each affected in five measurements while the G channel only had two measurements with no ROI. When considering only bbox, eight measurements had no bbox which satisfies the constraint. Of the remaining measurements, seven contained at least one RGB channel with valid signals. Of all measurements, 15 had at least one valid NIR channel. The number of occurrences of best performing ROIs for all measurements and all cameras combined is given in figure 11.

3.1.1. Bbox

All median values for bbox are below 60%, while most are below 10%, close to 0% coverage (figure 9). Considering the medians, the blue channel of the RGB camera is the best performing. By contrast, $\mathrm{CAM}_\mathrm{940}$ performed worst. The 75th percentiles for all channels of the RGB camera are above 80%. Additionally, the NIR cameras' 75th percentiles are below 10%, while outliers reach close to 70% ( $\mathrm{CAM}_\mathrm{850}$ ) and close to 40% ( $\mathrm{CAM}_\mathrm{940}$ ).

3.1.2. Best performing ROI

Selecting the best performing ROI improves the coverage compared to bbox (figure 10): medians go as high as 88.68% for $\mathrm{CAM_G}$ , and similar values are reached for the other color channels. Moreover, the other medians stay above the best result of bbox. The 75th and 25th percentiles shift to higher values compared to bbox. Regarding the color camera, the 25th percentiles are above 50% coverage. In addition, the 75th percentile of $\mathrm{CAM}_\mathrm{850}$ improves to above 70%. Similarly, the 75th percentile of $\mathrm{CAM}_\mathrm{940}$ is higher than 10%, while occasional outliers reach more than 70%.

**Figure 9.** Relative temporal coverage $\mathrm{COV}_\mathrm{3\ bpm/(cam,ref)}$ of all cameras with $\mathrm{PR}_{\mathrm{est}}$ deviating less than **3 bpm** from the reference when selecting the **bbox** per measurement considering only the results of ROIs with at least 30 s of artifact-free signal segments.
Download figure:
Standard image High-resolution image

**Figure 9.** Relative temporal coverage $\mathrm{COV}_\mathrm{3\ bpm/(cam,ref)}$ of all cameras with $\mathrm{PR}_{\mathrm{est}}$ deviating less than **3 bpm** from the reference when selecting the **bbox** per measurement considering only the results of ROIs with at least 30 s of artifact-free signal segments.
Download figure:
Standard image High-resolution image

**Figure 10.** Relative temporal coverage $\mathrm{COV}_\mathrm{3\ {bpm}/(cam,ref)}$ of all cameras with $\mathrm{PR}_{\mathrm{est}}$ deviating less than **3 bpm** from the reference when selecting the **best region** per measurement considering only the results of ROIs with at least 30 s of artifact-free signal segments.
Download figure:
Standard image High-resolution image

3.1.3. Number of occurrences of best performing ROIs

The number of occurrences of the best performing ROIs in the given scenario considering all measurements are shown in figure 11. These ROIs are very often the right foot, nose, torso or face. The right arm and hand satisfied the condition equally often as bbox and the left hand. The head and the left arm were never the best performing.

3.2. Temporal coverage $\mathbf{COV}_\mathbf{cam,ref}$

We next considered $\mathrm{COV}_\mathrm{cam,ref}$ without a constraint to $T_\mathrm{cam,ref}$ . The corresponding plots are given in figures 12 and 13, respectively.

3.2.1. Bbox

All median values for bbox are below 10% and close to 0% coverage (figure 12). The 75th percentiles for all cameras are below 30% but outliers can reach above 40%, even 70%. In all aspects, the NIR cameras perform better than the different channels of the RGB camera.

**Figure 12.** Temporal coverage $\mathrm{COV}_\mathrm{cam,ref}$ of all cameras when selecting **bbox** per measurement.
Download figure:
Standard image High-resolution image

3.2.2. Best performing ROI

Similar to $\mathrm{COV}_{\mathrm{3\ bpm/(cam,ref)}}$ , selecting the best performing ROI (figure 13) improved the coverage when compared with bbox (figure 12): medians went as high as 24% for $\mathrm{CAM_{RGB,G}}$ and stay above the best result of bbox (~ 7% vs ~ 6% for $\mathrm{CAM_{940}}$ ). The 75th and 25th percentiles shift to higher values. To be more precise, the 75th percentiles range between 20 and 40% coverage. Occasionally, values above 50% and even above 70% were reached.

3.2.3. Number of occurrences of best performing ROIs

The ROIs performing best regarding $\mathrm{COV}_\mathrm{cam,ref}$ are given in figure 14. Long coverages could often be achieved with torso, bbox, face or nose. Other ROIs occurred less often. Left leg and foot occurred seldom, while the left arm not even once.

3.3. Example of best performing ROI

According to the analysis, the results of the recording with the highest temporal coverage $\mathrm{COV}_\mathrm{cam,ref}$ belongs to the green channel of the color camera ( $\mathrm{CAM}_\mathrm{RGB,G}$ ) tracking the right arm (arm, right) of the 12th subject.

As a best-case example, more plots are provided for the said region: namely, the spectrogram of the bandpass filtered signal is given in figure 15 as well as $F_\mathrm{peak}$ and SQI. In addition, $\mathrm{PR}_\mathrm{est}$ is plotted vs $\mathrm{PR}_\mathrm{ref}$ in figure 16. To visualize the agreement, a BA and a correlation plot are given in figure 17. For comparison, the respective figures for bbox are shown in figures 18–20.

**Figure 15.** (a) Spectrogram of the bandpass filtered signal of the overall best performing ROI (**arm, right** in $\mathrm{CAM}_\mathrm{RGB,G}$ ). The measurement is disturbed by artifacts visible due to the characteristic energy-frequency distribution. The black lines correspond to the frequency band where $F_\mathrm{peak}$ is calculated from (90 to 220 bpm). (b) The second row shows $F_\mathrm{peak}$ while the third row shows the corresponding SQI values (c). Low signal segments/artifacts were automatically marked.
Download figure:
Standard image High-resolution image

**Figure 16.** Extracted PR ( $\mathrm{PR}_\mathrm{est}$ ) of a tracked region (**arm, right** in $\mathrm{CAM}_\mathrm{RGB,G}$ ) compared to the patient monitor reference ( $\mathrm{PR}_\mathrm{ref}$ ). No camera-estimated PR values are given for segments with a low SQI. Hence, more good signal segments were detected compared to the signal given in figure 19. The reference signal ( $\mathrm{PR}_\mathrm{ref}$ ) during a synchronization event is treated like an artifact.
Download figure:
Standard image High-resolution image

**Figure 16.** Extracted PR ( $\mathrm{PR}_\mathrm{est}$ ) of a tracked region (**arm, right** in $\mathrm{CAM}_\mathrm{RGB,G}$ ) compared to the patient monitor reference ( $\mathrm{PR}_\mathrm{ref}$ ). No camera-estimated PR values are given for segments with a low SQI. Hence, more good signal segments were detected compared to the signal given in figure 19. The reference signal ( $\mathrm{PR}_\mathrm{ref}$ ) during a synchronization event is treated like an artifact.
Download figure:
Standard image High-resolution image

**Figure 17.** Correlation plot (a) and Bland–Altman (BA) plot (b) for the best performing measurement (**arm, right** in $\mathrm{CAM}_\mathrm{RGB,G}$ ).
Download figure:
Standard image High-resolution image

**Figure 18.** (a) Spectrogram of the bandpass filtered signal of **bbox** corresponding to the overall best performing ROI. The measurement is disturbed by artifacts visible due to the characteristic energy-frequency distribution. The black lines correspond to the frequency band where $F_\mathrm{peak}$ is calculated from (90 to 220 bpm). (b) The second row shows $F_\mathrm{peak}$ while the third row shows the corresponding SQI values (c). Low signal segments/artifacts were automatically marked.
Download figure:
Standard image High-resolution image

**Figure 18.** (a) Spectrogram of the bandpass filtered signal of **bbox** corresponding to the overall best performing ROI. The measurement is disturbed by artifacts visible due to the characteristic energy-frequency distribution. The black lines correspond to the frequency band where $F_\mathrm{peak}$ is calculated from (90 to 220 bpm). (b) The second row shows $F_\mathrm{peak}$ while the third row shows the corresponding SQI values (c). Low signal segments/artifacts were automatically marked.
Download figure:
Standard image High-resolution image

**Figure 19.** Extracted PR ( $\mathrm{PR}_\mathrm{est}$ ) of **bbox** compared to the patient monitor reference ( $\mathrm{PR}_\mathrm{ref}$ ). No camera estimated PR values are given for segments with a low SQI. Hence, less good signal segments were detected compared to the signal given in figure 16.
Download figure:
Standard image High-resolution image

**Figure 20.** Correlation plot (a) and BA plot (b) for **bbox** belonging to the best performing measurement.
Download figure:
Standard image High-resolution image

4. Discussion

4.1. Temporal coverage $\mathbf{COV}_{\mathbf{3\ bpm/(cam,ref)}}$

One can conclude from the boxplots in figures 9 and 10 that the presented frequency estimation algorithm delivers adequate estimates in artifact-free periods which are comparable to the contact-based reference by up to ±3 bpm. This upper limit is satisfied for various valid measurements over relatively long times of the measurements resulting in very high coverages. The results for the best performing ROI are better than for bbox (according to the medians). Furthermore, more measurements fulfilled the time constraint when considering all ROIs compared to only considering bbox (23 vs 16). Nevertheless, there are also measurements where bbox has high coverage. One example is the measurement given in figure 18. The channels of the color camera generally performed best, the NIR cameras could also be used for some measurements.

We identified the number of occurrences of the best performing ROIs (see figure 11). This differs from camera to camera and here, we consider the results of all cameras independent of the wavelengths. We can observe that ROIs which were visible and close to the camera occurred more often (e.g. the right foot). This could be due to the number of pixels available to form a time signal. Furthermore, the face occurred more often than the forehead, which could be due to visibility and/or tracking. If the subject moved, the bigger face ROI is likely to cover more useful pixels compared to the forehead. The nose is an ROI that contains skin pixels and if visible, should be trackable. The torso ROIs also fared well, probably because of ballistocardiography. It is noteworthy that, by showing only the best performing ROIs, the ROIs that fared similarly well are underrepresented.

4.2. Temporal coverage $\mathbf{COV}_\mathbf{cam,ref}$

The data presented in figures 12 and 13 demonstrate that most extracted signals have a low SQI. The fact that the plot in figure 13 reflects the best performing ROIs further supports this statement. Moreover, we assume not only movement and occlusions to be the main causes for this but also the manual ROI annotation. To be more precise, the approach has the following drawbacks:

only rectangular ROIs were considered; and
there was no recovery of the tracker implemented.

Consequently, complex movement patterns could deteriorate the signals very early in the processing chain.

In addition, these results may not be used to generalize which wavelength is the most suitable for measurements in the NICU: following the discussion above and by analyzing figures 12 and 13, one could easily jump to the conclusion that the RGB camera and especially the green channel are more suitable than NIR wavelengths. However, concluding this would, for example, disregard the number of pixels used to extract the time series signal (see section 2.4). Furthermore, the amount of light available to each camera and the FOV was not considered. When looking at figure 14, for example, we can see, as expected, that more exposed ROIs occurred more often, while ROIs which are not visible to the camera or were not marked (e.g. left extremities) occurred less often. Thus, the data is insufficient to give a clear statement about the unsuitability of certain wavelengths.

More importantly, the results show that the coverage improves when selecting the best performing region compared to selecting bbox. Hence, identifying and locating high SQI regions is worthwhile.

4.3. Example of best performing ROI

To give an example of the performance achievable, the signal extraction of the overall best performing ROI and the corresponding bbox signal extraction are shown in figures 15–20. The spectrograms indicate that it is possible to identify the 1st and even 2nd harmonic of $\mathrm{F}_\mathrm{peak}$ . However, in the presence of artifacts, the energy is spread along the whole frequency range and, consequently, the lines corresponding to $F_\mathrm{peak}$ (and $\mathrm{PR}_\mathrm{est}$ ) are sometimes not visible. This region is more often affected than smaller ones for the simple reason that any movement can affect bbox.

Accordingly, the SQI is very low in these segments and those are, thus, flagged as artifacts. As a result, bbox has only 244 good segments while arm, right has 476; see the correlation plots (figures 20(a) and 17(a)) and BA plots (figures 20(b) and 17(b)). According to the BA plots, bbox could yield performance similar to the best performing ROI regarding the coefficient of reproducibility (RPC) (1.96 times the standard deviation of the differences between reference and estimate), 2.2 vs 2.10 bpm.

4.4. Additional observations (qualitative)

4.4.1. Blood vessels

Light interacts differently with tissue depending, amongst others, on its wavelength. Shorter wavelengths (e.g. blue) are more strongly absorbed in the upper skin layers compared to longer wavelengths (e.g. NIR). The usage of NIR allows information about the vascular network of the neonate to be retrieved, as can be seen in figure 21. Here, we did not exploit this property, but the visible vascular network could be used as landmarks in tracking algorithms. Another potential application is vascular diagnostics.

**Figure 21.** Vessels becoming more visible with longer wavelengths. The image in 940 nm is grainier compared to 850 nm because of the amount of light used. Strong light absorption for shorter wavelengths can be observed as lower intensity values in the skin pixels of the green and blue channels.
Download figure:
Standard image High-resolution image

4.4.2. Sampling rate and exposure time

The recording frame rate was limited to 25 fps due to bandwidth constraints and light which could be used. We could observe by visual inspection that typical movement, for example, of the extremities, resulted in blurry images. We consider this blurring problematic when trying to extract the small signals associated with the PPG signal because the content of the ROI changes a lot relatively and locating the correct ROI position is more difficult. The blurring is caused, for example, by exposure times being too long (19.50 ms) and not matched with the movement speed. The frame rate used in combination with the long exposure time is certainly sufficient to capture the PR at rest. When faster movement is involved, the tuning of these parameters is necessary. On the positive side, a higher frame rate would allow one to shrink the search region for tracking algorithms and lower exposure times would reduce the blurring. On the negative side, both actions reduce the amount of light available for imaging.

4.4.3. Noise sources

We observed strong movement of the neonates themselves and that induced by caregiving, which resulted in light intensity changes. Furthermore, a ceiling fan positioned directly beneath a ceiling light also introduced light changes. Fortunately, the frequency range affected was above the PR band and below the Nyquist frequency (here, $f_\mathrm{nyquist} = 12.50\; {\rm Hz}$ ) due to the fan's rotating speed. In addition, the room was equipped with air conditioning. Using the IRT cameras, we could see intensity/temperature changes which we, at the moment, attribute to air movement. The effect of these temperature changes on the patient's comfort and signal changes in the PPGI signals need more attention as patient discomfort could cause more movement.

The medical devices used in the study were also identified as noise sources: radiant warmer, incubator and PPG reference. The radiant warmer is built to direct thermal radiation at the baby's body. However, the reflective material of the warmer does not discern between the sources of radiation and just reflects everything, such as light and the thermal radiation of the clinical staff. Specifically, we suspect that light from the glowing heating filaments is modulated at a low frequency and thus disturbed the PPGI cameras.

Similar problems could be observed regarding the incubator: in order to be able to film the inside of the incubator not only with the PPGI cameras but also with an IRT camera, a thermal window was realized by applying a thin but visibly transparent plastic foil. While it helped to sustain the microclimate, it caused unwanted reflections in combination with light from behind the measurement setup and from the measurement lighting. We found that optical measurements from outside the incubator will require additional engineering prior to commercialization, for example, modifications to the incubator. In the meantime, cameras should be positioned as close to the incubator casing as possible.

The reference PPG probe uses pulsed light instead of constant light. Thus, camera PR estimates of this measurement site were influenced by this light source and are, thus, not reliable. Covering the body parts affected should not be the first option. Instead, if the probe used constant light (or was synchronized with the setup), it would have helped the cameras.

All in all, the influence of medical equipment on remote signal retrieval cannot be neglected. Thus, we suggest moving cables and contact-based sensors to locations out of the FOV of cameras when evaluating contact-free methods. By contrast, when deploying the techniques in the field, we recommend to use sensors which can be used as optical markers for the cameras. The same applies to clothing and blankets: a complex pattern should be preferred to uniform colors. If possible, clothing should be chosen in such a way that it moves even under slight body movement (e.g. breathing movement).

5. Conclusions and outlook

In this work, we tested the feasibility of extracting PR via the camera-based sensing modality PPGI in a realistic scenario in an Indian NICU. Consequently, we recorded video sequences of neonates nursed below a radiant warmer and in an incubator viewed from the side. The measurement setup which was used to record both PPGI and a second sensing modality (IRT) was also introduced.

As a result, we successfully demonstrated the feasibility of PR estimation via PPGI with cameras using visible and NIR light.

We extracted several ROIs for signal retrieval and implemented a straightforward energy-based SQI to discard signal artifacts, which we associate with movement or low signal quality in the frequency band of interest. Firstly, manually annotated regions were tracked by the KCF tracker because it is known to be computationally cheaper than other methods. However, this algorithm was challenged with occlusions. Moreover, tracking all the videos and generating time series for different ROIs were still very time-consuming as various ROIs were sometimes close to image dimensions.

Under these circumstances, PR estimates extracted with the STFT algorithm showed good agreement when compared with the PPG reference: in some cases, deviations of ±3 bpm could be reached. In addition, it could be demonstrated that it is advantageous to extract signals not only from the whole body but by exploiting smaller and more localized ROIs.

Furthermore, average temporal coverage based on the single best performing ROIs was below 40% (considering the 75th percentiles) and occasionally reached more than 80%. In any case, these values can be attributed, among others, to the low performance of the KCF tracker, because it was used without failure recovery, which could be implemented in the future. We also expect to reach higher values, when fusing the results from multiple ROIs and/or cameras.

Furthermore, we predict better results for wavelengths not performing well in the dataset presented (850 nm and 940 nm) when optimizing recording parameters such as light sources, gain and camera sensitivity. In any event, we suggest using lower exposure times, providing adequate (diffuse and constant) lighting, and shielding against other light sources. The lower exposure times are needed to reduce motion blur and can only be achieved if enough light hits the sensor. As was observed, other light sources, namely medical devices such as PPG probes or radiant warmers, could be identified as potential noise sources.

It should be noted that we only recorded from one side. Consequently, certain ROIs were less often visible (e.g. extremities of the side facing away or the turned face). To account for this, future setups might use distributed cameras covering three views (left, right, and top). In addition to the face and torso, available skin pixels should be exploited, especially those closer to the camera (extremities) if available.

Given these points, in the future, we will concentrate more on methods for retrieving the raw signal from video sequences: we still consider the stable identification and tracking of ROIs as the main challenges. In order to address these problems, we plan to use more advanced image segmentation methods.

In conclusion, we are optimistic that camera-based sensing modalities can replace some of the contact-based sensors and reduce the wiring in the NICU in the near future.

Acknowledgments

The authors would like to thank Muhammad Faiz Md Shakhih for annotating the video sequences. We would like to thank Prof. Dr Vladimir Blazek and Dr Marian Walter for discussing the results. Furthermore, we would like to thank the clinical staff at Saveetha Medical College Hospital for their cooperation and help in conducting the study. Finally, we would like to express our special thanks to the children and parents for their participation.

Declarations

Ethics approval and consent to participate

The conducted study is a purely observational one. We obtained ethical approval from the Institutional Ethics Committee of Saveetha University (SMC/IEC/2018/03/067) and obtained informed consent from the parents of the children involved in the study.

Consent for publication

Not applicable.

Availability of data and material

The datasets generated and/or analyzed during the current study are not publicly available due to being clinical data. No video data can be made publicly available.

Competing interests

The authors declare that they have no competing interests.

Funding

The research project 'Noncontact Assessment of Vital Parameters of Neonates in Intensive care' (NAVPANI) was supported by Germany's Federal Ministry for Education and Research (BMBF) funding code 01DQ17008 and the Indian Council of Medical Research (ICMR). C. Hoog Antink gratefully acknowledges financial support provided by the German Research Foundation [Deutsche Forschungsgemeinschaft (DFG), LE 817/26-1].