5.1. Evaluation Methodology
Pose-event rate: We measure the output pose-event rate in Hz as , where and are the total number of predicted pose events and total duration, respectively.
Error metric: Alignment quality is evaluated via scale-invariant landmark localization error, as average Euclidean distance divided by a scale estimate. To account for large pose variations, scale estimates are based on bounding boxes of landmarks, either calculating the box diagonal [
45] or the geometric mean of width and height [
5], since the commonly used inter-ocular distance underestimates the scale with yaw. For more accurate scale estimates under large poses, we calculate the root mean square (RMS) of the three landmark deviations from center. Thus, the pose alignment error is
where
is the center of the ground-truth landmarks, and
and
are the ground-truth and predicted coordinates for the
landmark, respectively.
Accuracy and precision: We evaluate accuracy and precision using the error metric in Equation (
10). Accuracy is measured by a failure percentage. We assume that, within a maximum error allowed for successful alignment, human annotations of the same pose should agree with
probability. Therefore the failure threshold is set to the 99th percentile of the annotator disagreement error, which is 0.308. As a consequence, precision is calculated as the average error of the predictions that are deemed as successful.
Regular temporal re-sampling: Due to sparse prediction, we perform regular re-sampling via bounded nearest neighbor (BNN) interpolation since each event has a finite time-support. This is realized by assigning the index of the predicted pose-change event at
t according to
Thus, if the normalized distance is , then , meaning time-point t is not supported by any pose-event, i.e., alignment is not activated. For those time-points, we take the given initial pose as the prediction not to cause a misleading bias on error statistics. The re-sampling step size is 1 ms. For the regular re-sampling of the ground-truth that is available only at key-frames, linear interpolation is applied.
Speed-balanced evaluation: With ECs, motion is a major source of variation unlike in the conventional camera observations. Especially head movement speed is a crucial variable that can affect performance. Highly imbalanced speed distribution of a test dataset can be misleading when generalized for dissimilar speed distributions. Therefore it is essential to assess speed dependent alignment performance. For this evaluation, we do speed quantization by partitioning the regularly re-sampled test samples according to their ground-truth speed levels. Moreover, we can practically estimate expectations over any distribution of the speed levels. Particularly in this study, we take expectation on the uniform speed distribution to be able to evaluate with a more generalized average performance score, i.e., to suppress the speed bias of the test set. We calculate the observed speed on the image plane by estimating RMS speed of the pose vector (concatenated landmark coordinates) at time-point
t by differentiation
where
is the RMS distance between successive ground-truth pose vectors. Then we compose six speed levels with the upper interval boundaries:
pix/s (pixels per second). The range of
pix/s is evaluated separately, by labeling them as still head poses. We chose six speed levels; because, while it allows us to observe sufficiently different levels of movement speeds, it is also small enough to concisely report the evaluations. The upper limit is chosen to exclude very high speed outliers; after a statistical outlier analysis and then rounding to 450 for convenience as it is divisible by six.
Dual cross validation: We evaluate by cross validation and by separating training and test subjects, since the subject variation is limited by 18 subjects in our dataset. As we have two different predictors running in a cascade, their training and testing are performed by two different cross validation. For the detector, we apply 3-fold cross validation providing splits of the subjects; and for the pose regression, 6-fold cross validation is applied providing splits of . We opted for a smaller training set for the former to considerably reduce the training and testing times as the method requires multiple passes over the whole EC database. Because the model complexity of the detector is very low, i.e., there are only a few parameters to estimate, smaller training sets do not deteriorate the detection performance. Also, the training set of the detector is further divided as to adjust the FNRs on four subjects within each training fold while fitting on the eight. We set the FNR goal to 0.05 as it provides a good compromise between detection and pose-event rates, and observed that by this method resulting FNR values on the test sets met the goal closely.
Initialization: In a complete face analysis system, a face template is initially aligned via a coarse similarity transformation estimate. Commonly the face detection bounding box is used to obtain coarse location and scale estimates. A better initialization can be made in a tracking framework based on temporal continuity of pose/shape vectors [
45], with joint detection and alignment [
41], or using a coarse pose estimation [
46]. Good initialization simplifies the alignment task. However, regression-based methods can be sensitive to the face detector as well as the bounding box configuration, since a model of mapping learned in the training stage may not correspond well to the mappings due to initialization with a different detector applied in testing in Sagonas et al. [
59]. For experimentation, we simulate face detectors of different quality regarding the position and scale uncertainty, with the expectation that a typical detector predicts the face position within the size of the eyes-mouth triangle. We assume rotation and the rest of the affine pose components are totally unknown, therefore we fix them to identity transformations. For this reason, in-plane and out-of-plane rotations can cause very high errors even when translation and scale initialization is good. Translation is sampled using a uniform density over a circular region of radius
, and scale is sampled within a uniform density at logarithmic scale of the range
. Since
in Equation (
10) is the statistical size measure, we parameterize the translation range in the units of
. We experiment with three levels of scale-space localization uncertainty settings given in
Table 2. While the low simulation setting can be considered to be a state-of-the-art face detector with conventional cameras or a tracker with significant jitter noise, the medium and higher cases correspond to less precise detectors. We simulate also by high localization imprecision since EC-only face detectors may be much less precise than the conventional face detectors due to differential observations and motion induced uncertainty.
5.2. Comparison against Human Performance
Since there is no prior work on pose alignment with ECs, we evaluate our method by comparing against a second set of human annotators who annotate the landmarks on the EC observations as described in
Section 3.3. Thus, the errors due to the disagreement between the two sets of annotators serve as a very strong EC baseline.
For feature search and selection in the training for regression, we sample event-cell pairs over a square area of width
centered at the input pose center; because, relatively large sampling domain is helpful to capture patterns that can occur far away due to motion or due to poor initialization. We apply down-sampling to lower the spatial input resolution as explained in
Section 5.3. We empirically found good parameter values of feature selection that help to complete training in a reasonable amount of time without degrading the performance significantly as follows. The prior on the coordinate pairs is modeled by
with the
unit (Equation (
9)). The random feature selection pool size is set to 1000. We sample cell-mass difference thresholds of the node splits, i.e.,
in Equation (
8), in the range of
events, i.e., within the centered density interval of 0.02 events/(ms · pix
2). 40 trials are performed to find an optimal pair of features and decision thresholds at each tree node.
We adjust the predictor model with an optimal complexity as follows, which is comparatively evaluated in
Section 5.4. A forest of 40 trees at each stage of gradient boosting is used, as it is sufficiently big to regularize while small enough not to degrade the accuracy. Tree depth is set to four. 16 initialization clusters are estimated to fit the model as well as to be used in prediction. Both the bottom-level and top-level cascade lengths are set to 10, hence feature sampling for pose-invariance is done at every 10 stages and the total cascade length is 100.
Table 3 compares our predictor performance and inter-human annotation disagreements by average accuracy and precision in terms of failure percentages and precision errors, separately for intense and moderate head motion clips as well. The averages are taken excluding the still frames that are in RMS speed range of [0,1) pix/s. Here the predictor runs under low initial localization uncertainty setting (see
Table 2). We see that inter-human disagreements in terms of average accuracy and precision are smaller than prediction errors, as expected. However, for moderate head motion clips, average prediction errors are very close to inter-human disagreements.
Human superiority is seen for almost all the head movement speed levels in
Table 4 as well. The only exceptions are the second fastest level ([300, 375) pix/s) and the speed levels above 75 pix/s in the moderate motion category where the results are comparable. In general, at the fastest and slowest movements alignment mismatches are getting worse. While inter-human disagreements are slightly increasing, predictor performance worsens strikingly, especially at the highest speed level. These degradations can possibly be due to higher degrees of ambiguity, as in a fixed temporal window significant motion blur can occur at the fastest speeds and observations can become very weak at the slowest speeds.
We also see the resulting pose-event rates in
Table 3. EC skips still head pose moments by means of motion detection, thus avoids unnecessary processing for alignment. The resulting average pose-event rate is 32.2 Hz as seen in
Table 3. The processing rate with conventional video frames corresponds to the fixed rate of 50 Hz due to 20 ms time-steps. Please note that pose-event rate is even less with moderate motion clips.
Face detectors or trackers provide face position and scale with some degree of uncertainty to initialize the alignment. To characterize with respect to scale-space localization uncertainty, and thus to be able to determine required precision levels of EC-based face detectors and trackers, we evaluate under three uncertainty ranges given in
Table 2.
Figure 3 compares the accuracy and precision of our predictor and human annotators (using an optimal model complexity configuration for each uncertainty level according to
Section 5.4). Please note that the resulting average pose-event rates for the three uncertainty levels are almost the same; as 32.2 Hz, 32.0 Hz, 32.7 Hz from low to high uncertainty, respectively. On the left of
Figure 3, the heat map shows the failure percentages of the individual speed levels as well as the averages over all excluding the still head labels. As expected, failures escalate with increasing uncertainty range for all the speed levels. The worsening effect at the fastest and slowest motion levels as explained above are much more prominent with increasing localization uncertainty. Moreover, the two slowest speed levels cause more severe degradations, which can be explained by undetected motion not activating the alignment, since in the absence of activity EC sensors only emit pixel-events due to noise.
On the right of
Figure 3, cumulative distributions of the precision error are shown, after speed level balancing by sub-sampling without replacement from each speed interval with the size of the smallest speed partition. For all the uncertainty levels, the curves start from similarly low error values, but with increasing error, higher uncertainty in localization makes the error distribution more dissimilar to the annotators curve worsening the precision.
Figure 4 shows example snapshots of alignments (orange triangular shapes) at the high initial localization uncertainty setting (initialized by blue dashed triangular shapes) compared to the two sets of human annotators (ground-truth: red dots, second set: green circles). We see accurate prediction of landmarks (below the failure threshold 0.308) for various poses involving talking lips and eye blinks as well as roll, yaw and pitch rotations. An example of alignment by eye blinks are at the cell “6b”, and by talking lips are at “1c”, “2b”, “2d”, and “6d”. Alignment can accurately be done also when the mouth is cropped as seen at “7d”. Precision of the detector, in general, slightly worse than the human annotators. However, occasionally there are opposite cases as well. For instance, due to the hair occlusion on the left eye at the cell “4c”, human disagreement is higher than the prediction error whereas prediction could locate the left eye similarly to the ground-truth. Small annotator disagreements are observed also due the talking lips in “2d” and on the eyes due to a fast movement at “7b”.
5.3. Evaluation of Face Resolution
In our dataset, face resolution is quite high, as eye-to-eye distance frontal view is 67 pixels on average. Depending on the sensor, lens, and distance to camera, acquired face resolution can be much lower in different applications. Lower resolutions might have an unexpected negative impact on the performance, since event sensor data has the complexity of high degree of variations due to motion dependence which may complicate the alignment if some necessary details are lost. Conversely it may also lead to improvements, since learning is facilitated due to reduced feature search space which originally contains too much redundancy for the alignment task. To clarify these possible cases, we investigate the effects of lower resolutions on the performance.
We experiment by applying spatial down-sampling via pixel-event cells, with the down-scaling ratios of
,
and
, and using odd cell sizes. Example snapshots of the resulting spatial-scales are shown in
Figure 5.
Table 5 shows the corresponding event-cell sizes, resolutions and alignment failure percentages for the three levels of uncertainty depending on the down-sampling ratio. Failures increase one octave below
for all the uncertainty levels. In this experiment, to reduce the experimentation time, we use four initialization clusters and tree depth of four. It is seen that optimal down-sampling ratio also greatly depends on the localization uncertainty. The optimal ratios are
for the low level, and
for the medium and high level uncertainties. These results indicate the benefit of eliminating redundancy and fine details for the coarse pose alignment task with ECs, especially for the more challenging cases of higher localization uncertainties, as well as demonstrate the pose alignment capability for wider range of applications and with cheaper lower resolution sensors.
5.4. Evaluation of Model Complexity
Complexity of a predictor alters the trade-off between computation requirements and performance. Also, higher complexity models can cause over-learning which degrades the generalization on the test set. For these reasons, we analyze the performance of EC-based pose alignment by varying the complexity of our predictor, for different levels of localization uncertainty ranges. The main parameters that change the model complexity are tree depth, cascade length and number of initialization clusters. We observed significant improvements up to 16 initialization clusters. For instance, more than 3% of failures are recovered by increasing the clusters from four to 16 for the high uncertainty level case, and about 1% for the other ranges. Higher number of clusters improves the performance by reducing the chance of getting stuck at poorly initialized predictions and by enabling a richer data augmentation in training. However, from 16 to 32 the improvements were negligible with the expense of doubling the computations. Therefore we set the number of clusters to 16.
Second, we investigate the effects of tree depth.
Table 6 shows that increasing the tree depth improves the learning capacity, up to a certain level after which over-fitting starts to cause significant degradation. While depth of six is the best for the high and medium uncertainty cases, four levels obtains the lowest rate for the low case.
Finally, we evaluate the cascade length in
Figure 6 which shows the failure percentage versus top-level cascade length, for each localization uncertainty level using the best tree depths according to
Table 6. The majority of the improvements are rapidly achieved in the first few stages, and then failure percentages slowly decrease. We picked the length of 10 for all the comparative evaluations in this paper, though some small amount of improvement is still possible with longer cascades. Similarly, the length of the bottom-level cascade is also fixed at 10 (hence total number of stages is 100), due to insignificant changes on the results. The results with the selected optimal parameters are reported in
Table 7.