1. Introduction
With the continuous advance and gradual maturity of computer sciences, humans expect to obtain and deal with more information about themselves by means of computer technology, such as tracking human limb motion. As it contains personality and gait characteristics, human motion plays an important role in various fields of application, such as posture analysis and virtual reality. In the aforementioned background, three-dimensional (3D) human posture reconstruction based on videos is a popular research area [
1]. 3D human posture estimation based on monocular video sequences has received more attention, owing to its advantages of low cost and less limitations. The applications have their special requirements for 3D human pose estimation, although two key performance indicators for human pose estimation algorithms are accuracy and real-time.
Recently, owing to extracted depth information, depth cameras [
2,
3] are applied for estimating 3D human poses and representing human activity. Kong et al. [
4] presented a hybrid framework to detect joints automatically based on a depth camera. Then, 3D human poses were estimated using the located human skeleton model. Stommel et al. [
5] proposed a novel method for estimating 3D human poses based on the spatiotemporal segmentation of key points, provided by depth contours, using Kinect camera data. However, the estimation accuracy will be affected by the captured distance. Therefore, a traditional camera is used for obtaining human postures when the distance between the test people and the camera increases. Because of the absence of depth information, it is very difficult to estimate 3D human poses based on monocular video sequences. To address this challenge, a number of methods have been developed. Mehta et al. [
6] proposed a real-time method to capture global 3D skeletal poses and estimate human poses based on a single RGB camera, combining a convolutional neural network with kinematic skeleton fitting. Atrevi et al. [
7] extracted 3D poses using a traditional camera without any depth information, based on the correspondence between silhouettes and skeletons. Sigal et al. [
8] and Babagholami–Mohamadabadi et al. [
9] proposed a baseline algorithm and sparse representation to estimate 3D human poses. Furthermore, the Bayesian framework was improved by estimating a posterior distribution for sparse codes. Based on the obtained spatial and temporal features, Li et al. [
10] presented an algorithm for estimating a sequence human pose in unconstrained videos. Based on the spatial model, the detection precision of body parts was improved. In order to overcome the interference to similar human poses, the corresponding depth information images were applied using several algorithms. Dinh et al. [
11] presented an approach to recover 3D human poses in real-time from a depth image using principal direction analysis. Based on the introduced prior models of human poses and depth images, He et al. [
12] developed a latent variable pictorial structure for estimating human poses using a monocular camera. Wu et al. [
13] presented a method, called model-based recursive matching, to estimate human poses based on a depth image and 3D point cloud.
In recent years, deep learning has made considerable progress and also obtained satisfactory results in human pose estimation. Marin–Jimenez et al. [
14] proposed a deep depth pose model to obtain 3D positions of body joints and reconstruct human poses. Hong et al. [
15] improved traditional methods by adopting locality preserved restriction, based on a denoising auto-encoder for estimating 3D human poses. Sedai et al. [
16] and Guo et al. [
17] proposed a discriminative fusion method and Markov random fields, respectively, to reconstruct human poses using shape and appearance features. In order to solve the problem of occlusion, multi-view video sequences are applied for 3D human pose estimation. Sharifi et al. [
18] proposed a marker-based human pose tracking and estimating method, based on particle swarm optimization, with search space partitioning.
Human pose estimation is also critical in some other areas, such as action recognition and behavior monitoring. Yang et al. [
19] proposed a novel recurrent attention convolutional neural network for recognizing human action based on the sequences of video frames. Furthermore, the region of interest is visualized in order to efficiently analyze human action. Chaaraoui et al. [
20] proposed a framework to recognize human behavior using multi-view cameras. Furthermore, a privacy-by-context method was used for protecting the privacy of inhabitants. Batchuluun et al. [
21] recognized human behavior using camera systems, including visible light and thermal cameras. The accuracy of human behavior prediction was improved by the proposed fuzzy system.
There exist a large number of research efforts to reconstruct high-quality 3D human poses. However, the current methods suffer from the following key shortcomings: errors in human limb motion are great, the real-time reconstruction of different 3D human poses needs to be improved, and the connection between the adjacent limbs after pose adjustment is not smooth.
To address these shortcomings, we combine the iterative calculation of joint points and conformal geometric algebra (CGA) to estimate accurate 3D human poses. The images containing different human poses are captured by a camera, and the experimental 3D human models are selected from the database of free 3D models [
22]. Compared with the existing work, the main contributions of this work include: (1) strip information of clothes and prior data on different human limbs are used for locating joint points, which can solve the occlusion problem between the limb part and human torso; (2) iterative calculation of the skeleton model is proposed for estimating the 3D coordinates of joint points in order to solve the high-cost problem caused by multiple cameras or a depth camera; (3) CGA makes limb motion on a 3D human model more convenient and efficient due to its obvious superiority in rotation and transformation; and (4) a high-precision virtual human model is applied for 3D human pose estimation, which can generate more realistic and reasonable human poses.
This paper is organized as follows. In
Section 2, the whole estimation process of 3D human poses is demonstrated. In
Section 2.1, a 3D skeleton model and 3D limb parts are firstly introduced. The methods of locating the human joint points on a target human body and treating of the occlusion problem are then described. Subsequently, an iterative calculation of the skeleton model is applied for estimating the 3D coordinates of joint points. In
Section 2.2, motion directions and the angles of various human limbs are firstly calculated by CGA. To estimate the 3D human poses, rigid transformation is then applied for adjusting the postures of limb parts on a high-precision model. In
Section 3, the performance at the location on the joint points and the reconstruction performance of different human poses are analyzed. In addition, the result of 3D human pose reconstruction based on the proposed method is compared to the existing algorithms, when limb occlusion occurs. Finally,
Section 4 concludes the paper.
3. Experimental Results and Validation
To test the proposed location of the joint points in the target human body images and 3D human pose reconstruction, the experiments were implemented based on the captured human body images and motion sequences for different human poses. In order to avoid the phenomenon of the labels not being able to indicate the correct joint positions when the human pose is changed, another person helped to adjust the positions of all the labels when the target human body was located in corresponding poses. That is, the human motion images were captured when all of the labels were in satisfactory positions, based on the manual operation of the second person. The 3D human model in the experiment was provided from the free 3D model database [
22]. Furthermore, in order to demonstrate the location accuracy of the joint points in the target human body images, we compared the experimental results of 3D human pose reconstruction, based on the manual location, and the proposed joint point location method. In addition, the proposed 3D human pose estimation method was contrasted to the human pose reconstruction method in [
23]. The whole algorithm was developed by Visual Studio 2010 and executed on an Intel CORE i5 1.7GHz PC.
All of the target human body images and human motion sequences were captured by a traditional monocular camera and mobile phone locating at a fixed position. The camera was parallel to the projected plane in the capturing process. That is, the human error in the depth information of the human joint points was eliminated based on the whole captured system. In the process of experimentation, various joint points of the target human body were identified using the pasted labels. Therefore, the lengths of various human skeleton parts and their corresponding coordinates in the human motion image can be measured based on the located joint points of the target human body. Therefore, the focal length of the camera was calculated by the connection model of three limb parts in [
23] using the located clavicle point, right shoulder point, right elbow point and right wrist point. As shown in
Figure 7, subjects of the experiment needed to keep the standard standing posture, and the green points are the located human joint points. The width and height of the captured images were 1536 and 2048 pixels, respectively.
The location of the joint points of the target human body will have a great effect on 3D human pose reconstruction. In this part, the performance of 3D human pose estimation using the manual location and proposed joint point location method was tested first. In the process of experimentation, the scale factors of various joint points were calculated based on the three different joint point location groups (See
Figure 8). The obtained human joint points, using the proposed automatic location method, are shown in
Figure 8a. As shown in the figure, the obtained human joint points were generally located at the center of the pasted labels. It is also proved that the proposed joint point location method was effective and accurate. The first group of joint points were located by manual location, and this had little error compared to the accurately located joint points, as shown in
Figure 8b. The second group of joint points, and the error between the marked joint points and the accurate joint points is very great, as shown in
Figure 8c. The front view and side view of the estimated 3D human pose, using the above three groups of joint points, are shown in
Figure 8e,f,h–g, respectively. As shown in the figures, the estimated 3D human pose, based on the joint points located by the proposed method, can reflect the human body pose in the motion image. However, the estimated 3D human pose using the two groups of joint points, located by manual location, had great error. In addition, the error between the estimated 3D human pose and real 3D human pose for the second group of joint points was greater than that when the first group of joint points were applied. That is, the error of the estimated 3D human pose depends on the accuracy of the location of the joint points.
To test the accuracy of the proposed 3D human pose estimation method, different 3D human poses were estimated using the rigid motion of various limb parts based on the different human motion images captured. In this part, in order to evaluate the proposed 3D human pose estimation method, the joint points of the target human body were obtained by the method of manual location to improve the accuracy of the location of the joint points. The estimated 3D human poses using the rigid motion of various limb parts for the captured different poses are shown in
Figure 9. The captured images of various human poses are shown in
Figure 9a; the front view of the estimated 3D human poses is shown in
Figure 9b; and the corresponding side view of the estimated 3D human poses is shown in
Figure 9c. As shown in the figures, the poses in the captured images were estimated by the proposed method. Similar to Equation (17), the average errors of the 3D coordinates of the joint points for the eight human motion images were calculated. The errors on the right elbow, right wrist, left elbow, left wrist, right knee, right ankle, left knee and left ankle were 0.42, 3.12, 0.58, 4.22, 1.45, 1.52, 0.62 and 1.64 respectively. Therefore, the proposed method obtained a satisfactory 3D human pose estimation result, because all of the errors were acceptable. In addition, it also demonstrated that the rigid motion of various limb parts, based on conformal transformation, is feasible and effective.
In this part, the tracking of human motion poses is studied by combining the proposed joint point location method with the 3D human pose estimation method based on the real human motion video sequences. Human motion sequence images were captured by a stationary camera, and the human error in the depth information of human joint points can be eliminated. The experimental motion sequences are the actions of the subjects in the images. The total 50 frames of a human motion sequence were obtained under the same interval. 3D human poses were estimated based on the above sequence images using the proposed joint point location and 3D pose reconstruction methods. The 3D pose estimation results on frame 1, 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 are shown in
Figure 10. The captured images of the human motion sequence are shown in
Figure 10a. The front view of the estimated 3D human poses using a 3D human model is shown in
Figure 10b. The side view of the estimated 3D human poses is shown in
Figure 10c in order to display the depth information of the joint points. As shown in the figures, the accuracy of the 3D pose estimation of various frames is different. Furthermore, the average errors of the 3D coordinates of the joint points for the eleven human motion images are calculated using Equation (17). The errors on right elbow, right wrist, left elbow, left wrist, right knee, right ankle, left knee and left ankle are 0.39, 3.24, 0.56, 4.34, 1.56, 1.97, 0.68 and 1.83 respectively. Therefore, as a whole, the 3D pose estimation results of human motion sequence images are satisfactory. The error in 3D human pose estimation was mainly from the inaccurate location of the joint points and the illuminant variation in the different human motion frames when the rigid motion of various limb parts is applied. In addition, compared with other skeleton parts,
and
had greater errors. This phenomenon can be attributed to the accumulative errors, because the scale factors of various joint points are calculated by the iteration method.
To estimate the accuracy of the tracking of human motion poses, the joint points obtained by the proposed automatic location method are compared to the accurate joint points based on the manual location method (see
Figure 11). The width error of the located right wrist point
, left wrist point
, right ankle point
and left ankle point
, based on the human motion sequence images shown in
Figure 10, are shown in
Figure 11a. The image shown in
Figure 11b corresponds to the located height error of the above joint points. As shown in the figures, the range of all the located error pixels is (3, 8). That is, the proposed method satisfactorily locates joint points. In addition, the located errors of joint points
and
have the lowest value in the 30th and 40th frames of the motion sequence images (See
Figure 10). The reason is that the points of the left wrist and right ankle are nearer to the camera in the above frames. Therefore, the location of the two joint points was more accurate, because the corresponding pasted labels can be identified more easily. Furthermore, the located errors of all the joint points have the largest value, from the 5th to the 15th frame, and from the 45th to the 50th frame. In addition, as shown in
Figure 11a, the right wrist point had the largest error from the 10th to the 15th frame, and from the 25th to the 30th frame. The reason is attributed to the phenomenon of the point being occluded by other human parts. Similarly, the left wrist point had the largest error from the 5th to the 15th frame. The reason is attributed to the phenomenon of the left wrist point being disturbed by the left elbow point, owing to their close positions (See
Figure 10).
To further estimate the location accuracy of joint points, the proposed method was compared to the five human pose estimation methods. They are those of Yang et al. [
31], Chen et al. [
32], Chou et al. [
33], Chu et al. [
34] and Luvizon et al. [
35]. The PCKh, proposed by Andriluka et al. [
36], was applied for measuring the location accuracy of joint points. The result is shown in
Figure 12. In the lower part of the figure, the experimental 30 images, corresponding to various human poses, are presented, and in the upper part, the average location accuracy of all of the human joint points in the above 30 images is shown. As shown in the figure, apart from the specific poses, the location accuracy of the joint points was around 90%, when all of the methods are applied. Therefore, the five state-of-the-art human pose estimation methods and the proposed algorithm can obtain satisfactory location results of the human joint points. Furthermore, based on the 30 images, the average location accuracy of the joint points located by the proposed method was 93.02%. The corresponding average location accuracy was 93.23%, 93.11%, 92.78%, 93.03% and 92.04%, when the other five methods were applied. In addition, their variance in location accuracy was 9.71, 7.59, 8.47, 8.83 and 9.26, respectively. Furthermore, the variance was 7.29, when the proposed method was applied. Therefore, the proposed method had the minimal variation range of error in relation to joint point location, when various human poses were considered, although it had no superiority in location accuracy. That is, compared with the other five methods, the proposed method was the most stable algorithm. Furthermore, not all methods achieved a high-performance in relation to sitting people (e.g., poses 12, 17, 19). The reason is attributed to the phenomenon of the distances among all of the joint points decreasing when sitting people are considered, so that several joint points are often mistakenly recognized. In addition, consider the poses 13, 14, 15, 16, 17, 19, 20, 22, 23, 24, 27, 28, 29, which correspond to people with joint point occlusion. The proposed method was superior to [
35] by 1.46% PCKh (91.05% vs. 89.59%), [
34] by 0.44% PCKh (91.05% vs. 90.61%), [
33] by 0.15% PCKh (91.05% vs. 90.90%) and [
31] by 0.11% PCKh (91.05% vs. 90.94%), considering the average of the above poses. Compared with the method in [
32], the proposed method only decreased by 0.27% (91.05% vs. 91.32%). Therefore, the proposed method is feasible for joint point location when the occlusion phenomenon occurs.
To demonstrate the variation of various limb parts in human motion sequences, the rotation angles of limb parts were extracted using the method of rigid transformation. The variation and error variation of rotation angles on the left forearm
and right calf
, when the proposed method was applied based on 50 frames of motion sequence images (See
Figure 10), is shown in
Figure 13. The variation curve of the rotation angles, when the 3D poses of the left forearm
and right calf
were estimated, is shown in
Figure 13a. The rotation angle of the left forearm has the maximum value in the 10th and the 35th frame, and it has the minimum value from the 20th to the 25th frame. This indicates that the left forearm moved from the stationary state to the raised state twice. In addition, as shown in
Figure 10, the two maximum values of the rotation angle correspond to the actions of right leg extension, and the minimum value corresponds to the middle action between the above two actions. The rotation angle of the left forearm in the first 15 frames was similar to that of the left forearm. In addition, the variation of the rotation angle of the right calf
remained steady in the last 30 frames, because the relative position of the right thigh and right calf remains unchanged for the action of the shot. The error variation curve of the rotation angles, when the 3D poses of the left forearm
and right calf
are estimated, is shown in
Figure 13b. The performance of the proposed method was satisfactory, because all of the errors were less than 8 degrees. In addition, the errors from the 30th to the 45th frame were greater than those in other frames. The reason is attributed to the phenomenon of
and
being near to the screen, because the error in the location of the same 2D joint points can lead to more deviation in the 3D human pose.
To further prove the efficiency of the proposed 3D human pose estimation method, we compared it to the method in [
23]. In this part, the set of experiments concerning the smoothness of the adjacent limb parts and the accuracy of the estimated 3D poses using the two methods are presented. The accuracy is compared based on images with the occlusion phenomenon.
The estimated results of 3D human poses based on the human motion images with occlusion, using the proposed method and the method in [
23], are shown in
Figure 14. As shown in the figures, the estimated 3D human motion was more realistic using the proposed method, because the algorithm of the smooth connection of the adjacent limb parts was introduced. However, the method in [
23] cannot describe the realistic 3D human motion efficiently due to the distortion of the articulation when the 3D human poses were estimated. The reason is that the method in [
23] only estimated the 3D poses of the limb parts and ignored the treatment of the articulated point.
In addition, as shown in
Figure 14, the proposed method obtained more accurate 3D human pose estimation results. By contrast, the method in [
23] cannot estimate 3D human poses with the same accuracy; especially, the deviation of the occluded limb parts was greater. This phenomenon is attributed to the estimated error in the coordinates of the occluded joint points. Due to the introduced treatment of the occluded limb parts, the proposed method solved the phenomenon of occlusion successfully and obtains a satisfactory result.
In this part, the front and side views of the estimated 3D human poses using the proposed method and the method in [
37] are presented based on the image annotations in the MPII human pose dataset. As shown in
Figure 15, the estimation accuracy of the 3D human parts using the proposed method was better than that using the method in [
37], especially for the thigh and calf parts in model 1 and the forearms in model 2. Therefore, compared with the method in [
37], the proposed method can satisfactorily estimate the human poses by calculating the 3D coordinates and CGA.
The estimation accuracy can be calculated by comparing the predicted 3D coordinates of the joint points to the ground truth data. The calculated average errors of the 3D coordinates of the joint points, for the four human motion images with occlusion in
Figure 14, are shown in
Table 4. The error
of the joint point is defined as follows:
where
are the coordinates of the ground truth joint points using the method of manual location in the
i-th image,
are the estimated coordinates of the joint points using the above two methods in the
i-th image. As shown in
Table 4, compared to the method in [
23], there was an important improvement in the accuracy of the location of the joint points for the proposed method, and the accuracy is critical in 3D human pose estimation.
In
Table 5, the computation time of the location of the joint points, by the six methods presented in
Figure 12, is shown. Compared with the other five methods, the proposed method had the lowest computation time. The computation time of the other five methods was similar. The maximum and minimum values correspond to the methods of Chen et al. [
32] and Luvizon et al. [
35]. Furthermore, the numbers of vertexes and computation time of various human parts, of CGA and the method in [
38], are shown in
Table 6. Compared with the method in [
38], CGA had a longer computation time. The reason is attributed to the phenomenon that direct geometric calculation is used for changing human limb poses when the method in [
38] is applied. In our method, several transformations in CGA were implemented based on the corresponding function package. That is, function references increased the whole computation time. However, computation time was acceptable and worthwhile for the whole human grid model, because the accuracy was greatly improved when the proposed method was applied. Therefore, the proposed method is not applicable for real-time human pose estimation, because grid transformation based on CGA will take some time. However, in many application areas of computer animation and human body simulation, it is not sufficient to estimate 3D human poses based solely on the human skeleton, using other methods mentioned above. The contribution of the proposed method is a high-precision human mesh model that can be used for 3D human pose estimation, so that it can be applied in many practical fields, such as virtual reality. In conclusion, compared with the other methods, the proposed method was mainly suitable for human pose estimation using high-precision 3D human mesh models and off-line processing in practical applications.
According to the experiments, the proposed method only estimated 3D poses for one person at present. The distance between the target person and camera was around 3 meters, and the images were captured in full light. Currently, the proposed method can predict poses with one occluded joint point. However, the method cannot be applied when the two adjacent joint points are occluded simultaneously. We will treat this phenomenon in future research.