Real-Time Recognition of The Users Arm Gestures in 2D Space With A Smart Camera
Real-Time Recognition of The Users Arm Gestures in 2D Space With A Smart Camera
G
esture control technology is one of the most limits most input to normal keyboard and mouse interactions
important technologies introduced today to fa- without mechanical devices [3].
cilitate human-machine communication. In this There are two types of gestures in computer interfaces. The
article, we propose a Smart Camera One (SCO) to remotely first type, offline gestures, are processed after the user inter-
monitor a user ’s arm movements in order to control a ma- acts with the object. Robots can be taught how to perform tasks
chine. Therefore, its main role is to act as an intermediary by using self-learning mechanisms allow robots to simulate
between the user and the machine to facilitate commu- human behavior, but these learning mechanisms require expen-
nication. This instrument technology can measure and sive real-world data and are difficult to collect [4]. In the method
control the different arm positions in real-time with an ac- proposed by Bonardi et al. [5], the successful educational simu-
curacy of up to 81.5%. The technique consists of learning lation of robots has been implemented using algorithms known
to perform live tasks. SCO is able to immediately execute as task-integrated control networks ( TecNets), without the need
a user ’s task without going through the machine learn- for a real human rendering of the keys to form a robot, so the
ing phase through demonstrations. During the time of method does not require any interaction with human charges.
the test, the setup of the demonstrator is done visually by The technique is to learn to accomplish tasks from one or more
SCO, it can recognize users according to their skin color demonstrations to produce actions on a variation of the same
and distinguish them from colorful backgrounds. This is touch e.g., motor speeds. With TecNets, tasks are hard to for-
based on image processing using an intelligent algorithm get, and many tasks can be done after being learned. During the
implemented in the SCO through the Python program- time of the test, the configuration of the demonstrator is done vi-
ming languages and the OpenCV library as described in sually by camera, so the learning will be very applicable from a
this article. Although initially a specialized application, human demonstration. The camera is like an eye, observing var-
SCO could be important in several areas, from remote ious actions and then recording the data needed for rehearsal.
control of mobile robots to gaming, education, and even However, TecNets are unable to immediately perform a user’s
marketing. Extensive experiments demonstrate the effec- task without going through the robot learning phase through
tiveness of the developed model as shown in [1]. demonstrations.
In the second type of gestures, named direct Action Ges-
Current Trends and Applications tures, the concept of recognition and direct manipulation is
Most people can accomplish any task simply by watching used. Simple gestures allow users to control devices without
another person perform it in front of them only once. In the physically touching them. Often it is facial or hand move-
world of human-machine communication, it will be compli- ments that make gestures more than any other physical
cated until a machine can imitate human behaviors, before movement [6]. This avoided having to touch the interface,
it can effectively reproduce that behavior. Gesture recogni- such as smartphones, laptops, games, TV, and music equip-
tion technology is considered one of the most important and ment during the COVID-19 pandemic [7],[8]. However, most
widely used technologies, because it can provide users with body-based systems or handheld sensors use built-in data sen-
convenient information at the time of completing the task that sors to accelerate moving positions, e.g., the disadvantage of
facilitate communication and control of any device remotely data glove-based systems is that the user must wear gloves
[2]. Gesture recognition can be considered as a way for com- to operate the system. Alrubayi et al. proposes a pattern rec-
puters to understand human language, therefore building a ognition model for static gestures [9],[10]. The dataset was
richer bridge between machines and humans than primitive captured using a DataGlove device, and the finger and wrist
text user interfaces or even graphical user interfaces. This still movements can be measured as mentioned in [11], in which
Arms up
Facing forward
n itor C1
Mo C2
Head
Li
m
SCaMO 135° b
480 Pixels
Arms at shoulder
dcs
Limb
height
90°
θ2 Torso θ1
Y Surface
0° 0°
Arms at sides
Z
Arms down/
Limb
Limb
X
(a) (b)
C1 C2
180° 180°
+θ
θ2
line (ml R)
θ1
line (ml L)
za L za R
0°
0°
Gluteus medius (gm L) Gluteus medius (gm R)
Upper Limb
C3
180° 180° C4
+θ
Hip (hjL)
Gracile (gllL) Gracile (grlR) Hip (hj R)
Ankle (ajL,R) 270°
270° 90° 90°
Gracile (gll R)
Gracile (grl L)
θ3 θ4
zl L zl R
0°
0°
Ground (g L) Lower Limb Ground (g R)
(c)
Fig. 1. (a) Schematic of the experiment after placing the SCO (SCaMO) to test user gestures. (b) The normal alignment of the lower limb notes the toes and knees
face forward, and the upper limb moves on the sides in straight movements without any bending. (c) Visual representation of the joint angles of the upper limb and
lower limb.
(xe, ye)
(xe, ye) (xe, ye) (xe, ye) (xe, ye)
d
Right han
hand
Left hand Right hand Left
User Right hand User Left hand d User User User
Right hand
n Left
Left hand
t ha han
d
gh
Ri
(xe, ye) (xe, ye) (xe, ye) (xe, ye) (xe, ye)
Left th d Left
han Righ an han
d th d
gh
Ri
Fig. 2. Some of the dynamic gestures and static gestures of user detected by the SCO under different circumstances.
Fig. 2i; only dynamic gestures in the image as shown in Fig. 2c, divided into two main parts that complement each other.
Fig. 2d, Fig. 2g, and Fig. 2h; and dynamic gestures and static The first part works on the ‘Face detection,’ and the sec-
gestures in one picture, as shown in Fig. 2e, Fig. 2f, and Fig. 2j. ond part works on the ‘Fixing intervals’ on the image. In the
The algorithm converts dynamic gestures to static ones since first part, when recording a video, the algorithm tracks the
the movements are recognized and then included in the static movement of the user’s head in all directions. Lu et al. [25]
gestures. proposed a new computer vision-based algorithm deduced
from face detection technology. In our algorithm, a part of
Concept of the Algorithm the real-time face recognition algorithm from the OpenCV li-
Our algorithm is applied an identical method to the mea- brary is also used, but with some modifications, including a
surement of angles θ. The principle is to determine the change in the size of the green frame that surrounds the us-
reflection point coordinates that represent a pixel in an er’s head during movement. We also extract the center point
image. Fig. 3 shows the five basic steps through which (xcp, ycp) and work with it instead of (x, y) obtained from the
the algorithm passes inside the camera. The algorithm is library, as shown in Fig. 4a. The following equations show
2
The purpose is to make a very small size that is almost a green
point (Gpt) drawn on the user’s nose tip visible in the video. It
has an important role for the user, as it shows him whether the
SCaMO detected the gesture of his arms or not. It appears au- Fixing intervals Face detection
tomatically when the user is in front of the camera and in the
Fixed image Regulating user
video only when the algorithm inside the camera is running. parametric values face detection
according to face parameters
The user can adjust software parameters, such as calibrating detection settings (x, y, w, l).
parameters in the video HSV as mentioned by Chen et al. [26],
according to the effects of external factors and modification in
the position of the user. The dsc distance (Fig. 1a) that separates
4
the user from the camera in order to capture the movement of
the arms is set at a few meters. This distance varies according Detect
coordinates of 3
to the difference in the size of the user’s body and according to
first white pixels
the resolution of the camera, and external factors such as light-
to identify arms
ing and shadows that affect the quality of the image through position
the appearance of the image noise. The metric we adopt to get
the good distance dsc, in order to get good results, is the Gpt in- 5
dicator appearing on the video when the user is in the correct Action
position. On average, this distance can reach approximately
3 m, depending on a regular web camera, as is the case in this Fig. 3. The schematic diagram of arms gestures recognition using SCO.
article. The dsc distance can be enlarged many times when we
adopt a zoom and work with a more advanced camera. Also, proposed algorithm. Therefore, this advantage makes it possi-
the importance of using point (xcp, ycp) in the algorithm is to ble to change the work equipment. For example, it is possible
work on dividing the image into two parts (the right and left to put our algorithm in a computer with an integrated camera
side of the user). Equation (5) was used to get the number of instead of using the RPi with a web camera as mentioned by N.
centroids in a rectangle that consists of n distinct points x1 to Lalithamani [27]. Through the xcp point, the algorithm divides
xn. But in the context of image processing, (6) and (7) are used the image into two parts, each part carrying half of the body.
to obtain the coordinate of centroid where xcp is the abscissa of The ycp parameter plays an important role in determining
the centroid, ycp is the ordinate of the centroid, and M denotes the position of the hand. That is, it knows the direction of the
the Moment: hand, specifically whether the hand is raised above or not. The
n coordinates of nose tip (xcp, ycp) are where the point xcp repre-
∑x (5)
1
C= i sents the distinction between ‘Right side of user’ and ‘Left side
n
i =1 of user’. The point ycp represents the level at which the position
of the raised hand is determined. According to the physical
M10
xcp = (6) length and the distance between the user and the camera, the
M00
dcs distance is related to the user size of the image. If the dcs is
large, then the size of the user in the image is small, and there-
M10
ycp = (7) fore Y1 is large, and vice versa. In our example, the image size
M00
is 640 x 480 pixels, and the sum of the three regions is less than
Fig. 4a shows the right slit is equal to the left slit because the or equal to the image length of 480 pixels. The Y1 represents
user is in the middle of the video. In some other, more proba- the range ‘Arms up’ and is specified in this range [y0, ycp = y1].
ble cases, we find that the user does not exist in flipping the The Y2 represents the range ‘Arms at shoulder height’ and is
video, and therefore one side is larger than the other. The specified in this range [y1, y2].
reason is that the camera is installed on fixed support in all In the next phase, the algorithm works mainly on the fixing
directions. In our algorithm, the camera can determine the intervals where it divides the right-hand slot for the user into
position of the arm even though the user is not placed in the three equal parts (X1r = X2r = X3r). The same is true for the left-
center of the video due to the programming intelligence of the hand slot (X1l = X2l = X3l) on abscise axis as shown in Fig. 4. The
Y1
Y1
wp
(xbml, ybml)
ycp ybml
hp (xcp, ycp) ycp
480 Pixels
y1
Y2
Y2
User
y2
ybmr y2
(xbmr, ybmr)
(xbpl, ybpl) ybpl
Y3 Y3
Z Z
y3 y3
X
(a)
On the right side of the user On the left side of the user
Smart Camera
(b) (c)
Fig. 4. (a) The method of recognizing the user's arm movement through the camera‘s internal algorithm. (b) The threshold area X2r. (c) The threshold area X2l.
purpose of these splits is to reduce the time required to process the first white pixel coordinates (xbml, ybml) in Fig. 4c and the co-
the video image in order to recognize the result. The two parts ordinates (xbmr, ybmr) in Fig. 4b. Searching starts from right to
X1l and X3r contain the body, head, torso, and other parts of the left, e.g., from point x2l to point x1l, to find the first white pixel
user’s body that are not important to process and that the al- in Fig. 4c. This is unlike Fig. 4b in which the search direction
gorithm cannot distinguish. Therefore, we only had to work starts from left to right, e.g., from point x1r to point x2r. Then the
on analyzing the X2l and X2r parts of the image. This represents point xbml belongs to [x2l, x1l] which is the interval x2l. Moreover,
2/6 of the image size, which corresponds to reducing more the point xbmr belongs to [x2r, x2r] which is the interval X2r. From
than 80% of the time required to obtain the result, especially point ybml, we can discover the gesture of the right arm.
important since the shooting is via video and requires immedi- In addition, from the ybmr point, we can see the left arm ges-
ate results. Fig. 4b and Fig. 4c are respectively the parts X2r and ture. If ybmr belongs to Y1, then the hand is up since that point is in
X2l. These two parts are two small RGB images. The algorithm the ‘Arms up’ range. If ybmr belongs to Y2, then the hand is in the
converts them to two binary images and starts by searching for middle since this point is in the ‘Arm at shoulder height’ range.
Fig. 5. Monitor displays the user who is present in front of SCO and displays the result in the small black window at the bottom: (a) hand down; (b) hand in the
middle; (c) hand up.