CIRP Annals - Manufacturing Technology: Sichao Liu, Jianjing Zhang, Lihui Wang (1), Robert X. Gao
CIRP Annals - Manufacturing Technology: Sichao Liu, Jianjing Zhang, Lihui Wang (1), Robert X. Gao
A R T I C L E I N F O A B S T R A C T
Article history: Autonomous robots that understand human instructions can significantly enhance the efficiency in human-
Available online 23 April 2024 robot assembly operations where robotic support is needed to handle unknown objects and/or provide on-
demand assistance. This paper introduces a vision AI-based method for human-robot collaborative (HRC)
Keywords: assembly, enabled by a large language model (LLM). Upon 3D object reconstruction and pose establishment
Robot
through neural object field modelling, a visual servoing-based mobile robotic system performs object manip-
Assembly
vision AI
ulation and navigation guidance to a mobile robot. The LLM model provides text-based logic reasoning and
high-level control command generation for natural human-robot interactions. The effectiveness of the pre-
sented method is experimentally demonstrated.
© 2024 The Author(s). Published by Elsevier Ltd on behalf of CIRP. This is an open access article under the CC
BY license (http://creativecommons.org/licenses/by/4.0/)
1. Introduction critical to reliable assembly operations to enable handling the right objects in
the right way [12].
Autonomous robot-driven collaborative assembly promotes human-robot This paper presents a vision AI-based HRC assembly technique supported
interaction and control for on-demand robot assistance, especially for personal- by an LLM and AMR. A neural object field-based model is presented for accurate
ised product assembly scenarios [1]. These scenarios involve operations that can- 3D reconstruction and 6D pose estimate of objects. The model enables a visual
not be predefined but need to be dynamically adapted to. While it is natural for servoing-based autonomous mobile robotic system with object mapping capa-
humans to instruct a robot to pick up a spare part when a component is broken bility to navigate around the assembly environment for object detection, track-
or missed during assembly operations, having the robot to not only autono- ing and manipulation. Finally, LLM-driven logic reasoning of text instructions
mously recognise the needed part but also parse and decompose human instruc- and high-level robot control commands is presented for natural human-robot
tions into executable actions to provide assistance has remained a challenge. interactions in assembly.
Establishing the 3D model and 6D pose of an object is the first step toward part
recognition and manipulation [2]. In general, the existing methods rely on avail- 2. 3D modelling and pose estimate of unknown objects
able CAD models and category /instance-level prior knowledge or known camera
poses to create the object pose. This is unpractical for handling objects that are 2.1. Vision AI-based HRC assembly
previously unknown [3]. In recent years, vision artificial intelligence (AI) has
been introduced for 3D reconstruction of unknown objects and pose tracking [4].
As shown in Fig. 1, the vision AI-based HRC assembly starts with RGB-D
Recently, neural rendering has been investigated for 3D modelling of unknown
video collection of an object (e.g., a valve cover) along scanning paths, with the
products [5], however, to achieve on-demand assistance, pose tracking will have
output being object frames and masks (a video frame includes a colour and a
to be integrated with 3D rendering and reconstruction techniques.
depth image). The frames and masks serve as the input for training a network
To assist a robot in understanding human language commands for assem-
to build the 3D model of the object with an optimised pose. The object is subse-
bly, natural language processing (NLP) models enable sentence parsing for
quently detected by a camera-driven visual servoing system installed on the
cause-and-effect analysis [6]. However, traditional NLP models lack the ability
AMR. Separately, a laser scanner (Lidar) creates a simultaneous localisation and
for assembly contextual understanding and high-level text instruction analysis.
mapping (SLAM) map of the assembly environment along the moving path of
Recently, large language models (LLMs) have demonstrated the capabilities of
the robot, enabling it to navigate safely around the assembly environment.
text understanding and reasoning [7]. As an example, Figure 01 humanoid
Since the robot does not know initially what objects to be acted upon, object
robots powered by OpenAI’s visual-language models can converse, reason and
mapping with labelling is taken as landmarks. To control the robot for task exe-
plan their actions as they work [8]. Also, leveraging LLMs in high-level plan-
cution, new capabilities of the LLM are explored to reason and extract control
ning, robot manipulation and code generation was investigated [9,10], but
logic steps behind text instructions issued by a human operator. Finally, high-
without exploring logic reasoning behind text instructions. With understand-
level control commands with vocabulary-based object indexing and mapping
ing of the text commands, autonomous mobile robots (AMRs) supported by
are used for the robot motion control and assembly task execution.
vision and navigation capabilities can achieve motion control, object detection,
and manipulation when executing assembly tasks [11]. These functions are
2.2. Neural representation of unknown objects
* Corresponding author. The goal of neural representation of an unknown object is to build its opti-
E-mail address: [email protected] (L. Wang). mal pose estimate for robotic manipulation when the CAD model and instance-
https://doi.org/10.1016/j.cirp.2024.03.004
0007-8506/© 2024 The Author(s). Published by Elsevier Ltd on behalf of CIRP. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
14 S. Liu et al. / CIRP Annals - Manufacturing Technology 73 (2024) 1316
level prior information of the object as well as camera poses are not available object representation function (R). The function takes object’s geometry (G)
[3]. As shown in Fig. 2, four modules have been developed to realise 3D recon- and appearance (A) as inputs to construct 3D shape and appearance of the
struction and 6D pose of the unknown objects [4]. Specifically, Module ❶ object while adjusting the pose of key frames. In this study, a recorded video
receives RGB-D video streams of an object from a depth camera and produces with 708 frames and 229 key frames are used to obtain the 3D model and pose
object frames and masks by using a segmentation network. Subsequently, of a valve cover, implemented in a CUDA environment. The outcome of the 3D
pixel-wise dense match features between the current and previous frames, reconstruction of the valve cover is shown in Figs. 3(b) and (c). The mesh mod-
together with their masks, are extracted to generate a coarse pose of the cur- els rendered by point clouds (front & back) provide precise 3D neural rendering
rent frame. with detailed representations of the object’s complex structure (i.e., reflected
by a red circle in Fig 3(c)).
Fig. 3. 3D reconstruction and pose estimate for an object (valve cover): (a) object
image as ground truth; (b) & (c): mesh models rendered by point cloud (front & back);
Meanwhile, the current frame and its coarse pose are transmitted to Mod- (d) object’s 6D pose with a grasping point (white dot).
ule ❷ to perform pose comparison with a set of key frames provided by Module
❸, which is a frame pool storing frames with informative object features and
adds the first frame to set a canonical coordinate system for next frames. If sig-
nificant feature changes between the current frame and existing key frames in Pk ¼ f P~ k ðIc ; Id ; M; C Þ; F ðkÞ; FNK : k 2 N; P0 ¼ null ð1Þ
the pool are detected, online pose graph optimisation is performed to refine
and update the pose, and the current frame is taken as a key frame and added
NF ¼ R G; A; F K ; k : k2N ð2Þ
into the frame pool. Otherwise, the current frame is discarded. By iterating
each frame, key frames of all the frames and their pose estimates are obtained. With the 3D model created, its fine 6D pose is built simultaneously as
Given that real-time neural processing of all the frames takes significant shown in Fig. 3(d), which provides the location and orientation of the object as
computational resources, only key frames are stored in Module ❸, while other inputs to robotic grasping. Since the object’s pose centre is computed from the
frame information is discarded. Next, Module ❹ receives all the posed key visible point cloud and may not represent an appropriate grasping point, the
frames from Module ❸ as inputs to the neural object field [5]. The training net- centre of its re-defined coordinate frame, which is created by taking the centre
work learns to accumulate information into a consistent 3D representation that and oriented box of the mesh model of the object’s surface structure model and
captures both the geometry and appearance of the object by using a neural geometry, is selected as the grasping point (indicated by a white dot in Fig. 3
signed distance function (Neural SDF). Finally, the 3D model of the object with (d)). Finally, object’s 6D pose with a proper grasping point is identified and
a 6D pose is built for robotic manipulation. tacked for object grasping.
4. System implementation
From the top camera view, the object pose is tracked in real time and
passed to an ROS (robot operating system)-based motion planner that gener-
ates the robot trajectories. The robot arm is controlled to grasp the object at the
customised grasping point. This forms a closed-loop of visual servoing to estab-
lish object-camera-robot data streams for robot control and object manipula-
tion. It further provides the capability of handling dynamic situations (e.g.,
moving objects).