\newfloatcommand

capbtabboxtable[][\FBwidth]

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

Ola Shorinwa^⋆, Johnathan Tucker^⋆, Aliyah Smith, Aiden Swann, Timothy Chen,
Roya Firoozi, Monroe Kennedy III, Mac Schwager ^⋆ The co-first authors contributed equally.Stanford University, Stanford, CA 94305, USA {shorinwa,jatucker,aliyah1,swann,chengine, rfiroozi,schwager,monroek}@stanford.eduThis work was supported in part by DARPA grant HR001120C0107, NSF Graduate Research Fellowship DGE-1656518 and DGE-2146755, NSF Grant 2220867, and by a gift from Meta. We are grateful for this support.Toyota Research Institute provided funds to support this work.

Abstract

We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representations to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) ASK-Splat, a GSplat representation that distills semantic and grasp affordance features into the $3$ D scene. ASK-Splat enables geometric, semantic, and affordance understanding of $3$ D scenes; (ii) SEE-Splat, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a “digital twin” of the evolving environment throughout the manipulation task; and (iii) Grasp-Splat, a grasp generation module that uses ASK-Splat and SEE-Splat to propose affordance-aligned candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks and in four multi-stage manipulation tasks, using the edited scene to reflect changes due to prior manipulation stages, which is not possible with existing baselines. The project page is available at https://splatmover.github.io, and the code for the project will be made available after review.

Index Terms:

Gaussian Splatting, Robotic Manipulation, Scene Editing

I Introduction

Open-world robotic manipulation requires spatial and semantic understanding of the scene in which a robot is operating. In particular, a robot must be able to identify the location and geometry of objects in its environment based on their semantic attributes. However, spatial and semantic scene understanding alone is often inadequate in robotic manipulation. In many cases, it is critical for the robot to know the best place to grasp the object to perform the required task, that is, to be able to detect and localize grasp affordances on the object [1, 2]. Furthermore, for multi-stage manipulation tasks, where the pre-conditions for the next stage are met through the success of a previous stage, it is essential to continually update the 3D scene model to reflect the objects’ motions and rearrangements due to the robot’s previous actions, to have an accurate scene representation that is useful for subsequent stages.

Consequently, in this work, we introduce Splat-MOVER, a modular robotics stack for Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting. Splat-MOVER consists of three modules: (i) a $3$ D Affordance-and-Semantic Knowledge Gaussian Splatting scene representation (ASK-Splat), (ii) a Scene-Editing-Enabled module for Gaussian Splatting scenes (SEE-Splat), and (iii) a grasp generation module (Grasp-Splat) that uses ASK-Splat and SEE-Splat to plan grasps in multi-stage manipulation tasks. These components together make up Splat-MOVER, illustrated in Figure LABEL:fig:splat_mover. In a brief pre-scanning phase, we train the ASK-Splat model from posed RGB images of the workspace, simultaneously embedding CLIP and grasp affordance features from the images into the 3D scene model. Then, given a natural language prompt of a multi-stage manipulation task, Splat-MOVER uses ASK-Splat to localize and mask the objects from the prompt in the $3$ D scene, providing 3D initial and target coordinates for the robotic manipulator at each stage of the manipulation task. Subsequently, SEE-Splat edits the ASK-Splat scene, reflecting the dynamic poses of the objects throughout the manipulation task. Finally, Grasp-Splat uses the pre-trained GraspNet model [3] to generate candidate grasps, which are then re-ranked based on their affordance scores from the embedded affordance features in ASK-Splat, to obtain grasp poses for each object, at each stage of the manipulation task. Point-to-point motion planning for the robot arm is accomplished with standard off-the-shelf planning tools, e.g., MoveIt [4].

We showcase Splat-MOVER’s effectiveness through hardware experiments on a Kinova robot, where Splat-MOVER achieves significantly improved success rates across four single-stage open-vocabulary manipulation tasks compared to two recent baseline methods: LERF-TOGO [5] and F3RM [6]. In three single-stage manipulation tasks, Splat-MOVER improves the success rate of LERF-TOGO by a factor of at least $2.4$ , while achieving similar success rates in a fourth task ( $95\%$ compared to LERF-TOGO’s $100\%$ ). Likewise, Splat-MOVER improves the success rates of F3RM by a factor ranging from $1.2$ to $3.3$ , across the four single-stage manipulation tasks. In addition, we demonstrate the performance of Splat-MOVER in four multi-stage manipulation tasks, where we leverage SEE-Splat to reflect the updates in the scene resulting from prior manipulation stages, a capability absent in existing baseline approaches.

We summarize our contributions below:

(i)

We introduce ASK-Splat, a Gaussian Splatting scene model with embedded affordance and semantic features, enabling geometric, semantic, and affordance scene understanding.
(ii)

We propose SEE-Splat, a scene-editing module that uses 3D object masks from ASK-Splat to enable real-time editing of Gaussian Splatting scenes, reflecting the motion of objects in the scene due to the robot’s actions.
(iii)

We introduce Grasp-Splat, a grasp-generation module that proposes affordance-aligned grasp candidates for specified objects, leveraging the embedded affordance features in ASK-Splat.

These three modules come together to make Splat-MOVER, which takes in a natural-language description of a multi-stage manipulation task, and produces an executable motion plan to accomplish the task.

II Related Work

Open-World Robotic Manipulation Advances in foundation models [7] (trained on large-scale datasets [8, 9]) have enabled the development of open-world robotic manipulation methods, where robots manipulate objects given natural-language task instructions at runtime, without being trained on those specific objects or tasks. Existing methods fall into one of two categories: (i) (“smart policy”) end-to-end visuo-motor policies based on a pre-trained large vision-language model [10, 11, 8], or (ii) (“smart map”) methods that use a rich 3D scene representation that embeds features from a vision-language foundation model, interfacing with a traditional manipulation planning stack to drive robot motion. Our method is of the “smart map” variety. Two other recent methods also take this general approach: LERF-TOGO [5] and F3RM [6]. LERF-TOGO [5] builds upon the semantic NeRF representation in LERF [12], coupled with GraspNet [3] for grasp generation. Similarly, F3RM [6] uses the NeRF-based distilled feature field from [13] to embed task-relevant features learned from human demonstrations into the 3D scene model, which are used to generate and optimize candidate grasps. We compare against LERF-TOGO and F3RM as baselines.

Despite their effectiveness, these methods still face a number of challenges. LERF-TOGO [5] requires the specification of a grasp location on an object by a human operator or a large language model, which might not be readily available. Likewise, F3RM [6] requires human demonstrations, which might be difficult to collect. Both methods rely on NeRFs as the underlying scene representation, which can be time-consuming to train and is not easily edited to represent changes in the environment (e.g., due to manipulation actions by the robot). In this paper, we embed affordance features from a pre-trained 2D affordance model VRB [14] into the 3D scene, thereby avoiding the need for human demonstrations or human specification of grasp locations. We also build our scene on the Gaussian Splatting representation, which is faster to train and render than NeRF, and easier to edit to reflect object motions. We introduce a novel GSplat scene editing algorithm specifically designed for the tabletop, object-centric edits needed in robotic manipulation.

Language-Embedded NeRFs and Gaussian Splats The core enabling technology behind the “smart map” variety of open-world manipulation methods is the neural feature field concept, which distills information from 2D data sources into a view-consistent 3D field by back-propagating through a differentiable image renderer. Neural Radiance Field (NeRF) [15] represents one of the earliest high-fidelity instance of this concept, distilling the 2D color information from RGB images into a 3D color and density field. NeRFs have been integrated into a number of robotics tasks, spanning navigation [16, 17], SLAM [18, 19, 20], and manipulation [21, 22]. Recent works have distilled semantic features from vision-language foundation models (such as CLIP [23], DINO [24], or LSeg [25]) into NeRFs [26, 27, 12], yielding a rich visual-semantic 3D scene representation. These methods were designed to highlight 3D semantic relevance to a text query (e.g., MaskCLIP [13] and LERF [12]), or to produce 3D object masks for scene editing (e.g., CLIP-NeRF [27] and Distilled Feature Fields (DFFs) [26]).

Unfortunately, adoption of these rich representations in robotics is still hampered by two key challenges: (i) slow training and rendering times, and (ii) inability to reflect dynamic scenes. Gaussian Splatting [28], a recent photorealistic volumetric scene representation, addresses these fundamental limitations, by providing faster training and rendering speeds than NeRFs, while also offering the potential for modeling dynamic scenes through real-time scene editing, as we demonstrate in this paper. Gaussian Splatting represents the environment using $3$ D Gaussians, coupled with a fast rasterization-based image renderer. Some recent works [29, 30, 31, 32, 33] have explored embedding semantic features into 3D Gaussian Splatting, similarly to our work, but none of these have been used for robotic manipulation.

One key novelty in our method is that we distill grasp affordance features into the 3D field, together with CLIP semantic features. We use the pretrained VRB model [14], which infers a per-pixel grasp affordance metric from 2D RGB images, trained from videos of humans interacting with objects. Other works learn 2D grasp affordance models from still images ([34, 35, 36]) or from human-object interaction points in video data ([37, 38]), but none have distilled these 2D models into 3D fields to aid in robotic grasping. Both the semantic and affordance features in our model are crucial to grasp success, as shown in our experimental studies.

In concurrent work, GaussianGrasper [39] leverages the deep-learned model AnyGrasp [40] to generate robotic grasps from GSplat environments. However, GaussianGrasper does not embed grasp affordances or consider scene editing for multi-stage tasks, as we do. The concurrent work ManiGaussian [41] uses a GSplat to represent the robot arm itself, and predicts optimal arm actions for grasping given this representation. The method is shown in simulation only. Finally, the workshop paper [42] considers a GSplat method for tracking objects during robot manipulation motions, but does not consider planning or distilled affordances, as is our focus here.

III Preliminaries

We introduce notation relevant to this paper, before presenting Gaussian Splatting. We denote a $2$ D feature embedding as ${f\in\mathbb{R}^{a\times b\times c}}$ , where the first-two dimensions $(a,b)$ represent the spatial dimensions, while the last dimension $c$ represents the dimension of the embedding. Similarly, we denote $1$ D feature embeddings as ${f\in\mathbb{R}^{c}}$ . We provide a brief review of Gaussian Splatting, which we build upon to obtain a affordance-and-semantic-aware scene representation. We direct interested readers to [28] for a more in-depth discussion of Gaussian Splatting. In Gaussian Splatting, the scene is represented using $3$ D Gaussians, parameterized by a mean ${\mu\in\mathbb{R}^{3}}$ and covariance ${\Sigma\in\mathbb{R}^{3\times 3}}$ . Further, the covariance matrix $\Sigma$ is represented as the product of a scaling (diagonal) matrix ${S\in\mathbb{R}^{3\times 3}}$ and a rotation matrix ${R\in\mathbb{R}^{3\times 3}}$ (given by a quaternion), expressing $\Sigma$ as: ${\Sigma=RSS^{\mathsf{T}}R^{\mathsf{T}}}$ . An opacity parameter ${\alpha\in\mathbb{R}}$ and spherical harmonics parameters are assigned to each Gaussian. The spherical harmonics parameters enable the Gaussians to capture the view-dependent properties of the color of each point in space. Tile-based rasterization with $\alpha$ -blending is utilized for rendering the Gaussians. Specifically, the $3$ D Gaussians is projected to $2D$ , with the projected covariance matrix given by: ${\Sigma^{\prime}=JW\Sigma W^{\mathsf{T}}J^{\mathsf{T}},}$ where ${J\in\mathbb{R}^{3}}$ denotes the Jacobian of the projective transformation (approximated by an affine approximation) and ${W\in\mathbb{R}^{3}}$ denotes the viewing transformation. The color of each pixel $C$ in the rendered image results from the point-based $\alpha$ -blending procedure: ${C=\sum_{i\in N}c_{i}\alpha_{i}\prod_{j=1}^{i-1}(1-\alpha_{j}),}$ where $N$ represents the number of Gaussians overlapping the pixel, $c_{i}$ is computed from the spherical harmonics representation of the color of Gaussian $i$ , and $\alpha_{i}$ denotes the opacity of Gaussian $i$ multiplied by its unnormalized pdf (overloading the notation for the opacity parameter). The parameters of each Gaussian are optimized through stochastic gradient descent, leveraging the differentiability of the tile-based rasterization procedure to compute the gradients of the loss function $\mathcal{L}$ with respect to the parameters. In Gaussian Splatting, the loss function $\mathcal{L}$ comprises of a convex combination of the $\ell_{1}$ -photometric loss and the structural similarity index measure (SSIM) loss [43], with: ${\mathcal{L}=(1-\lambda)\mathcal{L}_{1}+\lambda\mathcal{L}_{\mathrm{SSIM}}}$ .

IV Affordance-and-Semantic-Knowledge Gaussian Splatting

Given a set $\mathcal{D}$ of RGB images, given by ${\mathcal{D}=\{\mathcal{I}_{1},\ldots,\mathcal{I}_{N}\}}$ , ASK-Splat embeds two kinds of features in the GSplat scene: (1) CLIP features, and (2) grasp affordance features. Using the CLIP features, ASK-SPlat generates a 3D heat map in response to an open-vocabulary text prompt, which denotes the semantic relevance of regions in the scene to the prompt. E.g., the prompt “apple” will give a 3D heat map that is hottest over the apple, but also somewhat hot over the pear and orange (which are semantically “close” to apple). Although the affordance feature field cannot be queried directly in natural language, it provides information on the graspability of all points in the scene. We use the CLIP features to mask out an object of interest, and the affordance features to highlight graspable locations within the mask. We illustrate these components in Figure 1.

Refer to caption — Figure 1: ASK-Splat grounds $2$ D visual attributes (e.g, color and lighting effects), grasp affordance, and semantic embeddings within a $3$ D GSplat representation and is trained entirely from RGB images. Leveraging $3$ D ASK-Splat, SEE-Splat enables open-vocabulary scene-editing via semantic localization of relevant Gaussian primitives in the scene, followed by $3$ D masking and transformation ${\xi(t)}$ of these Gaussians.

Grounding Language Semantics in 3D GSplats A naive approach to encode semantic information within a Gaussian Splatting representation would be to assign an additional parameter ${f\in\mathbb{R}^{C}}$ assigned to each Gaussian, representing the semantic feature embedding associated with the Gaussian. However, such an approach introduces significant memory and computation challenges, effectively eliminating the real-time rendering rates of Gaussian Splatting. To preserve the real-time rendering speed of Gaussian Splatting, we learn a lower-dimensional latent space associated with the high-dimensional semantic feature embeddings, and subsequently, leverage the lower-dimensional latent features for semantic knowledge distillation. Given an RGB image ${\mathcal{I}\in\mathbb{R}^{H\times W\times 3}}$ , we begin by extracting ‘ground-truth’ image embeddings from the vision-language foundation model CLIP, denoted by ${f_{\mathrm{gt}}\in\mathbb{R}^{H_{f}\times W_{f}\times C}}$ , using MaskCLIP [13]. We note that although the image embeddings from [13] are sometimes described to be dense, the resulting image embeddings for the original input image are essentially patch-based, with ${H_{f}<H}$ and ${W_{f}<W}$ . Further, we note that the dimension of $f_{\mathrm{gt}}$ (denoted by $C$ ) depends on the CLIP model used, with common values ranging between $512$ (for the CLIP-ViT models) and $1024$ (for the CLIP-ResNet models). Subsequently, we learn a lower-dimensional encoding of the semantic embeddings using an autoencoder, consisting of an encoder ${g_{\phi}^{\mathrm{enc}}:\mathbb{R}^{H_{f}\times W_{f}\times C}\mapsto\mathbb{% R}^{H_{f}\times W_{f}\times l}}$ , mapping semantic embeddings of dimension $C$ from the semantic feature space to the semantic latent space, and a decoder ${g_{\theta}^{\mathrm{dec}}:\mathbb{R}^{H_{f}\times W_{f}\times l}\mapsto% \mathbb{R}^{H_{f}\times W_{f}\times C}}$ , mapping the semantic latent embeddings of dimension $l$ to the semantic feature space, with parameters $\phi$ and $\theta$ , respectively, with ${l=3\ll C}$ , depicted in Figure 1. We distill the latent CLIP features into the Gaussian Splatting representation, by augmenting the attributes of each Gaussian with an additional parameter ${f_{s}\in\mathbb{R}^{l}}$ , denoting the semantic feature of the Gaussian, which are optimized via gradient descent. We describe the distillation and optimization procedures in greater detail in Appendix I.

Grounding Affordance in 3D Gaussian Splatting ASK-Splat embeds object-specific grasp affordances directly within the $3$ D Gaussian Splatting environment, yielding a heat map over objects in the scene relating to the graspability. We distill visual grasp affordances from a vision-affordance foundation model trained on large human-object interaction datasets, VRB [14], into the scene representation, although we note that other vision-affordance foundation models can also be used. This endows robots with the ability to identify the parts of an object that a human is more likely to grasp, capturing common-sense human knowledge and experience in interacting with objects.

We obtain the ground-truth affordance score by evaluating the training dataset $\mathcal{D}$ using VRB and leverage the $3$ D Gaussian primitives in embedding the $2$ D visual grasp affordance scores in $3$ D. By augmenting the attributes of each Gaussian in the scene with an additional parameter ${\beta\in\left[0,1\right]}$ , representing the score of the Gaussian affording the task of grasping, we ground the grasp affordances within the geometric primitives representing the occupied regions of the scene. We enforce the box constraints on $\beta$ using the sigmoid activation function to provide smooth gradients during the training procedure. We provide additional details in Appendix I.

V Scene-Editing-Enabled Gaussian Splatting

SEE-Splat consists of three components: (i) a semantic component utilizing natural-language queries to generate a relevancy map based on semantic similarity; (ii) a Gaussian masking module that generates a sparse point cloud of the relevant objects, comprising of all semantically-relevant Gaussians; and (iii) a scene transformation component that edits the $3$ D scene by inserting, removing, or modifying the geometric, spatial, or visual properties of the Gaussians. In Figure 1, we illustrate each component of SEE-Splat by editing a real-world cooking scene to visualize the action of moving the saucepot from the table to the electric stove, showing localization of the saucepot, extraction of the semantically-relevant Gaussians, and transformation of the relevant Gaussians. Given a natural language query, we generate a semantic similarity heatmap of the object within ASK-Splat, which is used to mask the Gaussians belonging to the object, given a threshold for semantic relevancy. To edit ASK-Splat, SEE-Splat applies a transform ${\xi:\mathcal{G}_{s}\mapsto\mathcal{G}_{s}}$ to the relevant Gaussians to update their spatial attributes, where $\mathcal{G}_{s}$ represents the space of the Gaussian primitives. Together, these components enable robots to visualize the effects of their interactions with other objects in a virtual $3$ D scene, prior to executing these actions in the real-world. Essentially, SEE-Splat provides a digital twin of the scene. Moreover, provided sensor feedback is available, SEE-Splat enables real-time visualization of the real-world in a virtual environment, enabling the Gaussian primitives to accurately reflect the real-time geometry and visual properties of objects in the real-world. We provide a more detailed discussion of each component in Appendix II.

VI Affordance-Aligned Grasp Generation

Grasp-Splat utilizes the dense point cloud of the object generated by SEE-Splat to propose grasp configurations for grasping the object, while harnessing the grasp affordance knowledge in ASK-Splat to identify generally more stable grasp configurations associated with the specified object. The proposed grasps are generated from the point cloud using a deep-learned model GraspNet [3], that estimates $6$ D grasp poses for a parallel-jaw gripper from a point cloud of the object, along with estimated grasp-quality scores associated with these grasp poses. We note that the grasps generated by GraspNet are not always ideal. Consequently, Grasp-Splat executes a grasp selection procedure to identify more-promising candidate grasps. We hypothesize that leveraging grasp affordance of each part of the object when generating candidate grasps could be essential to identifying better candidate grasps. As a result, we introduce the grasp metric ${\nu:\mathrm{SE}(3)\mapsto\mathbb{R}}$ , which computes the grasp affordance at a specified grasp pose ${X\in\mathrm{SE}(3)}$ . Grasp-Splat ranks the candidate grasps generated by GraspNet based on the grasp scores given by $\nu(X)$ , leveraging the affordance associated with each grasp pose to identify grasp configurations that are more likely to succeed, depicted in Figure 2. Moreover, since GraspNet does not consider the position of the robot relative to the object, the proposed grasps might require post-processing to account for the relative position of the robot. We elaborate more on our grasp-selection procedure and its application to multi-stage robotic manipulation in Appendix III.

[Uncaptioned image] — TABLE I: Grasping success rate and the percentage of grasps in the affordance region (AGSR) of each object for LERF-TOGO [5], F3RM [6], and Splat-MOVER across $20$ feasible trials are presented in the first column. In addition, we present the pick and place success rates for the two-stage manipulation task across four scenarios where each scenario has two pick and place stages. From top to bottom: *Cooking* task, where a robot is asked to move a saucepan to an electric burner, followed by placing a fruit in the saucepan; *Chopping* task, where a robot is asked to move a knife to a chopping board, followed by placing a fruit next to the knife; *Cleaning* task, where a robot is asked to move a cleaning spray into a bin, followed by placing a sponge next to the bin; and *Workshop* task, where a robot is asked to move a power drill to a work mat, followed by moving a wooden block next to the drill. The top-performing stat is shown in bold.

VII Hardware Evaluations

We compare Splat-MOVER to the existing open-vocabulary robotic manipulation methods LERF-TOGO [5] and F3RM [6] in four tasks: the Cooking task, Chopping task, Cleaning task, and Workshop task, illustrated in Table I, executed on a Kinova robot. We utilize the publicly-available implementation of LERF-TOGO provided by the authors. Since F3RM requires collecting human demonstrations in virtual-reality, which we could not provide in our evaluations, we implement F3RM and utilize GraspNet for grasp generation.

Since the NeRF-based methods LERF-TOGO and F3RM are not amenable to scene-editing, we limit our comparisons to the first stage of the manipulation task, where we consider grasping a saucepan, knife in a knife guard, cleaning spray, and power drill. In the hardware experiments on the robot, we describe a candidate grasp as being feasible if the MoveIt planner successfully finds a plan to execute the grasp. We execute $20$ feasible trials for each object, and in Table I, provide the grasping success rate of each method, as well as the percentage of grasps that lie within the affordance region of the object (AGSR), where the affordance region of each object is defined as follows: the entire handle of the saucepan, the entire knife with the guard on, the region of the cleaning spray excluding its cap, and the middle region of the power drill below its top compartment and above its battery compartment. From Table I, Splat-MOVER achieves the best grasping success rates and the highest percentage of grasps in the affordance region in grasping the saucepan, knife, and cleaning spray. LERF-TOGO attains a perfect success rate in grasping the power drill; however, LERF-TOGO always grasped the top of the drill, outside its affordance region, shown in Appendix IV. In contrast, Splat-MOVER achieves a $95\%$ success rate, grasping the object within its affordance region.

In addition, we evaluate the pick-and-place success rate of all the methods in each manipulation task, where the place success rate is conditioned on the number of successful trials in picking the object. The publicly-available implementation of LERF-TOGO does not support the specification of a place location. Hence, we do not evaluate its place success rate. In Table I, we present the pick-and-place success rates in the four tasks. In the Cooking task, Splat-MOVER achieves a perfect success rate in picking up the saucepan, with a place success rate of $60\%$ . We note that the place step involves placing the saucepan on an electric burner, which is quite challenging, potentially explaining the lower place success rate. Nonetheless, Splat-MOVER outperforms LERF-TOGO and F3RM in the first stage of the task, the only stage of the task that both methods can perform. We present additional results, in addition to details on the experimental setup, in Appendix IV.

VIII Conclusion

We present Splat-MOVER, a robotics stack for multi-stage open-vocabulary robotic manipulation, consisting of three modules: ASK-Splat, SEE-Splat, and Grasp-Splat. ASK-Splat enables semantic and affordance queries via natural-language interaction to identify relevant objects within $3$ D scenes, while SEE-Splat provides real-time, dynamic scene editing, enabling visualization of the evolution of the real-world scene due to the robot’s interaction with objects within the scene. Grasp-Splat builds upon these two modules for affordance-aware grasp generation, necessary for effective multi-stage robotic manipulation. We demonstrate the effectiveness of Splat-MOVER in real-world experiments in comparison to two recent baseline methods. Compared to the existing works, Splat-MOVER endows robots with the unique capability for multi-stage, open-vocabulary manipulation with minimal human inputs, by leveraging distilled grasp affordance knowledge and real-time dynamic scene editing, which are essential to multi-stage manipulation.

IX Limitations and Future Work

We present a few limitations of Splat-MOVER and provide directions for future work. ASK-Splat distills grasp affordance knowledge from foundation models into $3$ D Gaussian Splatting scenes. We note that the effectiveness of the distilled grasp affordances and the generalization capability of our model is highly dependent on that of the affordance foundation model. For example, when trained using the VRB foundation model, an ASK-Splat scene may not generalize notably well to markedly-different objects outside of those seen in the EPIC-KITCHENS dataset. Future work will examine enhancing the generalization capability of the proposed ASK-Splat module by training and employing diverse cross-environment, cross-task, vision-affordance foundation models, rather than relying on VRB, which is specifically trained for kitchen tasks. Further, in its current form, the distilled visual grasp affordance does not depend on the orientation of the candidate grasp, limiting Grasp-Splat from fully harnessing grasp affordances in identifying better grasp configurations in $\mathrm{SE}(3)$ . Future work will consider methods for extending the computation of grasp affordance to $\mathrm{SE}(3)$ . Moreover, future work will seek to integrate fast online Gaussian Splatting into ASK-Splat, eliminating the need for a brief scanning phase of the scene prior to training in addition to speeding up the training procedure. Additionally, we will seek to integrate sensor feedback into SEE-Splat to enable closed-loop, real-time scene editing, improving the accuracy of the scene representation. In addition, by closing the scene-editing loop with sensor feedback, we can ensure that our $3$ D scene representation updates dynamically according to unexpected real-world events. For example, if the pan slips out of the robot gripper, SEE-Splat will be able to promptly reflect this change and modify the plans of the robot accordingly.

References

[1] J. J. Gibson, “The ecological approach to the visual perception of pictures,” Leonardo, vol. 11, no. 3, pp. 227–235, 1978.
[2] D. A. Norman, The psychology of everyday things. Basic books, 1988.
[3] H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 444–11 453.
[4] S. Chitta, I. Sucan, and S. Cousins, “Moveit![ros topics],” IEEE Robotics & Automation Magazine, vol. 19, no. 1, pp. 18–19, 2012.
[5] A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg, “Language embedded radiance fields for zero-shot task-oriented grasping,” in 7th Annual Conference on Robot Learning (CoRL). PMLR, 2023, pp. 178–200.
[6] W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled feature fields enable few-shot language-guided manipulation,” in 7th Annual Conference on Robot Learning (CoRL), 2023.
[7] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman et al., “Foundation models in robotics: Applications, challenges, and the future,” arXiv preprint arXiv:2312.07843, 2023.
[8] O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf, B. Wulfe, B. Ichter, C. Lu, C. Xu, C. Le, C. Finn, C. Wang, C. Xu, C. Chi, C. Huang, C. Chan, C. Agia, C. Pan, C. Fu, C. Devin, D. Xu, D. Morton, D. Driess, D. Chen, D. Pathak, D. Shah, D. Büchler, D. Jayaraman, D. Kalashnikov, D. Sadigh, E. Johns, E. Foster, F. Liu, F. Ceola, F. Xia, F. Zhao, F. V. Frujeri, F. Stulp, G. Zhou, G. S. Sukhatme, G. Salhotra, G. Yan, G. Feng, G. Schiavi, G. Berseth, G. Kahn, G. Yang, G. Wang, H. Su, H.-S. Fang, H. Shi, H. Bao, H. B. Amor, H. I. Christensen, H. Furuta, H. Walke, H. Fang, H. Ha, I. Mordatch, I. Radosavovic, I. Leal, J. Liang, J. Abou-Chakra, J. Kim, J. Drake, J. Peters, J. Schneider, J. Hsu, J. Bohg, J. Bingham, J. Wu, J. Gao, J. Hu, J. Wu, J. Wu, J. Sun, J. Luo, J. Gu, J. Tan, J. Oh, J. Wu, J. Lu, J. Yang, J. Malik, J. Silvério, J. Hejna, J. Booher, J. Tompson, J. Yang, J. Salvador, J. J. Lim, J. Han, K. Wang, K. Rao, K. Pertsch, K. Hausman, K. Go, K. Gopalakrishnan, K. Goldberg, K. Byrne, K. Oslund, K. Kawaharazuka, K. Black, K. Lin, K. Zhang, K. Ehsani, K. Lekkala, K. Ellis, K. Rana, K. Srinivasan, K. Fang, K. P. Singh, K.-H. Zeng, K. Hatch, K. Hsu, L. Itti, L. Y. Chen, L. Pinto, L. Fei-Fei, L. Tan, L. J. Fan, L. Ott, L. Lee, L. Weihs, M. Chen, M. Lepert, M. Memmel, M. Tomizuka, M. Itkina, M. G. Castro, M. Spero, M. Du, M. Ahn, M. C. Yip, M. Zhang, M. Ding, M. Heo, M. K. Srirama, M. Sharma, M. J. Kim, N. Kanazawa, N. Hansen, N. Heess, N. J. Joshi, N. Suenderhauf, N. Liu, N. D. Palo, N. M. M. Shafiullah, O. Mees, O. Kroemer, O. Bastani, P. R. Sanketi, P. T. Miller, P. Yin, P. Wohlhart, P. Xu, P. D. Fagan, P. Mitrano, P. Sermanet, P. Abbeel, P. Sundaresan, Q. Chen, Q. Vuong, R. Rafailov, R. Tian, R. Doshi, R. Mart’in-Mart’in, R. Baijal, R. Scalise, R. Hendrix, R. Lin, R. Qian, R. Zhang, R. Mendonca, R. Shah, R. Hoque, R. Julian, S. Bustamante, S. Kirmani, S. Levine, S. Lin, S. Moore, S. Bahl, S. Dass, S. Sonawani, S. Song, S. Xu, S. Haldar, S. Karamcheti, S. Adebola, S. Guist, S. Nasiriany, S. Schaal, S. Welker, S. Tian, S. Ramamoorthy, S. Dasari, S. Belkhale, S. Park, S. Nair, S. Mirchandani, T. Osa, T. Gupta, T. Harada, T. Matsushima, T. Xiao, T. Kollar, T. Yu, T. Ding, T. Davchev, T. Z. Zhao, T. Armstrong, T. Darrell, T. Chung, V. Jain, V. Vanhoucke, W. Zhan, W. Zhou, W. Burgard, X. Chen, X. Chen, X. Wang, X. Zhu, X. Geng, X. Liu, X. Liangwei, X. Li, Y. Pang, Y. Lu, Y. J. Ma, Y. Kim, Y. Chebotar, Y. Zhou, Y. Zhu, Y. Wu, Y. Xu, Y. Wang, Y. Bisk, Y. Dou, Y. Cho, Y. Lee, Y. Cui, Y. Cao, Y.-H. Wu, Y. Tang, Y. Zhu, Y. Zhang, Y. Jiang, Y. Li, Y. Li, Y. Iwasawa, Y. Matsuo, Z. Ma, Z. Xu, Z. J. Cui, Z. Zhang, Z. Fu, and Z. Lin, “Open X-Embodiment: Robotic learning datasets and RT-X models,” https://arxiv.org/abs/2310.08864, 2023.
[9] K. Somasundaram, J. Dong, H. Tang, J. Straub, M. Yan, M. Goesele, J. J. Engel, R. De Nardi, and R. Newcombe, “Project aria: A new tool for egocentric multi-modal ai research,” arXiv preprint arXiv:2308.13561, 2023.
[10] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Quiambao, K. Rao, M. Ryoo, G. Salazar, P. Sanketi, K. Sayed, J. Singh, S. Sontakke, A. Stone, C. Tan, H. Tran, V. Vanhoucke, S. Vega, Q. Vuong, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich, “Rt-1: Robotics transformer for real-world control at scale,” in arXiv preprint arXiv:2212.06817, 2022.
[11] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. Ryoo, G. Salazar, P. Sanketi, P. Sermanet, J. Singh, A. Singh, R. Soricut, H. Tran, V. Vanhoucke, Q. Vuong, A. Wahid, S. Welker, P. Wohlhart, J. Wu, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich, “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in arXiv preprint arXiv:2307.15818, 2023.
[12] J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “LERF: Language embedded radiance fields,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 19 729–19 739.
[13] C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from CLIP,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 696–712.
[14] S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 13 778–13 790.
[15] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
[16] M. Adamkiewicz, T. Chen, A. Caccavale, R. Gardner, P. Culbertson, J. Bohg, and M. Schwager, “Vision-only robot navigation in a neural radiance world,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4606–4613, 2022.
[17] T. Chen, P. Culbertson, and M. Schwager, “Catnips: Collision avoidance through neural implicit probabilistic scenes,” IEEE Transactions on Robotics, vol. 40, pp. 2712–2728, 2024.
[18] A. Rosinol, J. J. Leonard, and L. Carlone, “Nerf-slam: Real-time dense monocular slam with neural radiance fields,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3437–3444.
[19] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit mapping and positioning in real-time,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6229–6238.
[20] J. Yu, J. E. Low, K. Nagami, and M. Schwager, “Nerfbridge: Bringing real-time, online neural radiance field training to robotics,” arXiv preprint arXiv:2305.09761, 2023.
[21] J. Ichnowski*, Y. Avigal*, J. Kerr, and K. Goldberg, “Dex-NeRF: Using a neural radiance field to grasp transparent objects,” in Conference on Robot Learning (CoRL), 2020.
[22] J. Kerr, L. Fu, H. Huang, Y. Avigal, M. Tancik, J. Ichnowski, A. Kanazawa, and K. Goldberg, “Evo-nerf: Evolving nerf for sequential robot grasping of transparent objects,” in 6th annual conference on robot learning, 2022.
[23] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML). PMLR, 2021, pp. 8748–8763.
[24] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 9630–9640. [Online]. Available: https://doi.org/10.1109/ICCV48922.2021.00951
[25] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven semantic segmentation,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=RriDjddCLN
[26] S. Kobayashi, E. Matsumoto, and V. Sitzmann, “Decomposing nerf for editing via feature field distillation,” in Advances in Neural Information Processing Systems, vol. 35, 2022. [Online]. Available: https://arxiv.org/pdf/2205.15585.pdf
[27] C. Wang, M. Chai, M. He, D. Chen, and J. Liao, “CLIP-NeRF: Text-and-image driven manipulation of neural radiance fields,” in CVPR, 2022, pp. 3835–3844.
[28] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D Gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, pp. 1–14, 2023.
[29] M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” arXiv preprint arXiv:2312.16084, 2023.
[30] X. Hu, Y. Wang, L. Fan, J. Fan, J. Peng, Z. Lei, Q. Li, and Z. Zhang, “Semantic anything in 3d gaussians,” arXiv preprint arXiv:2401.17857, 2024.
[31] S. Zhou, H. Chang, S. Jiang, Z. Fan, Z. Zhu, D. Xu, P. Chari, S. You, Z. Wang, and A. Kadambi, “Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields,” arXiv preprint arXiv:2312.03203, 2023.
[32] X. Zuo, P. Samangouei, Y. Zhou, Y. Di, and M. Li, “Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding,” arXiv preprint arXiv:2401.01970, 2024.
[33] G. Liao, J. Li, Z. Bao, X. Ye, J. Wang, Q. Li, and K. Liu, “Clip-gs: Clip-informed gaussian splatting for real-time and view-consistent 3d semantic understanding,” arXiv preprint arXiv:2404.14249, 2024.
[34] H. O. Song, M. Fritz, D. Goehring, and T. Darrell, “Learning to detect visual grasp affordance,” IEEE Transactions on Automation Science and Engineering, vol. 13, no. 2, pp. 798–809, 2015.
[35] P. Ardón, E. Pairet, R. P. Petrick, S. Ramamoorthy, and K. S. Lohan, “Learning grasp affordance reasoning through semantic relations,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4571–4578, 2019.
[36] N. Yamanobe, W. Wan, I. G. Ramirez-Alpizar, D. Petit, T. Tsuji, S. Akizuki, M. Hashimoto, K. Nagata, and K. Harada, “A brief review of affordance in robotic manipulation research,” Advanced Robotics, vol. 31, no. 19-20, pp. 1086–1101, 2017.
[37] T. Nagarajan, C. Feichtenhofer, and K. Grauman, “Grounded human-object interaction hotspots from video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8688–8697.
[38] M. Goyal, S. Modi, R. Goyal, and S. Gupta, “Human hands as probes for interactive object understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3293–3303.
[39] Y. Zheng, X. Chen, Y. Zheng, S. Gu, R. Yang, B. Jin, P. Li, C. Zhong, Z. Wang, L. Liu et al., “Gaussiangrasper: 3d language gaussian splatting for open-vocabulary robotic grasping,” arXiv preprint arXiv:2403.09637, 2024.
[40] H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y. Xie, and C. Lu, “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,” IEEE Transactions on Robotics (T-RO), 2023.
[41] G. Lu, S. Zhang, Z. Wang, C. Liu, J. Lu, and Y. Tang, “Manigaussian: Dynamic gaussian splatting for multi-task robotic manipulation,” arXiv preprint arXiv:2403.08321, 2024.
[42] Y. Li and D. Pathak, “Object-aware gaussian splatting for robotic manipulation,” in ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation.
[43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[44] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling egocentric vision: The epic-kitchens dataset,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 720–736.
[45] L. Medeiros, “Language Segment-Anything,” https://github.com/luca-medeiros/lang-segment-anything, 2023.
[46] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, “Dynamic head: Unifying object detection heads with attentions,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7373–7382.
[47] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
[48] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv preprint arXiv:2203.03605, 2022.
[49] M. Tancik, E. Weber, R. Li, B. Yi, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja, D. McAllister, A. Kanazawa, and E. Ng, “Nerfstudio: A framework for neural radiance field development,” in SIGGRAPH, 2023.
[50] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[51] D. Coleman, I. A. Şucan, S. Chitta, and N. Correll, “Reducing the barrier to entry of complex robotic software: a moveit! case study,” Journal of Software Engineering for Robotics, vol. 5(1), p. 3–16, 2014.

Appendix A Affordance-and-Semantic-Knowledge Gaussian Splatting

Here, we provide additional details of the distilled knowledge in ASK-Splat.

A-A Grounding Language Semantics in $3$ D Gaussian Splatting

In this work, we utilize feedforward MLPs in defining the encoder and decoder with ${l=3\ll C}$ . However, we note that larger values of $l$ generally result in more expressive semantic scene representations, at the expense of increased memory and rendering costs. We train the autoencoder with the loss function $\mathcal{L}_{g}$ , given by:

	$\displaystyle\mathcal{L}_{g}=$	$\displaystyle\kappa_{g}\sum_{i=1}^{\lvert\mathcal{D}\rvert}\left\\|g_{\theta}^{% \mathrm{dec}}(g_{\phi}^{\mathrm{enc}}(f_{\mathrm{gt},i}))-f_{\mathrm{gt},i}% \right\\|_{2}^{2}$		(1)
		$\displaystyle+\frac{1}{\lvert\mathcal{D}\rvert}\sum_{i=1}^{\lvert\mathcal{D}% \rvert}\left(1-\psi(g_{\theta}^{\mathrm{dec}}(g_{\phi}^{\mathrm{enc}}(f_{% \mathrm{gt},i}),f_{\mathrm{gt},i})\right),$		(1)

where ${g_{\theta}^{\mathrm{dec}}(g_{\phi}^{\mathrm{enc}}(\cdot))}$ represents the composition of the encoder and decoder that outputs the reconstruction of its inputs, ${\psi:\mathbb{R}^{U\times V\times C}\times\mathbb{R}^{U\times V\times C}% \mapsto\mathbb{R}}$ denotes the cosine-similarity function (where we note that $\psi$ applies to inputs of arbitrary height and width), and $\mathcal{D}$ denotes the dataset of images used in training the Gaussian Splatting scene, with $f_{\mathrm{gt},i}$ denoting the ground-truth semantic features of image $i$ . The first term in (1) represents the mean-squared-error (MSE) reconstruction loss with ${\kappa_{g}\in\mathbb{R}_{++}}$ denoting the constant associated with this term, while the second term represents the cosine-similarity loss between the ground-truth embeddings and the reconstruction.

Given a trained encoder, we map the ground-truth image embeddings from CLIP to the lower-dimensional latent space and distill the lower-dimensional embeddings into the Gaussian Splatting representation. We assign a semantic feature ${f_{s}\in\mathbb{R}^{l}}$ to each Gaussian. To render $2D$ semantic feature maps of the scene, we utilize the same tile-based rasterization procedure presented in [28], culling $3$ D Gaussians whose $99\%$ confidence ellipsoid do not intersect the view frustum associated with the pose of the camera. We optimize the semantic feature parameters using the loss function:

	$\displaystyle\mathcal{L}_{s}=$	$\displaystyle\kappa_{s}\sum_{i=1}^{\lvert\mathcal{D}\rvert}\left\\|\hat{% \mathcal{I}}_{i}^{f}-g_{\phi}^{\mathrm{enc}}(f_{\mathrm{gt},i}))\right\\|_{2}^{2}$		(2)
		$\displaystyle+\frac{1}{\lvert\mathcal{D}\rvert}\sum_{i=1}^{\lvert\mathcal{D}% \rvert}\left(1-\psi(\hat{\mathcal{I}}_{i}^{s}-g_{\phi}^{\mathrm{enc}}(f_{% \mathrm{gt},i}))\right),$		(2)

where ${\hat{\mathcal{I}}_{i}^{s}\in\mathbb{R}^{H\times W\times l}}$ denotes the $2D$ semantic feature map rendered from the Gaussian Splats and ${\kappa_{s}\in\mathbb{R}_{++}}$ denotes the constant term in the MSE loss, given by thee first component of $\mathcal{L}_{s}$ . Although not explicitly stated in (2), we resize the output of $g_{\phi}^{\mathrm{enc}}$ using bilinear interpolation to obtain a ground-truth semantic map of compatible dimensions.

Given a good initialization of the Gaussians (e.g., when the sparse point cloud from structure-from-motion is utilized in initializing the Gaussians), the semantic feature parameters associated with each Gaussian can be trained simultaneously with other spatial and visual-related parameters associated with the Gaussians, along with the autoencoder’s parameters. Nonetheless, empirical evaluations suggest that a sequential training procedure in which the semantic parameters are trained after the non-semantic parameters of the Gaussians have been trained yields better localized semantic feature maps. We hypothesize that this observation may result from more consistent grounding of the semantic features.

Now, we present our approach to generating semantic feature maps of the scene given a natural-language query. We compute the text embedding of the query using CLIP, which we use in evaluating the similarity between the specified query and the objects in scene. We utilize the cosine-similarity metric, which is widely used in prior work [23, 12]. Consistent with prior work, we allow for the specification of negative queries to help distinguish between dissimilar objects and the object of interest described by a positive query. However, we note that, in practice, a positive query suffices without negative queries. We compute the embeddings of each item in the set of negative queries and positive queries, and subsequently compute the cosine similarity between the predicted semantic feature given by the Gaussian Splats and each query in the set of negative and positive queries. Finally, the similarity score between a feature point $p$ and the positive query is given by:

\mathrm{sim}(\mathcal{Q}_{s},\mathcal{Q}_{s}^{\prime})=\min_{i\in\lvert% \mathcal{Q}_{s}^{\prime}\rvert}\gamma(p,\nu_{a}({\mathcal{Q}_{s}}),\nu_{b}({q_% {s,i}^{\prime}})),

(3)

where $\mathcal{Q}$ denotes the set of positive queries, ${\mathcal{Q}_{s}^{\prime}=\{{q_{s,i}^{\prime}},\ \forall i\in[\lvert\mathcal{Q% }_{s}^{\prime}\rvert]\}}$ denotes the set of negative queries, ${\nu_{a}:\mathcal{S}\mapsto\mathbb{R}}$ computes the average semantic embedding of a set of text prompts ${\bar{s}\in\mathcal{S}}$ , ${\nu_{b}(q_{s,i}^{\prime})}$ represents the semantic embedding of the negative query $q_{s,i}^{\prime}$ , and ${\gamma:\mathbb{R}^{3}\times\mathbb{R}^{C}\times\mathbb{R}^{C}\mapsto\mathbb{R}}$ represents the pairwise softmax function over the positive query embedding and the $i$ th negative query embedding at the $3$ D feature point, outputting the probability associated with the positive query embedding. In general, when rendering the semantic similarity maps, we apply a threshold of $0.5$ to the similarity score computed in (3) to distinguish feature points associated with the query from dissimilar ones.

A-B Grounding Affordance in $3$ D Gaussian Splatting

We embed grasp affordances in ASK-Splat. During training, we utilize the same tile-based rasterization procedure discussed in Section A-A in rendering the $2D$ visual grasp affordance of the scene and optimize the grasp affordance parameter $\beta$ using the loss function: ${\mathcal{L}_{\beta}=\kappa_{\beta}\sum_{i=1}^{\lvert\mathcal{D}\rvert}\left\|% \hat{\mathcal{I}}_{i}^{\beta}-\mathcal{I}_{i}^{\beta}\right\|_{2}^{2},}$ which represents the MSE loss between the ground-truth $2D$ visual grasp affordance ${\mathcal{I}_{i}^{\beta}\in\mathbb{R}^{H\times W\times 1}}$ and the rendered visual grasp affordance ${\hat{\mathcal{I}}_{i}^{\beta}\in\mathbb{R}^{H\times W\times 1}}$ . We optimize the affordance parameters concurrently with the non-semantic parameters of each Gaussian. From a trained ASK-Splat scene, we can generate dense $2$ D visual grasp affordance maps, as well as sparser $3$ D visual grasp affordance maps, by directly evaluating the affordance score associated with each Gaussian.

Appendix B Scene-Editing-Enabled Gaussian Splatting

We present the components that make up SEE-Splat, our module for Scene-Editing-Enabled Gaussian Splatting representations, that enables the identification and localization of relevant objects within a scene for insertion, removal, or modification of the object’s visual or spatial properties.

B-A Semantic Localization via ASK-Splat

Given a natural-language query specifying an object of interest, SEE-Splat leverages ASK-Splat to identify semantically relevant Gaussians, as discussed in Section A-A. In the main paper (cf. Fig. 3), we show the localization of an electric stove, saucepan, and a fruit in a real-world Cooking scene. To improve the localization accuracy, the text prompt can include the geometric and visual properties of the object, such as its color, in addition to its semantic class. At this stage, SEE-Splat generates a semantic similarity map, from which relevant Gaussians are extracted, given a threshold on the semantic score.

B-B Masking the Gaussians in SEE-Splat

Given a semantic similarity map of the scene, SEE-Splat generates a mask identifying the Gaussians relevant to the specified object. This procedure begins with thresholding the semantic scores of each Gaussian to remove dissimilar Gaussians from the set of relevant Gaussians, which creates a sparse point cloud of the relevant Gaussians, constructed from the means of these Gaussians. However, photo-realistic rendering of Gaussian environments require denser point clouds. Consequently, SEE-Splat lifts the features of the point cloud from the $3$ D Euclidean space to a $7$ D feature space, by augmenting each point in the point cloud with its RGB color and semantic score. Subsequently, SEE-Splat identifies all neighboring points in the scene within a specified distance of the point cloud in the $7$ D feature space using an efficient KD-tree query. SEE-Splat incorporates these points into the point cloud to create a denser point cloud, comprising of all semantically-relevant Gaussians, while removing outliers from the set of points. In the main paper (cf. Fig. 3), we show the Gaussians extracted by SEE-Splat, as a point cloud with well-defined geometry, given a natural-language query for each object.

B-C Editing the Gaussians in SEE-Splat

Leveraging the Gaussian primitives in ASK-Splat, SEE-Splat enables real-time scene-editing by inserting new Gaussians into the scene, removing Gaussians, and modifying the properties of the Gaussians, to reflect (or simulate) changes in the real-world. SEE-Splat supports seamless insertion and removal of Gaussians by introducing or deleting the relevant Gaussians from the set of Gaussians representing the scene, respectively. In addition, SEE-Splat supports both rigid and non-rigid transformation of the Gaussians, enabling simulated motion of the Gaussians, as well as changes to the shape of the Gaussians via non-isometric scaling. Specifically, given a function specifying the transformation ${\xi:\mathcal{G}_{s}\mapsto\mathcal{G}_{s}}$ (where $\mathcal{G}_{s}$ represents the space of the Gaussian primitives), SEE-Splat updates the spatial attributes of the relevant Gaussians by applying $\xi$ to these Gaussians. In the case of rigid transformations, ${\xi}$ can be described by an $\mathrm{SE}(3)$ transformation matrix, specifying rotation and translation of the Gaussians. We can render the edited scene to generate photo-realistic visualizations. Although, we do not consider physics-based simulations in this work, we note that physics can be incorporated into SEE-Splat to achieve realism. We expound on this point in or discussion on the limitations of SEE-Splat.

Deletion and transformation of the Gaussians introduces artifacts into the scene representation, degrading its photo-realistic qualities. To address this challenge, SEE-Splat enables $3$ D Gaussian infilling by introducing new Gaussians with similar attributes in regions with missing geometry, which we illustrate in Appendix I. Figure 3 provides an illustration of such artifacts (e.g., the hole in the table), when the scene is edited to visualize the effects of moving the saucepan to the electric stove. Through $3$ D Gaussian infilling, SEE-Splat generates a photorealistic rendering of the edited scene, eliminating these artifacts.

Appendix C Grasping and Manipulation with Splat-MOVER

We present Grasp-Splat and discuss its application to multi-stage robotic manipulation via Splat-MOVER.

C-A Grasp-Splat for Grasp Proposal

We note that the grasps generated by GraspNet are not always ideal. For example, the grasps generated by GraspNet in Figure 2 are either infeasible or challenging to execute. As a result, Grasp-Splat ranks the grasps proposed by GraspNet based on the grasp scores obtained from ASK-Splat. By leveraging the affordance score associated with each grasp pose, Grasp-Splat identifies grasp configurations that are more likely to succeed, depicted in Figure 2.

C-B Multi-Stage Robotic Manipulation

For multi-stage robotic manipulation, we begin by decomposing the manipulation task into stages. Our approach supports the specification of the stages comprising the task by a human or by a large language model (LLM). In the case where the natural-language description of the task does not specify the stages involved in the task, we query an LLM for the stages required to complete the manipulation task. For each stage in the manipulation task, we utilize ASK-Splat, SEE-Splat, and Grasp-Splat to identify the relevant object and generate candidate grasp poses. Likewise, we query ASK-Splat for the target location for placing the object. We evaluate the feasibility of each candidate grasp using an off-the-shelf motion planner for the robotic manipulator, inputting the point cloud of the scene, extracted from ASK-Splat and SEE-Splat, into the motion planner, which the motion planner uses for collision detection during motion planning. We execute the top candidate grasps, moving on to the next if the robot motion planner fails to compute a solution to execute the selected grasp.

We execute the motion plan returned by the motion planner on the robotic manipulator. We note that the end-effector trajectory can be published to SEE-Splat for real-time visualization of the task in the virtual scene. In this case, we can apply the relative transformation between consecutive end-effector poses to the spatial attributes of the Gaussian associated with the object being manipulated. In addition, we note that alternative approaches exist for computing the relative transformations of the object between consecutive frames. For example, if object-tracking information is available from sensors in the scene, SEE-Splat could leverage this information to update the spatial attributes of the Gaussians, rendering a video showing the real-time changes in the scene of the manipulator, including the motion of the object, as the manipulation task progresses. We proceed to the next stage in the manipulation task at the conclusion of the current stage, repeating the same procedures with the updated representation of the scene provided by SEE-Splat.

Appendix D Evaluations

We present additional experimental results of ASK-Splat, SEE-Splat, and Splat-MOVER in open-vocabulary, multi-stage robotic manipulation problems, including a discussion of the experimental setup.

D-A Experimental Setup

D-A1 ASK-Splat

We distill grasp affordances from the vision-affordance foundation model VRB [14], which is trained on the EPIC-KITCHENS dataset [44], consisting of videos of humans performing kitchen tasks, such as cutting fruits and vegetables. We note that the generalization of the affordance knowledge in ASK-Splat is limited by that of VRB, the underlying foundation model. VRB utilizes Language Segment-Anything (LangSAM) [45], which requires the specification of objects within each image for which it predicts the contact locations and motion direction after contact. This requirement is not limiting, in practice, as end-to-end object detectors that provide bounding boxes for all objects in the scene [46, 47, 48] could be used. We distill the grasp affordance scores from the heatmaps computed by VRB and the semantic embeddings from the vision-language foundation model RN50x64, CLIP-ResNet model [5]. We implement ASK-Splat in Nerfstudio [49]. To train ASK-Splat, we record a video of each scene using a smartphone and utilize the training API available in Nerfstudio, using the sparse point cloud computed via structure-from-motion [50] for initialization.

D-A2 Scenes

We consider only real-world scenes in our experiments, including a Kitchen scene (consisting of common kitchen cookware such as saucepans, chopping boards, and knives); Cleaning scene (consisting of common household cleaners such as disinfectant wipes, dish soaps, and cleaning sprays); Meal scene (consisting of cutlery such as plates, spoons, forks, and cups); Random scene (consisting of random items such as a pair of scissors, chess pieces, and keyholders); and a Workshop scene (consisting of tools such as a power drill, work mat, and scraper). Figure 5 shows these scenes. We note that the Workshop and Random scenes contain out-of-distribution objects with respect to the EPIC-KITCHENS dataset (i.e., objects not found in a typical kitchen), such as the power drill and the chess pieces.

D-A3 Splat-MOVER

We consider the multi-stage robotic manipulation task where the robot must sequentially pick and place two different objects and move them to a common goal location. The task is specified by a user that provides an open-vocabulary command, e.g., “Pick up the saucepan and move it to the burner, then pick up the lid and put it on the saucepan.” For simplicity, we limit the task to two sequential pick-and-place maneuvers. However, we note that Splat-MOVER does not impose this limitation and is amenable to longer multi-stage manipulation tasks. Furthermore, we consider three adjacency goal location primitives (“on”, “next to”, and “inside”) for the second object where each primitive is defined based on the geometry of the first object.

Specifically, we evaluate Splat-MOVER in four multi-stage manipulation tasks across three scenes: the Kitchen, Cleaning, and Workshop scenes. In the Kitchen scene, we consider a Cooking task where the robot is asked to place a saucepan on an electric burner (Stage 1) and subsequently place a fruit inside the saucepan (Stage 2). Further, in the Kitchen scene, we consider a Chopping task where the robot is asked to place a knife on a chopping board (Stage 1) and subsequently place a fruit next to the knife (Stage 2). We consider a Cleaning task (in the Cleaning scene), where the robot is asked to place a cleaning spray in a bin (Stage 1) and subsequently place a sponge next to the cleaning spray (Stage 2). Lastly, in the Workshop scene, the robot is asked to place a power drill on a work mat (Stage 1) and subsequently place a wooden block next to the drill (Stage 2), which we refer to as the Workshop task.

D-A4 Hardware Experiments

We implement Splat-MOVER in grasping and placing tasks on a Kinova Gen3 robot, equipped with a Robotiq parallel-jaw gripper. The Kinova robot is a $7$ -DoF robot with a maximum reach of $902$ mm. We interface with the robot using the Robot Operating System (ROS), through which we send waypoints, which are tracked by the default low-level controllers provided by the robot. We utilize the MoveIt ROS package [51] for motion planning for the Kinova robot given a specified grasp pose. At each stage of the manipulation task, we extract a point cloud and a mesh from ASK-Splat and SEE-Splat, reflecting the progress in the task up to that stage, which we use as the environment representation within MoveIt for collision avoidance during planning.

D-B ASK-Splat Representation

We train ASK-Splat on a number of different environments and evaluate the grasp affordance and semantic segmentation of the resulting Gaussian Splats. In Figure 5, we show the RGB image, grasp affordance heatmap computed by VRB, and the grasp affordance heatmap rendered from ASK-Splat composited with the rendered RGB image. The heatmap shows the regions in each object amenable to grasping. Qualitatively, from Figure 5, ASK-Splat encodes the grasp affordance given by VRB, identifying reasonable regions on each object for grasping. Although VRB provides the $2$ D motion direction associated with each grasp affordance region, we do not distill this knowledge into ASK-Splat, as we found the $2$ D motion directions to be quite noisy and relatively uninformative.

We compute the Structural Similarity Index (SSIM) for each scene to assess the quality of the distilled affordance compared to the VRB-generated grasp affordance. The SSIM metric ranges between $-1$ (indicating greater dissimilarity) to $1$ (indicating greater similarity). As expected, the Workshop scene yields the smallest SSIM value of ${0.592\pm 7.20\mathrm{e}^{-2}}$ , recalling that the objects in this scene, such as the power drill and the scraper, are outside the training distribution of the VRB model. Nevertheless, the model shows relatively-good generalization performance, given that the grasp affordance region lies around the handle of the drill, shown in Figure 5 (bottom row). Likewise, the Meal scene achieves the highest SSIM score of ${0.681\pm 8.91\mathrm{e}^{-2}}$ , noting that the objects in the scene can be found in the dataset used in training the VRB model. Further, the Cleaning, Kitchen, and Random scenes achieve SSIM scores of ${0.648\pm 9.06\mathrm{e}^{-2}}$ , ${0.647\pm 1.30\mathrm{e}^{-1}}$ , and ${0.614\pm 8.37\mathrm{e}^{-2}}$ , respectively.

Figure 6 shows the semantic masks generated by ASK-Splat across different scenes. Given a natural-language query, ASK-Splat localizes the relevant object in the scene based on the cosine-similarity of the Gaussians to the query. In Figure 6, ASK-Splat identifies the salt shaker, flower, and pair of scissors. However, the success of robotic manipulation tasks depend on the integration of semantic scene understanding with grasp affordance. As such, we show the semantic-affordance masks generated by ASK-Splat in Figure 6. With the semantic-affordance masks, a robot not only has the ability to identify a relevant object to grasp, the robot can also identify where on the relevant object to grasp.

D-C Splat-MOVER for Multi-Stage Robotic Manipulation

We compare Splat-MOVER to prior work LERF-TOGO [5] and F3RM [6] in four tasks: the Cooking task, Chopping task, Cleaning task, and Workshop task, described in Section D-A3. Figure 7 shows a few candidate grasps proposed by GraspNet, F3RM, LERF-TOGO, and Grasp-Splat for each of these objects. GraspNet does not consider the semantic features of the object in generating candidate grasps; as a result, the proposed grasps are not localized in regions where a human might grasp the object, unlike the candidate grasps proposed by F3RM, LERF-TOGO, and Grasp-Splat, which generate grasps closer to the handle of the respective objects. For example, the proposed grasps lie relatively close to the handle of the saucepan and the knife. F3RM and LERF-TOGO generate candidate grasp conditioned on a text prompt identifying the region to grasp the object (such as its handle) provided by a human operator or an LLM (in LERF-TOGO) or from a dataset of human demonstrations (in F3RM). In contrast, Grasp-Splat does not require any external guidance to generate candidate grasps of similar quality, harnessing the grasp affordances provided by ASK-Splat. We summarize the capabilities of each of these methods in Table II.

TABLE II: Representation Capabilities of LERF-TOGO [5], F3RM [6], and Splat-MOVER

Capabilities	Semantic Knowledge	Affordance Knowledge	Scene-Editing
LERF-TOGO [5]	✓	✗	✗
F3RM [6]	✓	✗	✗
Splat-MOVER (ours)	✓	✓	✓

In addition, we evaluate the pick-and-place success rate of all the methods in the Cooking task, Chopping task, Cleaning task, and Workshop task, where the place success rate is conditioned on the number of successful trials in picking the object. Table I provides the pick-and-place success rates in the Chopping task. Splat-MOVER achieves the highest pick success rate ( $85\%$ ) in Stage $1$ of the task. Although F3RM achieves the highest place success rate, it achieves a much lower pick success rate compared to Splat-MOVER. In addition, in the Cleaning and Workshop tasks, Splat-MOVER achieves the highest success rates in the first stage of each task, and further achieves relatively high success rates in the second stage of each task, shown in Table I. LERF-TOGO achieves a perfect success rate in picking up the power drill in the first stage of the Workshop task. Since LERF-TOGO and F3RM are not amenable to multi-stage manipulation tasks, we cannot evaluate the success rate of these methods for the entire manipulation task. In contrast, Splat-MOVER enables multi-stage robotic manipulation, achieving a task success rate of $40\%$ , $65\%$ , $70\%$ , and $80\%$ in the Cooking, Chopping, Cleaning, and Workshop tasks, respectively. We note that the Cooking task is the most challenging task, compared to the other tasks, given the little margin of error tolerated in placing the saucepan on the electric burner.