¹¹institutetext: University of Washington ¹¹email: [email protected]
²²institutetext: NVIDIA

Learning to Build by
Building Your Own Instructions

Aaron Walsman 11 Muru Zhang 11 Adam Fishman 11 Ali Farhadi 11 Dieter Fox 1122

Abstract

Structural understanding of complex visual objects is an important unsolved component of artificial intelligence. To study this, we develop a new technique for the recently proposed Break-and-Make problem in LTRON where an agent must learn to build a previously unseen LEGO assembly using a single interactive session to gather information about its components and their structure. We attack this problem by building an agent that we call InstructioNet that is able to make its own visual instruction book. By disassembling an unseen assembly and periodically saving images of it, the agent is able to create a set of instructions so that it has the information necessary to rebuild it. These instructions form an explicit memory that allows the model to reason about the assembly process one step at a time, avoiding the need for long-term implicit memory. This in turn allows us to train on much larger LEGO assemblies than has been possible in the past. To demonstrate the power of this model, we release a new dataset of procedurally built LEGO vehicles that contain an average of 31 bricks each and require over one hundred steps to disassemble and reassemble. We train these models using online imitation learning which allows the model to learn from its own mistakes. Finally, we also provide some small improvements to LTRON and the Break-and-Make problem that simplify the learning environment and improve usability. This data and updated environments can be found at github.com/aaronwalsman/ltron/blob/v1.1.0. Additional training code can be found at github.com/aaronwalsman/ltron-torch/tree/eccv-24.

1 Introduction

The ability to understand and execute complex assembly problems is one of the hallmarks of human intelligence. Humans use this ability to construct tools and reverse engineer previously unseen part-based objects.

The recently proposed Break-and-Make problem[42] is designed to train agents to develop these abilities using complex LEGO structures. In this problem, an agent must learn to build a previously unseen assembly by actively inspecting it. To do this the agent is given access to an interactive simulator that allows it to disassemble the structure in order to reveal hidden components and see how everything fits together. Once it is confident that it knows the structure, the agent is presented with an empty scene and must build the model again from scratch. This problem is designed to simulate a reverse engineering problem. The agent must take apart a complex structure in order to learn how to make it. By training agents to effectively reverse engineer these systems, we can discover new tools for understanding and building intelligent systems that can reason about complex structures.

The Break-and-Make problem is quite challenging, as it requires long-term memory, and interaction with a complex visual environment using a 2D cursor-based action space. Baseline approaches that use transformers[40] and LSTMs[16] for long-term memory have struggled to make progress on larger models. We introduce a new model InstructioNet for this problem that uses an explicit memory to store a stack of self-curated instruction images. InstructioNet slowly adds to this memory by iteratively disassembling part of the model, then saving its most recent observation to the top of this instruction stack. This stack provides the model with a series of short-term visual targets to use when reassembling the model later on, just like a real-world LEGO instruction book. When rebuilding the model the agent only has to reason about its current observation and the page of the instruction stack that it is currently working on. Once the agent’s current assembly matches this instruction, it can turn the page and get a new short-term target to work towards. Figure 1 shows a successful example of this on our new RC-Vehicles dataset.

In addition to this memory-based model, we also detail several practical components that are either necessary or improve training performance in this space. These include online imitation learning, conditional action heads and new loss functions for the dense 2D cursor-based action space.

Refer to caption — Figure 1: An example of InstructioNet completing the Break and Make task on a previously unseen example from RC-Vehicles. Our model saves 34 distinct images to the instruction stack over the first 69 steps. It then successfully rebuilds the model from scratch using these images over the course of another 135 steps.

Our primary contributions are:

1.

We introduce a new instruction-stack based model for the Break and Make problem.
2.

We provide several modeling and training improvements for complex visual action spaces that use a pixel-based cursor.
3.

We provide a new simplified visual interface for the Break and Make problem, along with RC-Vehicles a new dataset of challenging LEGO models for assembly problems.
4.

Our new model and training recipe achieve much better performance than previous baselines, especially on larger, more complicated models.

2 Related Work

2.1 Interactive Understanding and Assembly

Despite a proliferation of interactive environments with increasing sophistication[8, 35, 34, 37, 44, 47], long-term structural understanding and assembly problems remain an open challenge in artificial intelligence. Researchers have explored this in the context of furniture assembly[50, 27], Minecraft[11], CAD models[18] and robotic assembly problems[48]. In this work, we build on LTRON[42], a recent LEGO simulator designed to provide a building environment for learning agents. Constructing plans for assembly using a disassembly process has also been explored in the context of multi-part CAD based models[39], however this work is more concerned with finding collision-free paths through free-space, and does not directly reason about connection points.

LEGO bricks have been a popular substrate for learning assembly problems across a variety of subfields in artificial intelligence. These include design problems[30], robotic assembly[14], shape reconstruction [20, 21, 26], generative modelling[38], and image guided building[6, 23]. Most similar to our work, Wang et al.[43] build LEGO structures from existing instructions. Our setting is more challenging because the agent must learn to make its own instructions rather than assuming they are already provided. Furthermore, the action space in LTRON is more difficult as it requires the agent to use a 2D cursor to interact with the scene and contains assemblies with bricks attached to the sides of objects, that cannot be described using simple stacking.

2.2 Memory

Memory structures for interactive problems have been studied for decades. Early attempts at implicit (problem-agnostic) memory structures include simple RNNs[33, 19] and more complex variants such as LSTMs[17] and GRUs[4]. These methods propagate information forward in time using specialized network architecture. Neural Turing Machines [13] use a learned external memory module to read and write to long term storage. Recently attention-based transformers[40] have become one of the most popular ways to build neural networks that rely on past information to make future decisions. While transformers are quite effective, they are computationally expensive for large sequences of observations, which limits their use for very long-term memory.

In contrast to implicit memory structures, explicit memory uses some knowledge of the environment or task in order to build a structure more appropriate to the problem. This can come in the form of geometric or topological map building such as in SLAM[24, 5, 15], semantic maps in embodied navigation[12] or more complex structures that combine multiple components[2]. Our approach uses an explicit memory structure designed around the intuitive understanding that assembling a structure can be completed in the reverse order of disassembling it. This motivates the use of a stack that allows the agent to sequentially build up a series of experiences, and then pop them off one-by-one in reverse order later.

2.3 Inverse-Graphics

Reconstructing geometry and reasoning about 3D structures from images has been an important issue in research fields such as computer-aided design[10, 46] and robotics [3, 36, 22]. In particular, prior works such as [45, 25] use 3D shapes while [49, 28, 29] use images to guide the inverse inference process. However, when building complex structures such as LEGO models, it is challenging to generate a set of sufficient visual images to predict the reconstruction without dynamically interacting with the object or the environment.

3 Methods

3.1 Environments and Data

The Break and Make environment in LTRON[42] is designed to test an agent’s ability to perform complex building tasks on previously unseen target LEGO assemblies. In this, and many other complex construction problems, no single view of the target object is enough to fully describe it. Therefore, in order for the model to complete its objective, it must first interactively discover all the components of the LEGO assembly by disassembling it and remembering where all the individual bricks went during that process. In order to allow for this kind of interactive discovery, the Break and Make problem divides each interactive episode into two distinct phases. In the first Break phase, the agent is presented with a previously unseen LEGO model and is allowed to interactively inspect and disassemble it, while constructing some long-term memory of the object. Once it is done, it takes a dedicated Phase Switch action at which point the agent is presented with an empty scene and must use the memory it built in the Break phase to rebuild the model from scratch.

In this paper, we use a slightly modified version of the original Break and Make environment with a few changes for simplicity of training. The original Break and Make environment provided two images as its observation space, one “table" image representing the current scene, and another “hand" image representing a single brick that had just been removed from the scene, or was about to be added. In order to simplify the observation space of our model, we have removed the “hand" image. In this updated environment, when a new brick is inserted into the scene, it is simply placed in a floating location above the existing assembly computed using the bounding box of existing bricks, or at the origin if no bricks are present. We also add a new Translate action mode which allows the agent to shift a brick by a fixed amount by using the cursor to select a connection point and specifying a direction and discrete offset value. We found this useful for helping the agent recover from small mistakes.

The complete set of action primitives in the new environment are: Rotate: rotates a single brick about a connection point specified by clicking a 2D screen location. The rotation angle is selected from a set of discrete values in 90 degree increments. Translate: translates a single brick specified by clicking a 2D screen location. The translation distance and direction is again selected from a set of discrete values corresponding to multiples of the stud spacing and brick heights. Pick: inserts a new brick into the scene, by selecting a discrete shape and color index. The new brick is placed in a floating location above the current assembly. Assemble: attaches one brick to another by specifying two connection points in the scene. Disassemble: removes a brick from the scene by specifying a connection point. All of these actions will only succeed if the operation can be performed without causing a collision. Figure 2 shows the manipulation components of the action space of our updated environment.

Note that despite including collision checking, this environment design does not attempt to simulate real-world physical interaction, as it does not consider forces or other complex dynamics. This means that the LTRON interface is closer to CAD modelling, where an operator must use a set of discrete tools and a 2D mouse in order to manipulate a virtual 3D object, than a physical manipulation setting that might be encountered in robotics or industrial automation.

We also use a reduced resolution of $128\times 128$ pixels for training efficiency. In addition, we have made a number of improvements to the original LTRON simulator. These updates provide better collision checking and support for a larger number of connection point styles than was available previously. Note that while both the original and our updated environment allow for camera rotation, we found that rotating the camera was not necessary to successfully reassemble the models in the datasets considered here, so we used a fixed camera angle for all experiments.

In order to focus on construction ability and avoid the confounding issues of long-tailed part distributions, we used the 2, 4 and 8 brick random construction assemblies in LTRON and did not train on the Open Model Repository (OMR) data. Due to the simulator updates mentioned above, we regenerated this random construction data to account for improvements in collision checking. These new assemblies avoid certain rare configurations that resulted in small brick penetrations in the original data.

In addition to the random construction data, we have also developed a new RC-Vehicles dataset of randomly constructed vehicles. These were generated with a series of scripted rules defining distributions over the vehicle dimensions, swappable components such as tires and windshield shapes, and optional features such as wings, headlights and helicopter blades. When combined, these distributions have over 21 bits of entropy defining the shape of the vehicle and 29 bits of entropy defining its color combinations. These models vary in size from 19 to 73 bricks, making them substantially larger and more complex than the previous random construction data. Examples of these vehicles can be seen in Figure 3.

Note that both the random brick and the RC-Vehicles assemblies are not merely top-down stacks of bricks, but require some parts to be placed on the sides, front and back of the constructed object.

3.2 Instruction Stack

Our new model InstructioNet works by storing an explicit stack of instruction images in order to remember the structure of an assembly at various stages of deconstruction. To do this, we augment the action space discussed in the previous section with two additional Push Instruction and Pop Instruction actions. Push takes the current image from the simulator and adds it to the top of the instruction stack, while Pop removes the top image from the instruction stack. During training and inference, we do not restrict the size of the instruction stack.

Our learned policy takes in the current image from the simulator as well as the top image of the instruction stack. During the Break phase when the agent is trying to gather more information about the LEGO assembly, the agent can compare these two images and see if they are similar. If they are, then the agent should disassemble the model further. Otherwise, the agent should take the Push action to store the new information that has just been gathered. After completing this process several times, the agent should have an instruction stack with an image of the fully completed assembly on the bottom, and increasingly disassembled images as you move closer to the top. At the start of the Make phase, the agent will be presented with an empty scene, so the current image from the environment will be empty, while the top of the instruction stack will contain the last brick the model saw during the disassembly process. The agent must then build until the assembly in the current image matches the assembly in the top instruction image. When they match, the agent can take the Pop action and will then reveal a new instruction image with slightly more of the original assembly remaining. In this way, the model only needs to reason about two images at a time, which greatly reduces the complexity of the policy.

3.3 Model

The learned policy is implemented using a modified vision transformer[9] with multiple heads that are responsible for different components of the action space. This model tokenizes the workspace and the instruction images into $16\times 16$ pixel patches. Given that the environment produces images that are $128\times 128$ pixels, this results in 64 tokens per image. The patches from both images are passed through a single linear layer and added to a learned positional encoding. The model then concatenates a single decoder token and a binary embedding of the current phase (Break or Make) for 130 total tokens. The transformer consists of 12 blocks with 512 channels and 8 heads each.

To compute an action, the output of the decoder token is then passed through a set of decoder heads to predict distributions for the action mode such as Disassemble, Assemble, or Rotate and mode-specific parameters such as Rotate Direction when performing a Rotate action or Brick Shape and Color when inserting a new brick. In order to predict 2D cursor click and release locations, the 64 output tokens of transformer blocks 3,6,9 and 12 that correspond to the current image are combined using two separate DPT[31] decoders to produce dense feature maps at the resolution of the original input image. We found it beneficial to condition these click locations on the high level action and parameters sampled from the initial decoder heads. This is accomplished by passing the sampled high level actions to an embedding layer and adding the resulting feature to the output of the decoder token. This value is then used to compute a distribution over click locations using dot product attention over the dense features computed by the DPT decoder. This conditional structure is similar to models used in game AI with complex action spaces [41]. Figure 4 shows our policy model and how these components fit together.

3.4 Online Training

We trained our model using online imitation learning similar to DAgger[32] and show ablations using behavior cloning as well. To do this we built a fast expert that can provide online supervision for trajectories generated by the learning model during training. This is in contrast to [42] which used an expert that was too slow for online labeling. Note that while our new expert is much faster, it is not able to construct plans in cases where the agent makes too many mistakes or deviates too far from the target assembly. In these cases, we simply terminate the training episode early. When there are multiple possible best actions that the expert could suggest, one of them is selected at random. To avoid ambiguity, our expert instructs the agent to push an image to the instruction stack each time a brick is removed during the break phase.

The online training algorithm alternates between generating new data by acting in the environment according to either the expert or the learning model, and then training on a randomized subset of the data generated over the past several iterations. When generating data, a fixed percentage of the environment steps are generated by sampling actions according to the expert, and the rest are sampled by acting according to the agent. We refer to this expert mixture constant as $\alpha$ , and for our main experiments, we found that $\alpha=0.75$ , representing a mix of $75\%$ expert-generated and $25\%$ model-generated data worked well. Note that regardless of which model is controlling the simulator, the expert’s actions are always used for supervision. Incorporating trajectories generated by the agent in this way allows the model to learn to recover from its mistakes: when the model takes an inappropriate action it will reach part of the state space that would not have been encountered if acting according to the expert, yet seeing the expert’s advice in these states shows the model how to correctly recover from this behavior.

As noted by Czarnecki et al. [7], this method of using the model to generate data with direct supervision from the expert can be unstable. This instability occurs frequently when the data distribution shifts as the model gets better at some parts of the state space and learns to take different actions. This problem is exacerbated in settings with long trajectories where many nearby frames appear similar to each other. If proper care is not taken, the model can quickly overfit to the data it has most recently seen and catastrophically forget prior examples. The original DAgger algorithm addresses this by periodically retraining the entire model on all data collected so far. Unfortunately, this is somewhat impractical for long training runs with millions of steps. To mitigate this issue we maintain a replay buffer and train on data randomly sampled from this experience. See the supplementary material for the pseudo-code we use to train our model.

3.5 Cursor Losses

We experimented with several approaches to computing losses for the cursor click and release locations. During online training, we would like the click-and-release decoders to each produce a distribution over locations that can be sampled in order to generate a variety of training data. To compute the probability of clicking on a particular pixel $p(i,j)$ , the dense raw value $x_{i,j}$ predicted at that pixel location is normalized using a standard softmax:

p(i,j)=\frac{\exp(x_{i,j})}{\sum_{i^{\prime},j^{\prime}}\exp(x_{i^{\prime},j^{% \prime}})}

When the expert’s action suggests clicking on a particular LEGO connection point in the scene, there are usually multiple “acceptable" pixels that correspond to the same connection point which complicates the choice of loss function.

One option is to use binary cross-entropy loss using the mask of acceptable pixel locations as a target. This assumes a different probabilistic interpretation of the output, one where multiple pixels can be chosen at the same time instead of just one, but still may be a useful way to encourage the model to put a high probability on the acceptable pixels. This loss function encourages the model to increase the probability of all acceptable pixels, without considering the cross-pixel relationships.

Another option commonly used in keypoint detection is to construct target heatmaps using a small Gaussian distribution around correct locations and supervise the output values using a mean-squared-error loss [1]. We opted against this as it does not lend itself well to softmax sampling, and may add probability mass outside the desired pixel boundaries. However, we can use the mean-squared-error loss to simply push all acceptable pixels toward a large positive constant, and all unacceptable pixels towards a large negative constant. Here we used a constant such that if only one pixel in the image assumed the positive constant, and all others assumed the negative constant, the probability of selecting the single pixel in the softmax would be 0.999. For $128\times 128$ pixel images, these constants are $\pm 8.3$ .

Finally, we consider a loss function in which the probability of acceptable pixels is summed and the probability of unacceptable pixels is summed forming a new two-way distribution. We then supervise this new distribution to maximize the probability of choosing an acceptable pixel using cross-entropy.

L=-\log\sum_{i,j}y_{i,j}\frac{\exp x_{i,j}}{\sum_{i^{\prime},j^{\prime}}\exp x% _{i^{\prime},j^{\prime}}}

This allows the model to place probability mass on any of the acceptable pixels while decreasing the probability mass of all unacceptable pixels. We find that this loss function outperforms the others and discuss this in Section 4.4.

4 Experiments

4.1 Evaluation

We use the metrics from [42] to evaluate the quality of our learned agents. The first ( $F1_{b}$ ) is an $F1$ score over the brick shape and color, ignoring pose. The second ( $F1_{a}$ ) is over the brick shape, color, and pose after computing the best single rigid transformation to bring the estimated and ground-truth assemblies into alignment with each other. Assembly Edit Distance ( $AED$ ) measures how many rigid transforms (edits) are required to bring all the bricks in the estimated assembly into alignment with the bricks in the ground-truth assembly, with additional penalties for extra and missing bricks. The final ( $F1_{e}$ ) is an $F1$ score over the edges (connections) between bricks after using the alignment computed in $AED$ to construct a mapping between bricks in the estimated assembly and bricks in the ground-truth assembly.

4.2 Break and Make

To evaluate the effectiveness of the InstructioNet model, we trained it on our modified version of the Break and Make task using the randomly constructed assemblies and vehicles discussed in Section 3.1. Table 1 shows the performance of our model on these datasets compared to the reported numbers of the LSTM, Studnet-A, and Studnet-B baselines from [42]. The Studnet models are causal transformers that take in a series of deduplicated tiles from the beginning of an episode until the current time step and use a series of decoders to make predictions for the current action. All three of these models require a long horizon to make decisions, while InstructioNet requires only the current frame and the top of the instruction stack. See [42] for details of the Studnet and LSTM models. Note that due to the updated observation space and the small changes to the random construction data due to the improved collision checker, these models should not be considered to have been trained in the same environment, and so the comparisons are only approximate. Regardless of these differences, the InstructioNet model is able to reconstruct large models with much greater accuracy than was previously possible, and the large performance gap clearly demonstrates a new level of capability. Note that we were not able to train the LSTM and Studnet methods from [42] on the new RC-Vehicles data as they require the entire history of past frames to make each decision. The RC-Vehicles assemblies can take over 150 steps to correctly disassemble and reassemble, which are much larger sequences than we could effectively train on available hardware.

RC-2*	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
InstructioNet	0.98	0.95	0.93	0.18
LSTM	0.61	0.38	0.43	2.16
Studnet-A	0.90	0.86	0.58	1.11
Studnet-B	0.87	0.77	0.57	1.30
RC-4*	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
InstructioNet	0.80	0.69	0.71	2.39
LSTM	0.41	0.09	0.13	7.25
Studnet-A	0.56	0.29	0.24	5.80
Studnet-B	0.64	0.34	0.25	5.48
RC-8*	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
InstructioNet	0.68	0.62	0.63	6.30
LSTM	0.02	0.00	0.02	16.05
Studnet-A	0.02	0.01	0.01	15.87
Studnet-B	0.38	0.14	0.12	13.90
RC-Vehicles	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
InstructioNet	0.59	0.51	0.53	43.36

Table 1: InstructioNet compared against the LSTM and Studnet baselines from [42]. See Section 4.1 for details on these metrics, and 4.2 for an important note on the direct comparability of these methods.

4.3 Online Training

In order to show the effectiveness of online training using sequences of actions and observations generated by the learning model, we also train a model on sequences generated only by the expert on the RC-2 and RC-4 datasets. To do this, we set the expert mixture ( $\alpha$ ) to $1.0$ , which is equivalent to behavior cloning. The results are shown in the Online Training section of Table 2. While these models underperform relative to the default expert mixture ( $\alpha=0.75$ ) that includes training from an online expert, it still shows that significant progress can be made with offline training on this problem.

Online Training (Section 4.3)
RC-2	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
$\bm{\alpha=0.75}$	0.98	0.95	0.93	0.18
$\alpha=1.0$	0.97	0.93	0.90	0.29
RC-4	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
$\bm{\alpha=0.75}$	0.80	0.69	0.71	2.39
$\alpha=1.0$	0.77	0.68	0.66	2.88
Loss Functions (Section 4.4)
RC-2	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
Summed CE	0.98	0.95	0.93	0.18
BCE	0.91	0.72	0.72	0.92
MSE	0.91	0.50	0.66	1.00
Conditional Actions (Section 4.5)
RC-2	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
1.7M	0.98	0.94	0.92	0.22
2.6M	0.98	0.95	0.93	0.18
Cut 1.7M	0.97	0.89	0.84	0.43
Cut 2.6M	0.98	0.00	0.50	1.09
Selective Modification (Section 4.6)
RC-2	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
Original	0.98	0.95	0.93	0.18
Altered Color	0.98	0.93	0.95	0.21
RC-4	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
Original	0.80	0.69	0.71	2.39
Altered Color	0.78	0.67	0.68	2.63

Table 2: The results of various ablations, see the sections listed above for details.

4.4 Loss Functions

We evaluate the effectiveness of our cursor loss function by comparing it against the binary cross entropy and constant regression methods discussed in Section 3.5. We find that even on the relatively easy two-brick models, the summed-probability loss outperforms these other techniques. The results are shown in the Loss Functions section of Table 2. Note that due to the difference in magnitude between these losses, we adjusted the learning rate for these methods in an attempt to achieve the best results possible. We found that both benefited from a higher learning rate of $5\times 10^{-4}$ rather than the default $5\times 10^{-5}$ used for the summed-probability loss.

4.5 Conditional Action Generation

We also test the importance of sequentially conditioning the action heads on one another as discussed in Section 3.3 by training a new model where these conditional connections are cut. This corresponds to cutting the magenta connections coming out of the Action Head, Parameter Head and Click Head in Figure 4. We found that cutting these connections leads to training instability where after a certain point the model loses its ability to effectively use the cursor to connect bricks together. In light of this, we report results after 1.7M frames, right before the instability occurs in addition to the default 2.6M frames after the instability occurs. We also report an evaluation of the default model at 1.7M frames for comparison. Note that even before training became unstable, the model without the conditional connections was significantly underperforming the default model. The Conditional Actions section of Table 2 shows these results.

4.6 Selective Modification

We also tested our model on a new task that requires the agent to rebuild the model with one of the brick colors altered. In this setting, the model receives two additional tokens, one which specifies the color to change, and the other that specifies a new color. The agent’s objective is to rebuild the model with all bricks using the original color instead built with the new color. The Selective Modification section of Table 2 shows the performance on this task for two and four brick models. While, performance degrades slightly on this problem, the model still performs quite well. This demonstrates the model’s ability to reason not only about reproducing the exact same model seen during the break phase, but also incorporating new instructions when rebuilding.

4.7 Hyperparameters

Unless otherwise mentioned, we used AdamW with a learning rate of $5\times 10^{-5}$ , $\beta_{1}=0.9$ , $\beta_{2}=0.95$ and weight decay of 0.1. For RC-8 and RC-V, the learning rate was cut to $1\times 10^{-5}$ after 15.8M frames. All models were trained on a single Nvidia 4090 or A40 graphics card. The larger RC-8 and RC-Vehicles runs were trained on 19.7M training steps which took five consecutive days per run. The RC-2 and RC-4 datasets were trained on 2.6M and 7.9M steps respectively which took between one and three consecutive days per run. Table 3 shows the training hyperparameters used to train the models for each dataset.

Hyperparameters	RC-2	RC-4	RC-8	RC-V
Total Training Steps	2.6M	7.9M	19.7M	19.7M
New Data Steps Per Epoch	8K	8K	8K	32K
Training Steps Per Epoch	16K	16K	16K	65K
Replay Buffer Size	32K	32K	32K	131K
Expert Data Mixture ( $\alpha$ )	0.75	0.75	0.75	0.75

Table 3: Hyperparameters used to train the different datasets.

4.8 Qualitative Evaluation

Figure 5 shows ten representative failure and success cases of our model on the RC-Vehicles dataset sorted by their F1a score. Example A shows a case where the model fails to complete the Break phase due to a small ornament on top of the car that the model does not realize it needs to remove. In examples B, C, and D the model successfully completes the Break phase, but then struggles to complete the early part of the model. In example E, the model initially placed the wings correctly, but then misclicked as it was placing a later piece and inadvertently moved one of the wings to the wrong location and it was not able to recover. In example F, the model incorrectly built the front grille and fails to either undo its mistakes or move on. In example G, the model misplaced one brick in the back of the car and was also not able to correct its mistake. Examples H and I show cases where the model almost completely reconstructs the assembly, but gets hung up on the small ornamental details on the roof. Finally, example J shows an almost perfect reconstruction where one hidden brick is incorrect.

5 Conclusion

We have demonstrated substantially improved performance over previous baselines on the Break and Make problem, using a model with explicit instruction memory. The failure modes of this approach suggest that performance could be improved by working on solutions that avoid getting stuck, and that find ways to push forward even if it means making a local mistake. While our approach is successful, it requires an online expert which can provide explicit instructions not only for the inspection and reconstruction process but also for the process of storing and retrieving memory. This limits the utility of this method in real-world settings, where an online expert may not be available. The InstructioNet approach may not be appropriate for problems that do not follow our assumption that assembly can be completed by approximately reversing the disassembly process. Nevertheless the advantage of this method over prior approaches which considered the entire observation history points to the effectiveness of considering only a portion of memory at a time when making decisions.

References

[1] Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. pp. 717–732. Springer (2016)
[2] Chang, M., Gervet, T., Khanna, M., Yenamandra, S., Shah, D., Min, S.Y., Shah, K., Paxton, C., Gupta, S., Batra, D., et al.: Goat: Go to any thing. arXiv preprint arXiv:2311.06430 (2023)
[3] Chen, Q., Memmel, M., Fang, A., Walsman, A., Fox, D., Gupta, A.: Urdformer: Constructing interactive realistic scenes from real images via simulation and generative modeling. In: Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023 (2023)
[4] Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
[5] Choset, H., Nagatani, K.: Topological simultaneous localization and mapping (slam): toward exact localization without explicit localization. IEEE Transactions on robotics and automation 17(2), 125–137 (2001)
[6] Chung, H., Kim, J., Knyazev, B., Lee, J., Taylor, G.W., Park, J., Cho, M.: Brick-by-brick: Combinatorial construction with deep reinforcement learning. Advances in Neural Information Processing Systems 34, 5745–5757 (2021)
[7] Czarnecki, W.M., Pascanu, R., Osindero, S., Jayakumar, S., Swirszcz, G., Jaderberg, M.: Distilling policy distillation. In: The 22nd international conference on artificial intelligence and statistics. pp. 1331–1340. PMLR (2019)
[8] Deitke, M., Han, W., Herrasti, A., Kembhavi, A., Kolve, E., Mottaghi, R., Salvador, J., Schwenk, D., VanderBilt, E., Wallingford, M., et al.: Robothor: An open simulation-to-real embodied ai platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3164–3174 (2020)
[9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[10] Du, T., Inala, J.P., Pu, Y., Spielberg, A., Schulz, A., Rus, D., Solar-Lezama, A., Matusik, W.: Inversecsg: Automatic conversion of 3d models to csg trees. ACM Transactions on Graphics (TOG) 37(6), 1–16 (2018)
[11] Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu, H., Tang, A., Huang, D.A., Zhu, Y., Anandkumar, A.: Minedojo: Building open-ended embodied agents with internet-scale knowledge. In: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022), https://openreview.net/forum?id=rc8o_j8I8PX
[12] Gordon, D., Fox, D., Farhadi, A.: What should i do now? marrying reinforcement learning and symbolic planning. arXiv preprint arXiv:1901.01492 (2019)
[13] Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)
[14] Gupta, A., Fox, D., Curless, B., Cohen, M.: Duplotrack: a real-time system for authoring and guiding duplo block assembly. In: Proceedings of the 25th annual ACM symposium on User interface software and technology. pp. 389–402 (2012)
[15] Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments. The international journal of Robotics Research 31(5), 647–663 (2012)
[16] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
[17] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
[18] Jones, B., Hildreth, D., Chen, D., Baran, I., Kim, V.G., Schulz, A.: Automate: A dataset and learning approach for automatic mating of cad assemblies. ACM Transactions on Graphics (TOG) 40(6), 1–18 (2021)
[19] Jordan, M.I.: Serial order: A parallel distributed processing approach. In: Advances in psychology, vol. 121, pp. 471–495. Elsevier (1997)
[20] Kim, J.W., Kang, K.K., Lee, J.H.: Survey on automated lego assembly construction (2014)
[21] Lee, S., Kim, J., Kim, J.W., Moon, B.R.: Finding an optimal lego® brick layout of voxelized 3d object using a genetic algorithm. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation. pp. 1215–1222 (2015)
[22] Lee, Y., Hu, E.S., Lim, J.J.: Ikea furniture assembly environment for long-horizon complex manipulation tasks. In: 2021 ieee international conference on robotics and automation (icra). pp. 6343–6349. IEEE (2021)
[23] Lennon, K., Fransen, K., O’Brien, A., Cao, Y., Beveridge, M., Arefeen, Y., Singh, N., Drori, I.: Image2lego: Customized lego set generation from images. arXiv preprint arXiv:2108.08477 (2021)
[24] Leonard, J.J., Durrant-Whyte, H.F.: Simultaneous map building and localization for an autonomous mobile robot. In: IROS. vol. 3, pp. 1442–1447 (1991)
[25] Li, C., Pan, H., Bousseau, A., Mitra, N.J.: Sketch2cad: Sequential cad modeling by sketching in context. ACM Transactions on Graphics (TOG) 39(6), 1–14 (2020)
[26] Li, K., Bian, J.W., Castle, R., Torr, P.H., Prisacariu, V.A.: Mobilebrick: Building lego for 3d reconstruction on mobile devices. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4892–4901 (2023)
[27] Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing ikea objects: Fine pose estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 2992–2999 (2013)
[28] Mo, K., Guerrero, P., Yi, L., Su, H., Wonka, P., Mitra, N., Guibas, L.J.: Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575 (2019)
[29] Niu, C., Li, J., Xu, K.: Im2struct: Recovering 3d shape structure from a single rgb image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4521–4529 (2018)
[30] Peysakhov, M., Regli, W.C.: Using assembly representations to enable evolutionary design of lego structures. Ai Edam 17(2), 155–168 (2003)
[31] Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)
[32] Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. pp. 627–635. JMLR Workshop and Conference Proceedings (2011)
[33] Rumelhart, D.E., Hinton, G.E., Williams, R.J., et al.: Learning internal representations by error propagation (1985)
[34] Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., et al.: Habitat: A platform for embodied ai research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9339–9347 (2019)
[35] Shen, B., Xia, F., Li, C., Martín-Martín, R., Fan, L., Wang, G., Buch, S., D’Arpino, C., Srivastava, S., Tchapmi, L.P., et al.: igibson, a simulation environment for interactive tasks in large realisticscenes. arXiv preprint arXiv:2012.02924 (2020)
[36] Suárez-Ruiz, F., Zhou, X., Pham, Q.C.: Can robots assemble an ikea chair? Science Robotics 3(17), eaat6385 (2018)
[37] Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N., Mukadam, M., Chaplot, D.S., Maksymets, O., et al.: Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems 34, 251–266 (2021)
[38] Thompson, R., Ghalebi, E., DeVries, T., Taylor, G.W.: Building lego using deep generative models of graphs. arXiv preprint arXiv:2012.11543 (2020)
[39] Tian, Y., Xu, J., Li, Y., Luo, J., Sueda, S., Li, H., Willis, K.D., Matusik, W.: Assemble them all: Physics-based planning for generalizable assembly by disassembly. ACM Trans. Graph. 41(6) (2022)
[40] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
[41] Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)
[42] Walsman, A., Zhang, M., Kotar, K., Desingh, K., Farhadi, A., Fox, D.: Break and make: Interactive structural understanding using lego bricks. In: European Conference on Computer Vision. pp. 90–107. Springer (2022)
[43] Wang, R., Zhang, Y., Mao, J., Cheng, C.Y., Wu, J.: Translating a visual lego manual to a machine-executable plan. In: European Conference on Computer Vision. pp. 677–694. Springer (2022)
[44] Wani, S., Patel, S., Jain, U., Chang, A.X., Savva, M.: Multi-on: Benchmarking semantic map memory using multi-object navigation. In: Neural Information Processing Systems (NeurIPS) (2020)
[45] Willis, K.D., Pu, Y., Luo, J., Chu, H., Du, T., Lambourne, J.G., Solar-Lezama, A., Matusik, W.: Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences. ACM Transactions on Graphics (TOG) 40(4), 1–24 (2021)
[46] Xu, X., Peng, W., Cheng, C.Y., Willis, K.D., Ritchie, D.: Inferring cad modeling sequences using zone graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6062–6070 (2021)
[47] Yan, C., Misra, D., Bennnett, A., Walsman, A., Bisk, Y., Artzi, Y.: Chalet: Cornell house agent learning environment. arXiv preprint arXiv:1801.07357 (2018)
[48] Zakka, K., Zeng, A., Lee, J., Song, S.: Form2fit: Learning shape priors for generalizable assembly from disassembly. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 9404–9410. IEEE (2020)
[49] Zhan, G., Fan, Q., Mo, K., Shao, L., Chen, B., Guibas, L.J., Dong, H., et al.: Generative 3d part assembly via dynamic graph learning. Advances in Neural Information Processing Systems 33, 6315–6326 (2020)
[50] Zhang, J., Cherian, A., Liu, Y., Ben-Shabat, Y., Rodriguez, C., Gould, S.: Aligning step-by-step instructional diagrams to video demonstrations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2483–2492 (2023)

Appendix 0.A Additional Ablations

0.A.1 Camera Motion

We also test the extent to which direct image comparison enables our results. To evaluate this, we retrained our model on the RC-V dataset, but with the camera in each frame rotated about the center of the assembly by $\pm 0.1$ radians, and translated by $\pm 10$ LDU in X, Y and Z. This means that when comparing the current simulator image with the top image of the construction stack, they will be viewed from slightly different viewpoints, and will not be aligned pixel-by-pixel. For reference, we also report numbers for the original model that was trained without camera motion evaluated on data with camera motion. The results of this experiment is shown in Table 4. Clearly these shifts in viewpoint negatively impact performance in a substantial way even when retrained on this data (see row Motion(T)), though the model is still able to achieve some success. We hypothesize that this decrease in performance is due to the difficulty of accurately comparing small offsets at low resolution. The problem becomes much worse when evaluating the model that was not trained with camera motion on data that contains camera motion (see row Motion(Unt)). In this case the model completely fails given that it was only ever trained on neatly aligned images.

RC-V	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
No Camera Motion	0.59	0.51	0.53	46.36
Camera Motion (Trained)	0.45	0.29	0.33	56.37
Camera Motion (Untrained)	0.00	0.00	0.00	78.75

Table 4: The default training approach (No Motion) compared against a model trained under small camera motion (Motion(Trained)) and the model trained on no motion, but evaluated under camera motion (Motion (Untrained)).

0.A.2 Expert Instruction Images

Given that our model only requires the current simulator image, the top image of the instruction stack and the current phase, we also explore the success of the model in a setting where we use the expert to generate the instruction stack during the break phase, but then do assembly with a learned agent using the expert’s instruction stack as input. Note that the models used here are not retrained, but use the same checkpoints from our main results. The improvement in performance when using the expert instructions aligns with the fact that the model sometimes fails to complete the Break phase on its own. Having access to the expert instructions and starting each episode in the Make phase removes these failure cases. Unsurprising the gap shrinks for smaller models where the overall performance is higher. While it is theoretically possible that the distribution of expert instructions differs from that which is typically produced by the agent, to the extent that this distribution shift exists, it does not appear to be hindering the results here.

RC-2	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
Model generated instructions	0.98	0.95	0.93	0.18
Expert generated instructions	0.99	0.96	0.96	0.17
RC-4	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
Model generated instructions	0.80	0.69	0.71	2.39
Expert generated instructions	0.86	0.77	0.76	1.94
RC-8	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
Model generated instructions	0.68	0.62	0.63	6.30
Expert generated instructions	0.83	0.75	0.75	4.39
RC-V	$F1_{b}\uparrow$	$F1_{e}\uparrow$	$F1_{a}\uparrow$	$AED\downarrow$
Model generated instructions	0.59	0.51	0.53	46.36
Expert generated instructions	0.71	0.63	0.65	33.81

Table 5: The fully trained model evaluated on instructions generated by the model and by the expert.

Appendix 0.B Online Expert Details

Here we include details on the logical procedure used to generate expert actions. At each time step, the expert has access to the current assembly $\hat{A}$ and the target assembly $A$ . We also keep track of a list of assemblies $\bar{A}$ that correspond to each time an instruction image was pushed onto the instruction stack. Algorithm 1 shows the procedure for generating expert actions. We use the shorthand $|A|$ to refer to the number of bricks in an assembly. The subroutine $\text{matching}(A,B)$ computes the single transform that best aligns the assemblies $A$ and $B$ and creates a lookup table between bricks that match under the alignment. The subroutine $\text{matching\_statistics}(m,A,B)$ takes a matching and a lookup table $m$ and two assemblies and computes a set of true positives $t_{p}$ , false positives $f_{p}$ and false negatives $f_{n}$ . It also returns the disconnected true positives $d_{p}$ which are bricks in the estimated assembly that match the shape and color of a brick in the ground truth assembly, but is not in the correct pose and is not connected properly, as well as connected true positives $c_{p}$ which are bricks that match the shape and color of a brick in the ground truth assembly, but are not in the correct pose, but do have at least one connection point connected correctly. The REMOVE, ROTATE, and ASSEMBLE require that an appropriate connection point is visible for cursor selection. If one is not, TERMINATE EARLY is returned instead.

Algorithm 1 Online Expert

Current Assembly

\hat{A}

Target Assembly

A

Stack of Instruction Assemblies

\bar{A}

Current Phase

p

Compute assembly matching

m,T=\text{match}(\bar{A}_{top},\hat{A})

t_{p},d_{p},c_{p},f_{p},f_{n}=\text{match\_statistics}(m,\bar{A}_{top},\hat{A})

m_{p}=d_{p}\cup c_{p}

|f_{n}|>1

|m_{p}|>1

or (

|m_{p}|

and

|A|\neq|\bar{A}_{top}|

) then

Return TERMINATE EARLY

end if

p=\text{Break}

then

r=|\bar{A}_{top}|-|\hat{A}|

r>1

r<0

then

Return TERMINATE EARLY

else if

r=1

then

Return PUSH INSTRUCTION

else if

|\hat{A}|=0

then

Return SWITCH TO MAKE PHASE

end if

else

\hat{A}=\bar{A}_{top}

then

Return POP INSTRUCTION

else if

\hat{A}=A

then

Return DONE

end if

|f_{p}|>0

then

Return REMOVE(

f_{p}

)

else if

|f_{n}|>0

then

Return INSERT(

f_{n}

)

else if

|c_{p}|>1

then

Return ROTATE(

c_{p}

)

else if

|d_{p}|>1

then

\text{orientation\_correct}(d_{p})

then

Return ASSEMBLE(

d_{p}

)

else

Return ROTATE(

d_{p}

)

end if

During the Break phase, this expert provides advice that removes one brick at a time, then pushes a new instruction image until no more bricks remain. It then switches to the make phase. During the Make phase, the expert will add a new brick and move it into place until the current assembly matches the assembly that was stored with the top instruction image, then pop that image off the stack. When rolling out using the learning agent, the expert is capable of recovering from incorrectly inserted bricks (by removing them), incorrectly placed bricks (by moving them). However it is not able to handle situations where the current assembly differs from the assembly corresponding to the top of the instruction stack by two or more bricks.

Appendix 0.C Online Training Algorithm

Our online training algorithm mixes off-policy data generated by the expert with on-policy data generated by the learner. The percentage of expert-generated data is controlled by a constant $\alpha$ which is set to 0.75 by default ( $75\%$ expert generated data). In both scenarios, the expert’s advice is used a label regardless of which policy was used to construct actions in the environment.

The algorithm maintains a fixed-capacity replay buffer $D$ of recent experience. Training proceeds in epochs, in which a fixed number of new dataset transitions $N_{steps/epoch}$ are generated according to the expert ratio $\alpha$ . These transitions are added to the replay buffer, evicting the oldest data. The model is then trained for a fixed number of steps $N_{train/epoch}$ on data sampled uniformly from the replay buffer. This process repeats until a maximum number of steps $N_{s}teps$ have taken in the environment. In practice, online data generation was parallelized across 32 copies of the environment. See Algorithm 2 for step-by-step psuedocode.

Algorithm 2 Online Training

LTRON Environment

E

Expert

\pi^{*}

Total Steps

N_{steps}

Epoch Rollout Steps

N_{steps/epoch}

Epoch Train Steps

N_{train/epoch}

Max Dataset Size

C

Expert-guided percentage

\alpha

Initialize policy

\pi_{\theta}

Initialize

D=\left\{\right\}

while

i<N_{steps}

for

j=1

N_{steps/epoch}

Sample action

a^{*}\sim\pi^{*}(o)

Sample action

a\sim\pi_{\theta}(o)

j/N_{steps/epoch}<\alpha

then

Execute action

o=E(a^{*})

else

Execute action

o=E(a)

end if

Add

(o,a^{*})

D

|D|>C

, evict the oldest entry

end for

for

j=1

N_{train/epoch}

Sample a batch

(o,a^{*})\sim D

Train

\pi_{\theta}(o)\rightarrow a

end for

i=i+N_{steps/epoch}

end while

Learning to Build by Building Your Own Instructions

Abstract

1 Introduction

2 Related Work

2.1 Interactive Understanding and Assembly

2.2 Memory

2.3 Inverse-Graphics

3 Methods

3.1 Environments and Data

3.2 Instruction Stack

3.3 Model

3.4 Online Training

3.5 Cursor Losses

4 Experiments

4.1 Evaluation

4.2 Break and Make

4.3 Online Training

4.4 Loss Functions

4.5 Conditional Action Generation

4.6 Selective Modification

4.7 Hyperparameters

4.8 Qualitative Evaluation

5 Conclusion

References

Appendix 0.A Additional Ablations

0.A.1 Camera Motion

0.A.2 Expert Instruction Images

Appendix 0.B Online Expert Details

Appendix 0.C Online Training Algorithm

Learning to Build by
Building Your Own Instructions