Blog / Research
One of our newest internal evaluation tasks is one-shot assembly. A person constructs a small structure, and the robot copies it. We're evaluating our models on how well they can build Legos – end-to-end, from pixels to 100Hz actions. No task-specific engineering, no custom instructions: it sees what you build and replicates it.
Why this matters:
- Visual understanding: the model figures out “what to build” by looking at what is in front of it. Pixels in, Lego copies out.
- Next-level dexterity: Lego assembly demands sub-millimeter precision, careful re-grasps, nudges, and forceful interactions e.g. presses timed to the instant studs align (see one 3rd-party perspective).
- Sequential reasoning: for each brick, the model must choose the right one, orient it, stage it, and attach it correctly.
We were inspired by your suggestions on our previous Lego throwing demos, and as far as we know this is the world's first robot to assemble Legos with end-to-end visuomotor control. If you have tasks that you want to see robots do, we'd love to hear.
Note: there are expected bounds to the generalization of what's shown in the video: we've only tested model capabilities for 4-colored, 3-brick structures of two-by-four Lego bricks. Calculating how many possibilities this presents is not easy. (If this is easy for you, please reach out for a job.) If we agree that uncolored 3-brick combinations of two-by-four Lego bricks have 1,560 combinations, then having 4 color options for each of the 3 bricks gives 4 × 4 × 4 × 1,560 = 99,840 possible combinations.