At Generalist, we're working towards a future where robots can "just do anything," and we're excited to share a step in this direction.
One of our newest internal benchmark tasks is one-shot assembly. Our team constructs a small structure, and the robot copies it. We're evaluating our models on how well they can build Legos – end-to-end, from pixels to 100Hz actions. No task-specific engineering, no custom instructions: it sees what you build and replicates it.
Why this matters:
Visual understanding: the model figures out “what to build” by looking at what is in front of it. Pixels in, Lego copies out.
Next-level dexterity: Lego assembly demands sub-millimeter precision, careful re-grasps, nudges, and forceful interactions e.g. presses timed to the instant studs align (see one 3rd-party perspective).
A recent perspective slots this task into the highest level of sophistication of general-purpose robots: "Level 4 represents the final evolution where robots can perform force-dependent, delicate tasks with pinpoint accuracy. These tasks require the Dexterity to understand and react with nuance to the physical forces of the environment."
Sequential reasoning: for each brick, the model must choose the right one, orient it, stage it, and attach it correctly.
We were inspired by your suggestions on our previous Lego throwing demos, and as far as we know this is the
world's first robot to assemble Legos with end-to-end visuomotor control. If you have tasks that you want to
see robots do, we'd love to hear.
Note: there are expected bounds to the generalization of what's shown in the video: we've only tested model
capabilities for 4-colored, 3-brick structures of two-by-four Lego bricks. Calculating how many possibilities
this presents is not easy. (If this is easy for you, please reach out for a job.) If we agree that uncolored
3-brick combinations of two-by-four Lego bricks have 1,560 combinations, then having 4 color options for each
of the 3 bricks gives 4*4*4*1,560 = 99,840 possible combinations.
Research Preview
Jun 17, 2025
Today we're excited to share a glimpse of what we're building at Generalist.
As a first step towards our mission of deploying general purpose robots, we
are pushing the frontiers of what end-to-end AI models can achieve in the real world. We've been training
models and evaluating their capabilities for dexterous sensorimotor policies across different embodiments,
environments, and physical interactions. We're sharing capability demonstrations on tasks stressing different
aspects of manipulation: fine motor control, spatial and temporal precision, generalization across robots and
settings, and robustness to external disturbances.
In each of these videos, the robot is fully autonomous and controlled in real time by an end-to-end deep
neural network mapping pixels and other sensor data to 100Hz actions. The entire hardware and software stack
jointly enables reactive, smooth, and precise dexterous control from neural networks.
We're using these tasks to test model capabilities along various axes of autonomous dexterity. The tasks
require nuanced behaviors like pushing, pulling, twisting, and multi-step re-grasping. Bi-manual
coordination allows actions like stabilizing and breaking apart Lego structures, tensioning flexible
materials, and dynamically creating funnels for small part handling. High-frequency control is important for
real-time behaviors like wiggling, throwing, or adjusting in-flight grasps. Precision is a requisite for being
able to close a box with millimeter-level tolerances.
Further, the cross-embodied model transfers across different arms (e.g., 7-DoF Flexiv Rizon 4, and 6-DoF UR5),
and generalizes well to entirely new environments. For example, the fasteners task used no data from UR5 arms,
and zero data for that task inside the same evaluated environment.
We're encouraged by the early results and the potential this system demonstrates. More to come.
Task: Pick & sort fasteners
Evaluates the ability of an end-to-end model to quickly pick and sort small, thin objects from
clutter, and place them oriented into corresponding compartments. Hardware torque is the limiting
factor for cycle time.
Task: Fold a box, pack a bike chain lock & close
Evaluates capabilities in handling articulated and deformable objects over long-horizon sequences and
with precision; adapts to disturbances and modulates force precisely. After assembling the box, the
long bike lock chain needs to be deformably coiled into the box in order to fit. Precision is
particularly tested in closing the box, which requires aligning flaps on both sides of the box
simultaneously, each with millimeter-level tolerance. Note also that the arm is strong enough to crush
the box at any moment.
Task: Get the screws back into the glass jar
Evaluates tool use, precision, and bi-manual coordination across a number of maneuvers. The robot is
tasked with efficiently getting all the shiny M4 screws back into a clear container. As needed, it can
scrape them off a magnetic bit holder, bend the paper plate to form a makeshift funnel to pour them,
or pick them up one by one. Scraping requires precise interhand coordination (e.g. when does a scrape
become a grasp?), as does forming and transporting the funnel without spilling.
Task: Break apart, sort, & throw Legos
Evaluates capabilities in precise regrasping, forceful interhand coordination, generalization, and
high-velocity maneuvers. The robot is tasked with deconstructing assembled legos and sorting the
bricks into their color-corresponding bins. This can require re-grasping the bricks to get a better
grip, before wiggling and twisting them apart. The robot generalizes over a distribution of brick
formations and works with any ordering and positioning of the bins, via visual conditioning. This task
can't be done slowly, due to the physics of throwing bricks.