Retail-VLA-10K
Dataset by DreamVu
A large-scale egocentric video dataset of human manipulation actions in real retail environments, curated by DreamVu for robot training. Formatted for LeRobot v2.1 and free for the research community.
Why use Retail-VLA-10K for
the Agibot World Challenge?
The official Reasoning2Action track includes retail operations as a core task. Here is why this dataset gives your team a concrete edge.
Retail Skills That Match the Challenge Tasks
The challenge tasks stock_and_straighten_shelf and take_wrong_item_shelf map directly to skills in this dataset — Placing on Shelf (1,267 episodes), Picking Up Item (1,550 episodes), Reaching (1,593 episodes), and Grasping (1,590 episodes). This is task-aligned demonstration data, not generic manipulation.
Real-World Data to Complement AgiBot's Sim Dataset
The official Reasoning2Action dataset is simulation-based (Genie Sim 3.0). Retail-VLA-10K is curated from real retail environments — real lighting, real product diversity, real shelf clutter. Using both together directly addresses the Sim2Real gap that the challenge is built around.
Same LeRobot v2.1 Format as the Official Dataset
The official AgiBot challenge dataset uses the LeRobot v2.1 layout (meta / data / videos). Retail-VLA-10K uses the exact same structure. There is no reformatting, no custom dataloaders — you can mix and augment your training set immediately and spend time on model development instead.
10,000+ Episodes Ready to Train On
With 10,000+ episodes and 3M+ frames, this dataset is large enough to meaningfully pretrain or fine-tune a VLA policy — not just evaluate one. Teams that start with more high-quality demonstration data have a measurable head start on generalization.
LAPA Latent Actions — No Proprioception Required
Actions are encoded using LAPA (Latent Action Pretraining from Videos) — a codebook-based quantization model. You do not need proprioceptive robot data to benefit. LAPA lets you use this human-curated video directly for latent action pretraining, hardware-agnostic and compatible with the G2 robot setup.
Free, Immediate, No Paperwork
Released under CC BY-NC 4.0 — no access requests, no waiting, no gating. Download and start training today. The only condition is non-commercial use, which covers all research and challenge participation.
Reasoning2Action — Task Alignment
See how Retail-VLA-10K maps to the 10 official challenge tasks in the Reasoning2Action track.
Track 1 evaluates models across 10 progressively challenging manipulation tasks, ranging from basic to complex — including retail operations, logistics sorting, and long-horizon skills.
Retail-VLA-10K directly supports the two retail-specific tasks and provides the core manipulation primitives — grasping, reaching, placing — that underpin performance across the entire track.
11 Retail Manipulation Skills
Every episode is captured from a first-person egocentric perspective, designed to match the natural viewpoint of a deployed robot.
| Skill | Dataset ID | Episodes | Frames | Volume |
|---|---|---|---|---|
| Grasping | manipulation_grasping | 1,590 | 484,619 | |
| Reaching | manipulation_reaching | 1,593 | 467,868 | |
| Holding | manipulation_holding | 1,558 | 488,087 | |
| Picking Up Item | manipulation_picking_up_item | 1,550 | 473,327 | |
| Cart Pushing | manipulation_cart_pushing | 1,180 | 354,508 | |
| Placing on Shelf | manipulation_placing_item_on_shelf | 1,267 | 395,428 | |
| Placing in Cart | manipulation_placing_item_in_cart | 423 | 137,514 | |
| Lifting | manipulation_lifting | 445 | 137,354 | |
| Object Manipulation | manipulation_object_manipulation | 188 | 55,303 | |
| Placing in Basket | manipulation_placing_item_in_basket | 153 | 47,938 | |
| Holding Item | manipulation_holding_item | 176 | 53,760 | |
| Total | 11 skills | 10,123 | 3,095,706 |
Format & Structure
Plug-and-play with LeRobot v2.1 pipelines — the same format as the official Agibot challenge dataset.
Video
640×480 H.264 video at 30 fps. Each skill is a self-contained sub-directory with meta/, data/, and videos/ folders matching the LeRobot v2.1 layout exactly.
Action Encoding
Actions encoded via LAPA (Latent Action Pretraining from Videos). 4 latent action indices per frame, codebook size 8, sequence length 4. No proprioceptive labels required.
Annotations
Episodes include language task annotations and standard LeRobot metadata — episodes.jsonl, episodes_stats.jsonl, info.json, and tasks.jsonl.
Codebase Version
Packaged for LeRobot v2.1. Episode data stored as .parquet files per chunk. Identical structure to the official Agibot Reasoning2Action dataset — mix them directly.
Capture Perspective
First-person (egocentric) viewpoint throughout — matching the camera placement of humanoid and mobile manipulation robots. No viewpoint mismatch to compensate for.
License
CC BY-NC 4.0 — free for research and challenge participation. No access requests, no waiting. Attribution to DreamVu required. Non-commercial use only.
Ready to close the Sim2Real gap?
Download Retail-VLA-10K on Hugging Face and start training alongside the official Agibot dataset today.