PRISM: Multi-view Video Dataset for Physical AI
Closing the knowledge gap holding physical AI back from real-world deployment.
PRISM is the first dataset to unify all three knowledge dimensions of physical AI — space, physics, and embodied action — in a single real-world deployment domain.
arXiv
Research Paper Abstract
Download PDF
Full Technical Paper
Dataset
PRISM-100K on Hugging Face
Model
Cosmos-Reason2-2B Fine-tuned
GitHub Code
Project Implementation
Need access to the full 270K corpus?
Contact SalesThe bottleneck isn't architecture — it's a structural gap in training data. Existing datasets address at most one knowledge dimension in any given domain, leaving embodied AI fundamentally unprepared for deployment. PRISM closes that gap.
Spatial Knowledge
Depth estimation, 3D layout understanding, foreground-background separation, and region-level spatial relationships — from both close-range egocentric and wide-angle 360° panoramic views.
Temporal & Physical Knowledge
Causality reasoning, arrow-of-time prediction, object permanence, and physics-grounded chain-of-thought over gravity, momentum, and biomechanics — at zero annotation cost.
Embodied Action Knowledge
Next-subtask prediction, task completion verification, goal-conditioned reasoning, cross-view activity matching, hand-interaction recognition, and multi-actor social navigation.
Domain-specific SFT beats general pretraining
Fine-tuning on PRISM reduces average error by 66.6% and cuts the Embodied Reasoning error rate by a factor of five — far beyond what scaling general-corpus training can achieve.
Multi-view training is mutually reinforcing
Adding exocentric supervision improves cross-view understanding without degrading egocentric performance. Ego-exo data mixing is a curriculum advantage, not a trade-off.
Supervision format matters as much as scale
LLM-generated chain-of-thought annotations deliver substantially larger gains than template-based alternatives for spatial and causal reasoning tasks.
60% of the data captures 95% of the gain
The data-scaling curve shows that 162K samples (60% of PRISM) already achieves 87.7% accuracy — within 1.2 pp of the full-data ceiling of 88.9%.
Embodied Reasoning (9 Probes)
Next subtask prediction, task completion verification, goal-conditioned action reasoning, exo-to-ego activity matching, hand interaction recognition, atomic action recognition, atomic action reasoning CoT, multi-actor scene understanding, social navigation reasoning CoT.
Common Sense & Spatial (6 Probes)
Scene VQA, environment VQA (exo), spatial reasoning CoT, affordance reasoning, causality reasoning, exo spatial reasoning.
Spatial Perception (2 Probes)
Relative depth reasoning, 360° spatial layout CoT.
Intuitive Physics (3 Probes)
Arrow-of-time (ego + exo), physics CoT, object permanence.
MCQ Overlay & Multi-format
Multiple-choice and open-ended versions including CoT, compatible with any VLM or VLA.