PRISM: Multi-view Video Dataset for Physical AI
Closing the knowledge gap holding physical AI back from real-world deployment.
PRISM is the first dataset to unify all three knowledge dimensions of physical AI — space, physics, and embodied action — in a single real-world deployment domain.
Watch Demo
Interactive PRISM Demo
arXiv
Research Paper Abstract
Download PDF
Full Technical Paper
Dataset
PRISM-100K on Hugging Face
Need access to the full 270K corpus?
Contact SalesThe bottleneck isn't architecture — it's a structural gap in training data. Existing datasets address at most one knowledge dimension in any given domain, leaving embodied AI fundamentally unprepared for deployment. PRISM closes that gap.
Is the person still evaluating the product or ready to place it in the basket?
What is he doing in the scene?
Describe the scene.
Where is the person located?
What activity is this person performing and why?
How many products did the person evaluate? Why?
Which hand did the person use to pick up the green product?
How many products did the person check? Respond with a number only.
Would you expect to find beef in this aisle? Answer short.
What material is the grabbed item made of?
How many products did the person put into the basket?
Count the products in the basket at the beginning and at the end.
What actions are the hands performing?
Count the products in the basket by the end. What was the last item?
Can we open the refrigerator from where we are standing?
How can we get to the refrigerator from where we are standing?
What is the person in the black-and-white shirt doing?
Spatial Knowledge
Depth estimation, 3D layout understanding, foreground-background separation, and region-level spatial relationships — from both close-range egocentric and wide-angle 360° panoramic views.
Temporal & Physical Knowledge
Causality reasoning, arrow-of-time prediction, object permanence, and physics-grounded chain-of-thought over gravity, momentum, and biomechanics — at zero annotation cost.
Embodied Action Knowledge
Next-subtask prediction, task completion verification, goal-conditioned reasoning, cross-view activity matching, hand-interaction recognition, and multi-actor social navigation.
Domain-specific SFT beats general pretraining
Fine-tuning on PRISM reduces average error by 66.6% and cuts the Embodied Reasoning error rate by a factor of five — far beyond what scaling general-corpus training can achieve.
Multi-view training is mutually reinforcing
Adding exocentric supervision improves cross-view understanding without degrading egocentric performance. Ego-exo data mixing is a curriculum advantage, not a trade-off.
Supervision format matters as much as scale
LLM-generated chain-of-thought annotations deliver substantially larger gains than template-based alternatives for spatial and causal reasoning tasks.
60% of the data captures 95% of the gain
The data-scaling curve shows that 162K samples (60% of PRISM) already achieves 87.7% accuracy — within 1.2 pp of the full-data ceiling of 88.9%.
Embodied Reasoning (9 Probes)
Next subtask prediction, task completion verification, goal-conditioned action reasoning, exo-to-ego activity matching, hand interaction recognition, atomic action recognition, atomic action reasoning CoT, multi-actor scene understanding, social navigation reasoning CoT.
Common Sense & Spatial (6 Probes)
Scene VQA, environment VQA (exo), spatial reasoning CoT, affordance reasoning, causality reasoning, exo spatial reasoning.
Spatial Perception (2 Probes)
Relative depth reasoning, 360° spatial layout CoT.
Intuitive Physics (3 Probes)
Arrow-of-time (ego + exo), physics CoT, object permanence.
MCQ Overlay & Multi-format
Multiple-choice and open-ended versions including CoT, compatible with any VLM or VLA.