Now Available

PRISM: Multi-view Video Dataset for Physical AI

Closing the knowledge gap holding physical AI back from real-world deployment.

The One-Liner

PRISM is the first dataset to unify all three knowledge dimensions of physical AI — space, physics, and embodied action — in a single real-world deployment domain.

Resources

arXiv

Research Paper Abstract

Download PDF

Full Technical Paper

Dataset

PRISM-100K on Hugging Face

Model

Cosmos-Reason2-2B Fine-tuned

GitHub Code

Project Implementation

Need access to the full 270K corpus?

Contact Sales

The Challenge

The Problem We're Solving

State-of-the-art vision-language models can describe what they see. But they cannot reliably act in the world.

The bottleneck isn't architecture — it's a structural gap in training data. Existing datasets address at most one knowledge dimension in any given domain, leaving embodied AI fundamentally unprepared for deployment. PRISM closes that gap.

Performance

Key Stats

66.6%
Error rate reduction across all probes

+23.8%
Average accuracy gain (62.8% → 86.6%)

5×
Embodied Reasoning error reduction (45.5% → 9.1%)

270K

Training samples

~11.8M

Video frames

~730M

Total tokens

Capability probes

Real Supermarkets

Methodology

The Three Knowledge Pillars

Spatial Knowledge

Depth estimation, 3D layout understanding, foreground-background separation, and region-level spatial relationships — from both close-range egocentric and wide-angle 360° panoramic views.

Temporal & Physical Knowledge

Causality reasoning, arrow-of-time prediction, object permanence, and physics-grounded chain-of-thought over gravity, momentum, and biomechanics — at zero annotation cost.

Embodied Action Knowledge

Next-subtask prediction, task completion verification, goal-conditioned reasoning, cross-view activity matching, hand-interaction recognition, and multi-actor social navigation.

Analysis

Four Things PRISM Proves

Finding 01

Domain-specific SFT beats general pretraining

Fine-tuning on PRISM reduces average error by 66.6% and cuts the Embodied Reasoning error rate by a factor of five — far beyond what scaling general-corpus training can achieve.

Finding 02

Multi-view training is mutually reinforcing

Adding exocentric supervision improves cross-view understanding without degrading egocentric performance. Ego-exo data mixing is a curriculum advantage, not a trade-off.

Finding 03

Supervision format matters as much as scale

LLM-generated chain-of-thought annotations deliver substantially larger gains than template-based alternatives for spatial and causal reasoning tasks.

Finding 04

60% of the data captures 95% of the gain

The data-scaling curve shows that 162K samples (60% of PRISM) already achieves 87.7% accuracy — within 1.2 pp of the full-data ceiling of 88.9%.

Documentation

What's Inside

Embodied Reasoning (9 Probes)

Next subtask prediction, task completion verification, goal-conditioned action reasoning, exo-to-ego activity matching, hand interaction recognition, atomic action recognition, atomic action reasoning CoT, multi-actor scene understanding, social navigation reasoning CoT.

Common Sense & Spatial (6 Probes)

Scene VQA, environment VQA (exo), spatial reasoning CoT, affordance reasoning, causality reasoning, exo spatial reasoning.

Spatial Perception (2 Probes)

Relative depth reasoning, 360° spatial layout CoT.

Intuitive Physics (3 Probes)

Arrow-of-time (ego + exo), physics CoT, object permanence.

MCQ Overlay & Multi-format

Multiple-choice and open-ended versions including CoT, compatible with any VLM or VLA.