Now Available

PRISM: Multi-view Video Dataset for Physical AI

Closing the knowledge gap holding physical AI back from real-world deployment.

The One-Liner

PRISM is the first dataset to unify all three knowledge dimensions of physical AI — space, physics, and embodied action — in a single real-world deployment domain.

Resources

Watch Demo

Interactive PRISM Demo

arXiv

Research Paper Abstract

Download PDF

Full Technical Paper

Dataset

PRISM-100K on Hugging Face

Need access to the full 270K corpus?

Contact Sales

The Challenge

The Problem We're Solving

State-of-the-art vision-language models can describe what they see. But they cannot reliably act in the world.

The bottleneck isn't architecture — it's a structural gap in training data. Existing datasets address at most one knowledge dimension in any given domain, leaving embodied AI fundamentally unprepared for deployment. PRISM closes that gap.

Performance

Key Stats

66.6%
Error rate reduction across all probes

+23.8%
Average accuracy gain (62.8% → 86.6%)

5×
Embodied Reasoning error reduction (45.5% → 9.1%)

270K

Training samples

~11.8M

Video frames

~730M

Total tokens

Capability probes

Real Supermarkets

Demo

Dataset Video Examples

Is the person still evaluating the product or ready to place it in the basket?

Cosmos-Reason2-2B The person is still evaluating the product. Next, he should place the guava in the basket.

PRISM The person is still evaluating the guava's quality. The next action would be to place it in the basket if it meets their standards.

What is he doing in the scene?

Cosmos-Reason2-2B He is reaching into a fruit display case.

PRISM The person is inspecting a blackberry. They hold it in their hand and examine it closely.

Describe the scene.

Cosmos-Reason2-2B The shelves are stocked with a variety of products, including personal care items, food, and household goods.

PRISM The scene shows the person walking down a wide aisle. The shelves are filled with personal care products, including shampoos and conditioners.

Where is the person located?

Cosmos-Reason2-2B The person is located in the aisle of a store.

PRISM The person is in the Aisle (Personal Care/Flour).

What activity is this person performing and why?

Cosmos-Reason2-2B The person is carefully reading the label on the shampoo bottle.

PRISM The person is holding the Vatika shampoo bottle and examining the back label for ingredients, usage instructions, or expiration date.

How many products did the person evaluate? Why?

Cosmos-Reason2-2B The person evaluated two products, the blue box and the green box.

PRISM The person is evaluating two products, a green box and a blue box, to compare them for a decision.

Which hand did the person use to pick up the green product?

Cosmos-Reason2-2B The person used their right hand to pick up the green product.

PRISM The person's left hand was holding the item, and their right hand was reaching and grasping it.

How many products did the person check? Respond with a number only.

Cosmos-Reason2-2B 1

PRISM 2

Would you expect to find beef in this aisle? Answer short.

Cosmos-Reason2-2B Yes

PRISM No, the visible items are tea boxes, indicating this is a tea aisle, not a grocery aisle for beef.

What material is the grabbed item made of?

Cosmos-Reason2-2B The grabbed item is made of plastic.

PRISM The grabbed item is a blue packet of Bru coffee. The packet is made of plastic.

How many products did the person put into the basket?

Cosmos-Reason2-2B Two products.

PRISM The person put one product, a small red container, into the basket.

Count the products in the basket at the beginning and at the end.

Cosmos-Reason2-2B There are 3 products at the beginning and 4 at the end.

PRISM The basket contains 2 items at the start, and 3 items at the end.

What actions are the hands performing?

Cosmos-Reason2-2B The hands are holding a blue basket and placing a jar into it.

PRISM The left hand is holding the basket, and the right hand is grasping an item.

Count the products in the basket by the end. What was the last item?

Cosmos-Reason2-2B There is one product in the basket by the end, which is a blue bottle.

PRISM The shopping basket contains 2 items. The last item added was the Hershey's Cocoa container.

Can we open the refrigerator from where we are standing?

Cosmos-Reason2-2B Yes, the refrigerator is on the left side of the aisle, and it is possible to open it from where you are standing.

PRISM No, the refrigerator is too far away to be opened from this position.

How can we get to the refrigerator from where we are standing?

Cosmos-Reason2-2B We can walk straight ahead to the refrigerator, which is located on the left side of the aisle.

PRISM Walk forward along the aisle, moving past the bulk item display, to reach the refrigerator on the left side.

What is the person in the black-and-white shirt doing?

Cosmos-Reason2-2B The person is pushing a shopping cart and walking towards the right side of the frame.

PRISM The person in the black-and-white shirt is pushing a shopping cart down the aisle, looking at items on the shelves.

Methodology

The Three Knowledge Pillars

Spatial Knowledge

Depth estimation, 3D layout understanding, foreground-background separation, and region-level spatial relationships — from both close-range egocentric and wide-angle 360° panoramic views.

Temporal & Physical Knowledge

Causality reasoning, arrow-of-time prediction, object permanence, and physics-grounded chain-of-thought over gravity, momentum, and biomechanics — at zero annotation cost.

Embodied Action Knowledge

Next-subtask prediction, task completion verification, goal-conditioned reasoning, cross-view activity matching, hand-interaction recognition, and multi-actor social navigation.

Analysis

Four Things PRISM Proves

Finding 01

Domain-specific SFT beats general pretraining

Fine-tuning on PRISM reduces average error by 66.6% and cuts the Embodied Reasoning error rate by a factor of five — far beyond what scaling general-corpus training can achieve.

Finding 02

Multi-view training is mutually reinforcing

Adding exocentric supervision improves cross-view understanding without degrading egocentric performance. Ego-exo data mixing is a curriculum advantage, not a trade-off.

Finding 03

Supervision format matters as much as scale

LLM-generated chain-of-thought annotations deliver substantially larger gains than template-based alternatives for spatial and causal reasoning tasks.

Finding 04

60% of the data captures 95% of the gain

The data-scaling curve shows that 162K samples (60% of PRISM) already achieves 87.7% accuracy — within 1.2 pp of the full-data ceiling of 88.9%.

Documentation

What's Inside

Embodied Reasoning (9 Probes)

Next subtask prediction, task completion verification, goal-conditioned action reasoning, exo-to-ego activity matching, hand interaction recognition, atomic action recognition, atomic action reasoning CoT, multi-actor scene understanding, social navigation reasoning CoT.

Common Sense & Spatial (6 Probes)

Scene VQA, environment VQA (exo), spatial reasoning CoT, affordance reasoning, causality reasoning, exo spatial reasoning.

Spatial Perception (2 Probes)

Relative depth reasoning, 360° spatial layout CoT.

Intuitive Physics (3 Probes)

Arrow-of-time (ego + exo), physics CoT, object permanence.

MCQ Overlay & Multi-format

Multiple-choice and open-ended versions including CoT, compatible with any VLM or VLA.