Pi0.5 Attention Analysis
A guided tour through the attention patterns of a Vision-Language-Action model, revealing how it decides where to look and what to do.
Raw Camera View
Attention Heatmap
What you're looking at is something remarkable. On the left, a simple camera image from a robot's workspace. On the right, the same image transformed to show where Pi0.5 is actually looking when it decides how to move. Those bright spots are the model's attention — and what it focuses on tells us everything.
A LIBERO benchmark task: pick up the alphabet soup and place it in the basket. Simple for humans, but for a robot it requires perceiving objects, planning grasps, and executing a sequence of 7-DoF arm movements.
The task is straightforward: pick up a can of alphabet soup and place it in a basket. Scrub through these key moments to see the full trajectory — from approach, to grasp, to transport, to placement.
Pi0.5 uses a two-stage architecture. A Vision-Language Model creates a shared understanding of images, text, and state. Then an expert action head queries that understanding through cross-attention to produce motor commands.
The key insight: the action head doesn't see the images directly. It only accesses them through the attention mechanism. This means the attention patterns are literally the model's window into the visual world.
Everything gets serialized into a single token sequence. Understanding this layout is key to reading the attention visualizations that follow.
| Image 1 | Tokens 0–255 | 256 patches (16×16 grid) |
| Image 2 | Tokens 256–511 | 256 patches (16×16 grid) |
| Image 3 | Tokens 512–767 | 256 patches (16×16 grid) |
| Text | Tokens 768–~798 | ~30 instruction tokens |
| State | Tokens ~798–~812 | ~14 proprioceptive tokens |
The attention patterns across these segments reveal exactly what information the model uses for each action decision. When we see bright patches in the image region, that's where the model is looking. When text tokens light up, those are the words driving the action.
These heatmaps show where the expert action head focuses when computing motor commands. Brighter = more attention. Use the controls to explore different steps and heads.
Look at the heatmap — attention concentrates on objects, not background. The bright spots reveal where the model actually looks when deciding how to move the arm.
The controls have been set to Step 0, t=9, Average heads, View 1Now look at Head 0 specifically — individual attention heads specialize on different aspects of the scene. This head may focus on the target object.
Switched to Head 0 — compare with the average viewSwitch to Head 1 — notice how it attends to different spatial regions. Each head learns a complementary visual strategy.
Switched to Head 1 — see how heads divide laborJump to Step 40 — the task is now in the grasp phase. Watch how attention shifts to track the soup can as the arm approaches.
Advanced to Step 40, back to average headsAt Step 100 — the arm is carrying the soup. Attention now highlights the basket destination rather than the soup. The model plans ahead.
Jumped to Step 100 — attention on the destinationFinally, switch to Camera View 2 (wrist camera) — the model gets a completely different perspective and attends to different features from this close-up viewpoint.
Switched to View 2 — wrist camera perspective
Notice how the attention isn't spread uniformly. It concentrates on specific objects and regions that are relevant to the current phase of the task. Try switching between different attention heads — each head learns to focus on different aspects.
Pi0.5 generates actions through 10 flow matching steps, starting from pure noise (t=0) and refining to clean actions (t=9). Watch how attention sharpens as actions become clearer.
As actions denoise from pure noise to clean predictions, attention sharpens dramatically onto task-relevant objects. The model learns WHERE to look, not just WHAT to do.
Watch how the attention pattern evolves. At time step zero, the attention is diffuse and scattered. By time step nine, it sharpens dramatically. The model literally learns to focus as it refines its predictions.
The model doesn't just look at images. It attends to the text instruction and the robot's joint state. These charts show which tokens receive the most attention.
Notice which words get the most attention — they tend to be the object nouns and action verbs most relevant to the current step. Toggle to state tokens to see how proprioceptive information is weighted.
The robot arm dominates every camera frame. You'd expect the model to focus heavily on it. But the attention tells a different story.
The model largely ignores the robot arm in visual attention. Why? Because the arm's position is already encoded in the state tokens (proprioception). The model focuses visual attention on what it can't get from proprioception — the objects and the destination.
Here's perhaps the most fascinating finding. The robot arm dominates the visual field, but the model largely ignores it. It knows the arm position from state tokens. So it focuses visual attention on targets and destinations — what it can't get from proprioception alone.
Inside the PaliGemma language model, text tokens and image patches attend to each other. This cross-modal attention grounds language in visual perception. But which tokens matter most?
Highlighted tokens are task-relevant nouns — these receive disproportionate attention from image patches.
Start with the "Text → Image" view — this shows where the language model's text representations look in the visual field. It highlights general task relevance. Then switch to "Image → Text" — this is where it gets really interesting. You'll see a bar chart showing how much attention each text token receives from the image patches. Notice how "soup" and "basket" clearly stand out above function words like "the" and "and." The model has learned to connect visual objects with their names.
"basket" (~1.0 attention weight) and "soup" (~0.61) tower above function words like "the" (~0.13). The model has learned to connect visual objects with their linguistic labels — this is visual grounding in action.
Go beyond the pre-rendered visualizations. This explorer connects to a live analysis server, letting you investigate any combination of step, layer, head, and timestep in real time.
pip install -r requirements.txtpython src/analyzer/main.py