Beyond 3D VQAs: Injecting 3D Spatial Priors into
Vision-Language Models for Enhanced Geometric Reasoning

Chun-Hsiao Yeh^1,2 Shengyi Qian¹ Manchen Wang¹ Yi Ma^2,3 Joseph Tighe¹ Fanyi Xiao¹

¹FAIR at Meta ²UC Berkeley ³HKU

CVPR 2026 Main Track

TL;DR

Why it matters: VLMs' internal visual representations have near-zero geometric consistency. They can't reliably tell if the same object appears in two different views. Existing fixes (VQA fine-tuning, 3D encoders) either overfit or add rigid modules.
Our approach: Instead of memorizing QA pairs, GASP teaches the LLM fundamental geometry through point correspondence + depth consistency supervision at every transformer layer. The training head is discarded at inference with zero overhead.
The payoff: Internal correspondence jumps from <5% to 70%+. Downstream: +18.2% on spatial reasoning, +29.0% on object counting, all without any 3D VQA data.

Abstract

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks, including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data.

Key Findings

VLMs Are Geometrically Blind

VLMs cannot track objects across views. Q-K correspondence is below 5%. GASP boosts this to 70%+ across all layers.

Geometry > Memorization

Same data, VQA format: performance drops. GASP's geometric losses: +18.2% camera pose, +29.0% counting, +15.0% multi-view.

Train Geometry, Infer Standard

Head discarded after training. Geometric priors baked into attention weights. Zero overhead at inference.

Method

GASP Framework. We augment the VLM by attaching a lightweight correspondence head to every LLM transformer layer. The head is initialized via SVD decomposition of the pre-trained query projection weights. During training, it receives a dual geometric supervision signal:

Contrastive Correspondence Loss: An InfoNCE loss on ground-truth point correspondences from large-scale video scenes (DL3DV) enforces 2D view-invariance across frames.
Depth Consistency Loss: A soft-argmax depth prediction using the correspondence distribution acts as a discriminative geometric regularizer, forcing the model to distinguish visually similar objects at different depths.

At inference, the correspondence head is discarded. The geometric priors are permanently embedded in the LLM's learned attention weights, enabling robust spatial reasoning without auxiliary inputs or additional parameters.

Training Data Examples

Training correspondence examples. For each scene, the top row shows RGB frames with point tracks (red dots) projected from the query frame; the bottom row shows corresponding depth maps used for depth consistency supervision.

Correspondence Analysis

Visual correspondence diagnostic on LLaVA-NeXT-Video-7B (top) and Qwen2.5-VL-7B (bottom). We analyze three complementary metrics across all transformer layers:

(a, d) Layer-wise PCK: Baseline VLMs exhibit near-zero correspondence accuracy across all layers. GASP dramatically improves this, with peak accuracy exceeding 70%.
(b, e) Confidence-Accuracy Correlation: Baselines show negative correlation (high confidence predicts incorrect matches), revealing positional bias. GASP achieves strong positive correlation, indicating calibrated geometric understanding.
(c, f) Temporal Robustness: GASP maintains over 85% of its matching performance at long temporal distances, while baseline accuracy collapses to under 5%.

Correspondence Visualization

Visual correspondence visualization. Our GASP's visual correspondence on both VLM backbones (LLaVA-NeXT-Video and Qwen2.5-VL). (a) Patch-wise correspondence with optical flow encoding movement. (b) Attention heatmap demonstrating learned correspondence for a query point.

Quantitative Results

Comparison with state-of-the-art VLMs on spatial reasoning benchmarks. We evaluate on All-Angles Bench, VSI-Bench, and BLINK. GASP consistently improves spatial reasoning across both VLM backbones.

Methods	All-Angles Bench			VSI-Bench				BLINK
Methods	Cam. Pose	Manip.	Rel. Dir.	Obj. Count	Route	Rel. Dir.	App. Order	Spa. Rela.	Rel. Depth	Multi-View
General VLMs
GPT-4o	27.3	41.4	40.9	46.2	31.5	41.3	28.5	76.9	64.5	60.2
Gemini-1.5-Pro	25.0	40.3	29.8	56.2	36.0	46.3	34.6	67.1	50.0	41.3
InternVL2.5-8B	31.8	43.7	34.1	16.9	28.8	41.1	34.7	89.5	77.4	44.4
Qwen2.5-VL-72B	34.1	45.0	48.3	14.3	28.4	27.6	31.4	88.8	81.5	53.4
LLaVA-Onevision-72B	20.5	47.7	33.8	43.5	32.5	39.9	44.6	78.3	78.2	53.4
3D Spatial Reasoning VLMs (Fine-tuned on 3D VQAs)
VG-LLM	16.5	30.0	26.9	67.9	32.4	40.7	59.2	84.3	77.2	50.8
AoTD	32.4	37.6	26.7	23.5	28.8	41.4	23.3	61.5	49.2	45.1
VLM-3R	22.7	35.9	30.9	70.2	45.4	80.5	40.1	48.3	47.6	50.1
GASP on LLaVA-NeXT-Video-7B
Baseline (SFT)	22.7	39.9	24.7	23.5	24.7	32.4	11.5	53.1	44.4	42.1
+ DL3DV VQA	19.8	38.1	28.2	21.4	25.1	31.8	9.2	54.5	44.0	42.5
+ GASP (Full)	40.9	43.5	29.8	52.5	32.5	41.2	22.0	47.6	48.4	57.1
Δ Improvement	↑18.2	↑3.6	↑5.1	↑29.0	↑7.8	↑8.8	↑10.5	↓5.5	↑4.0	↑15.0
GASP on Qwen2.5-VL-7B
Baseline (SFT)	34.1	41.3	36.9	33.8	26.8	34.3	26.5	80.2	78.9	41.5
+ DL3DV VQA	31.5	41.5	36.2	33.2	27.1	34.3	25.3	81.0	78.1	42.0
+ GASP (Full)	52.8	40.1	37.2	41.6	30.4	40.6	35.0	88.8	80.7	53.4
Δ Improvement	↑18.7	↓1.2	↑0.3	↑7.8	↑3.6	↑6.3	↑8.5	↑8.6	↑1.8	↑11.9

Generalization Analysis

The overfitting problem of 3D-VQA fine-tuning. We show the performance change of specialized spatial VLMs relative to their base models across five benchmarks. While VQA fine-tuning yields large gains on specific datasets (e.g., VSI-Bench), it consistently degrades performance on out-of-distribution benchmarks like MMSI-Bench and SpaceVista.

This stark pattern confirms that standard VQA-based training memorizes dataset-specific biases rather than acquiring genuine spatial understanding. GASP avoids this by learning fundamental geometric priors that generalize across domains.

BibTeX

@inproceedings{yeh2026gasp,
  title={Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning},
  author={Yeh, Chun-Hsiao and Qian, Shengyi and Wang, Manchen and Ma, Yi and Tighe, Joseph and Xiao, Fanyi},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}