All Angles Icon Seeing from Another Perspective:
Evaluating Multi-View Understanding in MLLMs

Institution Logos
1UC Berkeley 2HKU 3NYU 4University of Oxford 5UC Davis 6TranscEngram 7SLAI
*Equal Contribution
AAAI 2026
All-Angles-Bench teaser
TL;DR
  • A Benchmark for Multi-View Understanding: We introduce All-Angles Bench, a large-scale benchmark with over 2,100 human-annotated multi-view QA pairs across 90 real-world scenes.
  • Performance Evaluation: We benchmark 27 leading MLLMs, including Gemini-2.5-Flash, Claude-4-Sonnet, and GPT-4o. Our results reveal a substantial gap between MLLMs and human performance.
  • Decoding MLLM Shortcomings: We identify two major failure modes: (1) weak cross-view correspondence under occlusions and (2) poor estimation of coarse camera poses.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various visual understanding tasks. However, their ability to reason about 3D scenes from multiple viewpoints remains largely unexplored. We introduce All-Angles Bench, a comprehensive benchmark designed to evaluate multi-view understanding in MLLMs. Our benchmark contains 2,132 question-answer pairs carefully annotated across 90 diverse real-world scenes sourced from EGO4D-EXO and EgoHumans, spanning six fundamental tasks: counting, attribute identification, relative distance, relative direction, manipulation, and camera pose estimation. We evaluate 27 leading MLLMs and reveal a substantial performance gap between current models and human-level multi-view understanding. Our analysis identifies two critical failure modes: (1) weak cross-view correspondence, where models fail to reconcile fragmented information across views, and (2) poor coarse camera pose estimation, where even simple viewpoint alignment poses significant challenges. These findings highlight key directions for advancing multi-view reasoning in MLLMs.

Key Findings


Huge Gap from Human Performance

In camera pose estimation, human annotators reach 88.9% accuracy, while top MLLMs lag by over 50%. Many open-source models perform worse than random guessing.

Open-Source Models Can Compete

Ovis2-34B and Qwen2.5-VL-72B outperform closed-source models on orientation-sensitive tasks. Domain-specific refinements like video-focused training enhance geometric reasoning.

3D VLMs: Gains but Limited

Specialized spatial reasoning MLLMs like SpaceR achieve notable gains (e.g., 51.1 on camera pose), but their strengths are often confined to specific subtasks.

All-Angles Bench: A Comprehensive Evaluation Benchmark

We introduce All-Angles Bench, a benchmark designed to evaluate the multi-view reasoning capabilities of MLLMs.

  • Feature 1: Contains 2,132 question-answer pairs carefully annotated across 90 diverse real-world scenes sourced from EGO4D-EXO and EgoHumans.
  • Feature 2: Comprises 6 tasks: counting, attribute identification, relative distance, relative direction, manipulation, and camera pose estimation.
  • Feature 3: Investigates major aspects of 3D scene understanding, ranging from creating correspondence between objects to associating relative object and camera poses.
All-Angles Bench overview

Figure 1. Overview of All-Angles Bench. The benchmark covers six fundamental multi-view understanding tasks across diverse real-world scenes.

How Do We Make All-Angles Bench?

Our benchmark is constructed through a rigorous three-stage pipeline ensuring high quality and consistency.

  • Data Collection & Question Type Design: Involves curating 90 diverse multi-view scenes and designing six tasks to evaluate multi-view reasoning.
  • Question Creation & Human Annotation: Utilizes MLLMs for initial question generation, followed by refinement and validation through human annotation to ensure clarity, correctness, and relevance.
  • Paired-Question Generation & Human Quality Check: Assesses cross-view consistency by systematically rephrasing or altering perspectives to generate paired questions, while preserving visual correspondences with a final quality control.
Data curation pipeline

Figure 2. Three-stage data construction pipeline: data collection, human annotation, and paired-question generation with quality control.

Evaluation on All-Angles Bench

We consolidate performance from both closed-source and open-source MLLM evaluations. We use deeper-gray to highlight the top result among all models in each sub-task, while light-gray marks the second-best result. As the primary results show, there remains a substantial performance gap between both closed- and open-source MLLMs and human-level multi-view understanding.

Methods Avg. Attr. Cam. Pose Count. Manip. Rel. Dir. Rel. Dist.
Human Level82.093.388.986.372.079.595.7
GPT-4o52.466.716.752.940.053.863.8
Gemini-2.0-Flash58.462.238.964.748.056.468.1
Claude-3.7-Sonnet52.860.038.937.338.056.480.9
InternVL2.5-38B60.873.327.870.642.064.168.1
Qwen2.5-VL-72B58.473.322.252.944.061.576.6
GPT-4o47.866.835.843.042.638.951.2
Gemini-1.5-Pro47.459.833.539.445.238.655.1
Gemini-1.5-Flash46.662.943.835.943.933.252.4
Gemini-2.0-Flash52.368.433.064.941.041.858.9
Claude-3.5-Sonnet48.263.233.041.841.243.555.3
Claude-3.7-Sonnet50.068.435.841.440.146.956.7
DeepSeek-VL2-Small45.565.327.839.042.632.751.6
DeepSeek-VL247.870.524.439.046.233.554.7
InternVL2.5-2B41.059.515.942.634.230.748.8
InternVL2.5-4B45.866.618.247.836.635.854.7
InternVL2.5-8B49.973.928.448.641.640.354.5
InternVL2.5-38B55.680.431.356.645.249.758.7
InternVL2.5-78B52.579.427.352.639.743.559.3
Qwen2.5-VL-3B45.262.722.245.037.236.453.8
Qwen2.5-VL-72B55.777.529.555.443.754.360.7
Ovis2-2B46.261.926.749.042.035.551.4
Ovis2-4B46.665.521.653.434.036.156.9
Ovis2-8B49.170.517.049.443.541.254.7
Ovis2-16B53.275.529.556.644.346.356.1
Ovis2-34B55.379.426.753.846.250.659.7
Cambrian-8B39.259.819.933.133.033.043.5
Cambrian-13B36.559.025.630.727.332.137.9
Cambrian-34B41.963.720.538.237.235.243.7
LLaVA-OV-Qwen2-7B45.964.522.239.444.535.252.0
LLaVA-OV-Qwen2-72B52.573.426.745.445.646.360.3
LLaVA-Video-Qwen2-7B42.864.812.542.232.637.250.8
LLaVA-Video-Qwen2-72B53.173.627.846.245.246.661.9

Table 1. Evaluation results on All-Angles Bench across closed-source and open-source MLLMs.

Finding 1: Simple tasks for humans like coarse camera pose estimation pose challenges for MLLMs.

While humans achieve near-perfect accuracy on All-Angles Bench, both open- and closed-source MLLMs struggle. In camera pose estimation, human annotators reach 88.9% accuracy, while top MLLMs like Gemini-2.0-Flash, Qwen2.5-VL-72B, and InternVL2.5-38B lag by over 50%. Many open-source models perform worse than random guessing, often failing to align viewpoints or interpret geometric relationships, highlighting a significant gap from human-level reasoning.

Finding 2: Certain open-source MLLMs surpass closed-source ones in orientation-sensitive tasks.

Interestingly, Ovis2-34B and Qwen2.5-VL-72B outperform closed-source models like Gemini-2.0 and Claude-3.7-Sonnet in object manipulation and relative direction. Qwen2.5-VL-72B benefits from robust video understanding and fine-grained visual grounding, excelling at tracking object re-orientation across views. The success of open-source models suggests that domain-specific refinements, such as video-focused training, can enhance orientation and geometric reasoning, offering insights for improving multi-view MLLMs.

3D Spatial Reasoning Models Avg. Attr. Cam. Pose Count. Manip. Rel. Dir. Rel. Dist.
VG-LLM (Zheng et al. 2025)33.756.716.526.330.026.934.2
AoTD (Shi et al. 2025)36.841.532.427.937.626.745.5
VLM-3R (Fan et al. 2025)40.356.122.739.435.930.945.6
CoF (Ghazanfari et al. 2025)47.875.735.841.438.738.749.0
SpaceR (Ouyang et al. 2025)49.772.851.146.241.238.149.4

Table 2. Performance of 3D specialized spatial reasoning MLLMs on All-Angles Bench.

Finding 3: 3D specialized spatial reasoning MLLMs close the gap but are often limited to specific subtasks.

Recent MLLMs purpose-built for spatial reasoning, such as SpaceR, VLM-3R, and AoTD, achieve notable gains across spatial subtasks. In particular, SpaceR scores 51.1 on Camera Pose Estimation, surpassing many general-purpose MLLMs. However, their strengths are often confined to the specific subtasks they explicitly target. These results suggest that while injecting spatial priors or specialized architectures helps, it does not yet fully solve the general multi-view challenge.

Paired Q&As' Inconsistency on MLLMs

We classify model responses into three categories: CC (both correct), WW (both wrong), and IC (inconsistent, one correct, one wrong). High IC scores indicate weak multi-view understanding, where simple rewording leads to failure.

Evaluating six top MLLMs, we find: 1) GPT-4o has the highest IC score (~70%) on relative distance tasks, while others hover around 40%. 2) All models struggle with relative direction, exceeding 40% IC, showing difficulty with orientation shifts. 3) Gemini-2.0-Flash and Claude-3.7-Sonnet have balanced inconsistency across tasks, whereas Ovis2-34B and GPT-4o show significant task-based variability.

Paired Q&A analysis

Figure 5. Paired Q&A inconsistency analysis across six top MLLMs. High IC scores indicate weak multi-view understanding.

MLLMs Fail with Multi-View Correspondence

While MLLMs often succeed when everyone is visible in one viewpoint (Complete-Visibility), they sometimes fail to reconcile fragmented information across views (Partial-Visibility), as shown by GPT-4o occasionally picking the largest per-view count rather than reconciling people across views.

Multi-view correspondence failure

Figure 6a. GPT-4o fails to reconcile fragmented information across views under partial-visibility.

We evaluate four prompting strategies: 1) Zero-Shot CoT, 2) Self-Consistency, 3) Identification CoT, and 4) Coarse Correspondence on GPT-4o, InternVL2.5-38B, and Ovis2-34B. While standard CoT strategies offer limited gains, Coarse Correspondence, which uses visual markers, significantly boosts performance, improving GPT-4o's partial-visibility accuracy by +25.5%. However, pure reasoning prompts (like Zero-Shot CoT) often degrade performance for stronger models, suggesting that visual grounding is more critical than linguistic reasoning for multi-view understanding.

View Type Bl. ZS-CoT Self-Consist. Ident.-CoT Co. Corr.
GPT-4oCompl. Vis.65.563.6 (-1.9)61.8 (-3.7)69.1 (+3.6)80.0 (+14.5)
Partial Vis.41.850.9 (+9.1)52.7 (+10.9)61.8 (+20.0)67.3 (+25.5)
InternVLCompl. Vis.73.263.6 (-9.6)61.8 (-11.4)67.2 (-6.0)78.2 (+5.0)
Partial Vis.65.560.0 (-5.5)61.8 (-3.7)67.2 (+1.7)74.5 (+9.0)
OvisCompl. Vis.65.561.8 (-3.7)63.6 (-1.9)67.2 (+1.7)76.4 (+10.9)
Partial Vis.60.054.5 (-5.5)52.7 (-7.3)63.6 (+3.6)74.5 (+14.5)

Table 3. Comparison of four prompting strategies on multi-view counting. Coarse Correspondence (Co. Corr.) provides the largest gains across all models.

MLLMs Fail with Coarse Camera Estimation

While GPT-4o and Gemini-2.0-Flash perform moderately well in single-view scene reconstruction, they struggle with aligning different camera perspectives. Errors in camera pose estimation lead to incorrect directional reasoning, impacting multi-view consistency in MLLMs.

Camera estimation failure

Figure 7. MLLMs struggle with aligning different camera perspectives, leading to incorrect directional reasoning.

Root Cause Analysis & Qualitative Failures

We illustrate a qualitative failure in Camera Pose Estimation. While Gemini-2.0-Flash attempts to deduce the camera layout through detailed reasoning and even generates a text-based diagram, it fails to correctly resolve the spatial relationships between views (e.g., misplacing View-2 relative to View-3). This highlights the persistent gap between linguistic reasoning and precise 3D spatial grounding.

Qualitative failure in camera pose estimation

Figure 8a. Gemini-2.0-Flash fails to correctly resolve spatial relationships between views despite detailed reasoning.

We further categorize errors for Gemini-2.0-Flash and GPT-4o. The breakdown reveals that Cross-View Spatial Misalignment (blue) is the dominant failure mode for spatial tasks like Relative Distance and Camera Pose Estimation. In contrast, semantic tasks like Counting and Attribute Identification suffer significantly more from Object Mismatch (red) and Visual Hallucination (yellow).

Error breakdown analysis

Figure 8b. Error categorization for Gemini-2.0-Flash and GPT-4o across different task types.

BibTeX

@inproceedings{yeh2026seeing,
  title={Seeing from another perspective: Evaluating multi-view understanding in mllms},
  author={Yeh, Chun-Hsiao and Wang, Chenyu and Tong, Shengbang and Cheng, Ta-Ying and Wang, Ruoyu and Chu, Tianzhe and Zhai, Yuexiang and Chen, Yubei and Gao, Shenghua and Ma, Yi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={40},
  number={14},
  pages={12000--12008},
  year={2026}
}