All Angles Icon Seeing from Another Perspective:
Evaluating Multi-View Understanding in MLLMs


Institution Logos
1UC Berkeley, 2HKU, 3NYU, 4University of Oxford, 5UC Davis, 6TranscEngram 7SLAI
    *Equal Contribution
AAAI 2026
All-Angles-Bench



  • 📌 A Benchmark for Multi-View Understanding: We introduce All-Angles Bench, a large-scale benchmark with over 2,100 human-annotated multi-view QA pairs across 90 real-world scenes.

  • 📊 Performance Evaluation: We benchmark 27 leading MLLMs, including Gemini-2.5-Flash, Claude-4-Sonnet, and GPT-4o. Our results reveal a substantial gap between MLLMs and human.

  • 🔍 Decoding MLLM Shortcomings: We identify two major failure modes in MLLMs: (1) weak cross-view correspondence under occlusions and (2) poor estimation of coarse camera poses.

All-Angles Bench: A Comprehensive Evaluation Benchmark

neti

    We introduce All-Angles Bench, a benchmark designed to evaluate the multi-view reasoning capabilities of MLLMs.
    - Feature 1: Contains 2,132 question-answer pairs carefully annotated across 90 diverse real-world scenes sourced from EGO4D-EXO and EgoHumans.
    - Feature 2: Comprises 6 tasks: counting, attribute identification, relative distance, relative direction, manipulation, and camera pose estimation.
    - Feature 3: Investigate major aspects of 3D scene understanding, ranging from creating correspondence between objects to associating relative object and camera poses.

How Do We Make All-Angles Bench?

neti

    Our benchmark is constructed through a rigorous three-stage pipeline ensuring high quality and consistency.
    - [Data Collection & Question Type Design] involves curating 90 diverse multi-view scenes and designing six tasks to evaluate multi-view reasoning.
    - [Question Creation & Human Annotation] utilizes MLLMs for initial question generation, followed by refinement and validation through human annotation to ensure clarity, correctness, and relevance.
    - [Paired-Question Generation & Human Quality Check] assesses cross-view consistency by systematically rephrasing or altering perspectives to generate paired questions, while preserving visual correspondences with a final quality control.

Evaluation on All-Angles Bench

neti

    We consolidate performance from both closed-source and open-source MLLM evaluations. We use deeper-gray to highlight the top result among all models in each sub-task, while light-gray marks the second-best result. As the primary results shown above, there remains a substantial performance gap between both of closed- and open-source MLLMs and human-level multi-view understanding. We post several findings we observe.


    Finding 1: Simple task for human like coarse camera pose estimation poses challenges for MLLMs.


    While humans achieve near-perfect accuracy on All-Angles Bench, both open- and closed-source MLLMs struggle. In camera pose estimation, human annotators reach 88.9% accuracy, while top MLLMs like Gemini-2.0-Flash, Qwen2.5-VL-72B, and InternVL2.5-38B lag by over 50%. Many open-source models perform worse than random guessing, often failing to align viewpoints or interpret geometric relationships, highlighting a significant gap from human-level reasoning.



    Finding 2: Certain open-source MLLMs surpass closed-source ones in orientation-sensitive tasks.


    Interestingly, Ovis2-34B and Qwen2.5-VL-72B outperform closed-source models like Gemini-2.0 and Claude-3.7-Sonnet in object manipulation and relative direction. Qwen2.5-VL-72B benefits from robust video understanding and fine-grained visual grounding, excelling at tracking object re-orientation across views. The success of open-source models suggests that domain-specific refinements, such as video-focused training, can enhance orientation and geometric reasoning—offering insights for improving multi-view MLLMs.

    3d evaluation
    Finding 3: 3D specialized spatial reasoning MLLMs close the gap but are often limited to specific subtasks.


    Recent MLLMs purpose-built for spatial reasoning—such as SpaceR, VLM-3R, and AoTD—achieve notable gains across spatial subtasks. In particular, SpaceR scores 51.1 on Camera Pose Estimation, surpassing many general-purpose MLLMs. However, their strengths are often confined to the specific subtasks they explicitly target. These results suggest that while injecting spatial priors or specialized architectures helps, it does not yet fully solve the general multi-view challenge.

Paired Q&As' Inconsistency on MLLMs

neti

    We classify model responses into three categories: CC (both correct), WW (both wrong), and IC (inconsistent—one correct, one wrong). High IC scores indicate weak multi-view understanding, where simple rewording leads to failure.

    Evaluating six top MLLMs, we find: 1) GPT-4o has the highest IC score (~70%) on relative distance tasks, while others hover around 40%. 2) All models struggle with relative direction, exceeding 40% IC, showing difficulty with orientation shifts. 3) Gemini-2.0-Flash and Claude-3.7-Sonnet have balanced inconsistency across tasks, whereas Ovis2-34B and GPT-4o show significant task-based variability.

MLLMs Fail with Multi-View Correspondence

While MLLMs often succeed when everyone is visible in one viewpoint (Complete-Visiblity), they sometimes fail to reconcile fragmented information across views (Partial-Visiblity), as shown by GPT—4o occasionally picks the largest per—view count rather than reconciling people across views.

We evaluate four prompting strategies: 1) Zero-Shot CoT, 2) Self-Consistency, 3) Identification CoT, and 4) Coarse Correspondence on GPT-4o, InternVL2.5-38B, and Ovis2-34B.

While standard CoT strategies offer limited gains, Coarse Correspondence—which uses visual markers—significantly boosts performance, improving GPT-4o's partial-visibility accuracy by +25.5%. However, pure reasoning prompts (like Zero-Shot CoT) often degrade performance for stronger models (e.g., InternVL), suggesting that visual grounding is more critical than linguistic reasoning for multi-view understanding.

MLLMs Fail with Coarse Camera Estimation

neti

    While GPT-4o and Gemini-2.0-Flash perform moderately well in single-view scene reconstruction, they struggle with aligning different camera perspectives. Errors in camera pose estimation lead to incorrect directional reasoning, impacting multi-view consistency in MLLMs.

Root Cause Analysis & Qualitative Failures

We illustrate a qualitative failure in Camera Pose Estimation. While Gemini-2.0-Flash attempts to deduce the camera layout through detailed reasoning and even generates a text-based diagram, it fails to correctly resolve the spatial relationships between views (e.g., misplacing View-2 relative to View-3). This highlights the persistent gap between linguistic reasoning and precise 3D spatial grounding.

We further categorize errors for Gemini-2.0-Flash and GPT-4o. The breakdown reveals that Cross-View Spatial Misalignment (blue) is the dominant failure mode for spatial tasks like Relative Distance and Camera Pose Estimation. In contrast, semantic tasks like Counting and Attribute Identification suffer significantly more from Object Mismatch (red) and Visual Hallucination (yellow).

BibTeX


    @article{yeh2025seeing,
      title={Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs},
      author={Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Rouyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao and Yi Ma},
      journal={arXiv preprint arXiv:2504.15280},
      year={2025}
    }