Seeing from Another Perspective:
Evaluating Multi-View Understanding in MLLMs

Chun-Hsiao Yeh*¹, Chenyu Wang*², Shengbang Tong³, Ta-Ying Cheng⁴, Ruoyu Wang², Tianzhe Chu⁶ Yuexiang Zhai¹ Yubei Chen⁵, Shenghua Gao^2,6 Yi Ma^1,2,6

¹UC Berkeley, ²TranscEngram, ³NYU, ⁴University of Oxford, ⁵UC Davis, ⁶HKU

*Equal Contribution

ArXiv 2025

Paper arXiv Code

All-Angles Bench

📌 A Benchmark for Multi-View Understanding: We introduce All-Angles Bench, a large-scale benchmark with over 2,100 human-annotated multi-view QA pairs across 90 real-world scenes.

📊 Performance Evaluation: We benchmark 27 leading MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. Our results reveal a substantial gap between MLLMs and human.

🔍 Decoding MLLM Shortcomings: We identify two major failure modes in MLLMs: (1) weak cross-view correspondence under occlusions and (2) poor estimation of coarse camera poses.

All-Angles Bench

Benchmark Overview: We introduce All-Angles Bench, a benchmark designed to evaluate the multi-view reasoning capabilities of MLLMs, containing 2,132 question-answer pairs carefully annotated across 90 diverse real-world scenes sourced from EGO4D-EXO and EgoHumans. All-Angles Bench comprises six challenging tasks including counting, attribute identification, relative distance, relative direction, manipulation, and camera pose estimation. These question types are designed to investigate several major aspects of 3D scene understanding, from creating correspondence between objects to associating relative object and camera poses.

How Do We Make All-Angles Bench?

(1) Data Collection & Question Type Design: We curate 90 diverse multi-view scenes and design six tasks to evaluate multi-view reasoning. (2) Question Creation & Human Annotation: Using MLLMs for initial question generation, we refine and validate them through human annotation to ensure clarity, correctness, and relevance. (3) Paired-Question Generation & Human Quality Check: To assess cross-view consistency, we systematically rephrase or alter perspectives to generate paired questions while preserving visual correspondences with a final quality control.

Evaluation on All-Angles Bench

We consolidate performance from both closed-source and open-source MLLM evaluations. We use deeper-gray to highlight the top result among all models in each sub-task, while light-gray marks the second-best result. As the primary results shown above, there remains a substantial performance gap between both of closed- and open-source MLLMs and human-level multi-view understanding. We post several findings we observe.

                      Finding 1: Simple task for human like coarse camera pose estimation poses challenges for MLLMs.
                  

While humans achieve near-perfect accuracy on All-Angles Bench, both open- and closed-source MLLMs struggle. In camera pose estimation, human annotators reach 88.9% accuracy, while top MLLMs like Gemini-2.0-Flash, Qwen2.5-VL-72B, and InternVL2.5-38B lag by over 50%. Many open-source models perform worse than random guessing, often failing to align viewpoints or interpret geometric relationships, highlighting a significant gap from human-level reasoning.

                      Finding 2: Certain open-source MLLMs surpass closed-source ones in orientation-sensitive tasks.
                  

Interestingly, Ovis2-34B and Qwen2.5-VL-72B outperform closed-source models like Gemini-2.0 and Claude-3.7-Sonnet in object manipulation and relative direction. Qwen2.5-VL-72B benefits from robust video understanding and fine-grained visual grounding, excelling at tracking object re-orientation across views. The success of open-source models suggests that domain-specific refinements, such as video-focused training, can enhance orientation and geometric reasoning—offering insights for improving multi-view MLLMs.

Paired Q&As' Inconsistency on MLLMs

We classify model responses into three categories: CC (both correct), WW (both wrong), and IC (inconsistent—one correct, one wrong). High IC scores indicate weak multi-view understanding, where simple rewording leads to failure.

Evaluating six top MLLMs, we find: 1) GPT-4o has the highest IC score (~70%) on relative distance tasks, while others hover around 40%. 2) All models struggle with relative direction, exceeding 40% IC, showing difficulty with orientation shifts. 3) Gemini-2.0-Flash and Claude-3.7-Sonnet have balanced inconsistency across tasks, whereas Ovis2-34B and GPT-4o show significant task-based variability.

MLLMs Fail with Multi-View Correspondence

While MLLMs often succeed when everyone is visible in one viewpoint (Complete-Visiblity), they sometimes fail to reconcile fragmented information across views (Partial-Visiblity), as shown by GPT‐4o occasionally picks the largest per‐view count rather than reconciling people across views.

We evaluate 1) Zero-Shot CoT, 2) Self-Consistency, and 3) Identification CoT on GPT-4o, Ovis2-34B, and InternVL2.5-38B under complete- and partial-view settings.

While CoT improves GPT-4o in partial-visibility cases, its impact is minimal on models already strong in multi-view counting (e.g., InternVL2.5-38B). This suggests that prompt refinement alone is insufficient—specialized multi-view training is needed to excel in All-Angles Bench.

MLLMs Fail with Coarse Camera Estimation

While GPT-4o and Gemini-2.0-Flash perform moderately well in single-view scene reconstruction, they struggle with aligning different camera perspectives. Errors in camera pose estimation lead to incorrect directional reasoning, impacting multi-view consistency in MLLMs.

BibTeX


    @article{yeh2025seeing,
      title={Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs},
      author={Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Rouyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao and Yi Ma},
      journal={arXiv preprint arXiv:2504.15280},
      year={2025}
    }