We consolidate performance from both closed-source and open-source MLLM evaluations.
We use deeper-gray to highlight the top result among all models in each sub-task, while
light-gray marks the second-best result.
As the primary results shown above, there remains a substantial performance gap between both of closed- and open-source MLLMs and human-level multi-view understanding.
We post several findings we observe.
Finding 1: Simple task for human like coarse camera pose estimation poses challenges for MLLMs.
While humans achieve near-perfect accuracy on All-Angles Bench, both open- and closed-source MLLMs struggle.
In camera pose estimation, human annotators reach 88.9% accuracy, while top MLLMs like Gemini-2.0-Flash,
Qwen2.5-VL-72B, and InternVL2.5-38B lag by over 50%. Many open-source models perform worse than random guessing,
often failing to align viewpoints or interpret geometric relationships, highlighting a significant gap from human-level reasoning.
Finding 2: Certain open-source MLLMs surpass closed-source ones in orientation-sensitive tasks.
Interestingly, Ovis2-34B and Qwen2.5-VL-72B outperform closed-source models like Gemini-2.0 and Claude-3.7-Sonnet
in object manipulation and relative direction.
Qwen2.5-VL-72B benefits from robust video understanding and fine-grained visual grounding, excelling at tracking object re-orientation across views.
The success of open-source models suggests that domain-specific refinements, such as video-focused training,
can enhance orientation and geometric reasoning—offering insights for improving multi-view MLLMs.
Finding 3: 3D specialized spatial reasoning MLLMs close the gap but are often limited to specific subtasks.
Recent MLLMs purpose-built for spatial reasoning—such as SpaceR, VLM-3R, and AoTD—achieve notable gains across spatial subtasks.
In particular, SpaceR scores 51.1 on Camera Pose Estimation, surpassing many general-purpose MLLMs.
However, their strengths are often confined to the specific subtasks they explicitly target.
These results suggest that while injecting spatial priors or specialized architectures helps, it does not yet fully solve the general multi-view challenge.