We consolidate performance from both closed-source and open-source MLLM evaluations.
We use deeper-gray to highlight the top result among all models in each sub-task, while
light-gray marks the second-best result.
As the primary results shown above, there remains a substantial performance gap between both of closed- and open-source MLLMs and human-level multi-view understanding.
We post several findings we observe.
Styled Box
Finding 1: Simple task for human like coarse camera pose estimation poses challenges for MLLMs.
While humans achieve near-perfect accuracy on All-Angles Bench, both open- and closed-source MLLMs struggle.
In camera pose estimation, human annotators reach 88.9% accuracy, while top MLLMs like Gemini-2.0-Flash,
Qwen2.5-VL-72B, and InternVL2.5-38B lag by over 50%. Many open-source models perform worse than random guessing,
often failing to align viewpoints or interpret geometric relationships, highlighting a significant gap from human-level reasoning.
Finding 2: Certain open-source MLLMs surpass closed-source ones in orientation-sensitive tasks.
Interestingly, Ovis2-34B and Qwen2.5-VL-72B outperform closed-source models like Gemini-2.0 and Claude-3.7-Sonnet
in object manipulation and relative direction.
Qwen2.5-VL-72B benefits from robust video understanding and fine-grained visual grounding, excelling at tracking object re-orientation across views.
The success of open-source models suggests that domain-specific refinements, such as video-focused training,
can enhance orientation and geometric reasoningâoffering insights for improving multi-view MLLMs.