VKR (Video Knowledge Reasonin), TSR (Temporal Spatial Reasoning), VPA (Video Plot Analysis), TSG (Temporal Spatial Grounding).
The best results are highlighted in bold and the second-best are underlined.
# | Model | FTR | VTC | VTG | VKR | TSR | VPA | TSG | Avg |
---|---|---|---|---|---|---|---|---|---|
1 | o1 π₯ | 66.7 | 52.2 | 56.9 | 74.3 | 61.0 | 60.2 | 0.0 | 56.7 |
2 | Gemini-2.0-Flash π₯ | 66.2 | 51.2 | 62.0 | 64.4 | 54.1 | 58.1 | 4.2 | 51.7 |
3 | GPT-4o π₯ | 54.7 | 49.1 | 44.8 | 68.6 | 48.9 | 57.6 | 2.8 | 46.9 |
4 | Gemini-1.5-Pro | 55.1 | 45.3 | 52.9 | 62.0 | 45.0 | 45.6 | 0.7 | 44.0 |
5 | Claude 3.5 Sonnet | 45.3 | 46.3 | 34.3 | 64.2 | 44.0 | 49.3 | 0.7 | 41.0 |
6 | Aria-25B | 45.3 | 45.0 | 33.6 | 56.2 | 43.7 | 38.8 | 2.8 | 38.2 |
7 | Qwen2.5-VL-72B | 45.0 | 39.9 | 34.1 | 56.2 | 38.1 | 48.9 | 2.1 | 37.9 |
8 | LLaVA-Video-72B | 49.7 | 49.1 | 17.5 | 49.7 | 43.7 | 43.2 | 2.1 | 36.6 |
9 | LLaVA-OneVision-72B | 47.8 | 42.2 | 25.9 | 52.3 | 45.9 | 38.1 | 0.0 | 36.4 |
10 | InternVideo2.5-8B | 40.9 | 43.5 | 14.0 | 42.1 | 48.1 | 41.7 | 0.0 | 33.0 |
11 | LLaVA-Video-7B | 47.2 | 36.6 | 18.9 | 41.8 | 40.7 | 40.3 | 0.0 | 32.5 |
12 | VideoLLaMA3-7B | 44.7 | 36.6 | 24.5 | 43.1 | 36.3 | 39.6 | 0.7 | 32.5 |
13 | InternVL2.5-78B | 40.9 | 39.8 | 9.8 | 52.9 | 29.6 | 39.6 | 0.0 | 30.9 |
14 | LLaVA-OneVision-7B | 35.8 | 34.8 | 24.5 | 39.9 | 37.8 | 41.0 | 0.0 | 30.7 |
15 | Qwen2.5-VL-7B | 37.1 | 26.7 | 29.4 | 47.1 | 34.8 | 36.0 | 0.7 | 30.4 |
16 | MiniCPM-o2.6-8B | 31.4 | 30.4 | 12.6 | 43.8 | 30.4 | 38.1 | 0.0 | 26.9 |
17 | InternVL2.5-8B | 32.7 | 29.8 | 11.9 | 33.3 | 25.9 | 30.9 | 0.7 | 23.9 |
18 | mPLUG-Owl3-7B | 13.2 | 6.2 | 2.8 | 5.9 | 15.6 | 7.2 | 0.0 | 7.3 |
19 | Llama-3.2-11B-Vision | 4.4 | 4.3 | 7.0 | 6.5 | 6.7 | 5.8 | 0.0 | 4.9 |