Temporal: FT(Forward Temporal), RT(Reverse Temporal), AS(Action Sequence)
Reasoning: CR(Causal Relationship), PU(Plot Understanding), CI(Counterfactual Inference)
Spatial: SR(Spatial Relationship) Counting: GC(General Counting)
The best results are highlighted in bold and the second-best are underlined.
# | Model | Perception | Reasoning | Temporal | Spatial | Counting | Avg | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OA | HA | OD | FM | CR | PU | CI | FT | RT | AS | SR | GC | |||
1 | Gemini-1.5-Pro π₯ | 70.9 | 74.8 | 34.8 | 58.8 | 80.7 | 76.8 | 48.6 | 70.4 | 70.5 | 46.4 | 70.0 | 56.1 | 67.9 |
2 | GPT-4o π₯ | 76.6 | 68.9 | 41.3 | 60.8 | 67.0 | 73.3 | 67.6 | 68.1 | 70.5 | 50.0 | 54.0 | 48.4 | 65.4 |
3 | Gemini-1.5-Flash π₯ | 61.2 | 64.4 | 28.3 | 52.6 | 72.3 | 64.2 | 37.8 | 58.0 | 54.5 | 35.1 | 56.0 | 52.1 | 57.3 |
4 | GPT-4o-Mini | 68.8 | 61.0 | 30.4 | 49.0 | 65.1 | 63.6 | 32.4 | 48.3 | 56.8 | 41.1 | 62.0 | 45.3 | 56.3 |
5 | mPLUG-Owl3-7B | 57.4 | 59.7 | 39.1 | 43.9 | 60.7 | 58.4 | 27.0 | 61.3 | 75.0 | 38.6 | 50.0 | 37.9 | 54.3 |
6 | LLaVA-Video-7B | 64.3 | 54.9 | 32.6 | 56.1 | 50.0 | 59.5 | 48.6 | 47.9 | 54.5 | 49.1 | 52.0 | 36.8 | 52.6 |
7 | Qwen2-VL-7B | 49.6 | 54.9 | 32.6 | 47.4 | 58.0 | 57.2 | 70.3 | 54.6 | 52.3 | 28.1 | 48.0 | 32.6 | 50.7 |
8 | LLaVA-OV-7B | 59.7 | 54.5 | 32.6 | 36.8 | 46.4 | 59.0 | 35.1 | 53.8 | 59.1 | 36.8 | 50.0 | 32.6 | 49.9 |
9 | MiniCPM-V 2.6-8B | 50.4 | 51.8 | 17.4 | 49.1 | 53.6 | 61.8 | 37.8 | 49.6 | 50.0 | 31.6 | 48.0 | 27.4 | 48.0 |
10 | LLaVA-NeXT-INST-IT-7B | 57.4 | 58.4 | 26.1 | 42.4 | 43.0 | 49.2 | 31.6 | 49.2 | 42.2 | 26.3 | 42.0 | 27.4 | 46.3 |
11 | LLaVA-NeXT-7B | 56.6 | 55.6 | 34.8 | 52.5 | 43.0 | 48.6 | 31.6 | 42.6 | 42.2 | 28.1 | 42.0 | 30.5 | 46.0 |
12 | VideoLLaMA2-7B | 47.3 | 45.8 | 26.1 | 45.6 | 41.1 | 52.0 | 35.1 | 44.5 | 50.0 | 29.8 | 44.0 | 32.6 | 43.4 |
13 | InternVL2.5-8B | 50.4 | 48.2 | 26.1 | 57.9 | 37.5 | 47.4 | 51.4 | 40.3 | 38.6 | 36.8 | 30.0 | 31.6 | 43.2 |
14 | PLLaVA-7B | 45.7 | 48.2 | 21.7 | 45.6 | 39.3 | 54.9 | 24.3 | 47.1 | 45.5 | 28.1 | 40.0 | 28.4 | 43.0 |
15 | InternVL2-8B | 48.1 | 47.8 | 23.9 | 35.1 | 42.9 | 51.4 | 59.5 | 42.0 | 36.4 | 28.1 | 46.0 | 24.2 | 42.7 |
16 | ShareGPT4Video-8B | 40.3 | 43.1 | 21.7 | 45.6 | 40.2 | 45.7 | 51.4 | 43.7 | 40.9 | 24.6 | 46.0 | 30.5 | 40.6 |