We introduce 'Thinking with Video', a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that Sora-2 surpasses GPT5 by 10% on eyeballing puzzles and reaches 69% accuracy on MMMU.
"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU).
Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance.
In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a unified multimodal reasoning paradigm.
Arc Connect
Circle Center
Circle Tangent Line
Circle Tangent Point
Isosceles Trapezoid
Midpoint
Orthocenter
Parallel
Parallelogram
Perpendicular Bisector
Ray Intersection
Ray Reflection
Hexagon Color Pattern Match
Grid Color Pattern Match
Shape Color Pattern Match
Rectangle Height Color Match
Color Mixing Perception & Application
Color Gradient Perception & Application
Grid Size Pattern Match
Cycle Size Pattern Match
Grid Shape & Size Pattern Match
Reflection Recognition & Application
ARC-AGI-2
ARC-AGI-2
ARC-AGI-2
ARC-AGI-2
Square Maze
GSM8K
GSM8K
MathVista
MathVista
MMMU
MMBench
@misc{tong2025thinkingvideovideogeneration,
title={Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm},
author={Jingqi Tong and Yurong Mou and Hangcheng Li and Mingzhe Li and Yongzhuo Yang and Ming Zhang and Qiguang Chen and Tianyi Liang and Xiaomeng Hu and Yining Zheng and Xinchi Chen and Jun Zhao and Xuanjing Huang and Xipeng Qiu},
year={2025},
eprint={2511.04570},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.04570},
}