Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong^1,2,*, Yurong Mou^1,*, Hangcheng Li^1,2,*, Mingzhe Li^2,*, Yongzhuo Yang^1,*, Ming Zhang¹, Qiguang Chen³, Tianyi Liang^1,2, Xiaomeng Hu⁴, Yining Zheng¹, Xinchi Chen¹, Jun Zhao^1,†, Xuanjing Huang¹, Xipeng Qiu^1,2,†

¹Fudan University ²Shanghai Innovation Institute ³Central South University ⁴The Chinese University of Hong Kong

* Core contribution † Corresponding authors

Abstract

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU).

Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance.

In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a unified multimodal reasoning paradigm.

Eyeballing Puzzles

Arc Connect

Circle Center

Circle Tangent Line

Circle Tangent Point

Isosceles Trapezoid

Midpoint

Orthocenter

Parallel

Parallelogram

Perpendicular Bisector

Ray Intersection

Ray Reflection

Visual Puzzles

Hexagon Color Pattern Match

Grid Color Pattern Match

Shape Color Pattern Match

Rectangle Height Color Match

Color Mixing Perception & Application

Color Gradient Perception & Application

Grid Size Pattern Match

Cycle Size Pattern Match

Grid Shape & Size Pattern Match

Reflection Recognition & Application

Failure cases (click to expand) ▶

Arc Connect (Wrong)

Arc Connect (mark point version) (Wrong)

Angle Bisector (CoT version) (Wrong)

Ray Intersection (Wrong)

Ray Reflection (Wrong)

ARC-AGI-2 (Mostly Correct)

ARC-AGI-2 (Partially Correct)

ARC-AGI-2 (Wrong)

ARC-AGI-2 (Mostly Correct)

ARC-AGI-2 (Wrong)

ARC-AGI-2 (Mostly Correct)

ARC-AGI-2 (Partially Correct)

Hexagon Maze (Wrong)

Circle Maze (Wrong)

BibTeX

@misc{tong2025thinkingvideovideogeneration,
      title={Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm}, 
      author={Jingqi Tong and Yurong Mou and Hangcheng Li and Mingzhe Li and Yongzhuo Yang and Ming Zhang and Qiguang Chen and Tianyi Liang and Xiaomeng Hu and Yining Zheng and Xinchi Chen and Jun Zhao and Xuanjing Huang and Xipeng Qiu},
      year={2025},
      eprint={2511.04570},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.04570}, 
}

TL;DR

Abstract

Eyeballing Puzzles

Visual Puzzles

ARC-AGI-2

Maze

GSM8K

Multimodal Reasoning

Failure cases (click to expand) ▶

BibTeX