More Research: [ICLR 2026] 🔥Game-RL 🔥AI Can Learn Scientific Taste

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

CVPR 2026

Jingqi Tong^2,4,5,*, Yurong Mou^2,4,5,*, Hangcheng Li^2,4,5,*, Mingzhe Li^2,4,5,*, Yongzhuo Yang^4,5,*, Ming Zhang⁴, Qiguang Chen⁷, Tianyi Liang^2,5, Xiaomeng Hu⁶, Yining Zheng^1,3,5, Xinchi Chen^1,3,4,5,†, Jun Zhao^4,†, Xuanjing Huang^1,3,4, Xipeng Qiu^{1,2,3,4,5,†}

¹Institute of Trustworthy Embodied AI, Fudan University ²Shanghai Innovation Institute ³Shanghai Key Laboratory of Multimodal Embodied AI

⁴College of Computer Science and Artificial Intelligence, Fudan University ⁵OpenMOSS Team ⁶The Chinese University of Hong Kong ⁷Central South University

* Core contribution † Corresponding authors

Paper PDF GitHub 🤗Benchmark 🤗Daily Paper Leaderboard Twitter

TL;DR

We introduce "Thinking with Video", a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that Sora 2 surpasses GPT-5 by 10% on eyeballing puzzles and reaches 69% accuracy on MMMU.

GSM8K

Visual Puzzle

Ray Reflection

ARC-AGI-2

Maze

Abstract

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) the separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU).

Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance.

In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.

Leaderboard on VideoThinkBench (minitest)

Vision-Centric Tasks

Video Generation Models 🎞️

Image Generation Models 🖼️

Vision-Language Models 📃

Note:
"Eyeballing Point/Line/Shape" refer to Point Tasks, Line Tasks and Shape Tasks in Eyeballing Puzzles. The results of video generation models are Major Frame evaluation results.
"Visual Symmetry/Gradient/Compositionality" refer to the Symmetry Tasks, Gradient Tasks and Compositionality Tasks in Visual Puzzles.
"Maze Square/Hexagon/Circle" refer to Square Maze, Hexagon Maze and Circle Maze in Maze Tasks.

Text-Centric Tasks

Video Generation Models 🎞️

Image Generation Models 🖼️

Vision-Language Models 📃

Leaderboard on VideoThinkBench (test)

Vision-Centric Tasks

Text-Centric Tasks

Eyeballing Puzzles

Arc Connect

Circle Center

Circle Tangent Line

Circle Tangent Point

Isosceles Trapezoid

Midpoint

Orthocenter

Parallel

Parallelogram

Perpendicular Bisector

Ray Intersection

Ray Reflection

Visual Puzzles

Hexagon Color Pattern Match

Grid Color Pattern Match

Grid Size Pattern Match

Reflection Recognition & Application

Color Gradient Perception & Application

Cycle Size Pattern Match

Shape Color Pattern Match

Rectangle Height Color Match

Color Mixing Perception & Application

Grid Shape & Size Pattern Match

ARC-AGI-2

Maze

GSM8K

Multimodal Reasoning

MathVista

MMMU

MMBench

Failure cases (click to expand) ▶

Arc Connect (Wrong)

Arc Connect (mark point version) (Wrong)

Angle Bisector (CoT version) (Wrong)

Ray Intersection (Wrong)

Ray Reflection (Wrong)

ARC-AGI-2 (Mostly Correct)

ARC-AGI-2 (Partially Correct)

ARC-AGI-2 (Wrong)

ARC-AGI-2 (Mostly Correct)

ARC-AGI-2 (Wrong)

ARC-AGI-2 (Mostly Correct)

ARC-AGI-2 (Partially Correct)

Hexagon Maze (Wrong)

Circle Maze (Wrong)

BibTeX

@inproceedings{tong2026thinking,
  title={Thinking with video: Video generation as a promising multimodal reasoning paradigm},
  author={Tong, Jingqi and Mou, Yurong and Li, Hangcheng and Li, Mingzhe and Yang, Yongzhuo and Zhang, Ming and Chen, Qiguang and Liang, Tianyi and Hu, Xiaomeng and Zheng, Yining and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={41121--41129},
  year={2026}
}

TL;DR

Abstract

Leaderboard on VideoThinkBench (minitest)

Vision-Centric Tasks

Video Generation Models 🎞️

Image Generation Models 🖼️

Vision-Language Models 📃

Text-Centric Tasks

Video Generation Models 🎞️

Image Generation Models 🖼️

Vision-Language Models 📃

Leaderboard on VideoThinkBench (test)

Vision-Centric Tasks

Text-Centric Tasks

Eyeballing Puzzles

Visual Puzzles

ARC-AGI-2

Maze

GSM8K

Multimodal Reasoning

Failure cases (click to expand) ▶

Acknowledgement

BibTeX