Vincent Sitzmann: The Bitter Lesson of Computer Vision

Presenter: Vincent Sitzmann
Host Institute: MIT (Scene Representation Group)
This post is a translation of a blog article by Vincent Sitzmann (Assistant Professor at MIT, leading the Scene Representation Group), originally published on his personal website. The central thesis: computer vision as we know it is about to go away — the historical boundaries between vision, robot learning, and control will dissolve as the field converges on end-to-end perception-action loops, making hand-crafted intermediate representations like 3D structure obsolete.
本文翻译自 Vincent Sitzmann(MIT 助理教授,Scene Representation Group 负责人)发表在其个人网站上的博客文章。核心论点:我们所熟知的计算机视觉即将消亡——视觉、机器人学习和控制之间的历史界限将逐渐瓦解,研究将汇聚于端到端的感知-动作闭环,手工设计的中间表示(如 3D 结构)将变得多余。

Computer Vision Is About to Go Away

I believe that computer vision as we know it is about to go away.

Historically, we have treated vision as a mapping from images to intermediate representations — classes, segmentation masks, or 3D reconstructions. But in the era of the Bitter Lesson, these distinct tasks are becoming qualitatively no different than edge detection: historical artifacts of scoping “solvable intermediate problems” rather than solving intelligence.

While the “LLM moment” in NLP clarified that language modeling is the ultimate objective, the vision community is still debating the flavor of its own revolution. We continue to fine-tune models for specific tasks like point tracking, segmentation, or 3D reconstruction — even as world models emerge, skirt all conventional intermediate representations, and directly solve a problem dramatically more general than everything our community has tackled in the past.

In this post, I argue that the future of computer vision is as part of end-to-end perception-action loops. The historical boundaries between computer vision, robot learning, and control will dissolve. Frontier research will no longer draw a boundary between “seeing” and “learning to act.”

As a special case, I will discuss the waning importance of 3D representations: I predict that just as we no longer hand-craft features for detection, we will soon stop using 3D as part of embodied intelligence.

我相信,我们所熟知的计算机视觉即将消亡。

长久以来,我们把视觉视为从图像到中间表示的映射——类别、分割掩码或 3D 重建。但在 Bitter Lesson 的时代,这些独立的任务正在变得与边缘检测本质上无异:它们是为了界定”可解的中间问题”而产生的历史遗留物,而非直接解决智能本身。

在 NLP 领域,”LLM 时刻”已经明确了语言建模才是终极目标,但视觉社区仍在争论自己的革命将以何种形式到来。我们仍在为点追踪、分割或 3D 重建等特定任务微调模型——尽管世界模型已经涌现,绕过了所有传统中间表示,直接解决了一个比我们社区过去处理的一切都更具一般性的问题。

在这篇文章中,我主张计算机视觉的未来在于成为端到端感知-动作闭环的一部分。计算机视觉、机器人学习和控制之间的历史界限将会消融。前沿研究将不再在”看”和”学会行动”之间划定界限。

作为一个具体案例,我将讨论 3D 表示日益减弱的重要性:我预测,正如我们不再手工设计检测特征一样,我们很快也将停止在具身智能中使用 3D。

How We Arrived at Today's Scope of Computer Vision

To understand where the field is going, it is instructive to ask what vision actually is.

Historically, we have treated vision as the “visual perception” sub-module of intelligence — often summarized as “knowing what is where.” However, this is not a well-defined task. It does not specify a strictly falsifiable input-output behavior: The inputs are images or video, sure — but what are the outputs? Consequently, it does not lend itself to being definitively “solved.”

In the real world, there is a much clearer metric for perception: intelligent action. An agent has succeeded at perceiving the world when it can map present and past percepts to actions that accomplish its goals, especially when exposed to new and unseen environments. This is easily falsifiable: I want to be able to demonstrate a task to my robot such as cleaning out the dishwasher, and I expect the robot to succeed at this task. If it succeeded, it clearly perceived what was important.

Why, then, did we not start there? In the past, learning perception-action loops directly was intractable. Because the role of a scientist is to work on the solvable, we split off computer vision. The community converged on a niche of building algorithms that map images to intermediate representations that appeared practically useful — classification, segmentation, optical flow, 3D reconstruction, and SLAM.

Simultaneously, robot learning and control were scoped as the study of algorithms that ingest these specific representations — point clouds, bounding boxes, and masks — and map them to actions.

This factorization was a necessary compromise for the time. However, I believe this “modular” model of embodied intelligence is quickly losing its raison d’etre.

要理解这个领域的走向,有必要先问一个问题:视觉到底什么?

长期以来,我们把视觉当作智能的”视觉感知”子模块——常被概括为”知道什么在哪里”。然而,这并不是一个定义明确的任务。它没有指定严格可证伪的输入-输出行为:输入是图像或视频,没问题——但输出是什么?因此,它并不适合被确定性地”解决”。

在现实世界中,感知有一个清晰得多的度量标准:智能行动。当一个智能体能够将当前和过去的感知映射为实现其目标的动作时——尤其是在面对全新的未见环境时——它就已经成功地感知了世界。这很容易证伪:我想给我的机器人演示一个任务,比如清理洗碗机,然后期望机器人能完成这个任务。如果它成功了,它显然感知到了重要的信息。

那么,我们为什么没有从那里开始呢?在过去,直接学习感知-动作闭环是不可行的。因为科学家的职责是研究可解决的问题,我们把计算机视觉单独分离了出来。社区在构建将图像映射到看似实用的中间表示的算法上达成了共识——分类、分割、光流、3D 重建和 SLAM。

与此同时,机器人学习和控制被界定为研究接收这些特定表示——点云、边界框和掩码——并将其映射为动作的算法。

这种分解是当时必要的妥协。然而,我认为这种”模块化”的具身智能模型正在迅速失去其存在的理由。

Case Study: How 3D May Become Obsolete for Training Embodied Intelligence Models

Rich Sutton’s Bitter Lesson states: “General methods leveraging massive computation… consistently outperform human-crafted, task-specific systems, even though the latter feel clever initially.”

In computer vision, most researchers readily apply this lesson to algorithms, acknowledging that neural networks with physical inductive biases are rarely scalable. Yet, surprisingly few apply the same logic to representations.

Take the very notion of a 3D representation, be it a point cloud, a radiance field, a signed distance function, or a voxel grid. Consider the fundamental loop of embodied intelligence: perception in, action out. In a world where we can train end-to-end algorithms to tackle this behavior directly, hand-crafting an explicit intermediate representation like “3D structure” becomes exactly the kind of clever, human-designed bottleneck that the Bitter Lesson warns against.

To see why, try a thought experiment. Look at the room you are currently sitting in. If I gave you a perfect 3D reconstruction of this scene — a NeRF, a point cloud, whatever you choose — what real problem would you be able to solve with it?

There are of course niche applications such as novel view synthesis. But for any task involving embodied intelligence, you still need a separate, intelligent algorithm to ingest that 3D representation and decide what to do. The overall input-output behavior remains images to actions, reducing the 3D reconstruction to a clever pre-processing step. In the long arc of embodied intelligence, this factorization will not pass the test of time.

In fact, many tasks traditionally thought to rely on 3D are already being solved better by end-to-end learning. Take novel view synthesis: the state-of-the-art in few-shot view synthesis has for a while now not employed 3D differentiable rendering, but lies with generative world models. When my students Boyuan and Kiwhan developed History-Guided Video Diffusion, they generated novel views of RealEstate10k that looked far better than any 3D-structured algorithm I had ever worked on — and they did so almost as an afterthought.

Rich Sutton 的 Bitter Lesson 指出:”利用大规模计算的通用方法……始终优于人工设计的、特定任务的系统,尽管后者最初看起来很巧妙。”

在计算机视觉中,大多数研究者很容易将这一教训应用于算法层面,承认带有物理归纳偏置的神经网络很少具备可扩展性。然而,令人惊讶的是,很少有人将同样的逻辑应用于表示层面。

以 3D 表示的概念为例——无论是点云、辐射场、符号距离函数还是体素网格。考虑具身智能的基本闭环:感知输入,动作输出。在一个我们能训练端到端算法直接处理这一行为的世界里,手工设计像”3D 结构”这样的显式中间表示,恰恰成了 Bitter Lesson 所警告的那种巧妙但人为设计的瓶颈。

为了理解原因,做一个思想实验。看看你现在坐着的房间。如果我给你这个场景的完美 3D 重建——一个 NeRF、一个点云,随你选择——你能用它解决什么真正的问题?

当然有一些小众应用,比如新视角合成。但对于任何涉及具身智能的任务,你仍然需要一个独立的智能算法来消化那个 3D 表示并决定做什么。整体的输入-输出行为仍然是从图像到动作,将 3D 重建降格为一个巧妙的预处理步骤。在具身智能的漫长发展弧线中,这种分解经不起时间的考验。

事实上,许多传统上被认为依赖 3D 的任务已经被端到端学习更好地解决了。以新视角合成为例:少样本视角合成的最前沿技术已经有一段时间不再使用 3D 可微渲染,而是依靠生成式世界模型。当我的学生 Boyuan 和 Kiwhan 开发了 History-Guided Video Diffusion 时,他们在 RealEstate10k 上生成的新视角比我参与过的任何基于 3D 结构的算法都好看得多——而且这几乎是附带完成的。

SE(3) Camera Poses Will Go, Too

You might argue that these generative models are still conditioned on camera poses, obtained via conventional multi-view geometry (COLMAP) or learned equivalents. However, I predict that just like 3D representations, algorithms that output camera poses will also become obsolete. My lab has already shown that novel view synthesis can be formalized purely as a representation learning problem — without any concepts from multi-view geometry. No poses, no 3D.

Ego-motion (and therefore camera pose) is simply the most basic action an agent can take. It is not special. Ultimately, we must solve the problem of an AI controlling a body it has never inhabited before. In that context, inferring ego-motion is trivial compared to the complex control problems a general-purpose agent must solve. Whatever algorithm we converge on will handle ego-motion implicitly, without needing us to bake it in.

你可能会争辩说,这些生成模型仍然以相机位姿为条件,而位姿是通过传统多视图几何(COLMAP)或其学习版本获得的。然而,我预测,就像 3D 表示一样,输出相机位姿的算法也将变得过时。我的实验室已经证明,新视角合成可以被纯粹形式化为一个表示学习问题——不需要多视图几何中的任何概念。没有位姿,没有 3D。

自运动(因此也包括相机位姿)不过是智能体可以执行的最基本的动作。它并不特殊。最终,我们必须解决的问题是让 AI 控制一个它从未驾驭过的身体。在这个背景下,推断自运动与通用智能体必须解决的复杂控制问题相比微不足道。无论我们最终收敛于什么算法,它都会隐式地处理自运动,不需要我们将其烘焙进去。

To Get Models Competent at 3D Editing, Don't Train Them for 3D Editing

What about engineering tasks such as architecture, CAD, or manufacturing? Surely, we need explicit 3D representations to build a house or 3D-print an engine part. I agree that for the human-machine interface, having a 3D mesh representation and a CAD-like editor may be reasonable. However, my argument is not about how we talk to the machine, but how we train models that will ultimately help us automate 3D design tasks.

Here, again, to obtain models that are maximally competent at assisting with the manipulation of both physical and digital 3D objects, we should not train them to explicitly produce expert-crafted 3D representations, nor bake such representations into their architectures. Instead, our goal should be to train general-purpose physical intelligence models directly on raw data, allowing them to learn their own internal, task-relevant structure. These internal representations need not — and likely will not — correspond to any human-designed notion of 3D blocking, meshing, or reconstruction. Only after such a model has been trained should it be fine-tuned to interface with whatever representation or toolchain we humans like to use.

As for the final step — manufacturing the artifact — in the near term, we will similarly fine-tune models on 3D printer APIs or mesh file formats. In the very long run, I note that a 3D printer or excavator is essentially a robot: A physical machine that we seek to automate via AI. Hence, I find it plausible that we will eventually solve these challenges of 3D manufacturing in the same way in which we will solve embodied intelligence more generally, by yielding direct control of the actuators of the machine to the AI.

那么建筑、CAD 或制造等工程任务呢?我们当然需要显式的 3D 表示来建造房屋或 3D 打印引擎零件。我同意,对于人机接口而言,拥有 3D 网格表示和类 CAD 编辑器可能是合理的。然而,我的论点不是关于我们如何与机器交谈,而是关于我们如何训练最终将帮助我们自动化 3D 设计任务的模型。

在这里,同样地,要获得在操控物理和数字 3D 对象方面最为出色的模型,我们不应该训练它们显式地产生专家设计的 3D 表示,也不应该将这种表示烘焙到它们的架构中。相反,我们的目标应该是直接在原始数据上训练通用物理智能模型,让它们学习自己内部的、与任务相关的结构。这些内部表示不必——而且很可能不会——对应于任何人类设计的 3D 建模、网格化或重建的概念。只有在这样的模型训练完成后,才应该对其进行微调,使其与我们人类习惯使用的表示或工具链对接。

至于最后一步——制造实体——在近期内,我们同样会在 3D 打印机 API 或网格文件格式上微调模型。从长远来看,我注意到 3D 打印机或挖掘机本质上就是机器人:我们试图通过 AI 实现自动化的物理机器。因此,我认为我们最终将以解决更广义具身智能的同样方式来解决这些 3D 制造挑战——将机器执行器的直接控制权交给 AI。

The Key Challenge of Perception-Action Loops and World Models

The core challenge of embodied intelligence is the lack of paired perception-action data at scale. Deploying large numbers of robots in the real world is extremely expensive, and even if we could do so, it remains unclear what we would have them do. Collecting valuable data requires agents to perform meaningful, diverse behaviors. Today, this is largely achieved through teleoperation. This works remarkably well for self-driving — humans already drive cars, but it scales far less naturally to humanoid robots with dexterous hands.

The long-term goal is robots that collect data autonomously, driven by intrinsic motivation — much like toddlers. While this concept of “intrinsic reward” has a rich history in the RL community, current algorithms are far too sample-inefficient to be deployed on real robots. Moreover, unleashing large numbers of agents with essentially random policies into the physical world, where they can hurt themselves and others, is simply not viable.

This, then, is the central question facing embodied intelligence today: how do we move toward closing the perception-action loop without having direct access to large-scale action data?

具身智能的核心挑战在于缺乏大规模的感知-动作配对数据。在现实世界中部署大量机器人极其昂贵,即便我们能做到,也不清楚该让它们做什么。收集有价值的数据要求智能体执行有意义的、多样化的行为。目前,这主要通过遥操作实现。这在自动驾驶领域效果显著——人类本就会开车,但它对具有灵巧手的人形机器人的扩展远没有那么自然。

长期目标是由内在动机驱动、自主收集数据的机器人——很像蹒跚学步的幼儿。虽然”内在奖励”这个概念在强化学习社区有着丰富的历史,但当前的算法在样本效率上远不足以部署到真实机器人上。更何况,将大量本质上执行随机策略的智能体释放到物理世界中——它们可能伤害自己和他人——根本不可行。

因此,这就是具身智能今天面临的核心问题:在没有大规模动作数据的情况下,我们如何朝着闭合感知-动作闭环的方向前进?

The Role of World Models

This is where world models enter the picture. On the surface, they may appear to be just another intermediate task — learned simulators that do not themselves address the core challenge. And indeed, on their own, they do not.

They do, however, offer two promising angles.

First, video (and potentially audio) generative modeling provides a clearly scalable pre-training objective. Crucially, video does not merely capture raw sensory data — it also implicitly encodes a vast amount of information about not only physics and how the world works, but also human knowledge about skills, tasks, and their structure. Training a finite neural network to approximate this complex process could lead to useful representations that could hopefully be used as a basis to be fine-tuned into policies with useful know-how. However, this remains speculative: to date, I am not aware of any clear demonstration that video models can easily be fine-tuned into policies, though there are some early signs of life.

Second, by extending video models to be action-conditional, they can serve as simulators in which agents can be trained. In principle, this enables a form of data amplification: expensive real-world interactions can be used to bootstrap a model that supports much richer virtual experience. At the same time, this approach exposes a fundamental chicken-and-egg problem. Training interactive world models requires paired action-observation data, which is precisely the resource we lack — excepting, again, self-driving, where such data is plenty. Unsurprisingly, existing systems exhibit only limited forms of interactivity, often reminiscent of video games, which likely made up a significant part of their training data.

For these reasons, I do not believe that video generative models will solve embodied intelligence. They may not even be a necessary component of the final solution. Rather, they should be seen as one of several early attempts at identifying a scalable pre-training objective for perception-action learning. At present, however, there is no clear answer to what the “right” pre-training task should be. The same is true for many of the other ingredients required to close the perception-action loop. Questions of intrinsic motivation, exploration, long-horizon memory, continual learning, and real-time control with large models remain wide open.

However, nevertheless, things have changed: I believe that we are now at a time where tackling these questions head-on is viable. This, then, is the central point of this article: A call to abandon the conventional boundaries between computer vision and robot learning, and instead ponder the problems that arise when we seek to build machines that both perceive and act.

世界模型正是在这里登场的。表面上看,它们可能只是又一个中间任务——学习得到的模拟器本身并没有解决核心挑战。事实上,单靠它们确实做不到。

然而,它们确实提供了两个有前景的方向。

首先,视频(以及可能的音频)生成式建模提供了一个明确可扩展的预训练目标。关键在于,视频不仅仅捕获原始感官数据——它还隐式编码了大量信息,不仅涉及物理学和世界如何运转,还包括人类关于技能、任务及其结构的知识。训练一个有限的神经网络来近似这一复杂过程,可能产生有用的表示,并有望作为基础被微调为具有实用技能的策略。然而,这仍然是推测性的:到目前为止,我不知道有任何明确的证据表明视频模型可以轻松地被微调为策略,尽管有一些初步的迹象。

其次,通过将视频模型扩展为以动作为条件的模型,它们可以充当训练智能体的模拟器。原则上,这实现了一种数据扩增:昂贵的真实世界交互可以用来引导一个支持更丰富虚拟体验的模型。与此同时,这种方法暴露了一个根本性的先有鸡还是先有蛋的问题。训练交互式世界模型需要配对的动作-观察数据,而这恰恰是我们所缺乏的资源——再次例外的是自动驾驶,那里有大量这样的数据。不出所料,现有系统仅展示了有限形式的交互性,通常让人联想到视频游戏——这些可能在其训练数据中占了相当大的比重。

基于这些原因,我不认为视频生成模型会解决具身智能问题。它们甚至可能不是最终方案的必要组成部分。相反,它们应该被视为为感知-动作学习寻找可扩展预训练目标的若干早期尝试之一。目前,”正确”的预训练任务应该是什么,并没有明确的答案。对于闭合感知-动作闭环所需的许多其他要素也是如此。内在动机、探索、长期记忆、持续学习以及用大模型进行实时控制等问题仍然完全开放。

然而,尽管如此,事情已经发生了变化:我相信我们现在正处于一个可以直面这些问题的时代。因此,这就是本文的核心要点:呼吁抛弃计算机视觉和机器人学习之间的传统界限,转而思考当我们试图构建既能感知能行动的机器时所产生的问题。

Further Reading

Sitzmann's Group Work in This Space

  • Diffusion Forcing and History-Guided Video Diffusion that first showed stable auto-regressive rollout with diffusion and the potential to simulate video games – used to train the Oasis model and many other mainstream world models today.
  • “True Self-Supervised Novel View Synthesis is Transferable” which defines novel view synthesis without relying on any concepts from conventional multi-view geometry, can be seen as a “latent action model” in which “camera poses” and “ego motion” are really no different from any other action that may occur between video frames.
  • Large Video Planner which demonstrates that video generative models are useful to robotics already today, by generating “video plans” of solving a variety of tasks, though lots of challenges remain: how do we extract policies from these videos?
  • Generative View Stitching, a way of generating long videos with short-context video models such that the video is consistent with a pre-defined camera trajectory.
  • Diffusion ForcingHistory-Guided Video Diffusion 首次展示了扩散模型的稳定自回归展开及模拟视频游戏的潜力——被用于训练 Oasis 模型 和当今许多主流世界模型。
  • “True Self-Supervised Novel View Synthesis is Transferable” 在不依赖传统多视图几何任何概念的情况下定义了新视角合成,可被视为一个”潜在动作模型”,其中”相机位姿”和”自运动”与视频帧之间可能发生的任何其他动作没有本质区别。
  • Large Video Planner 展示了视频生成模型在今天已经对机器人技术有用,通过生成解决各种任务的”视频计划”,尽管仍有许多挑战:我们如何从这些视频中提取策略?
  • Generative View Stitching,一种使用短上下文视频模型生成长视频的方法,使视频与预定义的相机轨迹保持一致。