|

楼主 |
发表于 2025-1-29 19:42:18
|
显示全部楼层
I don't have too too much to add on top of this earlier post on V3 and I think it applies to R1 too (which is the more recent, thinking equivalent).
I will say that Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed in AI. You may not always be utilizing it fully but I would never bet against compute as the upper bound for achievable intelligence in the long run. Not just for an individual final training run, but also for the entire innovation / experimentation engine that silently underlies all the algorithmic innovations.
Data has historically been seen as a separate category from compute, but even data is downstream of compute to a large extent - you can spend compute to create data. Tons of it. You've heard this called synthetic data generation, but less obviously, there is a very deep connection (equivalence even) between "synthetic data generation" and "reinforcement learning". In the trial-and-error learning process in RL, the "trial" is model generating (synthetic) data, which it then learns from based on the "error" (/reward). Conversely, when you generate synthetic data and then rank or filter it in any way, your filter is straight up equivalent to a 0-1 advantage function - congrats you're doing crappy RL.
Last thought. Not sure if this is obvious. There are two major types of learning, in both children and in deep learning. There is 1) imitation learning (watch and repeat, i.e. pretraining, supervised finetuning), and 2) trial-and-error learning (reinforcement learning). My favorite simple example is AlphaGo - 1) is learning by imitating expert players, 2) is reinforcement learning to win the game. Almost every single shocking result of deep learning, and the source of all *magic* is always 2. 2 is significantly significantly more powerful. 2 is what surprises you. 2 is when the paddle learns to hit the ball behind the blocks in Breakout. 2 is when AlphaGo beats even Lee Sedol. And 2 is the "aha moment" when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc. It's the solving strategies you see this model use in its chain of thought. It's how it goes back and forth thinking to itself. These thoughts are *emergent* (!!!) and this is actually seriously incredible, impressive and new (as in publicly available and documented etc.). The model could never learn this with 1 (by imitation), because the cognition of the model and the cognition of the human labeler is different. The human would never know to correctly annotate these kinds of solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards a final outcome.
(Last last thought/reference this time for real is that RL is powerful but RLHF is not. RLHF is not RL. I have a separate rant on that in an earlier tweet
https://x.com/karpathy/status/1821277264996352246?lang=en)
由
翻译自 英语
对于 V3 的早期帖子,我没有太多要补充的,而且我认为它也适用于 R1(这是较新的、具有思考能力的版本)。
我想说的是,深度学习对计算有着传奇般的狂热,这是人工智能领域从未开发过的其他算法所不具备的。你可能并不总是充分利用它,但我绝不会打赌计算不是长期可实现智能的上限。这不仅适用于单个最终训练运行,也适用于默默支撑所有算法创新的整个创新/实验引擎。
数据在历史上一直被视为与计算不同的类别,但即使数据在很大程度上也处于计算的下游 - 您可以花费计算来创建数据。大量的数据。您听说过这被称为合成数据生成,但不太明显的是,“合成数据生成”和“强化学习”之间存在非常深的联系(甚至是等价的)。在 RL 的反复试验学习过程中,“试验”是模型生成(合成)数据,然后根据“错误”(/奖励)从中学习。相反,当您生成合成数据然后以任何方式对其进行排名或过滤时,您的过滤器直接等同于 0-1 优势函数 - 恭喜您做糟糕的 RL。
最后的想法。不确定这是否显而易见。有两种主要的学习类型,既有儿童学习,也有深度学习。有 1) 模仿学习(观看和重复,即预训练、监督微调)和2)反复试验学习(强化学习)。我最喜欢的简单例子是 AlphaGo - 1) 通过模仿专家玩家进行学习,2) 是通过强化学习赢得比赛。深度学习的几乎每一个令人震惊的结果,以及所有 *魔法* 的来源总是 2。2 明显更强大。2 是让你感到惊讶的。2 是当球拍学会在 Breakout 中击打挡块后面的球时。2 是当 AlphaGo 甚至击败李世石时。2 是 DeepSeek(或 o1 等)发现它可以很好地重新评估你的假设、回溯、尝试其他方法等时的“顿悟时刻”。这是你看到这个模型在其思路链中使用的解决策略。这就是它如何来回思考。这些想法是*突发的*(!!!)这实际上是非常不可思议、令人印象深刻和新颖的(如公开和记录等)。模型永远无法通过 1(通过模仿)学习这一点,因为模型的认知和人类标记者的认知是不同的。人类永远不会知道如何正确地注释这些类型的解决策略,甚至不知道它们应该是什么样子。它们必须在强化学习过程中被发现,因为它们在经验和统计上对最终结果有用。
(这次的最后一个想法/参考是,RL 很强大,但 RLHF 却不是。RLHF 不是 RL。我在之前的一条推文中对此进行了单独的吐槽
https://x.com/karpathy/status/1821277264996352246?lang=en ) |
|