The Short Case for Nvidia Stock英伟达股票的卖空案例

network · 发表于 2025-2-2 06:36:42

终于有时间逐字读完宝玉分享的这篇文章，我决定继续减持 Nvidia 股票，直至清仓。文章作者并非普通的对冲基金分析师，而是自 2010 年以来便深度参与人工智能领域，并逐步成长为拥有多个流行开源 AI 项目的开发者。这种罕见的投资专业知识与 AI 技术见解的结合，使他对 AI 进步如何影响股市估值的观点尤为值得关注。如果要高度概括文章的核心观点，可以总结如下：通用人工智能的数据壁垒确实存在，这意味着 pre-training scaling law 已接近极限。思维链 (CoT) 背后的 inference scaling law 带来了新的计算范式。 Nvidia 的护城河主要是软件（CUDA 及其生态）和高速互联技术，但这些正在被 Cerebras 和 Groq 的方案突破。 Nvidia 估值的爆发，源于它在 AI 发展初期是唯一可用的解决方案，但随着时间推移，市场将迎来更多替代选择。 DeepSeek 体现出的极致性能优化技术，将显著降低 AI 训练成本，进而影响 Nvidia 的未来营收预期。这并不意味着 Nvidia 会被超越或营收下降，但其当前的高估值是建立在市场对其高增长的高度预期之上。因此，从中短期来看，Nvidia 的股价很可能面临较大的回调压力。此外，考虑到今年整体市场的不确定性，清仓 NVDA 可能会错失部分机会，但仍然是一个相对稳妥的选择。

The Short Case for Nvidia Stock[color=rgba(0, 0, 0, 0.88)][color=rgb(37 99 235/var(--tw-text-opacity,1))]
[color=var(--tw-prose-body)][size=1.25]

As someone who spent ~10 years working as a generalist investment analyst at various long/short hedge funds (including stints at Millennium and Balyasny), while also being something of a math and computer nerd who has been studying deep learning since 2010 (back when Geoff Hinton was still talking about [color=var(--tw-prose-links)]Restricted Boltzmann Machines and everything was still programmed using [color=var(--tw-prose-links)]MATLAB, and researchers were still trying to show that they could get better results at classifying handwritten digits than by using [color=var(--tw-prose-links)]Support Vector Machines), I'd like to think that I have a fairly unusual perspective on how AI technology is developing and how this relates to equity valuations in the stock market.

For the past few years, I have been working more as a developer, and have several popular open-source projects for working with various forms of AI models/services (e.g., see [color=var(--tw-prose-links)]LLM Aided OCR, [color=var(--tw-prose-links)]Swiss Army Llama, [color=var(--tw-prose-links)]Fast Vector Similarity, [color=var(--tw-prose-links)]Source to Prompt, and [color=var(--tw-prose-links)]Pastel Inference Layer for a few recent examples). Basically, I am using these frontier models all day, every day, in about as intense a way as possible. I have 3 Claude accounts so I don't run out of requests, and signed up for ChatGPT Pro within minutes of it being available.

I also try to keep on top of the latest research advances, and carefully read all the major technical report papers that come out from the major AI labs. So I think I have a pretty good read on the space and how things are developing. At the same time, I've shorted a ton of stocks in my life and have won the best idea prize on the Value Investors Club twice (for [color=var(--tw-prose-links)]TMS long and [color=var(--tw-prose-links)]PDH short if you're keeping track at home).

I say this not to brag, but rather to help establish my bona fides as someone who could opine on the subject without coming across as hopelessly naive to either technologists or professional investors. And while there are surely many people who know the math/science better, and people who are better at long/short investing in the stock market than me, I doubt there are very many who are in the middle of the Venn diagram to the extent I can claim to be.

With all that said, whenever I meet with and chat with my friends and ex colleagues from the hedge fund world, the conversation quickly turns to Nvidia. It's not every day that a company goes from relative obscurity to being worth more than the combined stock markets of England, France, or Germany! And naturally, these friends want to know my thoughts on the subject. Because I am such a dyed-in-the-wool believer in the long term transformative impact of this technology— I truly believe it's going to radically change nearly every aspect of our economy and society in the next 5-10 years, with basically no historical precedent— it has been hard for me to make the argument that Nvidia's momentum is going to slow down or stop anytime soon.

But even though I've thought the valuation was just too rich for my blood for the past year or so, a confluence of recent developments has caused me to flip a bit to my usual instinct, which is to be a bit more contrarian in outlook and to question the consensus when it seems to be more than priced in. The saying "what the wise man believes in the beginning, the fool believes in the end" became famous for a good reason.

The Bull Case

Before we get into the developments that give me pause, let's pause to briefly review the bull case for NVDA shares, which is basically now known by everyone and his brother. Deep learning and AI are the most transformative technologies since the internet, and poised to change basically everything in our society. Nvidia has somehow ended up with something close to a monopoly in terms of the share of aggregate industry capex that is spent on training and inference infrastructure.

Some of the largest and most profitable companies in the world, like Microsoft, Apple, Amazon, Meta, Google, Oracle, etc., have all decided that they must do and spend whatever it takes to stay competitive in this space because they simply cannot afford to be left behind. The amount of capex dollars, gigawatts of electricity used, square footage of new-build data centers, and, of course, the number of GPUs, has absolutely exploded and seems to show no sign of slowing down. And Nvidia is able to earn insanely high 90%+ gross margins on the most high-end, datacenter oriented products.

We've just scratched the surface here of the bull case. There are many additional aspects to it now, which have made even people who were already very bullish to become incrementally more bullish. Besides things like the rise of humanoid robots, which I suspect is going to take most people by surprise when they are rapidly able to perform a huge number of tasks that currently require an unskilled (or even skilled) human worker (e.g., doing laundry, cleaning, organizing, and cooking; doing construction work like renovating a bathroom or building a house in a team of workers; running a warehouse and driving forklifts, etc.), there are other factors which most people haven't even considered.

One major thing that you hear the smart crowd talking about is the rise of "a new scaling law," which has created a new paradigm thinking about how compute needs will increase over time. The original scaling law, which is what has been driving progress in AI since [color=var(--tw-prose-links)]AlexNet appeared in 2012 and the Transformer architecture was invented in 2017, is the pre-training scaling law: that the more billions (and now trillions) worth of tokens we can use as training data, and the larger the parameter count of the models we are training, and the more FLOPS of compute that we expend on training those models on those tokens, the better the performance of the resulting models on a large variety of highly useful downstream tasks.

Not only that, but this improvement is somewhat knowable, to the point where the leading AI labs like OpenAI and Anthropic have a pretty good idea of just how good their latest models would be even before they started the actual training runs— in some cases, predicting the benchmarks of the final models to within a couple percentage points. This "original scaling law" has been vitally important, but always caused some doubts in the minds of people projecting the future with it.

For one thing, we seem to have already exhausted the world's accumulated set of high quality training data. Of course, that's not literally true— there are still so many old books and periodicals that haven't yet been properly digitized, and even if they have, are not properly licensed for use as training data. The problem is that, even if you give credit for all that stuff— say the sum total of "professionally" produced English language written content from the year 1500 to, say, the year 2000, it's not such a tremendous amount in percentage terms when you're talking about a training corpus of nearly 15 trillion tokens, which is the scale of current frontier models.

For a quick reality check of those numbers: Google Books has digitized around 40mm books so far; if a typical book has 50k to 100k words, or 65k to 130k tokens, then that's between 2.6T and 5.2T tokens just from books, though surely a large chunk of that is already included in the training corpora used by the big labs, whether it's strictly legal or not. And there are lots of academic papers, with the arXiv website alone having over 2mm papers. And the Library of Congress has over 3 billion digitized newspaper pages. Taken together, that could be as much as 7T tokens in total, but since much of this is in fact included in training corpora, the remaining "incremental" training data probably isn't all that significant in the grand scheme of things.

Of course, there are other ways to gather more training data. You could automatically transcribe every single YouTube video for example, and use that text. And while that might be helpful on the margin, it's certainly of much lower quality than, say, a highly respected textbook on Organic Chemistry as a source of useful knowledge about the world. So we've always had a looming "data wall" when it comes to the original scaling law; although we know we can keep shoveling more and more capex into GPUs and building more and more data centers, it's a lot harder to mass produce useful new human knowledge which is correct and incremental to what is already out there. Now, one intriguing response to this has been the rise of "synthetic data," which is text that is itself the output of an LLM. And while this seems almost nonsensical that it would work to "get high on your own supply" as a way of improving model quality, it actually seems to work very well in practice, at least in the domain of math, logic, and computer programming.

The reason, of course, is that these are areas where we can mechanically check and prove the correctness of things. So we can sample from the vast universe of possible math theorems or possible Python scripts, and then actually check if they are correct, and only include them in our corpus if they are. And in this way, we can very dramatically expand our collection of high quality training data, at least in these kinds of areas.

And then there are all the other kinds of data we could be training AI on besides text. For example, what if we take the entire whole genome sequencing (around 200 GB to 300 GB uncompressed for a single human being) for 100 million people? That's a lot of data obviously, although the vast majority of it would be nearly identical between any two people. Of course, this could be misleading to compare to textual data from books and the internet for various reasons:

Raw genome size isn't directly comparable to token counts
The information content of genomic data is very different from text
The training value of highly redundant data isn't clear
The computational requirements for processing genomic data are different

But it's still another large source of diverse information that we could train huge models on in the future, which is why I included it.

So while there is some hope in terms of being able to capture more and more additional training data, if you look at the rate at which training corpora have grown in recent years, it quickly becomes obvious that we are close to hitting a wall in terms of data availability for "generally useful" knowledge that can get us closer to the ultimate goal of getting artificial super-intelligence which is 10x smarter than John von Neumann and is an absolute world-class expert on every specialty known to man.

Besides the limited amount of available data, there have always been a couple other things that have lurked in the back of the mind of proponents of the pre-training scaling law. A big one of these is, after you've finished training the model, what are you supposed to do with all that compute infrastructure? Train the next model? Sure, you can do that, but given the rapid improvement in GPU speed and capacity, and the importance of electricity and other opex in the economic calculations, does it even really make sense to use your 2 year old cluster to train your new model? Surely you'd rather use the brand new data center you just built that costs 10x the old data center and is 20x more powerful because of better technology. The problem is, at some point you do need to amortize the up-front cost of these investments and recoup it with a stream of (hopefully positive) operating profit, right?

The market is so excited about AI that it has thankfully ignored this, allowing companies like OpenAI to post breathtaking from-inception, cumulative operating losses while garnering increasingly eye-popping valuations in follow-up investment rounds (although, to their credit, they have also been able to demonstrate very fast growing revenues). But eventually, for this situation to be sustainable over a full market cycle, these data center costs do need to eventually be recouped, hopefully with a profit, which over time is competitive with other investment opportunities on a risk-adjusted basis.

The New Paradigm

OK, so that was the pre-training scaling law. What's this "new" scaling law? Well, that's something that people really just started focusing on in the past year: inference time compute scaling. Before, the vast majority of all the compute you'd expend in the process was the up-front training compute to create the model in the first place. Once you had the trained model, performing inference on that model— i.e., asking a question or having the LLM perform some kind of task for you— used a certain, limited amount of compute.

Critically, the total amount of inference compute (measured in various ways, such as FLOPS, in GPU memory footprint, etc.) was much, much less than what was required for the pre-training phase. Of course, the amount of inference compute does flex up when you increase the context window size of the models and the amount of output that you generate from them in one go (although researchers have made breathtaking algorithmic improvements on this front relative to the initial quadratic scaling people originally expected in scaling this up). But essentially, until recently, inference compute was generally a lot less intensive than training compute, and scaled basically linearly with the number of requests you are handling— the more demand for text completions from ChatGPT, for instance, the more inference compute you used up.

With the advent of the revolutionary Chain-of-Thought ("COT") models introduced in the past year, most noticeably in OpenAI's flagship O1 model (but very recently in DeepSeek's new R1 model, which we will talk about later in much more detail), all that changed. Instead of the amount of inference compute being directly proportional to the length of the output text generated by the model (scaling up for larger context windows, model size, etc.), these new COT models also generate intermediate "logic tokens"; think of this as a sort of scratchpad or "internal monologue" of the model while it's trying to solve your problem or complete its assigned task.

This represents a true sea change in how inference compute works: now, the more tokens you use for this internal chain of thought process, the better the quality of the final output you can provide the user. In effect, it's like giving a human worker more time and resources to accomplish a task, so they can double and triple check their work, do the same basic task in multiple different ways and verify that they come out the same way; take the result they came up with and "plug it in" to the formula to check that it actually does solve the equation, etc.

It turns out that this approach works almost amazingly well; it is essentially leveraging the long anticipated power of what is called "reinforcement learning" with the power of the Transformer architecture. It directly addresses the single biggest weakness of the otherwise phenomenally successful Transformer model, which is its propensity to "hallucinate".

Basically, the way Transformers work in terms of predicting the next token at each step is that, if they start out on a bad "path" in their initial response, they become almost like a prevaricating child who tries to spin a yarn about why they are actually correct, even if they should have realized mid-stream using common sense that what they are saying couldn't possibly be correct.

Because the models are always seeking to be internally consistent and to have each successive generated token flow naturally from the preceding tokens and context, it's very hard for them to course-correct and backtrack. By breaking the inference process into what is effectively many intermediate stages, they can try lots of different things and see what's working and keep trying to course-correct and try other approaches until they can reach a fairly high threshold of confidence that they aren't talking nonsense.

Perhaps the most extraordinary thing about this approach, beyond the fact that it works at all, is that the more logic/COT tokens you use, the better it works. Suddenly, you now have an additional dial you can turn so that, as you increase the amount of COT reasoning tokens (which uses a lot more inference compute, both in terms of FLOPS and memory), the higher the probability is that you will give a correct response— code that runs the first time without errors, or a solution to a logic problem without an obviously wrong deductive step.

I can tell you from a lot of firsthand experience that, as good as Anthropic's Claude3.5 Sonnet model is at Python programming— and it is indeed VERY good— whenever you need to generate anything long and complicated, it invariably ends up making one or more stupid mistakes. Now, these mistakes are usually pretty easy to fix, and in fact you can normally fix them by simply feeding the errors generated by the Python interpreter, without any further explanation, as a follow-up inference prompt (or, more usefully, paste in the complete set of detected "problems" found in the code by your code editor, using what something called a Linter), it was still an annoying additional step. And when the code becomes very long or very complicated, it can sometimes take a lot longer to fix, and might even require some manual debugging by hand.

The first time I tried the O1 model from OpenAI was like a revelation: I was amazed how often the code would be perfect the very first time. And that's because the COT process automatically finds and fixes problems before they ever make it to a final response token in the answer the model gives you.

In fact, the O1 model used in OpenAI's ChatGPT Plus subscription for $20/month is basically the same model as the one used in the O1-Pro model featured in their new ChatGPT Pro subscription for 10x the price ($200/month, which raised plenty of eyebrows in the developer community); the main difference is that O1-Pro thinks for a lot longer before responding, generating vastly more COT logic tokens, and consuming a far larger amount of inference compute for every response.

This is quite striking in that, even a very long and complex prompt for Claude3.5 Sonnet or GPT4o, with ~400kb+ of context given, generally takes less than 10 seconds to begin responding, and often less than 5 seconds. Whereas that same prompt to O1-Pro could easily take 5+ MINUTES before you get a response (although OpenAI does show you some of the "reasoning steps" that are generated during the process while you wait; critically, OpenAI has decided, presumably for trade secret related reasons,to hide from you the exact reasoning tokens it generates, showing you instead a highly abbreviated summary of these).

As you can probably imagine, there are tons of contexts where accuracy is paramount— where you'd rather give up and tell the user you can't do it at all rather than give an answer that could be trivially proven wrong or which involves hallucinated facts or otherwise specious reasoning. Anything involving money/transactions, medical stuff, legal stuff, just to name a few.

Basically, wherever the cost of inference is trivial relative to the hourly all-in compensation of the human knowledge worker who is interacting with the AI system, that's a case where it become a complete no-brainer to dial up the COT compute (the major drawback is that it increases the latency of responses by a lot, so there are still some contexts where you might prefer to iterate faster by getting lower latency responses that are less accurate or correct).

Some of the most exciting news in the AI world came out just a few weeks ago and concerned OpenAI's new unreleased O3 model, which was able to solve a large variety of tasks that were previously deemed to be out of reach of current AI approaches in the near term. And the way it was able to do these hardest problems (which include exceptionally tough "foundational" math problems that would be very hard for even highly skilled professional mathematicians to solve), is that OpenAI threw insane amount of compute resources at the problems— in some cases, spending $3k+ worth of compute power to solve a single task (compare this to traditional inference costs for a single task, which would be unlikely to exceed a couple dollars using regular Transformer models without chain-of-thought).

It doesn't take an AI genius to realize that this development creates a new scaling law that is totally independent of the original pre-training scaling law. Now, you still want to train the best model you can by cleverly leveraging as much compute as you can and as many trillion tokens of high quality training data as possible, but that's just the beginning of the story in this new world; now, you could easily use incredibly huge amounts of compute just to do inference from these models at a very high level of confidence or when trying to solve extremely tough problems that require "genius level" reasoning to avoid all the potential pitfalls that would lead a regular LLM astray.

But Why Should Nvidia Get to Capture All The Upside?

Even if you believe, as I do, that the future prospects for AI are almost unimaginably bright, the question still remains, "Why should one company extract the majority of the profit pool from this technology?" There are certainly many historical cases where a very important new technology changed the world, but the main winners were not the companies that seemed the most promising during the initial stages of the process. The Wright Brothers' airplane company in all its current incarnations across many different firms today isn't worth more than $10b despite them inventing and perfecting the technology well ahead of everyone else. And while Ford has a respectable market cap of $40b today, it's just 1.1% of Nvidia's current market cap.

To understand this, it's important to really understand why Nvidia is currently capturing so much of the pie today. After all, they aren't the only company that even makes GPUs. AMD makes respectable GPUs that, on paper, have comparable numbers of transistors, which are made using similar process nodes, etc. Sure, they aren't as fast or as advanced as Nvidia's GPUs, but it's not like the Nvidia GPUs are 10x faster or anything like that. In fact, in terms of naive/raw dollars per FLOP, AMD GPUs are something like half the price of Nvidia GPUs.

Looking at other semiconductor markets such as the DRAM market, despite the fact that it is also very highly consolidated with only 3 meaningful global players (Samsung, Micron, SK-Hynix), gross margins in the DRAM market range from negative at the bottom of the cycle to ~60% at the very top of the cycle, with an average in the 20% range. Compare that to Nvidia's overall gross margin in recent quarters of ~75%, which is dragged down by the lower-margin and more commoditized consumer 3D graphics category.

So how is this possible? Well, the main reasons have to do with software— better drivers that "just work" on Linux and which are highly battle-tested and reliable (unlike AMD, which is notorious for the low quality and instability of their Linux drivers), and highly optimized open-source code in popular libraries such as [color=var(--tw-prose-links)]PyTorch that has been tuned to work really well on Nvidia GPUs.

It goes beyond that though— the very programming framework that coders use to write low-level code that is optimized for GPUs, CUDA, is totally proprietary to Nvidia, and it has become a de facto standard. If you want to hire a bunch of extremely talented programmers who know how to make things go really fast on GPUs, and pay them $650k/year or whatever the going rate is for people with that particular expertise, chances are that they are going to "think" and work in CUDA.

Besides software superiority, the other major thing that Nvidia has going for it is what is known as interconnect— essentially, the bandwidth that connects together thousands of GPUs together efficiently so they can be jointly harnessed to train today's leading-edge foundational models. In short, the key to efficient training is to keep all the GPUs as fully utilized as possible all the time— not waiting around idling until they receive the next chunk of data they need to compute the next step of the training process.

The bandwidth requirements are extremely high— much, much higher than the typical bandwidth that is needed in traditional data center use cases. You can't really use traditional networking gear or fiber optics for this kind of interconnect, since it would introduce too much latency and wouldn't give you the pure terabytes per second of bandwidth that is needed to keep all the GPUs constantly busy.

Nvidia made an incredibly smart decision to purchase the Israeli company Mellanox back in 2019 for a mere $6.9b, and this acquisition is what provided them with their industry leading interconnect technology. Note that interconnect speed is a lot more relevant to the training process, where you have to harness together the output of thousands of GPUs at the same time, than the inference process (including COT inference), which can use just a handful of GPUs— all you need is enough VRAM to store the quantized (compressed) model weights of the already-trained model.

So those are arguably the major components of Nvidia's "moat" and how it has been able to maintain such high margins for so long (there is also a "flywheel" aspect to things, where they aggressively invest their super-normal profits into tons of R&D, which in turn helps them improve their tech at a faster rate than the competition, so they are always in the lead in terms of raw performance).

But as was pointed out earlier, what customers really tend to care about, all other things being equal, is performance per dollar (both in up-front capex cost of equipment and in energy usage, so performance per watt), and even though Nvidia's GPUs are certainly the fastest, they are not the best price/performance when measured naively in terms of FLOPS.

But the thing is, all other things are NOT equal, and the fact that AMD's drivers suck, that popular AI software libraries don't run as well on AMD GPUs, that you can't find really good GPU experts who specialize in AMD GPUs outside of the gaming world (why would they bother when there is more demand in the market for CUDA experts?), that you can't wire thousands of them together as effectively because of lousy interconnect technology for AMD— all this means that AMD is basically not competitive in the high-end data center world, and doesn't seem to have very good prospects for getting there in the near term.

Well, that all sounds very bullish for Nvidia, right? Now you can see why the stock is trading at such a huge valuation! But what are the other clouds on the horizon? Well, there are few that I think merit significant attention. Some have been lurking in the background for the last few years, but too small to make a dent considering how quickly the pie has been growing, but where they are getting ready to potentially inflect upwards. Others are very recent developments (as in, the last 2 weeks) that might dramatically change the near-term trajectory of incremental GPU demand.

The Major Threats

At a very high level, you can think of things like this: Nvidia operated in a pretty niche area for a very long time; they had very limited competition, and the competition wasn't particular profitable or growing fast enough to ever pose a real threat, since they didn't have the capital needed to really apply pressure to a market leader like Nvidia. The gaming market was large and growing, but didn't feature earth shattering margins or particularly fabulous year over year growth rates.

A few big tech companies started ramping up hiring and spending on machine learning and AI efforts around 2016-2017, but it was never a truly significant line item for any of them on an aggregate basis— more of a "moonshot" R&D expenditure. But once the big AI race started in earnest with the release of ChatGPT in 2022— only a bit over 2 years ago, although it seems like a lifetime ago in terms of developments— that situation changed very dramatically.

Suddenly, big companies were ready to spend many, many billions of dollars incredibly quickly. The number of researchers showing up at the big research conferences like [color=var(--tw-prose-links)]Neurips and [color=var(--tw-prose-links)]ICML went up very, very dramatically. All the smart students who might have previously studied financial derivatives were instead studying Transformers, and $1mm+ compensation packages for non-executive engineering roles (i.e., for independent contributors not managing a team) became the norm at the leading AI labs.

It takes a while to change the direction of a massive cruise ship; and even if you move really quickly and spend billions, it takes a year or more to build greenfield data centers and order all the equipment (with ballooning lead times) and get it all set up and working. It takes a long time to hire and onboard even smart coders before they can really hit their stride and familiarize themselves with the existing codebases and infrastructure.

But now, you can imagine that absolutely biblical amounts of capital, brainpower, and effort are being expended in this area. And Nvidia has the biggest target of any player on their back, because they are the ones who are making the lion's share of the profits TODAY, not in some hypothetical future where the AI runs our whole lives.

So the very high level takeaway is basically that "markets find a way"; they find alternative, radically innovative new approaches to building hardware that leverage completely new ideas to sidestep barriers that help prop up Nvidia's moat.

The Hardware Level Threat

For example, so-called "wafer scale" AI training chips from Cerebras, which dedicate an entire 300mm silicon wafer to an absolutely gargantuan chip that contains orders of magnitude more transistors and cores on a single chip (see this recent [color=var(--tw-prose-links)]blog post from them explaining how they were able to solve the "yield problem" that had been preventing this approach from being economically practical in the past).

To put this into perspective, if you compare Cerebras' newest WSE-3 chip to Nvidia's flagship data-center GPU, the H100, the Cerebras chip has a total die area of 46,225 square millimeters compared to just 814 for the H100 (and the H100 is itself considered an enormous chip by industry standards); that's a multiple of ~57x! And instead of having 132 "streaming multiprocessor" cores enabled on the chip like the H100 has, the Cerebras chip has ~900,000 cores (granted, each of these cores is smaller and does a lot less, but it's still an almost unfathomably large number in comparison). In more concrete apples-to-apples terms, the Cerebras chip can do around ~32x the FLOPS in AI contexts as a single H100 chip. Since an H100 sells for close to $40k a pop, you can imagine that the WSE-3 chip isn't cheap.

So why does this all matter? Well, instead of trying to battle Nvidia head-on by using a similar approach and trying to match the Mellanox interconnect technology, Cerebras has used a radically innovative approach to do an end-run around the interconnect problem: inter-processor bandwidth becomes much less of an issue when everything is running on the same super-sized chip. You don't even need to have the same level of interconnect because one mega chip replaces tons of H100s.

And the Cerebras chips also work extremely well for AI inference tasks. In fact, you can try it today for free [color=var(--tw-prose-links)]here and use Meta's very respectable Llama-3.3-70B model. It responds basically instantaneously, at ~1,500 tokens per second. To put that into perspective, anything above 30 tokens per second feels relatively snappy to users based on comparisons to ChatGPT and Claude, and even 10 tokens per second is fast enough that you can basically read the response while it's being generated.

Cerebras is also not alone; there are other companies, like Groq (not to be confused with the [color=var(--tw-prose-links)]Grok model family trained by Elon Musk's X AI). Groq has taken yet another innovative approach to solving the same fundamental problem. Instead of trying to compete with Nvidia's CUDA software stack directly, they've developed what they call a "tensor processing unit" (TPU) that is specifically designed for the exact mathematical operations that deep learning models need to perform. Their chips are designed around a concept called "deterministic compute," which means that, unlike traditional GPUs where the exact timing of operations can vary, their chips execute operations in a completely predictable way every single time.

This might sound like a minor technical detail, but it actually makes a massive difference for both chip design and software development. Because the timing is completely deterministic, Groq can optimize their chips in ways that would be impossible with traditional GPU architectures. As a result, they've been demonstrating for the past 6+ months inference speeds of over 500 tokens per second with the Llama series of models and other open source models, far exceeding what's possible with traditional GPU setups. Like Cerebras, this is available today and you can try it for free [color=var(--tw-prose-links)]here.

Using a comparable Llama3 model with "speculative decoding," Groq is able to generate 1,320 tokens per second, on par with Cerebras and far in excess of what is possible using regular GPUs. Now, you might ask what the point is of achieving 1,000+ tokens per second when users seem pretty satisfied with ChatGPT, which is operating at less than 10% of that speed. And the thing is, it does matter. It makes it a lot faster to iterate and not lose focus as a human knowledge worker when you get instant feedback. And if you're using the model programmatically via the API, which is increasingly where much of the demand is coming from, then it can enable whole new classes of applications that require multi-stage inference (where the output of previous stages is used as input in successive stages of prompting/inference) or which require low-latency responses, such as content moderation, fraud detection, dynamic pricing, etc.

But even more fundamentally, the faster you can serve requests, the faster you can cycle things, and the busier you can keep the hardware. Although Groq's hardware is extremely expensive, clocking in at $2mm to $3mm for a single server, it ends up costing far less per request fulfilled if you have enough demand to keep the hardware busy all the time.

And like Nvidia with CUDA, a huge part of Groq's advantage comes from their own proprietary software stack. They are able to take the same open source models that other companies like Meta, DeepSeek, and Mistral develop and release for free, and decompose them in special ways that allow them to run dramatically faster on their specific hardware.

Like Cerebras, they have taken different technical decisions to optimize certain particular aspects of the process, which allows them to do things in a fundamentally different way. In Groq's case, it's because they are entirely focused on inference level compute, not on training: all their special sauce hardware and software only give these huge speed and efficiency advantages when doing inference on an already trained model.

But if the next big scaling law that people are excited about is for inference level compute— and if the biggest drawback of COT models is the high latency introduced by having to generate all those intermediate logic tokens before they can respond— then even a company that only does inference compute, but which does it dramatically faster and more efficiently than Nvidia can— can introduce a serious competitive threat in the coming years. At the very least, Cerebras and Groq can chip away at the lofty expectations for Nvidia's revenue growth over the next 2-3 years that are embedded in the current equity valuation.

Besides these particularly innovative, if relatively unknown, startup competitors, there is some serious competition coming from some of Nvidia's biggest customers themselves who have been making custom silicon that specifically targets AI training and inference workloads. Perhaps the best known of these is Google, which has been developing its own proprietary TPUs since 2016. Interestingly, although it briefly sold TPUs to external customers, Google has been using all its TPUs internally for the past several years, and it is already on its [color=var(--tw-prose-links)]6th generation of TPU hardware.

Amazon has also been developing its own custom chips called [color=var(--tw-prose-links)]Trainium2 and [color=var(--tw-prose-links)]Inferentia2. And while Amazon is building out data centers featuring billions of dollars of Nvidia GPUs, they are also at the same time investing many billions in other data centers that use these internal chips. They have one cluster that they are bringing online for Anthropic that features over 400k chips.

Amazon gets a lot of flak for totally bungling their internal AI model development, squandering massive amounts of internal compute resources on models that ultimately are not competitive, but the custom silicon is another matter. Again, they don't necessarily need their chips to be better and faster than Nvidia's. What they need is for their chips to be good enough, but build them at a breakeven gross margin instead of the ~90%+ gross margin that Nvidia earns on its H100 business.

OpenAI has also announced their [color=var(--tw-prose-links)]plans to build custom chips, and they (together with Microsoft) are obviously the single largest user of Nvidia's data center hardware. As if that weren't enough, Microsoft have themselves announced their [color=var(--tw-prose-links)]own custom chips!

And Apple, the most valuable technology company in the world, has been blowing away expectations for years now with their highly innovative and disruptive custom silicon operation, which now completely trounces the CPUs from both Intel and AMD in terms of performance per watt, which is the most important factor in mobile (phone/tablet/laptop) applications. And they have been making their own internally designed GPUs and "Neural Processors" for years, even though they have yet to really demonstrate the utility of such chips outside of their own custom applications, like the advanced software based image processing used in the iPhone's camera.

While Apple's focus seems somewhat orthogonal to these other players in terms of its mobile-first, consumer oriented, "edge compute" focus, if it ends up spending enough money on its new contract with OpenAI to provide AI services to iPhone users, you have to imagine that they have teams looking into making their own custom silicon for inference/training (although given their secrecy, you might never even know about it directly!).

Now, it's no secret that there is a strong power law distribution of Nvidia's hyper-scaler customer base, with the top handful of customers representing the lion's share of high-margin revenue. How should one think about the future of this business when literally every single one of these VIP customers is building their own custom chips specifically for AI training and inference?

When thinking about all this, you should keep one incredibly important thing in mind: Nvidia is largely an IP based company. They don't make their own chips. The true special sauce for making these incredible devices arguably comes more from TSMC, the actual fab, and ASML, which makes the special EUV lithography machines used by TSMC to make these leading-edge process node chips. And that's critically important, because TSMC will sell their most advanced chips to anyone who comes to them with enough up-front investment and is willing to guarantee a certain amount of volume. They don't care if it's for Bitcoin mining ASICs, GPUs, TPUs, mobile phone SoCs, etc.

As much as senior chip designers at Nvidia earn per year, surely some of the best of them could be lured away by these other tech behemoths for enough cash and stock. And once they have a team and resources, they can design innovative chips (again, perhaps not even 50% as advanced as an H100, but with that Nvidia gross margin, there is plenty of room to work with) in 2 to 3 years, and thanks for TSMC, they can turn those into actual silicon using the exact same process node technology as Nvidia.

The Software Threat(s)

As if these looming hardware threats weren't bad enough, there are a few developments in the software world in the last couple years that, while they started out slowly, are now picking up real steam and could pose a serious threat to the software dominance of Nvidia's CUDA. The first of these is the horrible Linux drivers for AMD GPUs. Remember we talked about how AMD has inexplicably allowed these drivers to suck for years despite leaving massive amounts of money on the table?

Well, amusingly enough, the infamous hacker George Hotz (famous for jailbreaking the original iphone as a teenager, and currently the CEO of self-driving startup Comma.ai and AI computer company Tiny Corp, which also makes the open-source tinygrad AI software framework), recently announced that he was sick and tired of dealing with AMD's bad drivers, and desperately wanted to be able to to leverage the lower cost AMD GPUs in their TinyBox AI computers (which come in multiple flavors, some of which use Nvidia GPUs, and some of which use AMD GPUS).

Well, he is making his own custom drivers and software stack for AMD GPUs without any help from AMD themselves; on Jan. 15th of 2025, he [color=var(--tw-prose-links)]tweeted via his company's X account that "We are one piece away from a completely sovereign stack on AMD, the RDNA3 assembler. We have our own driver, runtime, libraries, and emulator. (all in ~12,000 lines!)" Given his track record and skills, it is likely that they will have this all working in the next couple months, and this would allow for a lot of exciting possibilities of using AMD GPUs for all sorts of applications where companies currently feel compelled to pay up for Nvidia GPUs.

OK, well that's just a driver for AMD, and it's not even done yet. What else is there? Well, there are a few other areas on the software side that are a lot more impactful. For one, there is now a massive concerted effort across many large tech companies and the open source software community at large to make more generic AI software frameworks that have CUDA as just one of many "compilation targets".

That is, you write your software using higher-level abstractions, and the system itself can automatically turn those high-level constructs into super well-tuned low-level code that works extremely well on CUDA. But because it's done at this higher level of abstraction, it can just as easily get compiled into low-level code that works extremely well on lots of other GPUs and TPUs from a variety of providers, such as the massive number of custom chips in the pipeline from every big tech company.

The most famous examples of these frameworks are MLX (sponsored primarily by Apple), Triton (sponsored primarily by OpenAI), and JAX (developed by Google). MLX is particularly interesting because it provides a PyTorch-like API that can run efficiently on Apple Silicon, showing how these abstraction layers can enable AI workloads to run on completely different architectures. Triton, meanwhile, has become increasingly popular as it allows developers to write high-performance code that can be compiled to run on various hardware targets without having to understand the low-level details of each platform.

These frameworks allow developers to write their code once using high powered abstractions and then target tons of platforms automatically— doesn't that sound like a better way to do things, which would give you a lot more flexibility in terms of how you actually run the code?

In the 1980s, all the most popular, best selling software was written in hand-tuned assembly language. The PKZIP compression utility for example was hand crafted to maximize speed, to the point where a competently coded version written in the standard C programming language and compiled using the best available optimizing compilers at the time, would run at probably half the speed of the hand-tuned assembly code. The same is true for other popular software packages like WordStar, VisiCalc, and so on.

Over time, compilers kept getting better and better, and every time the CPU architectures changed (say, from Intel releasing the 486, then the Pentium, and so on), that hand-rolled assembler would often have to be thrown out and rewritten, something that only the smartest coders were capable of (sort of like how CUDA experts are on a different level in the job market versus a "regular" software developer). Eventually, things converged so that the speed benefits of hand-rolled assembly were outweighed dramatically by the flexibility of being able to write code in a high-level language like C or C++, where you rely on the compiler to make things run really optimally on the given CPU.

Nowadays, very little new code is written in assembly. I believe a similar transformation will end up happening for AI training and inference code, for similar reasons: computers are good at optimization, and flexibility and speed of development is increasingly the more important factor— especially if it also allows you to save dramatically on your hardware bill because you don't need to keep paying the "CUDA tax" that gives Nvidia 90%+ margins.

Yet another area where you might see things change dramatically is that CUDA might very well end up being more of a high level abstraction itself— a "specification language" similar to [color=var(--tw-prose-links)]Verilog (used as the industry standard to describe chip layouts) that skilled developers can use to describe high-level algorithms that involve massive parallelism (since they are already familiar with it, it's very well constructed, it's the lingua franca, etc.), but then instead of having that code compiled for use on Nvidia GPUs like you would normally do, it can instead be fed as source code into an LLM which can port it into whatever low-level code is understood by the new Cerebras chip, or the new Amazon Trainium2, or the new Google TPUv6, etc. This isn't as far off as you might think; it's probably already well within reach using OpenAI's latest O3 model, and surely will be possible generally within a year or two.

The Theoretical Threat

Perhaps the most shocking development which was alluded to earlier happened in the last couple of weeks. And that is the news that has totally rocked the AI world, and which has been dominating the discourse among knowledgeable people on Twitter despite its complete absence from any of the mainstream media outlets: that a small Chinese startup called DeepSeek released two new models that have basically world-competitive performance levels on par with the best models from OpenAI and Anthropic (blowing past the Meta Llama3 models and other smaller open source model players such as Mistral). These models are called [color=var(--tw-prose-links)]DeepSeek-V3 (basically their answer to GPT-4o and Claude3.5 Sonnet) and [color=var(--tw-prose-links)]DeepSeek-R1 (basically their answer to OpenAI's O1 model).

Why is this all so shocking? Well, first of all, DeepSeek is a tiny Chinese company that reportedly has under 200 employees. The story goes that they started out as a quant trading hedge fund similar to TwoSigma or RenTec, but after Xi Jinping cracked down on that space, they used their math and engineering chops to pivot into AI research. Who knows if any of that is really true or if they are merely some kind of front for the CCP or the Chinese military. But the fact remains that they have released two incredibly detailed technical reports, for [color=var(--tw-prose-links)]DeepSeek-V3 and [color=var(--tw-prose-links)]DeepSeekR1.

These are heavy technical reports, and if you don't know a lot of linear algebra, you probably won't understand much. But what you should really try is to download the free DeepSeek app on the AppStore [color=var(--tw-prose-links)]here and install it using a Google account to log in and give it a try (you can also install it on Android [color=var(--tw-prose-links)]here), or simply try it out on your desktop computer in the browser [color=var(--tw-prose-links)]here. Make sure to select the "DeepThink" option to enable chain-of-thought (the R1 model) and ask it to explain parts of the technical reports in simple terms.

This will simultaneously show you a few important things:

One, this model is absolutely legit. There is a lot of BS that goes on with AI benchmarks, which are routinely gamed so that models appear to perform great on the benchmarks but then suck in real world tests. Google is certainly the worst offender in this regard, constantly crowing about how amazing their LLMs are, when they are so awful in any real world test that they can't even reliably accomplish the simplest possible tasks, let alone challenging coding tasks. These DeepSeek models are not like that— the responses are coherent, compelling, and absolutely on the same level as those from OpenAI and Anthropic.
Two, that DeepSeek has made profound advancements not just in model quality, but more importantly in model training and inference efficiency. By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, DeepSeek was able to train these incredible models using GPUs in a dramatically more efficient way. By some measurements, over ~45x more efficiently than other leading-edge models. DeepSeek claims that the complete cost to train DeepSeek-V3 was just over $5mm. That is absolutely nothing by the standards of OpenAI, Anthropic, etc., which were well into the $100mm+ level for training costs for a single model as early as 2024.

How in the world could this be possible? How could this little Chinese company completely upstage all the smartest minds at our leading AI labs, which have 100 times more resources, headcount, payroll, capital, GPUs, etc? Wasn't China supposed to be crippled by Biden's restriction on GPU exports? Well, the details are fairly technical, but we can at least describe them at a high level. It might have just turned out that the relative GPU processing poverty of DeepSeek was the critical ingredient to make them more creative and clever, necessity being the mother of invention and all.

A major innovation is their sophisticated mixed-precision training framework that lets them use 8-bit floating point numbers (FP8) throughout the entire training process. Most Western AI labs train using "full precision" 32-bit numbers (this basically specifies the number of gradations possible in describing the output of an artificial neuron; 8 bits in FP8 lets you store a much wider range of numbers than you might expect— it's not just limited to 256 different equal-sized magnitudes like you'd get with regular integers, but instead uses clever math tricks to store both very small and very large numbers— though naturally with less precision than you'd get with 32 bits.) The main tradeoff is that while FP32 can store numbers with incredible precision across an enormous range, FP8 sacrifices some of that precision to save memory and boost performance, while still maintaining enough accuracy for many AI workloads.

DeepSeek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high-precision calculations at key points in the network. Unlike other labs that train in high precision and then compress later (losing some quality in the process), DeepSeek's native FP8 approach means they get the massive memory savings without compromising performance. When you're training across thousands of GPUs, this dramatic reduction in memory requirements per GPU translates into needing far fewer GPUs overall.

Another major breakthrough is their multi-token prediction system. Most Transformer based LLM models do inference by predicting the next token— one token at a time. DeepSeek figured out how to predict multiple tokens while maintaining the quality you'd get from single-token prediction. Their approach achieves about 85-90% accuracy on these additional token predictions, which effectively doubles inference speed without sacrificing much quality. The clever part is they maintain the complete causal chain of predictions, so the model isn't just guessing— it's making structured, contextual predictions.

One of their most innovative developments is what they call Multi-head Latent Attention (MLA). This is a breakthrough in how they handle what are called the Key-Value indices, which are basically how individual tokens are represented in the attention mechanism within the Transformer architecture. Although this is getting a bit too advanced in technical terms, suffice it to say that these KV indices are some of the major uses of VRAM during the training and inference process, and part of the reason why you need to use thousands of GPUs at the same time to train these models— each GPU has a maximum of 96 gb of VRAM, and these indices eat that memory up for breakfast.

Their MLA system finds a way to store a compressed version of these indices that captures the essential information while using far less memory. The brilliant part is this compression is built directly into how the model learns— it's not some separate step they need to do, it's built directly into the end-to-end training pipeline. This means that the entire mechanism is "differentiable" and able to be trained directly using the standard optimizers. All this stuff works because these models are ultimately finding much lower-dimensional representations of the underlying data than the so-called "ambient dimensions". So it's wasteful to store the full KV indices, even though that is basically what everyone else does.

Not only do you end up wasting tons of space by storing way more numbers than you need, which gives a massive boost to the training memory footprint and efficiency (again, slashing the number of GPUs you need to train a world class model), but it can actually end up improving model quality because it can act like a "regularizer," forcing the model to pay attention to the truly important stuff instead of using the wasted capacity to fit to noise in the training data. So not only do you save a ton of memory, but the model might even perform better. At the very least, you don't get a massive hit to performance in exchange for the huge memory savings, which is generally the kind of tradeoff you are faced with in AI training.

They also made major advances in GPU communication efficiency through their DualPipe algorithm and custom communication kernels. This system intelligently overlaps computation and communication, carefully balancing GPU resources between these tasks. They only need about 20 of their GPUs' streaming multiprocessors (SMs) for communication, leaving the rest free for computation. The result is much higher GPU utilization than typical training setups achieve.

Another very smart thing they did is to use what is known as a Mixture-of-Experts (MOE) Transformer architecture, but with key innovations around load balancing. As you might know, the size or capacity of an AI model is often measured in terms of the number of parameters the model contains. A parameter is just a number that stores some attribute of the model; either the "weight" or importance a particular artificial neuron has relative to another one, or the importance of a particular token depending on its context (in the "attention mechanism"), etc.

Meta's latest Llama3 models come in a few sizes, for example: a 1 billion parameter version (the smallest), a 70B parameter model (the most commonly deployed one), and even a massive 405B parameter model. This largest model is of limited utility for most users because you would need to have tens of thousands of dollars worth of GPUs in your computer just to run at tolerable speeds for inference, at least if you deployed it in the naive full-precision version. Therefore most of the real-world usage and excitement surrounding these open source models is at the 8B parameter or highly quantized 70B parameter level, since that's what can fit in a consumer-grade Nvidia 4090 GPU, which you can buy now for under $1,000.

So why does any of this matter? Well, in a sense, the parameter count and precision tells you something about how much raw information or data the model has stored internally. Note that I'm not talking about reasoning ability, or the model's "IQ" if you will: it turns out that models with even surprisingly modest parameter counts can show remarkable cognitive performance when it comes to solving complex logic problems, proving theorems in plane geometry, SAT math problems, etc.

But those small models aren't going to be able to necessarily tell you every aspect of every plot twist in every single novel by Stendhal, whereas the really big models can potentially do that. The "cost" of that extreme level of knowledge is that the models become very unwieldy both to train and to do inference on, because you always need to store every single one of those 405B parameters (or whatever the parameter count is) in the GPU's VRAM at the same time in order to do any inference with the model.

The beauty of the MOE model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge. DeepSeek's innovation here was developing what they call an "auxiliary-loss-free" load balancing strategy that maintains efficient expert utilization without the usual performance degradation that comes from load balancing. Then, depending on the nature of the inference request, you can intelligently route the inference to the "expert" models within that collection of smaller models that are most able to answer that question or solve that task.

You can loosely think of it as being a committee of experts who have their own specialized knowledge domains: one might be a legal expert, the other a computer science expert, the other a business strategy expert. So if a question comes in about linear algebra, you don't give it to the legal expert. This is of course a very loose analogy and it doesn't actually work like this in practice.

The real advantage of this approach is that it allows the model to contain a huge amount of knowledge without being very unwieldy, because even though the aggregate number of parameters is high across all the experts, only a small subset of these parameters is "active" at any given time, which means that you only need to store this small subset of weights in VRAM in order to do inference. In the case of DeepSeek-V3, they have an absolutely massive MOE model with [color=var(--tw-prose-links)]671B parameters, so it's much bigger than even the largest Llama3 model, but only 37B of these parameters are active at any given time— enough to fit in the VRAM of two consumer-grade Nvidia 4090 GPUs (under $2,000 total cost), rather than requiring one or more H100 GPUs which cost something like $40k each.

It's rumored that both ChatGPT and Claude use an MoE architecture, with some leaks suggesting that GPT-4 had a total of 1.8 trillion parameters split across 8 models containing 220 billion parameters each. Despite that being a lot more doable than trying to fit all 1.8 trillion parameters in VRAM, it still requires multiple H100-grade GPUs just to run the model because of the massive amount of memory used.

Beyond what has already been described, the technical papers mention several other key optimizations. These include their extremely memory-efficient training framework that avoids tensor parallelism, recomputes certain operations during backpropagation instead of storing them, and shares parameters between the main model and auxiliary prediction modules. The sum total of all these innovations, when layered together, has led to the ~45x efficiency improvement numbers that have been tossed around online, and I am perfectly willing to believe these are in the right ballpark.

One very strong indicator that it's true is the cost of DeepSeek's API: despite this nearly best-in-class model performance, DeepSeek charges something like [color=var(--tw-prose-links)]95% less money for inference requests via its API than comparable models from OpenAI and Anthropic. In a sense, it's sort of like comparing Nvidia's GPUs to the new custom chips from competitors: even if they aren't quite as good, the value for money is so much better that it can still be a no-brainer depending on the application, as long as you can qualify the performance level and prove that it's good enough for your requirements and the API availability and latency is good enough (thus far, people have been [color=var(--tw-prose-links)]amazed at how [color=var(--tw-prose-links)]well DeepSeek's infrastructure has held up despite the truly incredible surge of demand owing to the performance of these new models).

But unlike the case of Nvidia, where the cost differential is the result of them earning monopoly gross margins of 90%+ on their data-center products, the cost differential of the DeepSeek API relative to the OpenAI and Anthropic API could be simply that they are nearly 50x more compute efficient (it might even be significantly more than that on the inference side— the ~45x efficiency was on the training side). Indeed, it's not even clear that OpenAI and Anthropic are making great margins on their API services— they might be more interested in revenue growth and gathering more data from analyzing all the API requests they receive.

Before moving on, I'd be remiss if I didn't mention that many people are speculating that DeepSeek is simply lying about the number of GPUs and GPU hours spent training these models because they actually possess far more H100s than they are supposed to have given the export restrictions on these cards, and they don't want to cause trouble for themselves or hurt their chances of acquiring more of these cards. While it's certainly possible, I think it's more likely that they are telling the truth, and that they have simply been able to achieve these incredible results by being extremely clever and creative in their approach to training and inference. They explain how they are doing things, and I suspect that it's only a matter of time before their results are widely replicated and confirmed by other researchers at various other labs.

A Model That Can Really Think

The newer R1 model and technical report might even be even more mind blowing, since they were able to beat Anthropic to Chain-of-thought and now are basically the only ones besides OpenAI who have made this technology work at scale. But note that the O1 preview model was only released by OpenAI in mid-September of 2024. That's only ~4 months ago! Something you absolutely must keep in mind is that, unlike OpenAI, which is incredibly secretive about how these models really work at a low level, and won't release the actual model weights to anyone besides partners like Microsoft and other who sign heavy-duty NDAs, these DeepSeek models are both completely open-source and permissively licensed. They have released extremely detailed technical reports explaining how they work, as well as the code that anyone can look at and try to copy.

With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets. Their DeepSeek-R1-Zero experiment showed something remarkable: using pure reinforcement learning with carefully crafted reward functions, they managed to get models to develop sophisticated reasoning capabilities completely autonomously. This wasn't just about solving problems— the model organically learned to generate long chains of thought, self-verify its work, and allocate more computation time to harder problems.

The technical breakthrough here was their novel approach to reward modeling. Rather than using complex neural reward models that can lead to "reward hacking" (where the model finds bogus ways to boost their rewards that don't actually lead to better real-world model performance), they developed a clever rule-based system that combines accuracy rewards (verifying final answers) with format rewards (encouraging structured thinking). This simpler approach turned out to be more robust and scalable than the process-based reward models that others have tried.

What's particularly fascinating is that during training, they observed what they called an "aha moment," a phase where the model spontaneously learned to revise its thinking process mid-stream when encountering uncertainty. This emergent behavior wasn't explicitly programmed; it arose naturally from the interaction between the model and the reinforcement learning environment. The model would literally stop itself, flag potential issues in its reasoning, and restart with a different approach, all without being explicitly trained to do this.

The full R1 model built on these insights by introducing what they call "cold-start" data— a small set of high-quality examples— before applying their RL techniques. They also solved one of the major challenges in reasoning models: language consistency. Previous attempts at chain-of-thought reasoning often resulted in models mixing languages or producing incoherent outputs. DeepSeek solved this through a clever language consistency reward during RL training, trading off a small performance hit for much more readable and consistent outputs.

The results are mind-boggling: on AIME 2024, one of the most challenging high school math competitions, R1 achieved 79.8% accuracy, matching OpenAI's O1 model. On MATH-500, it hit 97.3%, and it achieved the 96.3 percentile on Codeforces programming competitions. But perhaps most impressively, they managed to distill these capabilities down to much smaller models: their 14B parameter version outperforms many models several times its size, suggesting that reasoning ability isn't just about raw parameter count but about how you train the model to process information.

The Fallout

The recent [color=var(--tw-prose-links)]scuttlebutt on Twitter and Blind (a corporate rumor website) is that these models caught Meta completely off guard and that they perform better than the new Llama4 models which are still being trained. Apparently, the Llama project within Meta has attracted a lot of attention internally from high-ranking technical executives, and as a result they have something like 13 individuals working on the Llama stuff who each individually earn more per year in total compensation than the combined training cost for the DeepSeek-V3 models which outperform it. How do you explain that to Zuck with a straight face? How does Zuck keep smiling while shoveling multiple billions of dollars to Nvidia to buy 100k H100s when a better model was trained using just 2k H100s for a bit over $5mm?

But you better believe that Meta and every other big AI lab is taking these DeepSeek models apart, studying every word in those technical reports and every line of the open source code they released, trying desperately to integrate these same tricks and optimizations into their own training and inference pipelines. So what's the impact of all that? Well, naively it sort of seems like the aggregate demand for training and inference compute should be divided by some big number. Maybe not by 45, but maybe by 25 or even 30? Because whatever you thought you needed before these model releases, it's now a lot less.

Now, an optimist might say "You are talking about a mere constant of proportionality, a single multiple. When you're dealing with an exponential growth curve, that stuff gets washed out so quickly that it doesn't end up matter all that much." And there is some truth to that: if AI really is as transformational as I expect, if the real-world utility of this tech is measured in the trillions, if inference-time compute is the new scaling law of the land, if we are going to have armies of humanoid robots running around doing massive amounts of inference constantly, then maybe the growth curve is still so steep and extreme, and Nvidia has a big enough lead, that it will still work out.

But Nvidia is pricing in a LOT of good news in the coming years for that valuation to make sense, and when you start layering all these things together into a total mosaic, it starts to make me at least feel extremely uneasy about spending ~20x the 2025 estimated sales for their shares. What happens if you even see a slight moderation in sales growth? What if it turns out to be 85% instead of over 100%? What if gross margins come in a bit from 75% to 70%— still ridiculously high for a semiconductor company?

Wrapping it All Up

At a high level, NVIDIA faces an unprecedented convergence of competitive threats that make its premium valuation increasingly difficult to justify at 20x forward sales and 75% gross margins. The company's supposed moats in hardware, software, and efficiency are all showing concerning cracks. The whole world— thousands of the smartest people on the planet, backed by untold billions of dollars of capital resources— are trying to assail them from every angle.

On the hardware front, innovative architectures from Cerebras and Groq demonstrate that NVIDIA's interconnect advantage— a cornerstone of its data center dominance— can be circumvented through radical redesigns. Cerebras' wafer-scale chips and Groq's deterministic compute approach deliver compelling performance without needing NVIDIA's complex interconnect solutions. More traditionally, every major NVIDIA customer (Google, Amazon, Microsoft, Meta, Apple) is developing custom silicon that could chip away at high-margin data center revenue. These aren't experimental projects anymore— Amazon alone is building out massive infrastructure with over 400,000 custom chips for Anthropic.

The software moat appears equally vulnerable. New high-level frameworks like MLX, Triton, and JAX are abstracting away CUDA's importance, while efforts to improve AMD drivers could unlock much cheaper hardware alternatives. The trend toward higher-level abstractions mirrors how assembly language gave way to C/C++, suggesting CUDA's dominance may be more temporary than assumed. Most importantly, we're seeing the emergence of LLM-powered code translation that could automatically port CUDA code to run on any hardware target, potentially eliminating one of NVIDIA's strongest lock-in effects.

Perhaps most devastating is DeepSeek's recent efficiency breakthrough, achieving comparable model performance at approximately 1/45th the compute cost. This suggests the entire industry has been massively over-provisioning compute resources. Combined with the emergence of more efficient inference architectures through chain-of-thought models, the aggregate demand for compute could be significantly lower than current projections assume. The economics here are compelling: when DeepSeek can match GPT-4 level performance while charging 95% less for API calls, it suggests either NVIDIA's customers are burning cash unnecessarily or margins must come down dramatically.

The fact that TSMC will manufacture competitive chips for any well-funded customer puts a natural ceiling on NVIDIA's architectural advantages. But more fundamentally, history shows that markets eventually find a way around artificial bottlenecks that generate super-normal profits. When layered together, these threats suggest NVIDIA faces a much rockier path to maintaining its current growth trajectory and margins than its valuation implies. With five distinct vectors of attack— architectural innovation, customer vertical integration, software abstraction, efficiency breakthroughs, and manufacturing democratization— the probability that at least one succeeds in meaningfully impacting NVIDIA's margins or growth rate seems high. At current valuations, the market isn't pricing in any of these risks.

I hope you enjoyed reading this article. If you work at a hedge fund and are interested in consulting with me on NVDA or other AI-related stocks or investing themes, I'm already signed up as an expert on [color=var(--tw-prose-links)]GLG and [color=var(--tw-prose-links)]Coleman Research.

network · 发表于 2025-2-2 06:49:31

英伟达股票的卖空案例
作为一名曾在各种多头/空头对冲基金（包括在 Millennium 和 Balyasny 工作过）担任过通才投资分析师约 10 年的人，同时也是一个自 2010 年以来一直在研究深度学习的数学和计算机迷（当时 Geoff Hinton 还在谈论受限玻尔兹曼机，一切编程仍使用MATLAB，研究人员仍在试图证明他们可以在分类手写数字方面获得比使用支持向量机更好的结果），我认为我对人工智能技术的发展及其与股票市场股权估值的关系有一个相当不同寻常的看法。

在过去的几年中，我更多地以开发人员的身份工作，并拥有几个流行的开源项目，用于处理各种形式的 AI 模型/服务（例如，请参阅LLM Aided OCR、Swiss Army Llama、Fast Vector Similarity、Source to Prompt和Pastel Inference Layer等几个最近的例子）。基本上，我每天都在尽可能密集地使用这些前沿模型。我有 3 个 Claude 帐户，因此我不会用完请求，并在 ChatGPT Pro 上线几分钟后就注册了它。

我还努力了解最新的研究进展，并仔细阅读各大人工智能实验室发布的所有重要技术报告论文。因此，我认为自己对这个领域以及事物的发展情况有很好的了解。与此同时，我一生中做空了大量股票，并两次获得价值投资者俱乐部的最佳创意奖（如果你一直在关注，则为TMS 多头和PDH 空头）。

我这样说不是为了吹牛，而是为了帮助确立我的信誉，即我可以就这个问题发表意见，而不会让技术人员或专业投资者觉得我幼稚无比。虽然肯定有很多人比我更懂数学/科学，也有很多人在股市的多头/空头投资方面比我更出色，但我怀疑很少有人能像我一样处于维恩图的中间位置。

尽管如此，每当我与对冲基金界的朋友和前同事见面聊天时，话题很快就会转到 Nvidia。一家公司从默默无闻发展到市值超过英国、法国或德国股市总和的现象并不常见！这些朋友自然想知道我对这个问题的看法。因为我坚信这项技术将带来长期变革性影响——我真的相信它将在未来 5-10 年内彻底改变我们经济和社会的几乎每个方面，这基本上是史无前例的——所以我很难说 Nvidia 的发展势头会在短期内放缓或停止。

但是，尽管在过去一年左右的时间里，我一直认为这个估值对我来说太高了，但最近一系列的发展让我有点改变了我的惯常本能，那就是在观点上更加逆势而行，当共识似乎被完全反映在价格中时，我会质疑它。“智者一开始相信什么，愚者最终也相信什么”这句话之所以出名是有原因的。

牛市案例
在讨论让我犹豫不决的进展之前，让我们先简要回顾一下 NVDA 股票的牛市情况，现在基本上人尽皆知。深度学习和人工智能是自互联网以来最具变革性的技术，有望改变我们社会的一切。就用于培训和推理基础设施的行业资本支出份额而言，Nvidia 在某种程度上几乎处于垄断地位。

世界上一些最大、最赚钱的公司，如微软、苹果、亚马逊、Meta、谷歌、甲骨文等，都已决定不惜一切代价保持这一领域的竞争力，因为他们根本无法承受落后的后果。资本支出金额、千兆瓦用电量、新建数据中心面积，当然还有 GPU 数量，都呈爆炸式增长，而且似乎没有放缓的迹象。而 Nvidia 能够在最高端、面向数据中心的产品上获得高达 90% 以上的毛利率。

我们只是触及了牛市的表面。现在还有许多其他方面，甚至让那些已经非常乐观的人也变得更加乐观。除了人形机器人的兴起，我怀疑当它们能够迅速完成大量目前需要非熟练（甚至熟练）人类工人才能完成的任务（例如，洗衣服、打扫、整理和做饭；与工人团队一起完成翻新浴室或建造房屋等建筑工作；经营仓库和驾驶叉车等）时，大多数人甚至还没有考虑到其他因素。

您听到聪明人谈论的一件大事是“新缩放定律”的兴起，它创造了一种新的范式思维，思考计算需求将如何随着时间的推移而增加。最初的缩放定律是预训练缩放定律，自2012 年AlexNet出现和 2017 年 Transformer 架构发明以来，它一直在推动人工智能的发展：我们可以用作训练数据的价值数十亿（现在是数万亿）的 token，我们正在训练的模型的参数数量越大，我们在这些 token 上训练这些模型所花费的计算 FLOPS 越多，生成的模型在各种非常有用的下游任务上的性能就越好。

不仅如此，这种改进在某种程度上是可以知晓的，以至于 OpenAI 和 Anthropic 等领先的人工智能实验室在开始实际训练之前就已经非常清楚他们的最新模型会有多好——在某些情况下，他们能够预测最终模型的基准，误差在几个百分点以内。这种“原始缩放定律”至关重要，但总是会让用它来预测未来的人心存疑虑。

首先，我们似乎已经耗尽了世界上积累的高质量训练数据集。当然，这并非完全正确——还有许多旧书和期刊尚未被正确数字化，即使已经数字化，也没有获得作为训练数据的适当许可。问题是，即使你把所有这些东西都归功于你——比如说从 1500 年到 2000 年“专业”制作的英语书面内容的总和，当你谈论一个近 15 万亿个标记的训练语料库时，从百分比来看，它并不是一个如此巨大的数量，这是当前前沿模型的规模。

让我们快速检查一下这些数字的真实情况：到目前为止，Google Books 已经数字化了大约 4000 万本书；如果一本典型的书有 5 万到 10 万个单词，或者 6.5 万到 13 万个标记，那么仅从书中获得的标记就介于 2.6T 到 5.2T 之间，尽管其中很大一部分肯定已经包含在大型实验室使用的训练语料库中，无论这是否严格合法。还有很多学术论文，仅 arXiv 网站就有超过 200 万篇论文。而国会图书馆拥有超过 30 亿页数字化的报纸。加起来，总共可能多达 7T 个标记，但由于其中大部分实际上都包含在训练语料库中，因此其余的“增量”训练数据在宏观上可能并不那么重要。

当然，还有其他方法可以收集更多训练数据。例如，你可以自动转录每个 YouTube 视频，并使用该文本。虽然这可能在边际上有所帮助，但其质量肯定远低于一本备受推崇的有机化学教科书，因为有机化学是有关世界的有用知识来源。因此，当谈到原始缩放定律时，我们总是面临一道迫在眉睫的“数据墙”；虽然我们知道我们可以继续将越来越多的资本支出投入到 GPU 中并建立越来越多的数据中心，但要大规模生产有用的人类新知识，这些知识是正确的，并且是对已有知识的增量。现在，对此的一个有趣的反应是“合成数据”的兴起，即文本本身就是法学硕士的输出。虽然通过“提高自己的供应量”来提高模型质量似乎几乎是荒谬的，但它实际上似乎在实践中非常有效，至少在数学、逻辑和计算机编程领域是如此。

当然，原因在于我们可以在这些领域机械地检查和证明事物的正确性。因此，我们可以从大量可能的数学定理或可能的 Python 脚本中抽样，然后实际检查它们是否正确，并且只在正确的情况下才将它们纳入我们的语料库。通过这种方式，我们可以极大地扩展我们高质量训练数据的收集，至少在这些领域是如此。

除了文本之外，我们还可以训练 AI 的其他数据种类也很多。例如，如果我们对 1 亿人进行全基因组测序（一个人的未压缩数据约为 200 GB 到 300 GB），结果会怎样？这显然是大量数据，尽管绝大多数数据在任何两个人之间都几乎相同。当然，出于各种原因，将这些数据与书籍和互联网上的文本数据进行比较可能会产生误导：

原始基因组大小不能直接与标记计数进行比较
基因组数据的信息内容与文本有很大不同
高度冗余数据的训练价值尚不明确
处理基因组数据的计算要求不同
但它仍然是另一个巨大的多样化信息来源，我们将来可以用它训练大型模型，这就是我将它包括在内的原因。

因此，尽管在获取更多额外训练数据方面仍有希望，但如果你看一下近年来训练语料库的增长速度，就会很快发现，我们在“普遍有用”知识的数据可用性方面已接近瓶颈，而这些知识可以让我们更接近最终目标，即获得比约翰·冯·诺依曼聪明 10 倍的超级人工智能，并且是人类已知的每个专业领域的绝对世界级专家。

除了有限的可用数据量之外，预训练扩展定律支持者的脑海中还一直隐藏着其他一些问题。其中最重要的一个问题是，在完成模型训练后，你应该如何处理所有这些计算基础设施？训练下一个模型？当然，你可以这样做，但考虑到 GPU 速度和容量的快速提升，以及电力和其他运营支出在经济计算中的重要性，使用你 2 年前的集群来训练你的新模型真的有意义吗？你肯定更愿意使用你刚刚建造的全新数据中心，它的成本是旧数据中心的 10 倍，由于技术更好，它的功能是旧数据中心的 20 倍。问题是，在某个时候你确实需要摊销这些投资的前期成本，并通过（希望是正的）运营利润流来收回成本，对吗？

市场对人工智能如此兴奋，幸运的是它忽略了这一点，这使得像 OpenAI 这样的公司从一开始就出现了惊人的累计运营亏损，同时在后续投资轮中获得越来越令人瞠目结舌的估值（尽管值得称赞的是，它们也能够展示出非常快速增长的收入）。但最终，要使这种情况在整个市场周期内持续下去，这些数据中心的成本确实需要最终收回，希望能有利润，随着时间的推移，这些利润在风险调整后与其他投资机会相比具有竞争力。

新范式
好的，这就是训练前的缩放定律。这个“新”缩放定律是什么？嗯，这是人们在过去一年里真正开始关注的事情：推理时间计算缩放。以前，你在这个过程中所花费的所有计算的绝大部分都是前期训练计算，用于首先创建模型。一旦你有了训练好的模型，对该模型进行推理——即提出问题或让 LLM 为你执行某种任务——就会使用一定数量的计算。

至关重要的是，推理计算的总量（以各种方式衡量，例如 FLOPS、GPU 内存占用等）远远少于预训练阶段所需的量。当然，当你增加模型的上下文窗口大小和一次性从中生成的输出量时，推理计算量确实会增加（尽管研究人员在这方面取得了惊人的算法改进，相对于人们最初预期的二次扩展）。但本质上，直到最近，推理计算通常比训练计算的密集程度要低得多，并且基本上与你处理的请求数量成线性关系——例如，ChatGPT 对文本完成的需求越多，你使用的推理计算就越多。

随着去年推出的革命性思想链 (COT) 模型的出现，这一切都发生了变化，最引人注目的是 OpenAI 的旗舰 O1 模型（但最近 DeepSeek 的新 R1 模型也出现了，我们将在后面详细讨论）。这些新的 COT 模型还生成中间“逻辑标记”；可以将其视为模型在尝试解决您的问题或完成其分配的任务时的一种便笺簿或“内部独白”。

这代表了推理计算工作方式的真正巨大变化：现在，您在这个内部思维链过程中使用的令牌越多，您为用户提供的最终输出质量就越好。实际上，这就像给人类工人更多的时间和资源来完成一项任务，这样他们就可以反复检查他们的工作，以多种不同的方式执行相同的基本任务，并验证他们是否以相同的方式得出结果；将他们得出的结果“代入”公式以检查它是否确实解决了方程式，等等。

事实证明，这种方法的效果几乎令人惊讶；它本质上是将人们期待已久的“强化学习”功能与 Transformer 架构的功能结合起来。它直接解决了 Transformer 模型最大的弱点，即它容易产生“幻觉”。

基本上，Transformers 在每个步骤中预测下一个标记的方式是，如果它们在初始反应中走上了一条错误的“道路”，它们就会变得像一个搪塞的孩子，试图编造一个故事来解释为什么自己是正确的，即使他们应该在中途利用常识意识到他们所说的话不可能是正确的。

由于模型总是寻求内部一致性，并让每个连续生成的标记自然地从前面的标记和上下文中流出，因此它们很难纠正和回溯。通过将推理过程分解为多个中间阶段，它们可以尝试许多不同的事情，看看哪些是有效的，并不断尝试纠正和尝试其他方法，直到它们能够达到相当高的置信度，相信它们不是在胡说八道。

除了这种方法确实有效之外，也许最特别之处在于，你使用的逻辑/COT 令牌越多，效果就越好。突然间，你现在有了可以转动的额外拨盘，这样，随着你增加 COT 推理令牌的数量（这需要使用更多的推理计算，无论是 FLOPS 还是内存），你给出正确答案的可能性就越大——第一次运行的代码没有错误，或者没有明显错误的推理步骤的逻辑问题解决方案。

我可以从大量的亲身经历中告诉你，尽管 Anthropic 的 Claude3.5 Sonnet 模型在 Python 编程方面表现优异（而且确实非常出色），但每当你需要生成任何长而复杂的东西时，它总是会犯一个或多个愚蠢的错误。现在，这些错误通常很容易修复，事实上，你通常可以通过简单地输入 Python 解释器生成的错误来修复它们，而无需任何进一步的解释，作为后续推理提示（或者，更有用的是，使用所谓的 Linter，粘贴代码编辑器在代码中发现的完整“问题”集），但这仍然是一个令人讨厌的额外步骤。当代码变得非常长或非常复杂时，有时可能需要更长的时间来修复，甚至可能需要手动进行一些手动调试。

我第一次尝试 OpenAI 的 O1 模型时，感觉就像是得到了启示：我惊讶地发现代码在第一次运行的时候就如此完美。这是因为 COT 流程会自动发现并修复问题，在问题进入模型给出的答案的最终响应标记之前。

事实上，OpenAI 每月 20 美元的 ChatGPT Plus 订阅中使用的 O1 模型与其新推出的 ChatGPT Pro 订阅中使用的 O1-Pro 模型基本相同，新订阅的价格是前者的 10 倍（每月 200 美元，这引起了开发者社区的广泛关注）；主要区别在于，O1-Pro 在响应之前会思考更长时间，从而生成更多的 COT 逻辑令牌，并且每次响应都会消耗大量的推理计算。

这非常引人注目，因为即使是 Claude3.5 Sonnet 或 GPT4o 的非常长且复杂的提示，给出的上下文超过 400kb，通常也只需不到 10 秒就能开始响应，通常不到 5 秒。而同样的 O1-Pro 提示可能需要 5 分钟以上才能得到响应（尽管 OpenAI 确实会向您展示在您等待的过程中生成的一些“推理步骤”；至关重要的是，OpenAI 决定（大概是出于商业机密的原因）向您隐藏它生成的确切推理标记，而是向您显示这些标记的高度简化摘要）。

您可能可以想象，在很多情况下，准确性至关重要——您宁愿放弃并告诉用户您根本做不到，也不愿给出一个很容易被证明是错误的答案，或者包含幻觉事实或其他似是而非的推理。任何涉及金钱/交易、医疗、法律的事情，仅举几例。

基本上，只要推理成本相对于与 AI 系统交互的人类知识工作者的每小时全额报酬来说微不足道，那么拨打 COT 计算就变得完全是轻而易举的事（主要缺点是它会大大增加响应的延迟，因此在某些情况下，您可能希望通过获得较低延迟但不太准确或正确的响应来实现更快的迭代）。

几周前，人工智能领域传出了一些最令人兴奋的消息，这些消息与 OpenAI 尚未发布的新 O3 模型有关，该模型能够解决大量以前被认为在短期内无法用现有人工智能方法解决的任务。它之所以能够解决这些最困难的问题（其中包括极其困难的“基础”数学问题，即使是高技能的专业数学家也很难解决），是因为 OpenAI 为这些问题投入了大量的计算资源——在某些情况下，花费价值 3000 美元以上的计算能力来解决单个任务（相比之下，单个任务的传统推理成本不太可能超过几美元，而使用没有思维链的常规 Transformer 模型）。

无需人工智能天才就能意识到，这一发展创造了一种完全独立于原始预训练缩放定律的新缩放定律。现在，您仍然希望通过巧妙地利用尽可能多的计算和尽可能多的万亿高质量训练数据来训练出最好的模型，但这只是这个新世界故事的开始；现在，您可以轻松使用大量计算，以非常高的置信度从这些模型中进行推理，或者在尝试解决需要“天才级”推理的极其棘手的问题时，避免所有可能导致普通 LLM 误入歧途的潜在陷阱。

但为什么 Nvidia 能够独享所有的优势呢？
即使你和我一样相信人工智能的未来前景几乎是不可想象的光明，问题仍然存在：“为什么一家公司应该从这项技术中榨取大部分利润？”历史上确实有很多非常重要的新技术改变了世界的案例，但主要的赢家并不是在这个过程的初始阶段看起来最有前途的公司。莱特兄弟的飞机公司如今在许多不同的公司中都以各种形式存在，尽管他们发明并完善了这项技术，但其价值不超过 100 亿美元。尽管福特今天的市值高达 400 亿美元，但这仅是 Nvidia 目前市值的 1.1%。

要理解这一点，首先要真正理解为什么 Nvidia 目前占据了如此大的市场份额。毕竟，他们并不是唯一一家生产 GPU 的公司。AMD 生产的 GPU 质量不错，理论上晶体管数量相当，采用类似的工艺节点制造，等等。当然，它们的速度和先进程度不如 Nvidia 的 GPU，但这并不意味着 Nvidia GPU 的速度快 10 倍或类似。事实上，从每 FLOP 的美元成本来看，AMD GPU 的价格大约是 Nvidia GPU 的一半。

看看其他半导体市场，例如 DRAM 市场，尽管该市场也高度整合，只有 3 家重要的全球参与者（三星、美光、SK-Hynix），但 DRAM 市场的毛利率从周期底部的负值到周期顶部的约 60% 不等，平均在 20% 左右。相比之下，Nvidia 最近几个季度的整体毛利率约为 75%，这受到利润率较低且商品化程度更高的消费 3D 图形类别的拖累。

那么这怎么可能呢？主要原因与软件有关——更好的驱动程序可以在 Linux 上“正常工作”，并且经过高度测试和可靠（不像 AMD，它的 Linux 驱动程序质量低下且不稳定），以及流行库中高度优化的开源代码，例如PyTorch，这些库已经过调整，可以在 Nvidia GPU 上很好地工作。

但事情远不止于此——程序员用来编写针对 GPU 优化的低级代码的编程框架 CUDA 完全由 Nvidia 专有，并且已成为事实上的标准。如果你想雇佣一群非常有才华、知道如何让 GPU 运行得非常快的程序员，并支付他们 65 万美元/年的薪水或具有该特定专业知识的人的现行工资，那么他们很可能会“思考”并使用 CUDA。

除了软件优势之外，Nvidia 的另一大优势是所谓的互连——本质上就是将数千个 GPU 高效连接在一起的带宽，以便它们可以联合起来训练当今领先的基础模型。简而言之，高效训练的关键是始终尽可能充分利用所有 GPU——而不是等待空闲，直到它们收到计算训练过程下一步所需的下一块数据。

带宽要求极高——远高于传统数据中心用例所需的典型带宽。您实际上无法使用传统网络设备或光纤进行这种互连，因为这会带来太多延迟，并且无法提供让所有 GPU 持续忙碌所需的纯每秒 TB 级带宽。

2019 年，Nvidia 做出了一个非常明智的决定，以区区 69 亿美元的价格收购了以色列公司 Mellanox，这次收购为他们带来了业界领先的互连技术。请注意，互连速度与训练过程的关系比推理过程（包括 COT 推理）要密切得多，在训练过程中，你必须同时整合数千个 GPU 的输出，而在推理过程中，你只需要少量 GPU — 你所需要的只是足够的 VRAM 来存储已训练模型的量化（压缩）模型权重。

因此，这些可以说是 Nvidia“护城河”的主要组成部分，也是它能够长期维持如此高利润率的原因（其中也有一个“飞轮”方面，他们积极地将超额利润投入到大量研发中，这反过来帮助他们以比竞争对手更快的速度改进技术，因此他们在原始性能方面始终处于领先地位）。

但正如前面指出的那样，在其他所有条件相同的情况下，客户真正关心的是每美元的性能（包括设备的前期资本支出成本和能源使用，即每瓦性能），尽管 Nvidia 的 GPU 无疑是最快的，但仅以 FLOPS 来衡量，它们并不是最好的性价比。

但问题是，其他所有条件都不一样，AMD 的驱动程序很烂，流行的 AI 软件库在 AMD GPU 上运行效果不佳，在游戏世界之外找不到真正优秀的专门研究 AMD GPU 的 GPU 专家（当市场对 CUDA 专家的需求更大时，他们为什么还要费心呢？），由于 AMD 的互连技术很差，你无法将数千个 GPU 有效地连接在一起——所有这些都意味着 AMD 在高端数据中心领域基本上没有竞争力，而且在短期内似乎也没有很好的前景。

好吧，这一切听起来对 Nvidia 来说都很乐观，对吧？现在您可以明白为什么该股的估值如此之高了！但地平线上还有哪些阴云呢？嗯，我认为很少有值得特别关注的。有些阴云在过去几年里一直潜伏在幕后，但考虑到蛋糕增长的速度，它们太小了，无法产生影响，但它们正准备向上弯曲。其他则是最近才出现的发展（例如，最近两周），可能会极大地改变 GPU 需求增量的近期轨迹。

主要威胁
从非常高的层次来看，你可以这样想：Nvidia 在一个非常小众的领域运营了很长时间；他们的竞争对手非常有限，而且竞争对手的利润并不特别丰厚，增长速度也不够快，无法构成真正的威胁，因为他们没有足够的资本来真正对 Nvidia 这样的市场领导者施加压力。游戏市场规模庞大且不断增长，但利润率并不惊人，年增长率也不特别惊人。

2016-2017 年左右，一些大型科技公司开始加大招聘力度，加大对机器学习和人工智能的投入，但从总体上看，这对任何一家公司来说都不是真正重要的项目，更像是“登月式”研发支出。但随着 2022 年 ChatGPT 的发布，人工智能大竞赛正式拉开帷幕——虽然从发展角度来看，这似乎已经是很久以前的事了，但其实只有两年多一点的时间——这种情况发生了巨大变化。

突然之间，大公司准备以惊人的速度投入数十亿美元。参加Neurips和ICML等大型研究会议的研究人员数量急剧增加。所有以前可能研究金融衍生品的聪明学生都转而研究 Transformers，非执行工程职位（即不管理团队的独立贡献者）的 100 万美元以上的薪酬待遇成为领先 AI 实验室的常态。

改变一艘大型游轮的方向需要一段时间；即使你行动迅速并投入数十亿美元，也需要一年或更长时间才能建成绿地数据中心并订购所有设备（交货时间不断增加）并使其全部设置好并运行。即使是聪明的程序员也需要很长时间才能雇用和入职，然后他们才能真正发挥自己的才能并熟悉现有的代码库和基础设施。

但现在，你可以想象，在这个领域投入了绝对惊人的资本、智力和努力。而 Nvidia 是所有公司中最大的目标，因为他们是今天赚取最大利润的人，而不是在某个假设的未来，那时人工智能将主宰我们的整个生活。

因此，从高层次上看，结论基本上就是“市场找到了出路”；它们找到了替代性的、彻底创新的硬件构建方法，利用全新的想法来绕过有助于支撑 Nvidia 护城河的障碍。

硬件级威胁
例如，Cerebras 所谓的“晶圆级” AI 训练芯片，将整个 300 毫米硅晶圆专用于一个绝对庞大的芯片，该芯片在单个芯片上包含数量级更多的晶体管和内核（请参阅他们最近的博客文章，解释了他们如何解决过去阻碍这种方法在经济上实用的“产量问题”）。

为了更直观地说明这一点，如果将 Cerebras 最新的 WSE-3 芯片与 Nvidia 的旗舰数据中心 GPU H100 进行比较，Cerebras 芯片的总芯片面积为 46,225 平方毫米，而 H100 仅为 814 平方毫米（而 H100 本身按照行业标准被视为一款巨型芯片）；这是约 57 倍的倍数！而且，与 H100 不同，Cerebras 芯片上启用了 132 个“流式多处理器”核心，而 Cerebras 芯片拥有约 900,000 个核心（当然，每个核心都更小，功能也少得多，但相比之下，这仍然是一个几乎不可思议的大数字）。更具体地说，Cerebras 芯片在 AI 环境中的 FLOPS 大约是单个 H100 芯片的 32 倍。由于 H100 的售价接近 4 万美元，因此您可以想象 WSE-3 芯片并不便宜。

那么，这一切为什么重要呢？事实上，Cerebras 并没有试图使用类似的方法与 Nvidia 正面交锋，并试图匹敌 Mellanox 互连技术，而是使用了一种彻底创新的方法来解决互连问题：当所有东西都运行在同一个超大芯片上时，处理器间带宽不再是问题。你甚至不需要拥有相同级别的互连，因为一个巨型芯片可以取代大量的 H100。

Cerebras 芯片在 AI 推理任务中也表现得非常好。事实上，您今天可以在这里免费试用，并使用 Meta 非常受人尊敬的 Llama-3.3-70B 型号。它基本上可以即时响应，每秒约 1,500 个令牌。从这个角度来看，根据与 ChatGPT 和 Claude 的比较，任何超过每秒 30 个令牌的速度对用户来说都感觉相对敏捷，甚至每秒 10 个令牌的速度也足够快，您基本上可以在生成响应时读取响应。

Cerebras 并非孤军奋战，还有其他公司，比如 Groq（不要将其与埃隆·马斯克的 X AI 训练的Grok模型系列混淆）。Groq 采取了另一种创新方法来解决同样的基本问题。他们没有试图直接与 Nvidia 的 CUDA 软件堆栈竞争，而是开发了所谓的“张量处理单元”（TPU），专门用于深度学习模型需要执行的精确数学运算。他们的芯片是围绕一种称为“确定性计算”的概念设计的，这意味着，与操作的确切时间可能有所不同的传统 GPU 不同，他们的芯片每次都以完全可预测的方式执行操作。

这听起来可能只是一个小技术细节，但实际上它对芯片设计和软件开发都有着巨大的影响。由于时间是完全确定的，Groq 可以以传统 GPU 架构无法实现的方式优化其芯片。因此，在过去 6 个多月的时间里，他们一直在使用 Llama 系列模型和其他开源模型展示每秒超过 500 个 token 的推理速度，远远超过了传统 GPU 设置所能达到的速度。与 Cerebras 一样，该产品现已上市，您可以在此处免费试用。

使用具有“推测解码”功能的类似 Llama3 模型，Groq 每秒能够生成 1,320 个令牌，与 Cerebras 相当，远远超过使用常规 GPU 所能达到的速度。现在，您可能会问，当用户似乎对 ChatGPT 非常满意时，实现每秒 1,000 多个令牌有什么意义，因为 ChatGPT 的运行速度不到该速度的 10%。事实是，这确实很重要。当您获得即时反馈时，它可以更快地进行迭代，并且不会像人类知识工作者那样失去专注力。如果您通过 API 以编程方式使用该模型（这越来越多地成为需求的来源），那么它可以启用需要多阶段推理（其中前几个阶段的输出用作提示/推理的连续阶段的输入）或需要低延迟响应的全新应用程序类别，例如内容审核、欺诈检测、动态定价等。

但更根本的是，处理请求的速度越快，循环速度就越快，硬件的繁忙程度也就越高。虽然 Groq 的硬件非常昂贵，单台服务器的价格高达 200 万到 300 万美元，但如果您有足够的需求让硬件一直保持繁忙，那么满足每个请求的成本就会低得多。

和 Nvidia 的 CUDA 一样，Groq 的优势很大一部分来自于他们自己的专有软件堆栈。他们能够采用 Meta、DeepSeek 和 Mistral 等其他公司开发和免费发布的相同开源模型，并以特殊方式分解它们，使它们在特定硬件上的运行速度大大提高。

与 Cerebras 一样，他们采取了不同的技术决策来优化流程的某些特定方面，这使他们能够以完全不同的方式做事。就 Groq 而言，这是因为他们完全专注于推理级计算，而不是训练：他们所有的特殊硬件和软件只有在对已经训练好的模型进行推理时才能提供巨大的速度和效率优势。

但是，如果人们所期待的下一个大扩展定律是推理级计算——如果 COT 模型的最大缺点是必须生成所有这些中间逻辑令牌才能做出响应而导致的高延迟——那么即使是一家只进行推理计算但速度和效率都比 Nvidia 快得多的公司——也可能在未来几年带来严重的竞争威胁。至少，Cerebras 和 Groq 可以蚕食当前股票估值中对 Nvidia 未来 2-3 年收入增长的高预期。

除了这些特别具有创新性、但相对不为人知的初创竞争对手之外，Nvidia 的一些最大客户本身也带来了一些激烈的竞争，这些客户一直在制造专门针对 AI 训练和推理工作负载的定制芯片。其中最知名的可能是谷歌，该公司自 2016 年以来一直在开发自己的专有 TPU。有趣的是，尽管谷歌曾短暂地向外部客户出售 TPU，但过去几年来，谷歌一直在内部使用其所有 TPU，而且它已经拥有第六代TPU 硬件。

亚马逊还一直在开发自己的定制芯片Trainium2和Inferentia2。亚马逊在建设配备数十亿美元 Nvidia GPU 的数据中心的同时，也在投资数十亿美元建设使用这些内部芯片的其他数据中心。他们有一个正在为 Anthropic 上线的集群，该集群拥有超过 40 万个芯片。

亚马逊因完全搞砸了其内部 AI 模型开发而受到大量批评，将大量内部计算资源浪费在最终没有竞争力的模型上，但定制芯片又是另一回事。同样，他们不一定需要他们的芯片比 Nvidia 的更好更快。他们需要的是他们的芯片足够好，但要以盈亏平衡的毛利率来制造它们，而不是 Nvidia 在其 H100 业务上赚取的约 90% 以上的毛利率。

OpenAI 还宣布了打造定制芯片的计划，他们（与微软一起）显然是 Nvidia 数据中心硬件的最大单一用户。似乎这还不够，微软自己也宣布了他们自己的定制芯片！

而苹果，作为全球市值最高的科技公司，多年来一直以其高度创新和颠覆性的定制硅片运营超出预期，其每瓦性能现已完全超越英特尔和 AMD 的 CPU，而性能是移动（手机/平板电脑/笔记本电脑）应用中最重要的因素。多年来，他们一直在制造自己内部设计的 GPU 和“神经处理器”，尽管他们尚未真正展示此类芯片在他们自己的定制应用程序之外的实用性，例如 iPhone 相机中使用的基于软件的高级图像处理。

虽然苹果的重点似乎与其他参与者在移动优先、消费者导向、“边缘计算”方面有些正交，但如果它最终在与 OpenAI 的新合同上投入足够的资金，为 iPhone 用户提供人工智能服务，你必须想象他们有团队正在研究制作自己的定制硅片用于推理/训练（尽管鉴于他们的保密性，你可能永远不会直接知道它！）。

现在，Nvidia 的超大规模客户群呈现出强大的幂律分布，这已经不是什么秘密了，其中少数顶级客户占据了高利润收入的最大份额。当这些 VIP 客户中的每一个都在为 AI 训练和推理构建自己的定制芯片时，人们应该如何看待这项业务的未来？

在考虑所有这些时，你应该记住一件非常重要的事情：Nvidia 在很大程度上是一家基于 IP 的公司。他们不生产自己的芯片。制造这些令人难以置信的设备的真正秘诀可能更多地来自台积电（实际的晶圆厂）和 ASML，后者制造了台积电用来制造这些尖端工艺节点芯片的特殊 EUV 光刻机。这一点至关重要，因为台积电会将他们最先进的芯片卖给任何有足够前期投资并愿意保证一定数量的人。他们不在乎它是用于比特币挖矿 ASIC、GPU、TPU 还是手机 SoC 等。

鉴于 Nvidia 的高级芯片设计师每年的薪水，其中一些最优秀的设计师肯定会被其他科技巨头以足够的现金和股票挖走。一旦他们拥有了团队和资源，他们就可以在 2 到 3 年内设计出创新的芯片（同样，可能还不到 H100 的 50%，但考虑到 Nvidia 的毛利率，还有很大的发挥空间），而且由于有台积电，他们可以使用与 Nvidia 完全相同的工艺节点技术将这些芯片变成真正的硅片。

软件威胁
似乎这些迫在眉睫的硬件威胁还不够严重，过去几年软件领域出现了一些发展，虽然起步缓慢，但现在正在真正发力，可能对 Nvidia CUDA 的软件主导地位构成严重威胁。其中第一个就是 AMD GPU 的糟糕 Linux 驱动程序。还记得我们谈到 AMD 多年来如何莫名其妙地允许这些驱动程序糟糕透顶，尽管这笔钱已经花光了？

有趣的是，臭名昭著的黑客乔治·霍兹（因在十几岁时破解第一代 iPhone 而闻名，目前是自动驾驶初创公司 Comma.ai 和人工智能计算机公司 Tiny Corp 的首席执行官，该公司还制作了开源 tinygrad 人工智能软件框架）最近宣布，他厌倦了处理 AMD 的糟糕驱动程序，并迫切希望能够在他们的 TinyBox 人工智能计算机中利用成本较低的 AMD GPU（有多种版本，其中一些使用 Nvidia GPU，一些使用 AMD GPU）。

事实上，他正在为 AMD GPU 制作自己的定制驱动程序和软件堆栈，而 AMD 自己并没有提供任何帮助。2025 年 1 月 15 日，他通过公司的 X 账户发推文称： “我们距离 AMD 上完全自主的堆栈，即 RDNA3 汇编程序，只差一步之遥。我们有自己的驱动程序、运行时、库和模拟器。（总共约 12,000 行！）”鉴于他的过往记录和技能，他们很可能会在未来几个月内完成所有这些工作，这将为使用 AMD GPU 进行各种应用程序提供许多令人兴奋的可能性，而目前这些应用程序公司都觉得有必要为 Nvidia GPU 付费。

好吧，这只是 AMD 的一个驱动程序，它甚至还没有完成。还有什么呢？好吧，软件方面还有其他几个影响更大的领域。首先，现在许多大型科技公司和整个开源软件社区都在做出巨大的努力，以制作更通用的 AI 软件框架，而 CUDA 只是众多“编译目标”之一。

也就是说，你使用更高级别的抽象来编写软件，系统本身可以自动将这些高级构造转换为经过精心调校的低级代码，这些代码在 CUDA 上运行得非常好。但由于它是在这种更高级别的抽象上完成的，因此它可以轻松地编译成低级代码，这些代码可以在来自各种提供商的许多其他 GPU 和 TPU 上运行得非常好，例如来自各大科技公司管道中的大量定制芯片。

这些框架最著名的例子是 MLX（主要由 Apple 赞助）、Triton（主要由 OpenAI 赞助）和 JAX（由 Google 开发）。MLX 特别有趣，因为它提供了一个类似 PyTorch 的 API，可以在 Apple Silicon 上高效运行，展示了这些抽象层如何使 AI 工作负载能够在完全不同的架构上运行。与此同时，Triton 变得越来越受欢迎，因为它允许开发人员编写高性能代码，这些代码可以编译为在各种硬件目标上运行，而无需了解每个平台的低级细节。

这些框架允许开发人员使用高性能抽象编写一次代码，然后自动针对大量平台 - 这听起来不是一种更好的做事方式吗？它会为您在实际运行代码的方式方面提供更多的灵活性？

在 20 世纪 80 年代，所有最流行、最畅销的软件都是用手工调整的汇编语言编写的。例如，PKZIP 压缩实用程序是手工编写的，以最大限度地提高速度，以至于用标准 C 编程语言编写并使用当时最好的优化编译器编译的编码版本运行速度可能只有手工调整的汇编代码的一半。其他流行软件包（如 WordStar、VisiCalc 等）也是如此。

随着时间的推移，编译器变得越来越好，每当 CPU 架构发生变化时（例如，从英特尔发布 486，然后是奔腾，等等），手动编写的汇编程序通常必须被抛弃并重写，只有最聪明的程序员才能做到这一点（有点像 CUDA 专家在就业市场上与“普通”软件开发人员处于不同的水平）。最终，事情趋于一致，以至于手动编写的汇编程序的速度优势被能够使用高级语言（如 C 或 C++）编写代码的灵活性大大抵消，在这些语言中，您依靠编译器使程序在给定的 CPU 上真正以最佳方式运行。

如今，很少有新代码是用汇编语言编写的。我相信，人工智能训练和推理代码最终也会发生类似的转变，原因也类似：计算机擅长优化，而灵活性和开发速度正日益成为更重要的因素——尤其是如果它还能让你大幅节省硬件费用，因为你不需要继续支付“CUDA 税”，而这为 Nvidia 带来了 90% 以上的利润。

您可能会看到事情发生巨大变化的另一个领域是，CUDA 最终可能会成为一种高级抽象 - 一种类似于Verilog 的“规范语言” （用作描述芯片布局的行业标准），熟练的开发人员可以使用它来描述涉及大规模并行性的高级算法（因为他们已经熟悉它，它构造得非常好，它是通用语言等），但不是像平常一样将该代码编译用于 Nvidia GPU，而是可以将其作为源代码输入到 LLM 中，然后将其移植到新的 Cerebras 芯片、新的 Amazon Trainium2 或新的 Google TPUv6 等可以理解的任何低级代码中。这并不像你想象的那么遥远；使用 OpenAI 最新的 O3 模型，它可能已经触手可及，并且肯定会在一两年内普遍实现。

理论上的威胁
也许之前提到的最令人震惊的发展发生在最近几周。这则新闻彻底震撼了人工智能世界，尽管主流媒体上完全没有报道，但它却主导了推特上知识渊博人士的讨论：一家名为 DeepSeek 的中国小型初创公司发布了两个新模型，它们的性能水平基本上与 OpenAI 和 Anthropic 的最佳模型相当（超越了 Meta Llama3 模型和其他较小的开源模型参与者，如 Mistral）。这些模型被称为DeepSeek-V3（基本上是他们对 GPT-4o 和 Claude3.5 Sonnet 的回答）和DeepSeek-R1（基本上是他们对 OpenAI 的 O1 模型的回答）。

为什么这一切如此令人震惊？首先，DeepSeek 是一家小型中国公司，据报道员工人数不到 200 人。据说他们最初是一家类似于 TwoSigma 或 RenTec 的量化交易对冲基金，但在习近平打击该领域后，他们利用自己的数学和工程能力转向人工智能研究。谁知道这些是真的还是他们只是中共或中国军方的某种幌子。但事实是，他们已经发布了两份非常详细的技术报告，分别是DeepSeek-V3和DeepSeekR1。

这些都是很繁琐的技术报告，如果你对线性代数了解不多，可能看不懂太多。但你真正应该尝试的是在这里下载 AppStore 上的免费 DeepSeek 应用，并使用 Google 帐户登录并试用（你也可以在这里在 Android 上安装它），或者简单地在浏览器中在台式电脑上试用它。确保选择“DeepThink”选项以启用思路链（R1 模型），并要求它用简单的术语解释技术报告的部分内容。

这将同时向你展示一些重要的事情：

首先，这个模型绝对合法。AI 基准测试中有很多 BS，这些基准测试经常被操纵，使得模型在基准测试中表现很好，但在实际测试中却很糟糕。在这方面，谷歌无疑是最糟糕的，他们不断吹嘘他们的 LLM 有多棒，但他们在任何实际测试中的表现都非常糟糕，甚至无法可靠地完成最简单的任务，更不用说具有挑战性的编码任务了。这些 DeepSeek 模型不是这样的——响应是连贯的、令人信服的，绝对与 OpenAI 和 Anthropic 的响应处于同一水平。

第二，DeepSeek 不仅在模型质量方面取得了长足进步，更重要的是在模型训练和推理效率方面也取得了长足进步。通过非常接近硬件，并将一些独特、非常巧妙的优化结合在一起，DeepSeek 能够以显著更高效的方式使用 GPU 训练这些令人难以置信的模型。根据一些测量，其效率比其他前沿模型高出约 45 倍。DeepSeek 声称训练 DeepSeek-V3 的全部成本略高于 500 万美元。按照 OpenAI、Anthropic 等的标准，这绝对不算什么，早在 2024 年，它们单个模型的训练成本就已超过 1 亿美元。

这怎么可能？这家小小的中国公司怎么能完全抢走我们领先的人工智能实验室里所有最聪明的人的风头，而这些实验室拥有比我们多 100 倍的资源、员工人数、工资单、资本、GPU 等？拜登对 GPU 出口的限制难道不应该让中国陷入困境吗？好吧，细节相当技术性，但我们至少可以从高层次描述它们。也许事实证明，DeepSeek 相对贫乏的 GPU 处理能力是使其更具创造力和聪明才智的关键因素，需要是发明之母。

一项重大创新是他们复杂的混合精度训练框架，使他们能够在整个训练过程中使用 8 位浮点数 (FP8)。大多数西方人工智能实验室使用“全精度”32 位数字进行训练（这基本上指定了描述人工神经元输出的可能等级数；FP8 中的 8 位让您可以存储比您预期的更广泛的数字 - 它不仅限于 256 个不同的相等大小的量级，就像您使用常规整数获得的那样，而是使用巧妙的数学技巧来存储非常小和非常大的数字 - 虽然自然精度低于 32 位。）主要的权衡是，虽然 FP32 可以在巨大的范围内以令人难以置信的精度存储数字，但 FP8 牺牲了部分精度以节省内存并提高性能，同时仍保持足够的精度以满足许多人工智能工作负载。

DeepSeek 通过开发一个巧妙的系统解决了这个问题，该系统将数字分解成小块以进行激活，将块分解为权重，并在网络的关键点策略性地使用高精度计算。与其他以高精度进行训练然后进行压缩（在此过程中会损失一些质量）的实验室不同，DeepSeek 的原生 FP8 方法意味着他们可以节省大量内存而不会影响性能。当您在数千个 GPU 上进行训练时，每个 GPU 的内存需求大幅减少意味着总体上需要的 GPU 数量会大大减少。

另一个重大突破是他们的多标记预测系统。大多数基于 Transformer 的 LLM 模型通过预测下一个标记（一次一个标记）来进行推理。DeepSeek 想出了如何预测多个标记，同时保持单标记预测的质量。他们的方法在这些额外的标记预测上实现了大约 85-90% 的准确率，这有效地将推理速度提高了一倍，而没有牺牲太多质量。巧妙之处在于他们保持了预测的完整因果链，因此模型不仅仅是猜测——它正在做出结构化的、上下文相关的预测。

他们最具创新性的开发之一就是所谓的多头潜在注意力 (MLA)。这是他们在处理所谓的键值索引方面取得的突破，键值索引基本上就是 Transformer 架构中各个标记在注意力机制中的表示方式。虽然这在技术上有点太高级了，但可以说这些键值索引是 VRAM 在训练和推理过程中的一些主要用途，也是您需要同时使用数千个 GPU 来训练这些模型的部分原因——每个 GPU 最多有 96 GB 的 VRAM，而这些索引会把这些内存全部吃光。

他们的 MLA 系统找到了一种存储这些索引的压缩版本的方法，这种方法可以在占用更少内存的情况下捕获基本信息。最妙的是，这种压缩直接内置在模型的学习方式中 — 这不是他们需要做的某个单独步骤，而是直接内置在端到端训练管道中。这意味着整个机制是“可微分的”，并且能够直接使用标准优化器进行训练。所有这些东西都有效，因为这些模型最终会找到比所谓的“环境维度”低得多的底层数据表示。因此，存储完整的 KV 索引是一种浪费，尽管这基本上是其他人所做的。

您不仅会因为存储了比所需多得多的数字而浪费大量空间，从而大大提高训练内存占用和效率（再次大幅减少训练世界级模型所需的 GPU 数量），而且实际上还可以提高模型质量，因为它可以充当“正则化器”，迫使模型关注真正重要的东西，而不是使用浪费的容量来适应训练数据中的噪音。因此，您不仅可以节省大量内存，而且模型甚至可能表现更好。至少，您不会因为节省大量内存而遭受性能的大幅下降，这通常是您在 AI 训练中面临的权衡。

他们还通过 DualPipe 算法和自定义通信内核在 GPU 通信效率方面取得了重大进展。该系统智能地重叠计算和通信，仔细平衡这些任务之间的 GPU 资源。他们只需要大约 20 个 GPU 的流式多处理器 (SM) 进行通信，其余的则用于计算。结果是 GPU 利用率远高于典型的训练设置。

他们做的另一件非常聪明的事情是使用所谓的混合专家 (MOE) Transformer 架构，但在负载平衡方面进行了关键创新。您可能知道，AI 模型的大小或容量通常以模型包含的参数数量来衡量。参数只是一个存储模型某些属性的数字；特定人工神经元相对于另一个人工神经元的“权重”或重要性，或特定标记在其上下文中的重要性（在“注意机制”中），等等。

Meta 最新的 Llama3 模型有几种大小，例如：10 亿参数版本（最小）、70B 参数模型（最常部署的模型），甚至还有 405B 参数的大型模型。对于大多数用户来说，这种最大的模型实用性有限，因为您需要在计算机中安装价值数万美元的 GPU 才能以可接受的速度运行推理，至少如果您部署的是简单的全精度版本。因此，这些开源模型在现实世界中的大多数使用和兴奋点都在 8B 参数或高度量化的 70B 参数级别，因为这正是消费级 Nvidia 4090 GPU 可以容纳的，而您现在可以以不到 1,000 美元的价格购买它。

那么，为什么这些很重要呢？从某种意义上说，参数数量和精度可以告诉你模型内部存储了多少原始信息或数据。请注意，我说的不是推理能力，也不是模型的“智商”：事实证明，在解决复杂的逻辑问题、证明平面几何定理、SAT 数学问题等方面，即使参数数量出奇地少的模型也能表现出非凡的认知性能。

但这些小型模型不一定能告诉你司汤达每部小说中每个情节转折的每个方面，而真正的大型模型却有可能做到这一点。这种极端知识水平的“代价”是，模型变得非常难以训练和推理，因为你总是需要同时将这 405B 个参数中的每一个（或任何参数数量）存储在 GPU 的 VRAM 中，以便对模型进行任何推理。

MOE 模型方法的优点在于，您可以将大模型分解为一组较小的模型，每个模型都了解不同的、不重叠（至少是完全不重叠）的知识。DeepSeek 在这方面的创新是开发他们所谓的“无辅助损失”负载平衡策略，该策略可保持专家的有效利用，而不会出现负载平衡通常带来的性能下降。然后，根据推理请求的性质，您可以智能地将推理路由到该组较小模型中最能回答该问题或解决该任务的“专家”模型。

你可以粗略地把它想象成一个由拥有各自专业知识领域的专家组成的委员会：一个可能是法律专家，另一个可能是计算机科学专家，另一个可能是商业战略专家。所以如果有关于线性代数的问题，你不会把它交给法律专家。这当然是一个非常宽泛的类比，在实践中它实际上并不是这样运作的。

这种方法的真正优势在于，它允许模型包含大量知识，而不会非常笨重，因为即使所有专家的参数总数很高，但在任何给定时间，这些参数中只有一小部分是“活跃的”，这意味着您只需将这一小部分权重存储在 VRAM 中即可进行推理。在 DeepSeek-V3 的情况下，他们有一个绝对庞大的 MOE 模型，具有671B 参数，因此它比最大的 Llama3 模型还要大得多，但在任何给定时间，这些参数中只有 37B 是活跃的——足以装入两个消费级 Nvidia 4090 GPU 的 VRAM（总成本低于 2,000 美元），而不需要一个或多个 H100 GPU，每个 GPU 的成本约为 40,000 美元。

据传 ChatGPT 和 Claude 都使用了 MoE 架构，一些泄露的消息表明 GPT-4 共有 1.8 万亿个参数，分布在 8 个模型中，每个模型包含 2200 亿个参数。尽管这比试图将所有 1.8 万亿个参数放入 VRAM 中要容易得多，但由于使用了大量的内存，它仍然需要多个 H100 级 GPU 才能运行该模型。

除了已经描述的内容之外，技术论文还提到了其他几项关键优化。这些包括极其节省内存的训练框架，该框架避免了张量并行性，在反向传播期间重新计算某些操作而不是存储它们，并在主模型和辅助预测模块之间共享参数。所有这些创新的总和，当层层叠加在一起时，已经导致了约 45 倍的效率改进数字，这些数字在网上流传开来，我完全愿意相信这些数字是正确的。

一个非常有力的指标就是 DeepSeek 的 API 成本：尽管 DeepSeek 的模型性能几乎是同类中最好的，但通过其 API 进行推理请求的费用比 OpenAI 和 Anthropic 的同类模型低 95%左右。从某种意义上说，这有点像将 Nvidia 的 GPU 与竞争对手的新定制芯片进行比较：即使它们不是那么好，但性价比要高得多，因此根据应用程序的不同，它仍然是轻而易举的事，只要您可以限定性能水平并证明它足以满足您的要求，并且 API 可用性和延迟足够好（到目前为止，尽管由于这些新模型的性能而出现了令人难以置信的需求激增，但 DeepSeek 的基础设施仍然表现得如此出色，这让人们感到惊讶）。

但与 Nvidia 的情况不同，Nvidia 的成本差异是其数据中心产品获得 90% 以上的垄断毛利率的结果，而 DeepSeek API 相对于 OpenAI 和 Anthropic API 的成本差异可能只是因为它们的计算效率高出近 50 倍（在推理方面甚至可能更高——~45 倍的效率是在训练方面）。事实上，甚至不清楚 OpenAI 和 Anthropic 是否从 API 服务中获得了丰厚的利润——他们可能更感兴趣的是收入增长，以及通过分析收到的所有 API 请求来收集更多数据。

在继续之前，如果我不提一下，那我就太失职了，很多人都在猜测 DeepSeek 在训练这些模型所用的 GPU 数量和 GPU 小时数上撒了谎，因为他们实际上拥有的 H100 数量远远超过了这些卡的出口限制，他们不想给自己惹麻烦，也不想损害自己获得更多这些卡的机会。虽然这当然是可能的，但我认为他们更有可能说的是实话，他们只是通过在训练和推理方面极其聪明和富有创造力的方法才能够取得这些令人难以置信的成果。他们解释了他们是如何做事的，我怀疑他们的结果被其他各个实验室的其他研究人员广泛复制和证实只是时间问题。

能够真正思考的模型
较新的 R1 模型和技术报告甚至可能更加令人震惊，因为他们能够在 Chain-of-thought 上击败 Anthropic，现在基本上是除 OpenAI 之外唯一能够大规模使用这项技术的公司。但请注意，O1 预览模型是 OpenAI 于 2024 年 9 月中旬发布的。这只是大约 4 个月前的事！你绝对必须记住的一点是，与 OpenAI 不同，OpenAI 对这些模型在低水平上的实际工作方式非常保密，并且不会向除微软等签署了重要保密协议的合作伙伴以外的任何人透露实际的模型权重，这些 DeepSeek 模型都是完全开源的，并且获得了许可。他们发布了非常详细的技术报告，解释了它们的工作原理，以及任何人都可以查看和尝试复制的代码。

借助 R1，DeepSeek 基本上破解了人工智能的圣杯之一：让模型逐步推理，而无需依赖大量监督数据集。他们的 DeepSeek-R1-Zero 实验展示了一些非凡的成果：使用纯强化学习和精心设计的奖励函数，他们设法让模型完全自主地开发复杂的推理能力。这不仅仅是解决问题——模型有机地学会了生成长链思维、自我验证其工作，并为更难的问题分配更多的计算时间。

这里的技术突破是他们新颖的奖励建模方法。他们没有使用可能导致“奖励黑客”的复杂神经奖励模型（即模型找到虚假的方法来提高奖励，但实际上并不会带来更好的现实世界模型性能），而是开发了一个巧妙的基于规则的系统，该系统将准确性奖励（验证最终答案）与格式奖励（鼓励结构化思维）相结合。事实证明，这种更简单的方法比其他人尝试过的基于过程的奖励模型更强大、更可扩展。

特别有趣的是，在训练过程中，他们观察到了所谓的“顿悟时刻”，即模型在遇到不确定性时自发学会在中途修改其思维过程的阶段。这种突发行为并非明确编程；它是模型与强化学习环境之间的交互自然产生的。模型会自行停止，标记其推理中的潜在问题，并以不同的方法重新启动，而所有这些都无需经过明确训练。

完整的 R1 模型基于这些见解，在应用强化学习技术之前引入了他们所谓的“冷启动”数据（一小组高质量示例）。他们还解决了推理模型的主要挑战之一：语言一致性。之前对思维链推理的尝试经常导致模型混合语言或产生不连贯的输出。DeepSeek 通过在强化学习训练期间巧妙的语言一致性奖励解决了这个问题，以较小的性能损失换取更易读和更一致的输出。

结果令人难以置信：在最具挑战性的高中数学竞赛之一 AIME 2024 上，R1 的准确率达到了 79.8%，与 OpenAI 的 O1 模型相当。在 MATH-500 上，它的准确率达到了 97.3%，在 Codeforces 编程竞赛中达到了 96.3 个百分点。但也许最令人印象深刻的是，他们设法将这些功能提炼到更小的模型中：他们的 14B 参数版本比许多大小为其几倍的模型表现更好，这表明推理能力不仅与原始参数数量有关，还与如何训练模型来处理信息有关。

后果
Twitter 和 Blind（一家企业谣言网站）最近有传言称，这些模型让 Meta 措手不及，而且它们的表现比仍在训练中的新 Llama4 模型更好。显然，Meta 内部的 Llama 项目吸引了高级技术主管的大量关注，因此他们有大约 13 个人从事 Llama 项目，每个人每年的总薪酬都高于表现优于它的 DeepSeek-V3 模型的总培训成本。你如何面不改色地向扎克伯格解释这一点？扎克伯格为何在向 Nvidia 投入数十亿美元购买 10 万台 H100 的同时保持微笑，而一个更好的模型仅用 2000 台 H100 就能训练出来，价格略高于 500 万美元？

但你最好相信，Meta 和其他所有大型人工智能实验室都在拆开这些 DeepSeek 模型，研究这些技术报告中的每一个字和他们发布的开源代码的每一行，拼命地试图将这些相同的技巧和优化集成到他们自己的训练和推理管道中。那么这一切的影响是什么呢？好吧，天真地认为，训练和推理计算的总需求应该除以某个大数字。也许不是除以 45，但可能是除以 25 甚至 30？因为无论你在这些模型发布之前认为你需要什么，现在都少了很多。

现在，乐观主义者可能会说：“你谈论的只是一个比例常数，一个倍数。当你处理指数增长曲线时，这些东西很快就会被淘汰，最终变得不那么重要了。”这确实有道理：如果人工智能真的像我预期的那样具有变革性，如果这项技术在现实世界中的效用以万亿为单位来衡量，如果推理时间计算是新的扩展法则，如果我们将拥有大批人形机器人四处奔波，不断进行大量推理，那么也许增长曲线仍然如此陡峭和极端，而 Nvidia 拥有足够大的领先优势，那么它仍然会成功。

但 Nvidia 为使这一估值合理，将未来几年的大量好消息纳入了定价，当你开始将所有这些因素叠加在一起，形成一幅完整的拼图时，我开始对花费约 20 倍于 2025 年预计销售额的价格购买其股票感到极度不安。如果你看到销售增长略有放缓，会发生什么？如果结果是 85% 而不是 100% 以上，会发生什么？如果毛利率从 75% 降至 70% 左右，对于一家半导体公司来说，这仍然高得离谱，会发生什么？

总结
从高层来看，英伟达面临着前所未有的竞争威胁，这使得其 20 倍预期销售额和 75% 毛利率的溢价估值越来越难以证明其合理性。该公司在硬件、软件和效率方面的所谓护城河都出现了令人担忧的裂痕。全世界——地球上数以千计最聪明的人，在数十亿美元的资本资源的支持下——正试图从各个角度攻击他们。

在硬件方面，Cerebras 和 Groq 的创新架构表明，NVIDIA 的互连优势（其数据中心主导地位的基石）可以通过彻底的重新设计来规避。Cerebras 的晶圆级芯片和 Groq 的确定性计算方法无需 NVIDIA 复杂的互连解决方案即可提供令人信服的性能。更传统的是，每个主要的 NVIDIA 客户（谷歌、亚马逊、微软、Meta、苹果）都在开发定制芯片，这可能会蚕食高利润数据中心的收入。这些不再是实验项目——仅亚马逊一家就为 Anthropic 构建了拥有超过 400,000 个定制芯片的庞大基础设施。

软件护城河似乎同样脆弱。MLX、Triton 和 JAX 等新的高级框架正在抽象化 CUDA 的重要性，而改进 AMD 驱动程序的努力可能会解锁更便宜的硬件替代品。向更高级别抽象的趋势反映了汇编语言如何让位于 C/C++，这表明 CUDA 的主导地位可能比想象的更为短暂。最重要的是，我们看到了 LLM 驱动的代码转换的出现，它可以自动将 CUDA 代码移植到任何硬件目标上运行，从而有可能消除 NVIDIA 最强大的锁定效应之一。

也许最令人震惊的是 DeepSeek 最近的效率突破，以大约 1/45 的计算成本实现了可比的模型性能。这表明整个行业一直在大量过度配置计算资源。再加上通过思维链模型出现的更高效的推理架构，对计算的总体需求可能比目前的预测要低得多。这里的经济效益是令人信服的：当 DeepSeek 能够达到 GPT-4 级别的性能，同时 API 调用费用降低 95% 时，这表明要么 NVIDIA 的客户在浪费资金，要么利润率必须大幅下降。

台积电将为任何资金充足的客户生产具有竞争力的芯片，这一事实自然限制了 NVIDIA 的架构优势。但从根本上讲，历史表明，市场最终会找到绕过人为瓶颈的方法，从而产生超额利润。这些威胁加在一起表明，NVIDIA 在维持其当前增长轨迹和利润率方面面临的困难要比其估值所暗示的要大得多。有五个不同的攻击向量——架构创新、客户垂直整合、软件抽象、效率突破和制造民主化——至少有一个成功对 NVIDIA 的利润率或增长率产生重大影响的可能性似乎很高。按目前的估值，市场还没有将这些风险中的任何一个计入价格。

希望您喜欢阅读这篇文章。如果您在对冲基金工作，并且有兴趣就 NVDA 或其他 AI 相关股票或投资主题向我咨询，我已经注册成为GLG和Coleman Research的专家。

network · 发表于 2025-2-2 17:02:26

终于有时间逐字读完宝玉分享的这篇文章，我决定继续减持 Nvidia 股票，直至清仓。

文章作者并非普通的对冲基金分析师，而是自 2010 年以来便深度参与人工智能领域，并逐步成长为拥有多个流行开源 AI 项目的开发者。这种罕见的投资专业知识与 AI 技术见解的结合，使他对 AI 进步如何影响股市估值的观点尤为值得关注。

如果要高度概括文章的核心观点，可以总结如下：

1️⃣ 通用人工智能的数据壁垒确实存在，这意味着 pre-training scaling law 已接近极限。
2️⃣ 思维链 (CoT) 背后的 inference scaling law 带来了新的计算范式。
3️⃣ Nvidia 的护城河主要是软件（CUDA 及其生态）和高速互联技术，但这些正在被 Cerebras 和 Groq 的方案突破。
4️⃣ Nvidia 估值的爆发，源于它在 AI 发展初期是唯一可用的解决方案，但随着时间推移，市场将迎来更多替代选择。
5️⃣ DeepSeek 体现出的极致性能优化技术，将显著降低 AI 训练成本，进而影响 Nvidia 的未来营收预期。

这并不意味着 Nvidia 会被超越或营收下降，但其当前的高估值是建立在市场对其高增长的高度预期之上。因此，从中短期来看，Nvidia 的股价很可能面临较大的回调压力。

此外，考虑到今年整体市场的不确定性，清仓 NVDA 可能会错失部分机会，但仍然是一个相对稳妥的选择。

账号		自动登录	找回密码
密码			注册