Leveraging Chain‑of‑Thought Network Effects To Compete With Open Source Models
4/22/2025
In 2023, an internal memo from Google was leaked claiming that neither they nor OpenAI can compete with open-source AI models. This sparked a discussion about the future of AI development. Some analysts claimed the future of AI would mirror that of open-source software: companies initially develop proprietary software but gradually concede to using open-source alternatives over decades. This dynamic played out starting in the 80s, culminating when Microsoft, having become a major cloud service provider, quietly incorporated Linux kernels into its stack. The flaw in this analogy is that while AI and software development share properties like openness, they are fundamentally different in other ways. For instance, open-source software projects like the GNU Compiler are nearly 50 years old—older than many programmers today. In contrast, AI models, both open-source and proprietary, often become deprecated as costly and outdated within just six months. This has also caused confusion. When DeepSeek published its open-source reasoning model R1, the stock market reacted violently with a dip before correcting. Financial analysts likely misunderstood the implications of a freely downloadable open-source model. However, those following the AI research community were already aware that proprietary AI models maintain only about an 18-month lead over open-source alternatives.
Currently, frontier AI companies operate knowing their models will likely be obsolete within a year. Their rationale is that demonstrating state-of-the-art capabilities secures continued funding, propelling them towards the ultimate goal: real-world AI applications yielding significant economic and scientific breakthroughs. While frontier AI firms frequently swap their models, they have focused on building a committed, paying user base. So far, AI models have struggled to escape the gravity well of commodity pricing. Models with similar performance benchmarks are released weekly, allowing users to easily switch between services by changing API keys with minimal cost. Companies attempt to differentiate themselves by building internal infrastructure tools to boost their inference performance without publishing their methods publicly. However, any advantage is eventually reproduced independently by academia or competing firms.
The release of chain-of-thought reasoning models has changed how companies utilize compute resources. Instead of dedicating most compute to a single training run followed by short inference times, they now trade less training time for more inference time compute to achieve better performance.
Aspect | Standard (“raw”) inference | Chain-of-Thought inference | Why it differs |
---|---|---|---|
Up-front training FLOPs | High – model needs to internalise reasoning during pre-training | Lower – you rely on runtime reasoning instead | CoT shifts some reasoning compute out of training |
Per-query FLOPs | Low | High – extra forward passes to “think out loud” | Each reasoning step is an additional decode cycle |
Latency | Short | Longer | More tokens ⇒ more wall-clock time |
Tokens generated | Few | Many – prompt + steps + answer | Intermediate thoughts are streamed back |
Typical accuracy | Baseline | +5–15 pp on reasoning tasks | Verbalised thoughts guide the model |
Cost per 1 k queries | Low | High | More tokens × more milliseconds × more GPU-time |
When researchers initially developed Chain-of-Thought for transformer models, the primary goal was simply to imbue traditional transformer-based models with reasoning abilities. A key advantage open-source models currently possess is rapid scalability. For instance, when DeepSeek released its R1 model, service providers integrated it into their APIs within approximately 48 hours. Competing with this level of scalability is challenging, especially considering the even cheaper alternative of running models internally on private infrastructure. This is where proprietary AI companies can differentiate themselves by offering the opposite of this fragmented scaling model. By consolidating users of the same model architecture onto unified infrastructure, companies can capitalize on the often repetitive nature of user prompts.
This consolidation enables a powerful optimization: Chain-of-Thought (CoT) trace reuse. Since CoT involves generating intermediate reasoning steps, and many user queries follow similar logical paths, the "thought process" generated for one user can often be cached and partially or fully reused for subsequent, similar queries from other users. This significantly reduces redundant computation, lowering inference costs and latency for common requests.
Example: Reusing CoT Steps
CoT generates intermediate reasoning steps. If these steps are common across queries, they can be cached and reused.
-
Query 1: "How many legs does a spider have?"
- CoT Steps:
[Identify animal: spider] -> [Recall leg count: 8] -> [Answer: 8]
- (Cache
[Recall leg count: 8]
)
- CoT Steps:
-
Query 2: "Do spiders have 8 legs?"
- CoT Steps:
[Identify animal: spider] -> [Recall leg count: 8]
(Reused!)-> [Compare: 8 == 8] -> [Answer: Yes]
- CoT Steps:
Reusing the [Recall leg count: 8]
step saves computation for Query 2.
Further more, not only traces of thoughts can be cached, but also during off time traces can be collected and mint new novel tokens to save inference time.
Example: Minting and Reusing a Specialized Reasoning Token
Let's illustrate how the "Developing Specialized Reasoning Tokens" concept works:
1. First time (no token yet) — model has to think out loud
Q: If all birds have feathers and a sparrow is a bird, does the sparrow have feathers?
<think>
1. All birds → feathers. (premise)
2. Sparrow → bird. (premise)
3. Therefore sparrow → feathers. (modus-ponens chain)
</think>
A: Yes. The sparrow has feathers.
2. We mint a novel token after observing that pattern
- Name:
[MP_CHAIN]
- Meaning: “Apply two-step modus-ponens and state the conclusion.” (Stored in the retrieval layer with its embedding.)
3. Future queries can reuse the token — no verbose CoT needed
Q: A robin is a bird. All birds have feathers. [MP_CHAIN]
A: Yes. The robin has feathers.
What happened:
Stage | Tokens generated | Latency | Accuracy |
---|---|---|---|
1. Explicit CoT | ~50 | Slow | ✓ |
3. With [MP_CHAIN] | ~10 | Fast | ✓ (same) |
The heavy reasoning FLOPs were paid once during step 1; every later call just injects the 1-token shortcut.
Interestingly, since the advent of Chain-of-Thought transformer models, the doubling time for agentic capabilities has reportedly decreased from 7 months to 4 months. While this acceleration wasn't necessarily anticipated by the original designers, perhaps the technique of generating reusable novel tokens to save inference time without a performance trade-off could further accelerate this improvement slope, leading to even shorter doubling times.
In such a scaling environment, the most advanced models might not belong to nimble open-source developers. Even after a model is published, optimizing it with techniques like novel token reuse requires constant maintenance and centralized infrastructure to store and manage these tokens. This inherently favors large-scale deployments where many users run inference on shared infrastructure. Ultimately, the advantage may lie with providers who can cultivate the strongest network effects among their users, continuously refining the model based on collective usage patterns. To conclude, the initial research that came up with chain-of-thought reasoning transformers did so only to improve performance, but this now also enables a new scaling dimension to be even more effective. By scaling the number of users, models will improve in both quality and efficiency.