
The AI Industry Is About to Bifurcate. Here's Our Bet
Models like Claude Mythos will coexist with small, specialized ones. They serve fundamentally different purposes. We're building for the side that keeps regulated industries running.

Yesterday, Anthropic announced Claude Mythos Preview. It scores 93.9% on SWE-bench Verified, 97.6% on the USA Mathematical Olympiad, and it autonomously discovered thousands of zero-day vulnerabilities across every major operating system and web browser. A 27-year-old bug in OpenBSD. A 16-year-old flaw in FFmpeg. It even chained four vulnerabilities together to escape a browser sandbox on its own. The model is so capable that Anthropic decided not to sell it. They restricted access to a handful of cybersecurity partners through a program called Project Glasswing, backed by $100 million in usage credits, and they have been briefing federal officials about the risks.
It is, by every measurable dimension, the most powerful AI model ever built. And it has absolutely nothing to do with what most enterprises actually need.
This is not a criticism. It is an observation about where we are, and where the industry is going. I think we are witnessing the beginning of a deep structural split in AI. Two fundamentally different kinds of compute, serving fundamentally different purposes, for fundamentally different buyers. And at bem, we have placed our entire company on one side of that split.
Two Kinds of Compute
On one side, you have the generalist frontier. Multi-trillion parameter models. Enormous context windows. Capabilities that span from mathematical proofs to autonomous code exploitation. Claude Mythos scores 94.5% on GPQA Diamond, a graduate-level science reasoning benchmark. It hits 80% on GraphWalks BFS at million-token context, nearly four times GPT-5.4's score. On CyberGym, the cybersecurity benchmark, it jumped from Opus 4.6's 66.6% to 83.1%. These are extraordinary machines, and they will get more extraordinary. They are built for flexibility, exploration, and open-ended reasoning. They are the polymaths.
On the other side, you have something quieter. Smaller models. Distilled, fine-tuned, domain-specific. Models that don't need to prove theorems or discover zero-days. Models that need to read a certificate of insurance and extract the policyholder, the effective date, and the coverage limits correctly, every single time, on the ten-thousandth document just as reliably as on the first. These are the specialists.
The industry is about to bifurcate along this line, and the gap between these two sides will widen, not narrow, in the next two years.
The data already tells this story. A March 2026 survey of 650 enterprise technology leaders found that 78% of organizations have at least one AI agent pilot running, but only 14% have successfully scaled a single one to production at organizational scale. Deloitte's State of AI 2026 report puts it even more starkly: 80% of enterprise AI pilots fail to reach production, double the failure rate of traditional IT projects. And it is not a technology problem. The models are capable. The tooling has improved. What is missing is the operational infrastructure that turns a capable model into a reliable system.
The Polymath Fallacy
Here is the thing about brilliant generalists: they are terrible employees for repetitive critical work. AI is just a very confident performer. It is less about whether AI has the knowledge. It is just very good at expressing what it thinks as knowledge, and people rely on it carelessly. I have started calling this AI Psychosis. This idea that you feel so productive because a tool is giving you very confident answers, when really you have not done the hard work of understanding whether those answers are correct.
We see this constantly with our customers. An insurance carrier runs a frontier model against a stack of loss runs and it gets 92% of the fields right. That sounds impressive until you realize the remaining 8% includes policy limits, deductible amounts, and claim totals. In insurance, an 8% error rate on those fields is not a minor inconvenience. It is an E&O exposure. A compliance violation. A reason to not trust the system at all.
You brainstorm with a polymath. You operate with a specialist. And if you are deploying AI to process a million documents for regulated customers, you do not want a brilliant improviser. You want a system trained to do the job precisely, reliably, and with full traceability.
The frontier models are not designed for this. They are designed to be flexible, creative, and general-purpose. That is their strength and their limitation. When you ask Claude Mythos to find zero-day vulnerabilities in Firefox, it is spectacular. When you ask a frontier model to extract 47 fields from a Spanish-language remittance advice for the ten-thousandth time today, it is overkill and unreliable in exactly the ways that matter most.
Where bem Lives
We build production infrastructure for unstructured data. Document extraction. Routing. Decision-making. Visibility. Observability. The full pipeline from raw document to schema-validated structured JSON, running against an API that our customers call millions of times. We work primarily in insurance, logistics, fintech, and healthcare. Industries where the documents are messy, the stakes are high, and "mostly right" is the same as wrong.
Our bet, from the very beginning, has been on small models, self-training loops, and deterministic infrastructure. Not because small models are better in the abstract. They are not. A 3-billion parameter model is not going to pass the USA Mathematical Olympiad. But a 3-billion parameter model fine-tuned on insurance loss runs will extract structured data from those documents faster, cheaper, and more accurately than a frontier model that also happens to know how to write sonnets and exploit browser sandboxes.
This is not a theoretical position. It is what we see in production every day. The economics are different. The latency is different. The consistency is different. And critically, the auditability is different. When a regulated enterprise asks "why did your system classify this claim this way," you need an answer that is more specific than "the model thought it was right."
Generalist Frontier
- Model size: Hundreds of billions of parameters and growing
- Optimized for flexibility and reasoning breadth
- Cost per 1M tokens: $10 - $100+
- Consistency at volume: degrades unpredictably
- Audit trail: opaque reasoning
- Deployment: cloud API dependency
- Serves developers, researchers, and creatives
Specialized Infrastructure
- Model size: 1B - 30B parameters, fine-tuned
- Optimized for precision on narrow tasks
- Cost per 1M tokens: < $1 self-hosted
- Consistency at volume: stable across runs
- Audit trail: schema-validated and traceable
- Deployment: self-hosted, zero-egress
- Serves operations teams in regulated industries
Not either/or. Different tools for different jobs.
Determinism Is the Product
I think a lot of people in our industry misunderstand what enterprises are actually buying when they buy AI. They are not buying intelligence. They are buying reliability. They are buying the ability to say, with confidence, that this system will produce correct, consistent, auditable results on this category of document, at this volume, within this latency, every single day.
That is a fundamentally different product than "here is a brilliant model, go figure out how to make it work." And it requires fundamentally different infrastructure. It requires schema validation on outputs. It requires human-in-the-loop supervision at the right decision points, not as a crutch but as an architectural feature. It requires observability into every step of the pipeline so that when something goes wrong, and things always go wrong, you can see exactly where and why. It requires self-training loops so the system gets better on your documents, not documents in general.
This is what we are building at bem. Not a model. A production system. The knowledge and operational rails that get you as close to determinism as the underlying technology allows.
The Human in the System
There has been a lot of discussion in the media about the role AI plays in insurance underwriting or claims validation. I never want bem to be in a position where we are enabling blind automation of decisions that affect people's lives. Part of the functionality we are releasing gives humans control of the AI system and the ability to look into it, rather than trying to replace them. The human being has the last call. Especially in regulated industries.
I think this is going to be regulated, especially by more progressive states. California's recent AI procurement bill requires vendors to attest to safety, bias, and civil rights policies. I agree with the intention. The enforcement will be extremely difficult because technically, even the companies building these models do not fully understand what influences them during training. There are hundreds of billions of parameters and they lose traceability. But the direction is clear: the market is moving toward accountability, and accountability requires visibility.
That is exactly what the specialized side of this bifurcation provides. Not a black box that gives you confident answers, but infrastructure that shows you the full thought process so you can be the one telling the system "no, you made the wrong judgment here."
The Mortar and the Brick
I want to be clear: I am not arguing against frontier models. They are extraordinary. Claude Mythos scoring 97.6% on USAMO, a competition that challenges the world's most gifted young mathematicians, is a genuine breakthrough. The 55-point jump over Opus 4.6 on mathematical reasoning represents a different category of capability entirely. The fact that it solved a corporate network attack simulation that would have taken a human expert over ten hours is remarkable.
These models are the mortar. They are flexible, adaptive, general-purpose. They fill in the gaps. They reason across domains. They explore new territory.
But mortar without bricks is just a puddle. And the bricks are the specialized, deterministic, observable systems that actually run the operational workflows of critical industries. The document extraction pipeline that processes ten thousand invoices overnight. The routing logic that decides which claim gets fast-tracked and which gets flagged for manual review. The decision infrastructure that an auditor can examine step by step.
At bem, we are the bricks. We are building the critically operational layer for industries that cannot afford confident improvisation. Small models, self-trained on domain data, running through composable primitives (transform, split, route, join, enrich, decide, supervise), with full visibility into every step. That is not a limitation. It is the product.
AI in isolation does not do anything. It is just vibes. The value lives in grounding it against real knowledge, real documents, real operational constraints. And 90% of an organization's knowledge lives in the ether of conversations and unstructured documents, not in someone's memo that was shelved and untouched for six months.
The Bet
So here is our bet, stated plainly. The AI industry is splitting into two compute paradigms. One is getting bigger, more expensive, and more general. The other is getting smaller, cheaper, more specialized, and more deterministic. Both will thrive. But the second one, the one most people are not paying attention to because it does not generate breathless headlines about zero-day exploits and mathematical proofs, is the one that will power the actual operational infrastructure of the world's largest industries.
That is where we are. That is where we are going. And honestly, we have never been more convinced that this is the right side of the split to be on. Not because frontier models are wrong. Because the enterprises we serve need something they cannot provide: the boring, reliable, auditable, observable, deterministic infrastructure that turns AI from a promising pilot into a system you would bet your operations on.
An investor once told me we have too many opinions. I wear that as a badge of honor. We have been building this for three years, and the market is finally catching up to the thesis. The bifurcation is here. And we know which side we are on.

Written by
Antonio Bustamante
Apr 8, 2026 · Updates


Ready to see it in action?
Talk to our team to walk through how bem can work inside your stack.
Talk to the team