The Router Is the Product: MoE at application Infra layer

Every time you send a prompt to an AI app, something behind the scenes has to decide which model inference server actually answers it. Most people never think about this. But that decision — the “router” — is quietly turning into one of the most important pieces of the AI stack. This is a walkthrough of what routers actually do today, what’s still shallow about them, and where the smart money says they’re headed next.

Routing isn’t one thing — it’s two, and only one of them is smart

Say the word “router” to five AI companies and you’ll get five different products. Almost all of them fall into one of two buckets.

The first is provider selection. You’ve already decided you want, say, GPT-4o — the router’s only job is finding which company hosting that model is cheapest or fastest right now. It’s basically what a CDN does for websites: pick the nearest healthy server. Useful, but it never actually reads your prompt. BitRouter is a clean example — it ranks providers by cost, latency, or throughput, and doesn’t pretend to be anything more.

The second bucket is model selection: instead of “you told me the model, I’ll find the best host,” the router looks at your actual prompt and decides which model should even answer it. A simple FAQ goes to a cheap, fast model; a gnarly coding problem goes to a frontier one. This is what people mean by “intelligent” routing.

But here’s the catch — even the “intelligent” routers today are mostly asking one question: how hard does this prompt look? That’s a real improvement over dumb load balancing, but it’s still shallow. Nobody’s asking the harder, more useful question: what does this specific business actually need from this specific answer? A travel-booking bot, a hospital intake assistant, and an insurance claims tool all have completely different definitions of a “good” response. One cares mostly about speed and cost. One cares mostly about never being confidently wrong. One cares mostly about being able to explain, on paper, why it said what it said. Difficulty-scoring alone can’t tell those apart. That gap is the real story here, and it’s the thread running through everything below.

The five ways routers actually make decisions

Crack open how today’s smarter routers work and you’ll find five genuinely different approaches, each with its own failure mode.

Some are learned meta-models — trained on thousands of examples of “here’s a prompt, here’s how two models answered it, here’s which was better,” learning to predict winners directly. Not Diamond works this way, and its standout feature is letting you train the predictor on your own data instead of a generic benchmark. RouteLLM, a well-documented open-source project out of Berkeley, does something similar but only chooses between two models — a strong one and a weak one — and it’s remarkably good at it: on one benchmark it matched 95% of GPT-4’s quality while sending only a quarter of the traffic to GPT-4.

Others are prompt graders — they score how hard or what kind of task a prompt is, then map that score to a model tier. OrcaRouter does this and keeps learning from live traffic instead of freezing after training, with a rules layer on top so you can override it when you know better. Amazon Bedrock does a version of this too, but only within one vendor’s lineup — it’ll pick between two flavors of Claude, but won’t reach outside Anthropic. Weave Router adds something almost nobody else bothers with: it knows switching models mid-conversation can break a “warm” prompt cache, and factors that into the decision. vLLM Semantic Router, backed by Red Hat, IBM, AMD, and Hugging Face, isn’t really “just” a router anymore — it bundles jailbreak detection, hallucination checks, and caching into one system.

A third camp treats the problem as a graph — modeling which queries pair well with which models the way you’d model a social network. Academically elegant, but worth a grain of salt: on the field’s own standardized test, one of these graph-based routers scored the lowest of all the academic entries.

The fourth camp is the most philosophically different, and belongs to a company called Martian — more on them shortly. The fifth is really just provider selection wearing a fancier name.

The one place you can actually check anyone’s claims

Every vendor says their router is “the smartest.” Almost none of that is independently checkable — except for one project: RouterArena, a standardized benchmark out of Rice University and UC Berkeley’s Sky Computing Lab (the same group behind RouteLLM). It tests routers against a broad dataset — nine domains, forty-four categories, three difficulty levels — and reports accuracy, cost, robustness, latency, and “optimality” separately instead of collapsing everything into one number. It also penalizes routers that can only pick from one company’s models, since a router that can’t look outside OpenAI isn’t really solving the full problem, and it disqualifies anyone caught training on its own test data.

In one snapshot of its leaderboard, OrcaRouter came out ahead of GPT-5’s own built-in router, ahead of Azure’s, and comfortably ahead of both Martian and Not Diamond. That’s not gospel — these rankings move as products update — but it’s the closest thing this space has to a referee, and it’s worth checking directly rather than trusting any single company’s marketing page.

So, which one should you actually use?

Here’s the honest comparison, algorithm and all:

Router	Access	Self-host?	How it decides	Where it shines	Where it falls short
BitRouter	Both	Yes + Cloud	Fixed cost/latency/throughput rules	Simple, transparent, no lock-in	Doesn’t judge prompts at all
Requesty	Managed	No	Failover + region rules	Fast failover, data residency	Same — no real prompt intelligence
OrcaRouter	Managed, zero markup	Partial (lite OSS)	Prompt grading + learns from live traffic	Best independently-verified score; auditable overrides	Newer player, shorter track record
Not Diamond	Own gateway needs your keys	No	Learned model trained on preference data; trainable on your own data	Only one that learns yourworkload specifically	Lower benchmark score on generic tests
Weave Router	Either	Yes	Embedding-based clustering, cache-aware	Rare cache-awareness; built for coding agents	Narrower focus, less proven elsewhere
Amazon Bedrock	Managed	No	Quality prediction within one model family	Zero setup if you’re on AWS already	Can’t route across different vendors
RouteLLM	Self-hosted	Yes	Four trainable classifiers on preference data	Free, well-documented, generalizes well	Only picks between two models, not many
vLLM Semantic Router	Self-hosted	Yes	Sixteen signals feeding a policy engine	Most complete open-source option	You own all the operational overhead
Martian	Managed	No	Tries to understand why a model works, not just correlate	Genuinely different research bet, real enterprise use	Hasn’t topped the independent benchmark yet

If you want something non-BYOK and cloud-hosted, with real intelligence behind it rather than just cost math, the order that makes sense is: OrcaRouter first, because it’s the only one with a verified top score, zero markup, and a way to override it when it’s wrong. Not Diamond second, specifically if you have your own evaluation data and want a router trained on it rather than a generic benchmark. Bedrock third, only if you’re already deep in AWS and can live with staying inside one vendor’s model family. Martian is worth piloting for its research direction, but don’t take its marketing at face value until you’ve checked it against the live leaderboard yourself.

You might be wondering why we’re discussing this when we have Openrouter Auto, which runs Not Diamond under the hood as intelligent routing with its own gateway for cost, latency, and throughput, making it a no-brainer choice. My experience with Openrouter, especially OpenClaw, hasn’t been great. Which made me go into the rabbit hole of finding better options. My search led me to a lot more than what I started with in terms of requirements. Now I want an intelligent router which fits my Claude code – coding stuff, Open Claw and Hermes – for routine agentic stuff and other harnesses that I am experimenting.

Isolate the model from the harness is my goal, and let user own the layer (call it memory or evolving context, which is specific to the user/business) between the model and harness as they both will be evolving over time. The same will be applicable to enterprises, at scale.

If self-hosting is genuinely on the table — because of compliance, data residency, or just wanting full control — vLLM Semantic Router is the most complete option out there, and RouteLLM is the simplest, best-documented one if you don’t need more than two models. The trade-off in plain terms: going cloud and non-BYOK gets you running in minutes, but you’re trusting someone else’s infrastructure and pricing. Going self-hosted gets you full control and no vendor risk, but now you’re the one operating it.

Where this is actually headed

Everything above describes routing as it exists right now. Here’s the more interesting part — where it’s going, and some of it is already starting to happen in public.

Routing is going to stop being “pick one model” and start being “assemble an answer out of several.” Think about what agentic AI workflows actually look like: one piece writing code, another reviewing it, another testing it, another writing a status update. That’s not one prompt anymore — it’s a dozen small tasks stitched together, each with different needs. Sending all of them to your most expensive model is like hiring a surgeon to change a lightbulb. A research paper called Router-R1, published at NeurIPS in 2025, points at where this goes: instead of picking a model once, the router itself acts like an LLM that alternates between “thinking” and “calling out” to different models mid-task, learning through trial and error — with an explicit penalty for calling expensive models when it didn’t need to. Expect “the router” and “the orchestrator” to become the same thing, and expect the smart selection to happen at the level of each small sub-task, not the whole job.

Routers are going to start learning your specific business, not just a generic benchmark. A router trained to be good at trivia and coding tests has no idea that a wrong answer in a medical app is a liability event, while a wrong answer in a travel app is just an annoyed customer. The fix is the same idea from Router-R1, pointed at a different target: instead of training the router to maximize benchmark accuracy, you train it against a reward that actually reflects your business — fewer doctor overrides, cleaner audit trails, higher booking conversion. Change what you’re rewarding, and you get a genuinely different router personality, one that keeps adjusting itself from real traffic instead of getting retrained by hand every quarter. OrcaRouter’s live-learning and Not Diamond’s custom training are the earliest, simplest versions of this. The real version — a router that tunes itself continuously to what “good” means in your specific industry — is still ahead of us, but the pieces already exist.

How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching.

Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaulting… pic.twitter.com/PUV9uQHGO0
— Brian Armstrong (@brian_armstrong) June 27, 2026

Cheap, open models are closing the price gap faster than anyone expected, and it’s already showing up in CEOs’ actual budgets. In June 2026, Coinbase’s CEO Brian Armstrong posted, in public, that the company had cut its AI spending nearly in half — without cutting anyone’s access — by doing five unglamorous things: defaulting engineers to open-weight models instead of frontier ones, automatically routing simple execution tasks away from expensive “planning-grade” models, pushing their cache hit rate from 5% up to 60%, keeping context windows lean, and tracking spend against actual impact instead of imposing blanket usage caps. The pricing behind this is genuinely striking — one open-weight model, Zhipu’s GLM 5.2, costs roughly $1.40 and $4.40 per million input and output tokens, compared to Claude Opus’s $5 and $25. That’s a five- to six-fold difference, and GLM 5.2 even beat a frontier model on a tough coding benchmark, though it still lags behind on deeper reasoning tasks. Snowflake and an AI startup called Lindy have made similar moves. It’s worth a note of caution here too: independent evaluators have found real gaps between what these models score on published benchmarks and how they actually perform once deployed, and any open-weight model — especially one from China — brings its own questions around data handling and export rules that self-hosting only partly answers. Still, the direction is unmistakable, and it points at something specific: a router that ships with cheap-model defaults, task-aware routing, and caching already built in saves every company from rebuilding this playbook from scratch on their own.

Caching is quietly becoming the router’s job, not each model provider’s. Every AI company offers some version of caching today, but only within their own model. The router is the one layer that sees every request across every model you use, which makes it the natural place to decide “don’t switch models here, it’ll break a warm cache” or “this prompt is functionally the same one we answered five minutes ago, just skip the call entirely.” Coinbase, notably, called caching their single biggest lever — bigger than model selection itself. Expect this to move fully into the routing layer rather than staying stuck inside each provider’s own API.

Martian’s bet on understanding models from the inside out is really a bet on trust, not just accuracy. Most routers, including the clever ones above, are pattern-matchers: they notice that a certain kind of prompt has historically gone well with a certain model, without really knowing why. Martian is trying something different — using techniques adjacent to mechanistic interpretability to actually understand what’s happening inside a model, so it can predict performance instead of just correlating with history. They’ve backed this with a real, peer-reviewed benchmark paper built with a UC Berkeley lab, logging over 400,000 inference outcomes, and it reportedly powers a real enterprise deployment at Accenture. It hasn’t topped the independent leaderboard yet, so treat the performance claims with some skepticism for now — but the underlying idea matters more than today’s ranking. As routing moves into regulated spaces like insurance or healthcare, “we understand mechanistically why this model fits this query” is a far better answer to a regulator than “it’s usually worked before.”

None of this is going to be sold — it’s going to spread. Nobody adopted Kubernetes because of a sales pitch, and this won’t be different. It’s already showing up as a CEO posting a cost breakdown on social media, as four unrelated tech companies independently contributing to the same open-source routing project because they all hit the same wall, as researchers putting router models straight onto Hugging Face for anyone to poke at. The routers that win long-term will be the ones an engineer can try on a random Tuesday afternoon — quick setup, honest pricing, decisions you can actually see and audit — not the ones that make you sit through a sales call first.

What’s the takeaway?

Routing today is mostly about judging how hard a prompt looks, and that’s already a real improvement over blind cost math — but the next leap is judging what a specific business actually needs, learning that from real usage instead of a fixed formula, and doing it while riding a genuine, ongoing collapse in what open-weight models cost to run. Add in caching finally becoming the router’s responsibility instead of each model’s afterthought, and a growing push toward routers that can explain themselves instead of just historically working — and it’s easy to see why this unglamorous piece of plumbing is turning into one of the more consequential parts of the whole AI stack. Whoever builds the version that does all of this without demanding a procurement meeting first is going to own a very large, very quiet layer of how AI actually gets used.