Stop Picking One Model: A 2026 Guide to LLM Routing Libraries
Libraries that automatically send each prompt to the best model for the task — RouteLLM, LiteLLM, semantic-router, vLLM Semantic Router, Not Diamond and more. How they decide, what they save (40–85%), and the honest tradeoffs from Hacker News.
We keep arguing about which model is “best.” It’s the wrong question. With ~280 models on this site spanning a 50x price range and wildly different strengths, the model that’s best for a one-line classification is absurd overkill for nothing — and useless for an 8-hour coding agent. The interesting question in 2026 isn’t which model, it’s which model for this specific request — and answering it automatically is a solved-enough problem that there’s now a whole category of libraries for it.
They’re called LLM routers. Think of one as an air-traffic controller sitting in front of your model fleet: it looks at each incoming prompt and dispatches it to the model that wins on the axis you care about — cost, quality, latency, or safety. Cheap model for the easy 80%, frontier model for the hard 20%.
The payoff is real. Published numbers cluster around 40–70% cost savings with under 2% quality loss, and LMSYS’s RouteLLM reports up to 85% cost reduction while keeping ~95% of GPT-4 quality on MT-Bench. When the routine path is a model like GLM-5.2 or DeepSeek at a tenth of frontier pricing, the math gets hard to ignore.
How routers actually decide
Not all routing is the same. There are five broad strategies, and most serious tools combine several:
- Rule / metadata routing — route on explicit signals: token count, requested feature (vision, tools), tenant, or a
modelhint. Deterministic and zero added latency, but blind to meaning. - Semantic routing — embed the prompt and match it against reference utterances per route (“this looks like a SQL question → send to the coding model”). Sub-millisecond, no LLM call.
- Predictive / classifier routing — a small trained model scores prompt difficulty and predicts which tier will succeed. RouteLLM’s matrix-factorization and BERT classifiers live here; they classify in roughly 400ms.
- Cascade / escalation — try the cheap model first, evaluate the answer, escalate to a stronger model only if it fails a confidence or verification check. Pay for the expensive model only when you need it.
- Consensus / ensemble — fan the same prompt to several models and aggregate. Higher quality, higher cost — used when correctness dominates.
The tradeoff axis is always the same: a smarter router catches more nuance but adds latency, failure modes, and non-determinism. The best tools let you pick how much intelligence to spend on the routing decision itself.
The open-source libraries
LiteLLM — the workhorse gateway
The de-facto standard. An MIT-licensed Python SDK and self-hostable proxy that speaks the OpenAI format to 100+ models across every major provider, with fallbacks, load balancing, budget controls, and retries. Routing started as load-balancing/fallback logic, but recent versions add semantic routing and a rule-based Complexity Router that scores difficulty with zero external calls and sub-millisecond latency (docs). If you adopt one piece of routing infrastructure, it’s usually this. The common gripe (see HN below): the codebase is sprawling.
RouteLLM — the research-grade classifier
From LMSYS (the LMArena team). It’s a framework for training and evaluating routers, shipping pretrained classifiers that decide between a strong and a weak model. Drop-in OpenAI replacement, and it leans on LiteLLM for the actual provider calls. The reference numbers — ~85% cheaper at ~95% GPT-4 quality — are the ones everyone cites. Best when you want a principled, benchmarked cost/quality knob rather than hand-written rules.
semantic-router — meaning-based, microsecond decisions
A focused library that routes purely on embeddings: define routes by example utterances, and prompts get matched by semantic similarity with no LLM in the path. It’s now integrated into LiteLLM, but stands alone well when you want intent classification (and guardrail routing) decoupled from a heavyweight gateway.
vLLM Semantic Router — production mixture-of-models
An open-source router built for serving heterogeneous fleets (local + cloud). It runs 16 signal families and 12 routing strategies (rules, latency heuristics, RL, ML selection), and folds safety into the routing layer — jailbreak detection, PII, hallucination checks before a request reaches a model. It claims ~98x faster routing via Flash Attention + prompt compression, taking decisions from seconds to tens of milliseconds. Notably backed by contributors from Google, Red Hat, IBM, and Microsoft.
LLMRouter — the academic toolkit
ulab-uiuc/LLMRouter implements 16+ routing algorithms across four families (single-round, multi-round, agentic, personalized): KNN, SVM, MLP, matrix factorization, Elo, graph-based, and BERT-based routers. Overkill for most apps, but the right place to start if you want to research routing strategies rather than just ship one.
any-llm — the lightweight option
A newer mozilla.ai library that routes across 20+ providers by changing a single string ("openai/gpt-5" → "anthropic/claude-opus-4.8"), using each provider’s official SDK instead of reimplementing APIs, with no proxy server. Positioned as a leaner alternative to LiteLLM for teams that want provider-switching without the operational surface.
The managed services
If you’d rather not run routing yourself:
| Tool | Type | Routing approach |
|---|---|---|
| Fugu (Sakana AI) | Service (orchestration) | A trained “conductor” LM that routes and synthesizes across a swappable model pool, sold as a single model |
| OpenRouter | Service | Provider/price/throughput ordering; an auto meta-model |
| Not Diamond | Service | Trained meta-router that predicts the best model per prompt |
| Martian | Service | Real-time adaptive routing |
| Unify | Service | Best model and provider per prompt |
| Portkey | Gateway (OSS core) | Conditional routing, fallbacks, circuit breakers, load balancing |
| Vercel AI Gateway | Gateway | Rules-based fallback, provider ordering |
| Braintrust | Platform | Eval/quality-score-based selection |
Not Diamond is the purest expression of the classic idea — a meta-router that claims to beat any single foundation model on aggregate benchmarks by always dispatching to the right one. They also maintain a useful awesome-ai-model-routing reading list.
Fugu — routing as the model
The most interesting recent entry inverts the whole pattern. Sakana AI’s Fugu (launched June 22, 2026, and #1 on Hacker News) isn’t a library you put in front of models — it’s a trained “conductor” language model that routes and synthesizes across a swappable pool of frontier models, exposed as a single API. Sakana trains no frontier model of its own; the product is the orchestration layer itself. It’s built on two ICLR 2026 works — TRINITY (a lightweight coordinator managing multiple LLMs over multiple turns) and Conductor (reinforcement learning to discover natural-language coordination strategies).
The numbers are the headline: Fugu Ultra reportedly outperforms Opus 4.8, GPT-5.5, and Gemini 3.1 Pro on many benchmarks — scoring 73.7 on SWE-bench Pro — and stands shoulder-to-shoulder with Claude Fable 5. The timing was pointed: VentureBeat framed it as “No Fable 5? No problem”, landing days after Anthropic’s flagship was pulled offline. Pricing mirrors the incumbents — $20/mo Standard, $100/mo Pro, $200/mo Max — and sakana/fugu-ultra is also callable per-token (~$5/$30 per 1M) via OpenRouter. Compare both access paths on the Fugu Ultra model page, or see the Sakana AI provider page.
Fugu matters because it reframes the category: if a conductor model that owns no weights can beat the best single model by orchestrating others, then “routing” stops being plumbing and becomes a product tier of its own. It’s the strongest evidence yet that the model layer is commoditizing and the coordination layer is where differentiation moves next.
What Hacker News actually thinks
The discourse is more skeptical than the vendor decks, which is healthy:
- LiteLLM is “rock-solid in practice” — but its complexity is a recurring complaint; commenters point at a 7,000+ line
utils.pyas the cost of supporting everything. The any-llm thread is largely a referendum on that tradeoff. - Fragmentation fatigue. Every new router triggers the xkcd “Standards” reflex — yet another abstraction over the same providers. The honest rebuttal is that the OpenAI-format lingua franca means these tools are mostly interchangeable at the call site.
- Does routing beat just using one cheap-good model? The sharpest critique: if an open model like GLM-5.2 is already ~95% as good at a tenth of the price, a lot of “routing for cost” collapses into “just default to the cheap model and escalate rarely.” For many teams, a two-tier cascade (cheap default → frontier on failure) captures most of the savings with a fraction of the complexity.
When not to route
Routing is infrastructure, and infrastructure has a bill:
- Added latency — even a 400ms classifier is 400ms on every request; semantic/rule routers mitigate this, predictive ones don’t.
- Misroutes — a classifier that sends a hard prompt to the weak model produces a confidently wrong answer. You need evals to know your route accuracy, not vibes.
- Non-determinism — “which model answered?” becomes a debugging variable. Log the routing decision with every response.
- Maintenance — model lineups change weekly (we’d know). Routes and classifiers need re-tuning as models launch and prices move.
If you call one model for one task, you don’t need a router. The value shows up at scale, with a real mix of easy and hard requests, and a cost line that hurts.
How to choose
- Just want provider-switching + fallbacks: LiteLLM (or any-llm if you want it lean).
- Want a benchmarked cost/quality dial: RouteLLM.
- Want intent/safety routing: semantic-router, or vLLM Semantic Router for production fleets.
- Don’t want to run anything: OpenRouter, Not Diamond, Martian, or Unify.
- Want quality decisions tied to your own evals: Braintrust.
The throughline connects directly to where models are heading. As we argued in our Chinese open-weight models piece, the smart 2026 architecture is model-agnostic — and a router is the component that makes “model-agnostic” actually operational. Pair a good router with a cheap, capable open default and a frontier escalation path, and you get most of the quality at a fraction of the cost — automatically, per request.
Compare the models and providers a router would pick between on Inference Hub. Routing landscape compiled from project docs and Hacker News discussions, as of June 2026.