By
May 5, 2026
9 min read
LLM Routing: How to Pick the Right Model for Every Request



Why Sending Everything to One Model Is a Bad Default
Most AI systems start the same way: pick the best model available, point all requests at it, ship. It works until it doesn't. Then you get a bill that looks like a rounding error in the wrong direction, latency complaints on simple queries, and occasional failures on complex ones.
The problem is that "best model" means different things depending on the task. A question like "what's the capital of France?" does not need GPT-4o or Claude 3.5 Sonnet. A question like "review this 200-line contract clause for ambiguous indemnification language" absolutely does. Treating these identically wastes money on the first and risks quality on the second.
LLM routing is the practice of deciding, per request, which model (or set of models) should handle it. It sounds obvious once stated. The implementation details are where it gets interesting.
What an LLM Router Actually Does
An LLM router sits between your application and your model providers. Every request goes through it. The router inspects the request, applies some decision logic, and forwards it to the appropriate backend.
That decision logic can be:
Rule-based: simple conditions on token count, topic, or user tier
Model-based: a small classifier or embedding model that predicts which backend will produce the best result
Cost-optimized: always try the cheapest model first, escalate if confidence is low
Latency-sensitive: route to whichever endpoint is currently fastest
Hybrid: combinations of the above, with fallback chains
The router does not need to be elaborate. A well-designed rule-based router can capture 80% of the value with a few dozen lines of code. The complexity scales with how much you care about optimizing the remaining 20%.
The Four Main Routing Strategies
1. Complexity-Based Routing
The most common starting point. You classify each incoming request as simple or complex, then send simple queries to a cheaper, faster model and complex ones to a more capable model.
Complexity signals you can use: token count, presence of specialized vocabulary, number of sub-tasks in the prompt, whether the user is asking for generation vs. retrieval vs. reasoning. A small BERT-class classifier trained on labeled examples from your own workload will typically outperform any heuristic here.
The RouteLLM project from LMSYS (the team behind Chatbot Arena) released open-source routers trained on preference data that achieve near-GPT-4 quality at 40-50% of the cost by routing roughly half of queries to smaller models. Their research is worth reading if you want to understand how preference-based training translates into routing decisions: arxiv.org/abs/2406.18665.
2. Cascade Routing
Try the cheap model first. If its output meets a confidence threshold, return it. If not, send the request to a more powerful model.
This requires a way to measure confidence. For classification tasks, that's straightforward: you can use the model's softmax probabilities. For open-ended generation, it's harder. Common proxies: self-consistency (run the cheap model twice, check if outputs agree), a small reward model scoring the output quality, or asking the model to rate its own certainty (unreliable but fast).
Cascade routing works well when the majority of your requests are genuinely easy and you have a measurable quality threshold. It adds latency on the hard cases because you ran the cheap model first. If more than 40% of your requests end up escalating, the overhead starts to hurt.
3. Domain-Specialized Routing
Some problems have specialized models that outperform general-purpose frontier models. Medical coding, legal analysis, code generation in niche frameworks, financial document parsing: there are fine-tuned models for many of these that cost less and perform better within their domain.
Domain routing classifies the request by topic or task type and sends it to the best-suited model. This is less about cost optimization and more about quality. A specialized code model running on cheaper infrastructure can beat GPT-4 on code generation tasks at a fraction of the price, as researchers have repeatedly shown. Hugging Face's StarCoder2 writeup is a useful reference for understanding where specialized code models sit on the capability curve.
4. Availability and Latency Routing
Provider APIs go down. Rate limits get hit. Regional latency varies. A production system needs fallback logic that is not just about quality: it's about reliability.
This type of router maintains a live view of endpoint health and routes accordingly. At its simplest, it's a retry-with-fallback pattern. At its most sophisticated, it's a load balancer with circuit breakers, weighted round-robin across providers, and automatic failover. Tools like LiteLLM handle a lot of this plumbing and are worth knowing about even if you build your own routing logic on top.
Building a Basic Router
Routing does not require a dedicated infrastructure service to start. A simple implementation in Python looks like this conceptually: classify the request, select a model based on the classification, call the appropriate client, handle failures.
The tricky part is the classification step. If you're early and don't have labeled data yet, start with a heuristic: token count under 200 and no specialized vocabulary goes to the cheap model, everything else goes to the capable model. Log everything. After a week of traffic, you'll have labeled examples based on user satisfaction signals, output ratings, and whether the agent completed the task. Use that data to train an actual classifier.
A few things to build in from day one:
Logging at the router level. You want per-request records of which model was selected, why, and what the outcome was. Without this, you're flying blind.
A fallback chain. Primary model fails or times out, try the secondary. Don't let a single provider outage kill your whole system.
Cost tracking. Track token consumption per route. You need to know whether routing is actually saving money, and by how much.
The Genta team routinely sets up routing layers like this as part of production agent deployments. The logging and fallback infrastructure tends to surface more optimization opportunities than the routing logic itself.
LLM Gateways vs. Custom Routers
There's a category of tools often called LLM gateways or LLM proxies that provide routing as a managed service. They sit in front of your model calls, normalize the API interface across providers, and offer routing, caching, rate limiting, and observability out of the box.
The tradeoff is control. A gateway gives you fast time-to-value but routes based on generic signals. A custom router can use task-specific signals: things a generic gateway doesn't know about, like whether a request is part of a multi-step agent run, what tools the agent has already called, or what the downstream use of the output will be.
For most teams starting out, an LLM gateway is the right first step. Once you have enough traffic data and a clear understanding of where routing decisions are wrong, you can layer in custom logic: either on top of the gateway or replacing it entirely.
Common Mistakes
The most frequent error is routing based on prompt length alone. Length correlates weakly with complexity. A 50-token instruction to rewrite a contract clause is harder than a 500-token narrative summary request. Length is a useful signal but should not be the only one.
Second: not testing routing decisions. Engineers build a router, measure aggregate cost savings, and call it done. But routing errors are silent: you don't see them unless you're sampling outputs from both paths and comparing quality. Set up an eval pipeline that periodically re-routes a sample of traffic through both models and compares results. The agent evaluation patterns we've written about apply directly here.
Third: premature optimization. It's tempting to build a sophisticated multi-tier cascade with confidence scoring before you have enough traffic to know what you're optimizing. Start simple. A binary router with good logging will teach you more in two weeks than months of upfront architectural planning.
When Routing Matters Most
If you're running fewer than a few thousand requests per day, routing probably won't move your cost needle enough to justify the engineering investment. The math changes fast though: at 100,000 requests per day, routing even 60% of traffic to a model that costs 10x less generates meaningful savings.
Routing also matters when your workload has clearly distinct task types. A single-purpose chatbot with homogeneous queries doesn't need routing. A general-purpose AI assistant handling everything from "draft this email" to "analyze this spreadsheet" almost certainly does.
The other dimension is latency. If your p95 response time matters to users, routing simple queries to a fast, cheap model while reserving the powerful model for genuinely complex requests will improve the latency distribution across your whole fleet, not just reduce costs.
Where This Fits in a Production AI Stack
LLM routing is one layer in a broader production stack. It interacts with context management (what you send to the model affects routing decisions), prompt caching (cached prompts change the effective cost calculation), and observability (you can't improve routing without measuring it).
If you're building toward a serious production deployment, routing should probably come after you've stabilized your prompts and context pipeline, and before you start optimizing individual agent steps. The sequence: get something working, instrument it, then route intelligently based on real data.
The infrastructure is learnable and the payoff is real. Teams that treat LLM selection as a static configuration decision tend to leave both performance and cost savings on the table.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
By
May 5, 2026
9 min read
LLM Routing: How to Pick the Right Model for Every Request



Why Sending Everything to One Model Is a Bad Default
Most AI systems start the same way: pick the best model available, point all requests at it, ship. It works until it doesn't. Then you get a bill that looks like a rounding error in the wrong direction, latency complaints on simple queries, and occasional failures on complex ones.
The problem is that "best model" means different things depending on the task. A question like "what's the capital of France?" does not need GPT-4o or Claude 3.5 Sonnet. A question like "review this 200-line contract clause for ambiguous indemnification language" absolutely does. Treating these identically wastes money on the first and risks quality on the second.
LLM routing is the practice of deciding, per request, which model (or set of models) should handle it. It sounds obvious once stated. The implementation details are where it gets interesting.
What an LLM Router Actually Does
An LLM router sits between your application and your model providers. Every request goes through it. The router inspects the request, applies some decision logic, and forwards it to the appropriate backend.
That decision logic can be:
Rule-based: simple conditions on token count, topic, or user tier
Model-based: a small classifier or embedding model that predicts which backend will produce the best result
Cost-optimized: always try the cheapest model first, escalate if confidence is low
Latency-sensitive: route to whichever endpoint is currently fastest
Hybrid: combinations of the above, with fallback chains
The router does not need to be elaborate. A well-designed rule-based router can capture 80% of the value with a few dozen lines of code. The complexity scales with how much you care about optimizing the remaining 20%.
The Four Main Routing Strategies
1. Complexity-Based Routing
The most common starting point. You classify each incoming request as simple or complex, then send simple queries to a cheaper, faster model and complex ones to a more capable model.
Complexity signals you can use: token count, presence of specialized vocabulary, number of sub-tasks in the prompt, whether the user is asking for generation vs. retrieval vs. reasoning. A small BERT-class classifier trained on labeled examples from your own workload will typically outperform any heuristic here.
The RouteLLM project from LMSYS (the team behind Chatbot Arena) released open-source routers trained on preference data that achieve near-GPT-4 quality at 40-50% of the cost by routing roughly half of queries to smaller models. Their research is worth reading if you want to understand how preference-based training translates into routing decisions: arxiv.org/abs/2406.18665.
2. Cascade Routing
Try the cheap model first. If its output meets a confidence threshold, return it. If not, send the request to a more powerful model.
This requires a way to measure confidence. For classification tasks, that's straightforward: you can use the model's softmax probabilities. For open-ended generation, it's harder. Common proxies: self-consistency (run the cheap model twice, check if outputs agree), a small reward model scoring the output quality, or asking the model to rate its own certainty (unreliable but fast).
Cascade routing works well when the majority of your requests are genuinely easy and you have a measurable quality threshold. It adds latency on the hard cases because you ran the cheap model first. If more than 40% of your requests end up escalating, the overhead starts to hurt.
3. Domain-Specialized Routing
Some problems have specialized models that outperform general-purpose frontier models. Medical coding, legal analysis, code generation in niche frameworks, financial document parsing: there are fine-tuned models for many of these that cost less and perform better within their domain.
Domain routing classifies the request by topic or task type and sends it to the best-suited model. This is less about cost optimization and more about quality. A specialized code model running on cheaper infrastructure can beat GPT-4 on code generation tasks at a fraction of the price, as researchers have repeatedly shown. Hugging Face's StarCoder2 writeup is a useful reference for understanding where specialized code models sit on the capability curve.
4. Availability and Latency Routing
Provider APIs go down. Rate limits get hit. Regional latency varies. A production system needs fallback logic that is not just about quality: it's about reliability.
This type of router maintains a live view of endpoint health and routes accordingly. At its simplest, it's a retry-with-fallback pattern. At its most sophisticated, it's a load balancer with circuit breakers, weighted round-robin across providers, and automatic failover. Tools like LiteLLM handle a lot of this plumbing and are worth knowing about even if you build your own routing logic on top.
Building a Basic Router
Routing does not require a dedicated infrastructure service to start. A simple implementation in Python looks like this conceptually: classify the request, select a model based on the classification, call the appropriate client, handle failures.
The tricky part is the classification step. If you're early and don't have labeled data yet, start with a heuristic: token count under 200 and no specialized vocabulary goes to the cheap model, everything else goes to the capable model. Log everything. After a week of traffic, you'll have labeled examples based on user satisfaction signals, output ratings, and whether the agent completed the task. Use that data to train an actual classifier.
A few things to build in from day one:
Logging at the router level. You want per-request records of which model was selected, why, and what the outcome was. Without this, you're flying blind.
A fallback chain. Primary model fails or times out, try the secondary. Don't let a single provider outage kill your whole system.
Cost tracking. Track token consumption per route. You need to know whether routing is actually saving money, and by how much.
The Genta team routinely sets up routing layers like this as part of production agent deployments. The logging and fallback infrastructure tends to surface more optimization opportunities than the routing logic itself.
LLM Gateways vs. Custom Routers
There's a category of tools often called LLM gateways or LLM proxies that provide routing as a managed service. They sit in front of your model calls, normalize the API interface across providers, and offer routing, caching, rate limiting, and observability out of the box.
The tradeoff is control. A gateway gives you fast time-to-value but routes based on generic signals. A custom router can use task-specific signals: things a generic gateway doesn't know about, like whether a request is part of a multi-step agent run, what tools the agent has already called, or what the downstream use of the output will be.
For most teams starting out, an LLM gateway is the right first step. Once you have enough traffic data and a clear understanding of where routing decisions are wrong, you can layer in custom logic: either on top of the gateway or replacing it entirely.
Common Mistakes
The most frequent error is routing based on prompt length alone. Length correlates weakly with complexity. A 50-token instruction to rewrite a contract clause is harder than a 500-token narrative summary request. Length is a useful signal but should not be the only one.
Second: not testing routing decisions. Engineers build a router, measure aggregate cost savings, and call it done. But routing errors are silent: you don't see them unless you're sampling outputs from both paths and comparing quality. Set up an eval pipeline that periodically re-routes a sample of traffic through both models and compares results. The agent evaluation patterns we've written about apply directly here.
Third: premature optimization. It's tempting to build a sophisticated multi-tier cascade with confidence scoring before you have enough traffic to know what you're optimizing. Start simple. A binary router with good logging will teach you more in two weeks than months of upfront architectural planning.
When Routing Matters Most
If you're running fewer than a few thousand requests per day, routing probably won't move your cost needle enough to justify the engineering investment. The math changes fast though: at 100,000 requests per day, routing even 60% of traffic to a model that costs 10x less generates meaningful savings.
Routing also matters when your workload has clearly distinct task types. A single-purpose chatbot with homogeneous queries doesn't need routing. A general-purpose AI assistant handling everything from "draft this email" to "analyze this spreadsheet" almost certainly does.
The other dimension is latency. If your p95 response time matters to users, routing simple queries to a fast, cheap model while reserving the powerful model for genuinely complex requests will improve the latency distribution across your whole fleet, not just reduce costs.
Where This Fits in a Production AI Stack
LLM routing is one layer in a broader production stack. It interacts with context management (what you send to the model affects routing decisions), prompt caching (cached prompts change the effective cost calculation), and observability (you can't improve routing without measuring it).
If you're building toward a serious production deployment, routing should probably come after you've stabilized your prompts and context pipeline, and before you start optimizing individual agent steps. The sequence: get something working, instrument it, then route intelligently based on real data.
The infrastructure is learnable and the payoff is real. Teams that treat LLM selection as a static configuration decision tend to leave both performance and cost savings on the table.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
By
May 5, 2026
9 min read
LLM Routing: How to Pick the Right Model for Every Request



Why Sending Everything to One Model Is a Bad Default
Most AI systems start the same way: pick the best model available, point all requests at it, ship. It works until it doesn't. Then you get a bill that looks like a rounding error in the wrong direction, latency complaints on simple queries, and occasional failures on complex ones.
The problem is that "best model" means different things depending on the task. A question like "what's the capital of France?" does not need GPT-4o or Claude 3.5 Sonnet. A question like "review this 200-line contract clause for ambiguous indemnification language" absolutely does. Treating these identically wastes money on the first and risks quality on the second.
LLM routing is the practice of deciding, per request, which model (or set of models) should handle it. It sounds obvious once stated. The implementation details are where it gets interesting.
What an LLM Router Actually Does
An LLM router sits between your application and your model providers. Every request goes through it. The router inspects the request, applies some decision logic, and forwards it to the appropriate backend.
That decision logic can be:
Rule-based: simple conditions on token count, topic, or user tier
Model-based: a small classifier or embedding model that predicts which backend will produce the best result
Cost-optimized: always try the cheapest model first, escalate if confidence is low
Latency-sensitive: route to whichever endpoint is currently fastest
Hybrid: combinations of the above, with fallback chains
The router does not need to be elaborate. A well-designed rule-based router can capture 80% of the value with a few dozen lines of code. The complexity scales with how much you care about optimizing the remaining 20%.
The Four Main Routing Strategies
1. Complexity-Based Routing
The most common starting point. You classify each incoming request as simple or complex, then send simple queries to a cheaper, faster model and complex ones to a more capable model.
Complexity signals you can use: token count, presence of specialized vocabulary, number of sub-tasks in the prompt, whether the user is asking for generation vs. retrieval vs. reasoning. A small BERT-class classifier trained on labeled examples from your own workload will typically outperform any heuristic here.
The RouteLLM project from LMSYS (the team behind Chatbot Arena) released open-source routers trained on preference data that achieve near-GPT-4 quality at 40-50% of the cost by routing roughly half of queries to smaller models. Their research is worth reading if you want to understand how preference-based training translates into routing decisions: arxiv.org/abs/2406.18665.
2. Cascade Routing
Try the cheap model first. If its output meets a confidence threshold, return it. If not, send the request to a more powerful model.
This requires a way to measure confidence. For classification tasks, that's straightforward: you can use the model's softmax probabilities. For open-ended generation, it's harder. Common proxies: self-consistency (run the cheap model twice, check if outputs agree), a small reward model scoring the output quality, or asking the model to rate its own certainty (unreliable but fast).
Cascade routing works well when the majority of your requests are genuinely easy and you have a measurable quality threshold. It adds latency on the hard cases because you ran the cheap model first. If more than 40% of your requests end up escalating, the overhead starts to hurt.
3. Domain-Specialized Routing
Some problems have specialized models that outperform general-purpose frontier models. Medical coding, legal analysis, code generation in niche frameworks, financial document parsing: there are fine-tuned models for many of these that cost less and perform better within their domain.
Domain routing classifies the request by topic or task type and sends it to the best-suited model. This is less about cost optimization and more about quality. A specialized code model running on cheaper infrastructure can beat GPT-4 on code generation tasks at a fraction of the price, as researchers have repeatedly shown. Hugging Face's StarCoder2 writeup is a useful reference for understanding where specialized code models sit on the capability curve.
4. Availability and Latency Routing
Provider APIs go down. Rate limits get hit. Regional latency varies. A production system needs fallback logic that is not just about quality: it's about reliability.
This type of router maintains a live view of endpoint health and routes accordingly. At its simplest, it's a retry-with-fallback pattern. At its most sophisticated, it's a load balancer with circuit breakers, weighted round-robin across providers, and automatic failover. Tools like LiteLLM handle a lot of this plumbing and are worth knowing about even if you build your own routing logic on top.
Building a Basic Router
Routing does not require a dedicated infrastructure service to start. A simple implementation in Python looks like this conceptually: classify the request, select a model based on the classification, call the appropriate client, handle failures.
The tricky part is the classification step. If you're early and don't have labeled data yet, start with a heuristic: token count under 200 and no specialized vocabulary goes to the cheap model, everything else goes to the capable model. Log everything. After a week of traffic, you'll have labeled examples based on user satisfaction signals, output ratings, and whether the agent completed the task. Use that data to train an actual classifier.
A few things to build in from day one:
Logging at the router level. You want per-request records of which model was selected, why, and what the outcome was. Without this, you're flying blind.
A fallback chain. Primary model fails or times out, try the secondary. Don't let a single provider outage kill your whole system.
Cost tracking. Track token consumption per route. You need to know whether routing is actually saving money, and by how much.
The Genta team routinely sets up routing layers like this as part of production agent deployments. The logging and fallback infrastructure tends to surface more optimization opportunities than the routing logic itself.
LLM Gateways vs. Custom Routers
There's a category of tools often called LLM gateways or LLM proxies that provide routing as a managed service. They sit in front of your model calls, normalize the API interface across providers, and offer routing, caching, rate limiting, and observability out of the box.
The tradeoff is control. A gateway gives you fast time-to-value but routes based on generic signals. A custom router can use task-specific signals: things a generic gateway doesn't know about, like whether a request is part of a multi-step agent run, what tools the agent has already called, or what the downstream use of the output will be.
For most teams starting out, an LLM gateway is the right first step. Once you have enough traffic data and a clear understanding of where routing decisions are wrong, you can layer in custom logic: either on top of the gateway or replacing it entirely.
Common Mistakes
The most frequent error is routing based on prompt length alone. Length correlates weakly with complexity. A 50-token instruction to rewrite a contract clause is harder than a 500-token narrative summary request. Length is a useful signal but should not be the only one.
Second: not testing routing decisions. Engineers build a router, measure aggregate cost savings, and call it done. But routing errors are silent: you don't see them unless you're sampling outputs from both paths and comparing quality. Set up an eval pipeline that periodically re-routes a sample of traffic through both models and compares results. The agent evaluation patterns we've written about apply directly here.
Third: premature optimization. It's tempting to build a sophisticated multi-tier cascade with confidence scoring before you have enough traffic to know what you're optimizing. Start simple. A binary router with good logging will teach you more in two weeks than months of upfront architectural planning.
When Routing Matters Most
If you're running fewer than a few thousand requests per day, routing probably won't move your cost needle enough to justify the engineering investment. The math changes fast though: at 100,000 requests per day, routing even 60% of traffic to a model that costs 10x less generates meaningful savings.
Routing also matters when your workload has clearly distinct task types. A single-purpose chatbot with homogeneous queries doesn't need routing. A general-purpose AI assistant handling everything from "draft this email" to "analyze this spreadsheet" almost certainly does.
The other dimension is latency. If your p95 response time matters to users, routing simple queries to a fast, cheap model while reserving the powerful model for genuinely complex requests will improve the latency distribution across your whole fleet, not just reduce costs.
Where This Fits in a Production AI Stack
LLM routing is one layer in a broader production stack. It interacts with context management (what you send to the model affects routing decisions), prompt caching (cached prompts change the effective cost calculation), and observability (you can't improve routing without measuring it).
If you're building toward a serious production deployment, routing should probably come after you've stabilized your prompts and context pipeline, and before you start optimizing individual agent steps. The sequence: get something working, instrument it, then route intelligently based on real data.
The infrastructure is learnable and the payoff is real. Teams that treat LLM selection as a static configuration decision tend to leave both performance and cost savings on the table.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.