How to Route Cheap and Premium Models Inside One Agent Workflow

April 12, 2026 • AI Operations • Butler

A practical guide to routing cheap and premium models inside one workflow, with cost logic, escalation rules, and the failure modes that erase savings.

The Butler standing beside a chess table in the manor library, representing strategic model routing decisions inside one agent workflow

Most teams still argue about model choice like they need one permanent winner. That is usually the wrong frame. In a real agent workflow, the useful question is which steps can be done cheaply, and which steps become expensive when they are wrong.

That is why model routing is a workflow decision, not a leaderboard decision. You are not buying intelligence in the abstract. You are deciding where stronger judgment actually reduces downstream cost, retries, review time, and cleanup.

For AI coding and agent operations, that distinction matters a lot. A cheap model can often handle triage, retrieval, parsing, and narrow edits. A premium model earns its keep on planning, ambiguity, repo-scale reasoning, and final review. If you route those steps deliberately, savings can be substantial. If you route them badly, the retry tax eats the win.

Why routing is really a workflow design problem

The wrong way to think about routing is, “Which model is best?”

The better question is, “Which parts of this workflow are cheap to verify, and which parts create expensive damage when they fail?”

That shift changes everything.

A step that is easy to check should usually start cheap. A step that shapes the whole task, touches broad context, or creates risky output should usually get a stronger model earlier. The goal is not to maximize cheap-model usage at all costs. The goal is to minimize total task cost, which includes:

token spend
retries
failed reviews
human cleanup
latency from bad escalations
coordination overhead when the workflow goes sideways

This is the same reason teams should think past sticker-price comparisons alone. If you have not already looked at the pricing side, Butler's AI model pricing comparison for 2026 is a helpful baseline. But price tables do not tell you where routing actually pays off. Workflow shape does.

The routing patterns teams actually use

Most real systems are not doing anything magical. They usually fall into a few practical patterns.

1. Step-fit routing

This is the simplest and often the most reliable pattern.

Use cheaper models for bounded, high-volume steps such as:

task classification
doc or code retrieval
extraction and parsing
simple transforms
narrow implementation passes inside a fixed plan

Use premium models for:

planning
architecture choices
ambiguous requirements
broad repo reasoning
final review on expensive diffs

If your workflow already has clear stages, this is the easiest place to start.

2. Router or judge model

A lightweight model classifies the incoming task as easy, medium, or hard, then routes it to the right tier.

This works well when your task mix is broad, but it is also one of the biggest failure points. A weak judge sends simple work to expensive models or sends hard work down the cheap path where retries pile up.

3. Fallback cascade

Start cheap, then escalate only if the answer fails confidence checks, verification, or policy gates.

This can save real money, but only if the fallback logic is honest. A lot of teams say they have escalation, when what they really have is three cheap retries and a premium model call at the end anyway. That is not routing. That is delay.

4. Risk-routed human checkpoints

Some tasks should not keep bouncing between models once risk gets high enough.

If a workflow touches production, destructive actions, public release, or broad access changes, the real bottleneck may be human approval, not another model retry. That boundary should be explicit. Otherwise teams keep escalating model strength when the actual issue is operational risk acceptance.

A concrete plan-execute-review example

A practical coding workflow might look like this:

Plan

Cheap model triages the request
Cheap model pulls likely files, tests, and docs
Premium model decides the approach, identifies risks, and sets scope

Execute

Cheap or mid-tier model makes the straightforward code changes inside that plan
Cheap model runs narrow checks or formats output for review

Review

Premium model reviews the diff, checks whether the implementation matches the plan, and looks for repo-coupling problems
Human approves if the remaining issue is risk, not reasoning

The premium model is not there to do all the typing. It is there to make the expensive decisions.

Imagine an inbound bug report. A cheap model classifies it, finds the likely modules, and pulls the relevant tests. A premium model decides whether the change is truly local or whether it smells cross-cutting. If the task stays narrow, a cheaper execution model can draft the implementation and run the obvious checks. If the diff expands or tests fail in a surprising way, the workflow escalates before review debt gets ugly.

The cost math only works if you include retries and review burden

The headline savings on routing are real, but only when you measure the full workflow.

A rough 2026 pricing spread still looks something like this:

cheap tier: around $0.05 to $0.10 per million input tokens, and roughly $0.15 to $0.30 per million output tokens
premium tier: around $2.50 to $3.00 per million input tokens, and roughly $10 to $15 per million output tokens

That gap is big enough to matter. If a workflow would cost about $10 with premium models on every step, and 70 to 80 percent of token volume can move to cheaper triage, retrieval, and bounded execution, you may end up closer to $2 to $4. But that only holds if the cheap path does not create extra work.

Here is the practical version of the math:

all-premium cost: $10
routed token cost: $2.50
extra cheap retries: +$0.40
extra premium re-review because scope drifted: +$1.20
human cleanup on one bad escalation: +$0.80 equivalent labor cost
actual effective cost: $4.90

That is still a worthwhile improvement.

Now imagine a worse system:

routed token cost: $2.50
cheap retries and misroutes: +$2.00
delayed premium review: +$2.50
extra human cleanup: +$2.00
actual effective cost: $9.00

At that point you barely saved anything, and you probably made the workflow slower.

This is why Butler's piece on what an AI coding task really costs matters here. Token spend is only one line item. Review burden is the other one people conveniently forget.

Failure modes that break routing systems

A routing setup usually fails in familiar ways.

Judge-model hallucination

The router misreads task difficulty. Easy work gets sent to expensive models, or hard work gets sent to cheap ones. Both are bad. One wastes money. The other creates quality failures and retries.

Over-routing to cheap models

This is the classic “looks efficient on paper” failure. Teams push too much work into the cheap tier, quality drops, and the premium layer ends up doing rescue work instead of selective high-value review.

Retry overhead

A cheap-first cascade can erase savings when false negatives are common. If the system keeps discovering too late that the cheap answer was not good enough, you pay twice.

Latency spikes

Synchronous cascades are fine until every other task escalates. Then response time gets ugly fast. If latency matters, routing logic needs to be simple and observable.

Routing overhead on tiny tasks

For very small jobs, the routing layer itself can cost more than the savings. If a task is trivial and low risk, one direct call may be the right answer.

Teams working on bigger codebases should also keep large-repo failure patterns in mind. Routing does not magically fix context sprawl or hidden repo coupling.

When not to bother routing

Routing is not mandatory maturity. Sometimes it is just extra machinery. You may not need it if:

your tasks are all roughly the same complexity
most tasks are medium-to-hard anyway
latency is strict and synchronous cascades are unacceptable
you do not have enough observability to tune thresholds
your verification layer is weak, so the system cannot tell when cheap output is actually failing

In those cases, a simpler model policy may outperform a clever router.

A simple rule teams can implement this week

If you want a usable starting point, use this:

start each step with the cheapest model that has low verification burden
escalate immediately for planning, architecture, ambiguous scope, or broad repo context
escalate when retries are becoming more expensive than one stronger call
route to human approval when the blocker is risk acceptance rather than model capability

That rule is not glamorous, but it is how real teams keep routing grounded.

Bottom line

Good routing is not about proving that cheap models are secretly enough for everything. It is about assigning expensive judgment to the places where mistakes create expensive downstream work.

If you route by workflow step, failure cost, and review burden, you can often cut spend without making the system flimsy. If you route by leaderboard vibes alone, you usually build a retry machine.

Related coverage

AI Disclosure

This article was researched and drafted with AI assistance, then edited and structured for publication by a human. Pricing and workflow patterns can shift as model capabilities and routing tools change.