LLM Routing Is Pareto Math

blog-image

tl;dr

You don't really pick “the best model”.

You pick whatever model makes sense for this one request.

That usually means:

  • guess the cost from input/output tokens
  • guess latency from TTFT and output speed
  • set some minimum quality bar
  • pick the cheapest model that still passes

So yeah, model selection is not really a leaderboard thing. It is more like boring infra / control plane stuff.

The basic problem

Every hosted LLM call has three things you can actually look at:

  1. Quality q: some benchmark score. Not truth. Just a rough ordering. I use the Artificial Analysis Intelligence Index because it is good enough for this example.
  2. Cost c: token pricing after input tokens, output tokens, cache, batch discounts, etc.
  3. Latency l: time to first token, then the time it takes to stream the rest.

For a request with n_in input tokens and n_out output tokens:

c = (n_in * p_in + n_out * p_out) / 1e6

If your provider has caching or batch discounts, add those rules. Do not just assume “cache is 50% off” everywhere. Providers love making this stuff weird.

Latency is:

l = TTFT + n_out / R

R is tokens/sec while decoding. If the UI feels slow, it is probably this plus queueing.

There is no single winner

A model can be clearly worse than another one.

If model B has:

  • same or better quality
  • same or lower cost
  • same or lower latency

then A is just worse. Remove it.

B dominates A if:

q_B >= q_A
c_B <= c_A
l_B <= l_A
and at least one inequality is strict

After deleting the obviously bad choices, you get the Pareto frontier.

The router is not asking:

what is the smartest model?

It is asking:

what is the best model I can use under my constraints?

More formally:

maximize q_m

subject to:
  c_m <= cost_max
  l_m <= latency_max
  q_m >= quality_floor

In production, you probably want something softer:

U(m) = q_m - lambda_cost * c_m - lambda_latency * l_m

This is not deep science. It is just a scoring function.

Also, the benchmark score is ordinal. Dollars and seconds are not. So do not get too galaxy-brained about the units. It is still useful.

lambda_cost means how much you care about money.

lambda_latency means how much you care about latency.

Your lambdas are basically your business model written as numbers.

Tiny router

Here is the dumb version:

from dataclasses import dataclass
from math import inf

@dataclass
class Model:
    name: str
    q: float
    p_in: float
    p_out: float
    tput: float
    ttft: float

    def cost(self, n_in, n_out, batch=False):
        c = (n_in * self.p_in + n_out * self.p_out) / 1e6
        return c * (0.5 if batch else 1.0)

    def latency(self, n_out):
        return self.ttft + n_out / self.tput

def utility(m, n_in, n_out, lam_cost, lam_latency, **kw):
    return (
        m.q
        - lam_cost * m.cost(n_in, n_out, **kw)
        - lam_latency * m.latency(n_out)
    )

def route(
    catalog,
    n_in,
    n_out,
    q_min=0,
    cost_max=inf,
    latency_max=inf,
    lam_cost=0,
    lam_latency=0,
    **kw
):
    feasible = [
        m for m in catalog
        if m.q >= q_min
        and m.cost(n_in, n_out, **kw) <= cost_max
        and m.latency(n_out) <= latency_max
    ]

    if not feasible:
        return None

    return max(
        feasible,
        key=lambda m: utility(m, n_in, n_out, lam_cost, lam_latency, **kw)
    )

This is the product, honestly. The model is just one input.

Remove bad options first

def frontier(models, n_in, n_out, **kw):
    rows = [
        (m, m.q, m.cost(n_in, n_out, **kw), m.latency(n_out))
        for m in models
    ]

    keep = []

    for a, qa, ca, la in rows:
        dominated = any(
            qb >= qa and cb <= ca and lb <= la
            and (qb > qa or cb < ca or lb < la)
            for _, qb, cb, lb in rows
        )

        if not dominated:
            keep.append(a)

    return keep

This just removes leaderboard spam. After that, there are fewer things to argue about.

Example catalog

Snapshot: 2026-06-08. These numbers will go stale. Put this stuff in config, not hard-coded forever.

CATALOG = [
    Model("claude-opus-4.8 @ anthropic", 61, 5.00, 25.00,  67, 2.2),
    Model("gpt-5.5-xhigh @ openai",      60, 5.00, 30.00,  65, 1.8),
    Model("gemini-3.1-pro @ google",     57, 2.00, 12.00, 142, 0.9),

    Model("gpt-oss-120b @ cerebras",     45, 0.35, 0.75, 3000, 0.3),
    Model("gpt-oss-120b @ groq",         45, 0.15, 0.60,  500, 0.3),

    Model("llama-3.3-70b @ groq",        41, 0.59, 0.79,  250, 0.3),
    Model("llama-3.1-8b @ groq",         24, 0.05, 0.08,  560, 0.2),
]

Say we have a support request:

input:  800 tokens
output: 250 tokens

You get something like:

model                              $/req      lat(s)   Q
claude-opus-4.8 @ anthropic      0.01025      5.93    61
gpt-5.5-xhigh @ openai           0.01150      5.65    60
gemini-3.1-pro @ google          0.00460      2.66    57
gpt-oss-120b @ cerebras          0.00047      0.38    45
gpt-oss-120b @ groq              0.00027      0.80    45
llama-3.3-70b @ groq             0.00067      1.30    41
llama-3.1-8b @ groq              0.00006      0.65    24

You can already see the tradeoff.

The “smartest” one is expensive and not super fast. The fastest one is not the smartest. The cheapest one might still be fine.

That “might still be fine” part is where the money is.

Provider matters too

A model name is not the whole product.

The actual product is:

model
+ runtime
+ queue behavior
+ cache policy
+ rate limits
+ failure mode
+ price sheet

Same open-weight model on two providers? Not the same thing.

Different runtime, different TTFT, different tokens/sec, different rate limits, different ways it fails at 3am.

For chat, streaming speed matters because users stare at the cursor.

For batch, streaming speed matters way less. Cheap can beat fast.

For agents, TTFT is not always the bottleneck. Tool loops and context growth can dominate.

For support bots, most requests should probably not hit the frontier model at all.

Cascading

Basic idea:

  1. Try cheap model.
  2. If it seems confident, use it.
  3. If not, escalate to stronger model.
def cascade(catalog, prompt, n_in, n_out, conf_fn, threshold=0.8):
    cheap = route(
        catalog,
        n_in,
        n_out,
        q_min=20,
        lam_cost=1e4,
        lam_latency=10
    )

    ans, conf = call(cheap, prompt)

    if conf >= threshold:
        return ans, cheap.name

    strong = route(
        catalog,
        n_in,
        n_out,
        q_min=55,
        lam_cost=1e3,
        lam_latency=2
    )

    ans, _ = call(strong, prompt)
    return ans, strong.name

Blended cost:

c_blended = c_cheap + alpha * c_strong

If 20% of requests escalate:

c_cheap  = $0.0004
c_strong = $0.0090
alpha    = 0.2

c_blended = $0.0022/request

So this is not really “model optimization”.

It is allocation.

Log stuff or you are guessing

If you are not logging this, you are mostly making vibes-based routing decisions:

request_id
provider
model
input_tokens
output_tokens
cache_hit
ttft_ms
decode_tps
total_latency_ms
cost_usd
quality_floor
router_policy
fallback_reason
user_visible_error

Also track p50, p95, p99.

Provider marketing numbers are best-case. Your logs are what actually happened.

Small shop vs mid-size company

A small shop usually cares a lot about cost.

It probably wants:

one fast managed endpoint
one fallback path
model name in env vars
no GPU ownership
no inference team
no pager

The first version should be boring:

cheap fast model
strict prompt
small knowledge base
fallback to human
measure everything

A mid-size company has more weird edge cases and more expensive mistakes.

It probably wants:

cheap model for bulk traffic
frontier model for high-value ambiguity
batch API for offline work
policy router
telemetry loop
cost dashboard

Do not just “buy a smarter model” and call it a strategy.

Route better.

Stuff that will bite you

  1. Quality scores are ordinal. Do not pretend one index point is a real physical unit.

  2. Caching is provider-specific. Prompt cache, batch price, priority lane, flex queue, all different.

  3. Throughput is not an SLO. Shared tiers slow down. TTFT grows. TPM limits happen.

  4. Context costs money. Long system prompts, retrieved docs, and chat history are not free.

  5. Free tiers are not production economics. They prove your demo works. That is about it.

  6. Fallbacks are UX. “We are busy” is a product choice, not just an exception handler.

  7. Catalog should be config. If price changes require a deploy, your router is too rigid.

The ending

Leaderboard question:

which model is smartest?

Finance question:

which model is cheapest?

Product question:

which model should handle this request?

So route it:

m* = argmax_m [q_m - lambda_cost * c_m - lambda_latency * l_m]

with your actual constraints.

Benchmarks tell you what the frontier looks like.

Telemetry tells you where you should sit on it.

comments powered by Disqus