Dual-LLM Model Routing: How to Cut 80% of Your AI Costs Without Degrading Quality
If you're building an AI-powered product and sending every request to your most capable model, you're overspending by an order of magnitude. Most requests don't need frontier reasoning. They need fast, accurate execution of well-defined operations.
This guide breaks down how we implemented dual-LLM model routing in Planck, our AI calendar agent, and the real-world cost and latency numbers behind it.
The problem with one-model-fits-all
Consider these two user messages to a calendar agent:
- "What's on my calendar tomorrow?"
- "I have back-to-back meetings Tuesday through Thursday, but I need 4 hours of focus time to prep for the board deck. Sarah and Mike both need a 30-minute slot with me this week, but Sarah is OOO Wednesday. What should I move?"
Message 1 is a database lookup wrapped in natural language. It needs intent classification, a tool call to fetch events, and a formatted response. Any model can do this reliably.
Message 2 requires multi-constraint reasoning: availability across multiple people, priority weighting (board prep > regular meetings), temporal constraints (Sarah's OOO), and a recommendation that balances competing needs. This is where model capability matters.
Sending both messages to Claude Sonnet (or GPT-4) means you're paying ~$0.015 per request for work that a $0.0002 model handles just as well 80% of the time.
The routing architecture
Our approach has three components: a classifier, a model map, and a provider abstraction.
Step 1: Request classification
Before calling any LLM, we classify the incoming message into a request type and complexity level. This classification is deterministic (regex + keyword matching), not LLM-based, so it adds zero latency and zero cost.
Request types and their complexity:
Low: get_schedule, set_preference, simple_query
Medium: create_event, check_availability, cancel_event
High: multi_person_scheduling, tradeoff_analysis, negotiation, explain_decision
The classifier checks for signal phrases:
- "Schedule with [person]" or "find time for [group]" →
multi_person_scheduling(high) - "Should I..." or "which is better" →
tradeoff_analysis(high) - "What's on my..." or "show me..." →
get_schedule(low) - "Cancel my..." or "move my..." → medium complexity
This isn't perfect. Some messages are ambiguous. But it doesn't need to be perfect. It needs to route 80% of requests correctly, and the 20% that get routed to a more capable model than strictly necessary just cost a few extra cents.
Step 2: Model mapping
Each complexity level maps to a model tier:
| Complexity | Anthropic | Google | |-----------|-----------|--------| | Low | claude-haiku-4-5-20251001 | gemini-2.5-flash | | Medium | claude-haiku-4-5-20251001 | gemini-2.5-flash | | High | claude-sonnet-4-20250514 | gemini-2.5-pro |
Low and medium both route to the fast model. The distinction exists for future flexibility (we may route medium to a mid-tier model eventually) but today the binary split is: simple stuff goes fast, complex stuff goes capable.
Step 3: Provider abstraction
The agent orchestrator doesn't know which provider or model it's talking to. It calls a unified interface that handles:
- Message formatting (Anthropic's
messagesformat vs. Google'sgenerateContentformat) - Tool/function call extraction (different response structures)
- Streaming (if enabled)
- Error handling and retries
The provider is selected via a single environment variable (LLM_PROVIDER=google or LLM_PROVIDER=anthropic). This makes it trivial to switch providers or A/B test.
Real-world numbers
After running this in production, here's how requests distribute:
| Tier | % of requests | Avg latency | Cost per request | |------|--------------|-------------|-----------------| | Fast (Haiku/Flash) | ~82% | 400-800ms | $0.0001-0.0003 | | Capable (Sonnet/Pro) | ~18% | 2-5s | $0.003-0.015 |
Blended cost per request: ~$0.001
Compare this to routing everything through Sonnet/Pro: ~$0.01 per request. That's a 10x cost reduction with no measurable quality degradation on simple tasks.
The latency improvement matters even more than cost. An 800ms response for "what's on my calendar?" feels instant. A 4-second response feels sluggish. Users don't consciously notice the model difference, but they notice the speed.
What the fast model handles well
We were surprised by how capable the fast models are for calendar operations. These tasks consistently produce correct results with Haiku/Flash:
- Event CRUD: Creating, updating, deleting events with extracted parameters (time, duration, title, attendees). The structured nature of calendar data plays to the strengths of smaller models.
- Availability queries: "Am I free Thursday at 2?" requires a tool call and a yes/no answer. No reasoning needed.
- Preference updates: "Set my focus time to mornings" maps directly to a preference key-value update.
- Schedule summaries: Formatting a list of events into a readable summary.
- Single-person scheduling: "Book a dentist appointment Thursday at 3pm" is one tool call.
The pattern: if the task is primarily intent extraction + tool execution + response formatting, the fast model is sufficient.
What requires the capable model
These tasks consistently benefit from (or require) Sonnet/Pro:
- Multi-constraint scheduling: Balancing availability across 3+ people with constraints (OOO, focus time, timezone differences). The model needs to reason about trade-offs, not just execute.
- Tradeoff analysis: "Should I reschedule my 1:1 to protect my focus block?" requires weighing the importance of the meeting against the focus time, considering how many focus hours the user has already hit this week, and making a recommendation.
- Negotiation strategy: In multi-round scheduling negotiations, the model needs to understand which proposals are likely to get accepted and adjust.
- Explaining decisions: "Why did you schedule it then?" requires the model to reconstruct its reasoning chain and present it clearly.
- Context-heavy operations: When the user references multiple prior messages or expects the agent to synthesize information across a long conversation.
The pattern: if the task requires weighing trade-offs, synthesizing context, or generating a nuanced recommendation, use the capable model.
Implementation considerations
Don't over-classify
Our first version had 12 request types mapped to 4 complexity levels. It was over-engineered. The binary split (fast vs. capable) captures 95% of the value. Adding a mid-tier model and finer-grained classification might squeeze out another 10-15% cost reduction, but the complexity cost isn't worth it yet.
Default to fast, escalate on failure
If the fast model produces a poor response (detected via response validation or user feedback), you can retry with the capable model. This happens rarely (<2% of fast-model requests) but provides a safety net.
Keep tool definitions identical
Both models receive the same tool definitions. This ensures consistent behavior regardless of which model handles the request. The tools themselves are model-agnostic - they're just function calls against the database and calendar APIs.
Monitor classification accuracy
Log the classified request type alongside the actual model used and the user's response (did they follow up with a correction?). This gives you a feedback loop to tune the classifier over time.
Provider-specific quirks
Anthropic and Google have different tool-use conventions:
- Anthropic returns
tool_usecontent blocks withid,name,inputfields - Google returns
functionCallparts withname,argsfields - Response termination signals differ (Anthropic's
stop_reasonvs. Google'sfinishReason)
Abstract these differences behind a clean interface early. Don't let provider-specific types leak into your business logic.
When to add a third tier
We haven't needed it yet, but the natural evolution is:
-
Tier 0 (no LLM): Deterministic operations that can be handled by pattern matching alone. "Show me today's meetings" doesn't need a model at all - it's a regex match to a database query. This eliminates LLM cost and latency entirely for the simplest operations.
-
Tier 1 (fast model): Current simple/medium operations.
-
Tier 2 (capable model): Current complex operations.
Tier 0 is the highest-leverage optimization remaining. For a calendar agent, ~30% of requests are simple enough to handle without any LLM. That would bring the blended cost per request below $0.0005.
The takeaway
Model routing isn't a premature optimization. It's a fundamental architectural decision that affects cost, latency, and user experience. The implementation is straightforward: classify the request, pick a model, abstract the provider.
Start with a binary split (fast vs. capable). Default to fast. Escalate when the task genuinely needs reasoning. Monitor and adjust.
Your users won't notice which model answered. They'll notice that it was fast.