smart-routerExpertise-aware model router with semantic domain scoring, context-overflow protection, and security redaction. Automatically selects the optimal AI model using weighted expertise scoring (Feb 2026 benchmarks). Supports Claude, GPT, Gemini, Grok with automatic fallback chains, HITL gates, and cost optimization.
Install via ClawdBot CLI:
clawdbot install c0nSpIc0uS7uRk3r/smart-routerIntelligently route requests to the optimal AI model using tiered classification with automatic fallback handling and cost optimization.
The router operates transparentlyβusers send messages normally and get responses from the best model for their task. No special commands needed.
Optional visibility: Include [show routing] in any message to see the routing decision.
The router uses a three-tier decision process:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TIER 1: INTENT DETECTION β
β Classify the primary purpose of the request β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β CODE β ANALYSIS β CREATIVE β REALTIME β GENERAL β
β write/debug β research β writing β news/live β Q&A/chat β
β refactor β explain β stories β X/Twitter β translate β
β review β compare β brainstorm β prices β summarize β
ββββββββ¬ββββββββ΄βββββββ¬βββββββ΄ββββββ¬βββββββ΄ββββββ¬ββββββ΄ββββββ¬ββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TIER 2: COMPLEXITY ESTIMATION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β SIMPLE (Tier $) β MEDIUM (Tier $) β COMPLEX (Tier $$)β
β β’ One-step task β β’ Multi-step task β β’ Deep reasoning β
β β’ Short response OK β β’ Some nuance β β’ Extensive outputβ
β β’ Factual lookup β β’ Moderate context β β’ Critical task β
β β Haiku/Flash β β Sonnet/Grok/GPT β β Opus/GPT-5 β
ββββββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TIER 3: SPECIAL CASE OVERRIDES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β CONDITION β OVERRIDE TO β
β ββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββ
β Context >100K tokens β β Gemini Pro (1M ctx) β
β Context >500K tokens β β Gemini Pro ONLY β
β Needs real-time data β β Grok (regardless) β
β Image/vision input β β Opus or Gemini Pro β
β User explicit override β β Requested model β
ββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
When a request contains multiple clear intents (e.g., "Write code to analyze this data and explain it creatively"):
Examples:
Non-English requests are handled normally β all supported models have multilingual capabilities:
| Model | Non-English Support |
|-------|---------------------|
| Opus/Sonnet/Haiku | Excellent (100+ languages) |
| GPT-5 | Excellent (100+ languages) |
| Gemini Pro/Flash | Excellent (100+ languages) |
| Grok | Good (major languages) |
Intent detection still works because:
Edge case: If intent unclear due to language, default to GENERAL intent with MEDIUM complexity.
| Intent | Simple | Medium | Complex |
|--------|--------|--------|---------|
| CODE | Sonnet | Opus | Opus |
| ANALYSIS | Flash | GPT-5 | Opus |
| CREATIVE | Sonnet | Opus | Opus |
| REALTIME | Grok | Grok | Grok-3 |
| GENERAL | Flash | Sonnet | Opus |
When a model becomes unavailable mid-session (token quota exhausted, rate limit hit, API error), the router automatically switches to the next best available model and notifies the user.
When a model switch occurs due to exhaustion, the user receives a notification:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β οΈ MODEL SWITCH NOTICE β
β β
β Your request could not be completed on claude-opus-4-5 β
β (reason: token quota exhausted). β
β β
β β
Request completed using: anthropic/claude-sonnet-4-5 β
β β
β The response below was generated by the fallback model. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Reason | Description |
|--------|-------------|
| token quota exhausted | Daily/monthly token limit reached |
| rate limit exceeded | Too many requests per minute |
| context window exceeded | Input too large for model |
| API timeout | Model took too long to respond |
| API error | Provider returned an error |
| model unavailable | Model temporarily offline |
def execute_with_fallback(primary_model: str, fallback_chain: list[str], request: str) -> Response:
"""
Execute request with automatic fallback and user notification.
"""
attempted_models = []
switch_reason = None
# Try primary model first
models_to_try = [primary_model] + fallback_chain
for model in models_to_try:
try:
response = call_model(model, request)
# If we switched models, prepend notification
if attempted_models:
notification = build_switch_notification(
failed_model=attempted_models[0],
reason=switch_reason,
success_model=model
)
return Response(
content=notification + "\n\n---\n\n" + response.content,
model_used=model,
switched=True
)
return Response(content=response.content, model_used=model, switched=False)
except TokenQuotaExhausted:
attempted_models.append(model)
switch_reason = "token quota exhausted"
log_fallback(model, switch_reason)
continue
except RateLimitExceeded:
attempted_models.append(model)
switch_reason = "rate limit exceeded"
log_fallback(model, switch_reason)
continue
except ContextWindowExceeded:
attempted_models.append(model)
switch_reason = "context window exceeded"
log_fallback(model, switch_reason)
continue
except APITimeout:
attempted_models.append(model)
switch_reason = "API timeout"
log_fallback(model, switch_reason)
continue
except APIError as e:
attempted_models.append(model)
switch_reason = f"API error: {e.code}"
log_fallback(model, switch_reason)
continue
# All models exhausted
return build_exhaustion_error(attempted_models)
def build_switch_notification(failed_model: str, reason: str, success_model: str) -> str:
"""Build user-facing notification when model switch occurs."""
return f"""β οΈ **MODEL SWITCH NOTICE**
Your request could not be completed on `{failed_model}` (reason: {reason}).
β
**Request completed using:** `{success_model}`
The response below was generated by the fallback model."""
def build_exhaustion_error(attempted_models: list[str]) -> Response:
"""Build error when all models are exhausted."""
models_tried = ", ".join(attempted_models)
return Response(
content=f"""β **REQUEST FAILED**
Unable to complete your request. All available models have been exhausted.
**Models attempted:** {models_tried}
**What you can do:**
1. **Wait** β Token quotas typically reset hourly or daily
2. **Simplify** β Try a shorter or simpler request
3. **Check status** β Run `/router status` to see model availability
If this persists, your human may need to check API quotas or add additional providers.""",
model_used=None,
switched=False,
failed=True
)
When a model is exhausted, the router selects the next best model for the same task type:
| Original Model | Fallback Priority (same capability) |
|----------------|-------------------------------------|
| Opus | Sonnet β GPT-5 β Grok-3 β Gemini Pro |
| Sonnet | GPT-5 β Grok-3 β Opus β Haiku |
| GPT-5 | Sonnet β Opus β Grok-3 β Gemini Pro |
| Gemini Pro | Flash β GPT-5 β Opus β Sonnet |
| Grok-2/3 | (warn: no real-time fallback available) |
After a model switch, the agent should note in the response that:
This ensures transparency and sets appropriate expectations.
When using streaming responses, fallback handling requires special consideration:
async def execute_with_streaming_fallback(primary_model: str, fallback_chain: list[str], request: str):
"""
Handle streaming responses with mid-stream fallback.
If a model fails DURING streaming (not before), the partial response is lost.
Strategy: Don't start streaming until first chunk received successfully.
"""
models_to_try = [primary_model] + fallback_chain
for model in models_to_try:
try:
# Test with non-streaming ping first (optional, adds latency)
# await test_model_availability(model)
# Start streaming
stream = await call_model_streaming(model, request)
first_chunk = await stream.get_first_chunk(timeout=10_000) # 10s timeout for first chunk
# If we got here, model is responding β continue streaming
yield first_chunk
async for chunk in stream:
yield chunk
return # Success
except (FirstChunkTimeout, StreamError) as e:
log_fallback(model, str(e))
continue # Try next model
# All models failed
yield build_exhaustion_error(models_to_try)
Key insight: Wait for the first chunk before committing to a model. If the first chunk times out, fall back before any partial response is shown to the user.
RETRY_CONFIG = {
"initial_timeout_ms": 30_000, # 30s for first attempt
"fallback_timeout_ms": 20_000, # 20s for fallback attempts (faster fail)
"max_retries_per_model": 1, # Don't retry same model
"backoff_multiplier": 1.5, # Not used (no same-model retry)
"circuit_breaker_threshold": 3, # Failures before skipping model entirely
"circuit_breaker_reset_ms": 300_000 # 5 min before trying failed model again
}
Circuit breaker: If a model fails 3 times in 5 minutes, skip it entirely for the next 5 minutes. This prevents repeatedly hitting a down service.
When the preferred model fails (rate limit, API down, error), cascade to the next option:
Opus β Sonnet β GPT-5 β Gemini Pro
Opus β GPT-5 β Gemini Pro β Sonnet
Opus β GPT-5 β Sonnet β Gemini Pro
Grok-2 β Grok-3 β (warn: no real-time fallback)
Flash β Haiku β Sonnet β GPT-5
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LONG CONTEXT FALLBACK CHAIN β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β TOKEN COUNT β FALLBACK CHAIN β
β ββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β 128K - 200K β Opus (200K) β Sonnet (200K) β Gemini Pro β
β 200K - 1M β Gemini Pro β Flash (1M) β ERROR_MESSAGE β
β > 1M β ERROR_MESSAGE (no model supports this) β
βββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββ
Implementation:
def handle_long_context(token_count: int, available_models: dict) -> str | ErrorMessage:
"""Route long-context requests with graceful degradation."""
# Tier 1: 128K - 200K tokens (Opus/Sonnet can handle)
if token_count <= 200_000:
for model in ["opus", "sonnet", "haiku", "gemini-pro", "flash"]:
if model in available_models and get_context_limit(model) >= token_count:
return model
# Tier 2: 200K - 1M tokens (only Gemini)
elif token_count <= 1_000_000:
for model in ["gemini-pro", "flash"]:
if model in available_models:
return model
# Tier 3: > 1M tokens (nothing available)
# Fall through to error
# No suitable model found β return helpful error
return build_context_error(token_count, available_models)
def build_context_error(token_count: int, available_models: dict) -> ErrorMessage:
"""Build a helpful error message when no model can handle the input."""
# Find the largest available context window
max_available = max(
(get_context_limit(m) for m in available_models),
default=0
)
# Determine what's missing
missing_models = []
if "gemini-pro" not in available_models and "flash" not in available_models:
missing_models.append("Gemini Pro/Flash (1M context)")
if token_count <= 200_000 and "opus" not in available_models:
missing_models.append("Opus (200K context)")
# Format token count for readability
if token_count >= 1_000_000:
token_display = f"{token_count / 1_000_000:.1f}M"
else:
token_display = f"{token_count // 1000}K"
return ErrorMessage(
title="Context Window Exceeded",
message=f"""Your input is approximately **{token_display} tokens**, which exceeds the context window of all currently available models.
**Required:** Gemini Pro (1M context) {"β currently unavailable" if "gemini-pro" not in available_models else ""}
**Your max available:** {max_available // 1000}K tokens
**Options:**
1. **Wait and retry** β Gemini may be temporarily down
2. **Reduce input size** β Remove unnecessary content to fit within {max_available // 1000}K tokens
3. **Split into chunks** β I can process your input sequentially in smaller pieces
Would you like me to help split this into manageable chunks?""",
recoverable=True,
suggested_action="split_chunks"
)
Example Error Output:
β οΈ Context Window Exceeded
Your input is approximately **340K tokens**, which exceeds the context
window of all currently available models.
Required: Gemini Pro (1M context) β currently unavailable
Your max available: 200K tokens
Options:
1. Wait and retry β Gemini may be temporarily down
2. Reduce input size β Remove unnecessary content to fit within 200K tokens
3. Split into chunks β I can process your input sequentially in smaller pieces
Would you like me to help split this into manageable chunks?
The router auto-detects available providers at runtime:
1. Check configured auth profiles
2. Build available model list from authenticated providers
3. Construct routing table using ONLY available models
4. If preferred model unavailable, use best available alternative
Example: If only Anthropic and Google are configured:
The router considers cost when complexity is LOW:
| Model | Cost Tier | Use When |
|-------|-----------|----------|
| Gemini Flash | $ | Simple tasks, high volume |
| Claude Haiku | $ | Simple tasks, quick responses |
| Claude Sonnet | $$ | Medium complexity |
| Grok 2 | $$ | Real-time needs only |
| GPT-5 | $$ | General fallback |
| Gemini Pro | $$$ | Long context needs |
| Claude Opus | $$$$ | Complex/critical tasks |
Rule: Never use Opus ($$$) for tasks that Flash ($) can handle.
Add [show routing] to any message:
[show routing] What's the weather in NYC?
Output includes:
[Routed β xai/grok-2-latest | Reason: REALTIME intent detected | Fallback: none available]
Explicit overrides:
Ask: "router status" or "/router" to see:
When processing a request:
1. DETECT available models (check auth profiles)
2. CLASSIFY intent (code/analysis/creative/realtime/general)
3. ESTIMATE complexity (simple/medium/complex)
4. CHECK special cases (context size, vision, explicit override)
5. FILTER by cost tier based on complexity β BEFORE model selection
6. SELECT model from filtered pool using routing matrix
7. VERIFY model available, else use fallback chain (also cost-filtered)
8. EXECUTE request with selected model
9. IF failure, try next in fallback chain
10. LOG routing decision (for debugging)
def route_with_fallback(request):
"""
Main routing function with CORRECT execution order.
Cost filtering MUST happen BEFORE routing table lookup.
"""
# Step 1: Discover available models
available_models = discover_providers()
# Step 2: Classify intent
intent = classify_intent(request)
# Step 3: Estimate complexity
complexity = estimate_complexity(request)
# Step 4: Check special-case overrides (these bypass cost filtering)
if user_override := get_user_model_override(request):
return execute_with_fallback(user_override, []) # No cost filter for explicit override
if token_count > 128_000:
return handle_long_context(token_count, available_models) # Special handling
if needs_realtime(request):
return execute_with_fallback("grok-2", ["grok-3"]) # Realtime bypasses cost
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# β STEP 5: FILTER BY COST TIER β THIS MUST COME FIRST! β
# β β
# β Cost filtering happens BEFORE the routing table lookup, β
# β NOT after. This ensures "what's 2+2?" never considers β
# β Opus even momentarily. β
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
allowed_tiers = get_allowed_tiers(complexity)
# SIMPLE β ["$"]
# MEDIUM β ["$", "$"]
# COMPLEX β ["$", "$", "$$"]
cost_filtered_models = {
model: meta for model, meta in available_models.items()
if COST_TIERS.get(model) in allowed_tiers
}
# Step 6: NOW select from cost-filtered pool using routing preferences
preferences = ROUTING_PREFERENCES.get((intent, complexity), [])
for model in preferences:
if model in cost_filtered_models: # Only consider cost-appropriate models
selected_model = model
break
else:
# No preferred model in cost-filtered pool β use cheapest available
selected_model = select_cheapest(cost_filtered_models)
# Step 7: Build cost-filtered fallback chain
task_type = get_task_type(intent, complexity)
full_chain = MASTER_FALLBACK_CHAINS.get(task_type, [])
filtered_chain = [m for m in full_chain if m in cost_filtered_models and m != selected_model]
# Step 8-10: Execute with fallback + logging
return execute_with_fallback(selected_model, filtered_chain)
def get_allowed_tiers(complexity: str) -> list[str]:
"""Return allowed cost tiers for a given complexity level."""
return {
"SIMPLE": ["$"], # Budget only β no exceptions
"MEDIUM": ["$", "$"], # Budget + standard
"COMPLEX": ["$", "$", "$$", "$$"], # All tiers β complex tasks deserve the best
}.get(complexity, ["$", "$"])
# Example flow for "what's 2+2?":
#
# 1. available_models = {opus, sonnet, haiku, flash, grok-2, ...}
# 2. intent = GENERAL
# 3. complexity = SIMPLE
# 4. (no special cases)
# 5. allowed_tiers = ["$"] β SIMPLE means $ only
# cost_filtered_models = {haiku, flash, grok-2} β Opus/Sonnet EXCLUDED
# 6. preferences for (GENERAL, SIMPLE) = [flash, haiku, grok-2, sonnet]
# first match in cost_filtered = flash β
# 7. fallback_chain = [haiku, grok-2] β Also cost-filtered
# 8. execute with flash
#
# Result: Opus is NEVER considered, not even momentarily.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COST OPTIMIZATION IMPLEMENTATION OPTIONS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β APPROACH 1: Explicit filter_by_cost() (shown above) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β’ Calls get_allowed_tiers(complexity) explicitly β
β β’ Filters available_models BEFORE routing table lookup β
β β’ Most defensive β impossible to route wrong tier β
β β’ Recommended for security-critical deployments β
β β
β APPROACH 2: Preference ordering (implicit) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β’ ROUTING_PREFERENCES lists cheapest capable models first β
β β’ For SIMPLE tasks: [flash, haiku, grok-2, sonnet] β
β β’ First available match wins β naturally picks cheapest β
β β’ Simpler code, relies on correct preference ordering β
β β
β This implementation uses BOTH for defense-in-depth: β
β β’ Preference ordering provides first line of cost awareness β
β β’ Explicit filter_by_cost() guarantees tier enforcement β
β β
β For alternative implementations that rely solely on β
β preference ordering, see references/models.md for the β
β filter_by_cost() function if explicit enforcement is needed. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Use sessions_spawn for model routing:
sessions_spawn(
task: "user's request",
model: "selected/model-id",
label: "task-type-query"
)
references/security.md for full security guidanceSee references/models.md for detailed capabilities and pricing.
Generated Mar 1, 2026
A development team uses the router to automatically select the best AI model for code generation, debugging, and refactoring tasks. For example, when a developer submits a complex bug fix request with large code context, the router routes to Claude Opus for its high SWE-bench performance, while simple syntax queries go to cost-effective models like Haiku.
A financial analyst requests real-time stock price analysis and trend predictions. The router detects REALTIME intent and routes to Grok for up-to-date data, ensuring timely insights. For deeper historical analysis with large datasets, it may switch to Gemini Pro to handle context overflow beyond 500K tokens.
A marketing agency uses the router to generate creative copy, stories, and brainstorming ideas. The router identifies CREATIVE intent and routes to models like GPT-5 for nuanced writing, while automatically falling back to Claude Sonnet for cost optimization on medium-complexity tasks.
A university researcher submits a query to analyze and compare scientific papers. The router detects ANALYSIS intent and routes to Claude Opus for deep reasoning, with security redaction to protect sensitive data. For multilingual research queries, it leverages models' excellent non-English support.
A customer service platform integrates the router to handle diverse queries, from simple Q&A to complex troubleshooting. The router classifies intents like GENERAL for translations and CODE for technical issues, using fallback chains to ensure reliable responses and HITL gates for low-confidence cases.
Offer the smart router as a cloud-based API service with tiered pricing based on usage volume and features like HITL gates or advanced analytics. Revenue is generated through monthly subscriptions, with enterprise plans for high-throughput clients needing security and custom routing rules.
License the router technology to other AI platforms or enterprises for embedding into their products. Revenue comes from upfront licensing fees and ongoing support contracts, targeting companies that want to enhance their AI offerings without building routing logic from scratch.
Provide the router via an API with pay-per-request pricing, appealing to developers and small businesses. Revenue is driven by API calls, with cost optimization features helping users reduce expenses by automatically selecting cheaper models for simple tasks.
π¬ Integration Tip
Ensure API keys for supported models are set in environment variables, and test with the [show routing] flag to debug intent classification before full deployment.
Transform AI agents from task-followers into proactive partners that anticipate needs and continuously improve. Now with WAL Protocol, Working Buffer, Autonomous Crons, and battle-tested patterns. Part of the Hal Stack π¦
Use the ClawdHub CLI to search, install, update, and publish agent skills from clawdhub.com. Use when you need to fetch new skills on the fly, sync installed skills to latest or a specific version, or publish new/updated skill folders with the npm-installed clawdhub CLI.
Clawdbot documentation expert with decision tree navigation, search scripts, doc fetching, version tracking, and config snippets for all Clawdbot features
Interact with Moltbook social network for AI agents. Post, reply, browse, and analyze engagement. Use when the user wants to engage with Moltbook, check their feed, reply to posts, or track their activity on the agent social network.
OpenClaw CLI wrapper β gateway, channels, models, agents, nodes, browser, memory, security, automation.
MoltGuard β runtime security plugin for OpenClaw agents by OpenGuardrails. Helps users install, register, activate, and check the status of MoltGuard. Use wh...