llm-eval-routerShadow-test local Ollama models against a cloud baseline with a multi-judge ensemble. Automatically promotes models when statistically proven equivalent โ re...
Install via ClawdBot CLI:
clawdbot install nissan/llm-eval-routerSet up a production-quality shadow evaluation pipeline that automatically
promotes local Ollama models when they statistically prove they match cloud
model quality โ reducing inference costs with evidence, not hope.
Run every task through your best local model (shadow) in parallel with your
cloud baseline (ground truth). A lightweight judge ensemble scores the local
output. After 200+ runs, if the local model hits 0.95 mean score, promote it
to handle that task type in production. Demote it automatically if quality drops.
ollama pull qwen2.5 or ollama pull phi4This skill makes outbound API calls to:
What stays local:
data/scores/*.jsonLangfuse (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network.
Every response is scored on:
| Dimension | Default weight | Analyze weight | What it measures |
|---|---|---|---|
| Structural | 25% | 10% | Format compliance, required keys present |
| Semantic | 25% | 40% | Meaning equivalence to ground truth |
| Factual | 20% | 25% | No hallucinated facts/numbers/entities |
| Completion | 15% | 18% | Task fully addressed |
| Tool use | 10% | 4% | Correct tool/format selection |
| Latency | 5% | 3% | Within acceptable bounds |
Important: Use per-task weight overrides. The default 25/25 split treats structural
accuracy equally with semantic similarity โ which works for extract/classify/format tasks
(where exact format matters) but is wrong for open-ended analysis. difflib.SequenceMatcher
on two prose analyses of the same question scores ~0.29 even when they're semantically
identical. With structural weight at 25%, this alone caps analyze scores at ~0.59.
# src/evaluator.py โ per-task weight profiles
TASK_WEIGHT_OVERRIDES = {
"analyze": {
"structural_accuracy": 0.10, # difflib is NOT meaningful for prose
"semantic_similarity": 0.40, # cosine over embeddings captures meaning
"factual_drift": 0.25,
"task_completion": 0.18,
"tool_correctness": 0.04,
"latency_score": 0.03,
},
"code_transform": {
"structural_accuracy": 0.15,
"semantic_similarity": 0.35,
"factual_drift": 0.20,
"task_completion": 0.20,
"tool_correctness": 0.07,
"latency_score": 0.03,
},
}
Also: For analyze tasks, constrain output structure via system_prompt so GT and
candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning).
This reduces Layer 2 drift and improves difflib scores even at reduced weight.
These run on every response at zero cost. Judges only run when L1+L2 pass and
the sampling rate triggers.
any model scoring below it should be flagged, not promoted
Create config/task_types.yaml:
tasks:
- id: summarize
description: "Summarize a document in N sentences"
require_json: false
judge_dimensions: [semantic, factual, completion]
- id: classify
description: "Classify text into one of N categories"
require_json: true # response must be valid JSON
judge_dimensions: [structural, semantic, completion]
- id: extract
description: "Extract structured data from unstructured text"
require_json: true
judge_dimensions: [structural, factual, completion]
- id: format
description: "Reformat content to match a template"
require_json: false
judge_dimensions: [structural, semantic, completion]
The router assigns each task to a model using a round-robin strategy during
burn-in (building n), then switches to confidence-weighted routing after promotion.
# src/router.py โ simplified version
class Router:
def __init__(self, candidates: list[str], control_floor: str):
self.candidates = candidates
self.control_floor = control_floor
self._rr_counters = defaultdict(int)
def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str:
"""Return the best model for this task type."""
promoted = confidence_tracker.get_promoted(task_type)
if promoted:
return promoted # use promoted model directly
# Round-robin during burn-in for fair exposure
idx = self._rr_counters[task_type] % len(self.candidates)
self._rr_counters[task_type] += 1
return self.candidates[idx]
For each task, run it through BOTH the local model (candidate) and the cloud
baseline (ground truth). Never use the ground truth response in production โ
it's only for evaluation.
async def evaluate_pair(prompt: str, local_response: str, gt_response: str,
task_type: str) -> float:
# Layer 1: deterministic
l1_score = validators.layer1(local_response, task_type)
if l1_score == 0.0:
return 0.0 # hard fail โ safety or format violation
# Layer 2: heuristic drift
l2_score = validators.layer2(local_response, gt_response)
# Sample judges (15%)
if random.random() < JUDGE_SAMPLE_RATE:
sonnet_score = await judge_sonnet(prompt, local_response, gt_response)
mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response)
if abs(sonnet_score - mini_score) >= 0.20:
gemini_score = await judge_gemini(prompt, local_response, gt_response)
final = median([sonnet_score, mini_score, gemini_score])
else:
final = (sonnet_score + mini_score) / 2
return weighted_score(l1_score, l2_score, final)
else:
return weighted_score(l1_score, l2_score, judge_score=None)
Track scores per model/task pair on disk (so restarts don't lose data):
# src/scoring/confidence.py โ simplified
@dataclass
class ModelStats:
model_id: str
task_type: str
scores: list[float] # all scores (None excluded)
promoted: bool = False
demoted: bool = False
@property
def mean(self) -> float:
return sum(self.scores) / len(self.scores) if self.scores else 0.0
@property
def n(self) -> int:
return len(self.scores)
def should_promote(self) -> bool:
return self.n >= 200 and self.mean >= 0.95 and not self.promoted
def should_demote(self) -> bool:
recent = self.scores[-50:] # last 50
pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent)
return pass_rate < 0.92 and not self.demoted
Run this on a cron (every 10-20 minutes via launchd/systemd):
# run_accumulate.py
async def accumulate():
task_type = pick_next_task() # round-robin across task types
prompt, gt_response = generate_task(task_type) # call cloud baseline
for candidate in router.get_candidates(task_type):
local_response = await ollama_client.complete(candidate, prompt)
score = await evaluate_pair(prompt, local_response, gt_response, task_type)
confidence_tracker.record(candidate, task_type, score)
if confidence_tracker.should_promote(candidate, task_type):
router.promote(candidate, task_type)
langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type))
# config/routing_policy.yaml
control_floor_model: phi4:latest # never promote below this model's score
task_policies:
policy_check_high_risk:
never_local: true # these tasks always use cloud model
summarize:
min_score_for_routing: 0.85
fallback_chain: [qwen2.5, llama3.1, phi4]
classify:
min_score_for_routing: 0.90 # higher bar for classification
fallback_chain: [qwen2.5, granite4, llama3.1]
Expose a simple HTTP API (FastAPI):
POST /run โ route a task through the best available model
GET /health โ service status + promoted models + ollama connectivity
GET /status โ full scoreboard (model ร task ร mean ร n)
GET /report โ cost heatmap + efficiency analysis
What worked:
else is also bad" errors. If the floor model beats a candidate, flag it โ don't promote.
must have blocks stripped before evaluation. Otherwise Layer 2
drift detection flags the reasoning chain as hallucinated content.
None โ 0.0 for unsampled runs: a run where no judge scored is not a failing run. Store None, exclude from mean. Mixing None with 0.0 poisons the mean.
require_json: False for plain-text tasks: classify and extract tasks that returnformatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate
the "is the format correct" check from "is it valid JSON."
Structural accuracy (difflib) is wrong for prose analysis โ use semantic similarity as
the primary signal for open-ended tasks. This lifted analyze mean from 0.44โ0.59 to 0.70.
system_prompt that specifiesan exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and
candidates follow the same template, improving structural alignment and reducing drift
penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses.
run_task, get_status, get_champions, get_promotion_timeline, get_cost_heatmap). Lets an LLM agent
query evaluation state without bespoke integration work.
What didn't work:
the latency dimension alone tanks the composite score. Practical ceiling is ~9GB models
on 24GB unified memory to avoid GPU memory swapping.
costs more in judge API fees than you save by routing locally. Sample at 15%.
qdrant or numpy cosine store instead.
overriding per task type led to all analyze evals silently failing for 112+ runs.
Lesson: evaluate your evaluator's scores by task type early โ if a whole task type
caps at a suspicious ceiling (e.g. 0.59), the metric is wrong, not the models.
With a 20-minute accumulator cadence and 9 candidates ร 7 task types:
Per accumulation cycle (one task, one model):
At 6 runs/hour ร 24 hours: ~$0.70/day during burn-in.
After first promotions: drops to ~$0.10/day (90%+ of task volume local).
Generated Mar 1, 2026
A fintech startup uses the skill to evaluate local models for summarizing quarterly financial reports, comparing them against Claude as a baseline. After collecting 200+ runs, they promote a local model to handle routine summaries, reducing API costs by 80% while maintaining quality through continuous monitoring.
An e-commerce company employs the skill to test local models for classifying customer support tickets into categories like refunds or technical issues. They use the multi-judge ensemble to ensure accuracy, promoting a model after it proves equivalent to cloud models, cutting down on expensive API calls for high-volume ticket processing.
A legal tech firm uses the skill to evaluate local models for analyzing contract clauses, with ground truth from Claude. They apply per-task weight overrides for analyze tasks to prioritize semantic similarity, ensuring reliable promotion of models that match cloud quality for non-critical legal reviews, saving on API expenses.
A social media platform integrates the skill to test local models for filtering inappropriate content, using cloud models as a baseline. After statistical validation, they promote a local model to handle initial filtering, reducing latency and costs while maintaining safety through demotion triggers if quality drops.
A healthcare analytics company uses the skill to evaluate local models for extracting structured data from patient notes, with ground truth from Anthropic. They leverage the deterministic validators for every run and promote models after meeting the 0.95 mean score threshold, enabling cost-effective data processing while ensuring compliance through local inference.
Offer the skill as a cloud-based service with tiered pricing based on usage volume, providing automated model evaluation and routing for enterprises. Revenue comes from monthly subscriptions, with premium tiers including advanced analytics and custom task type configurations.
Sell licenses for on-premise deployment to organizations with strict data privacy requirements, such as government or healthcare. Revenue is generated through one-time license fees and annual support contracts, with optional add-ons for integration with existing AI infrastructure.
Provide consulting services to help companies implement and customize the skill for specific use cases, such as optimizing task weights or integrating with local Ollama models. Revenue comes from project-based fees and ongoing maintenance agreements, targeting businesses new to AI cost optimization.
๐ฌ Integration Tip
Ensure Ollama is running with capable models and set up per-task weight overrides in config files to align with specific evaluation needs, such as reducing structural weight for analyze tasks.
Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Gemini CLI for one-shot Q&A, summaries, and generation.
Research any topic from the last 30 days on Reddit + X + Web, synthesize findings, and write copy-paste-ready prompts. Use when the user wants recent social/web research on a topic, asks "what are people saying about X", or wants to learn current best practices. Requires OPENAI_API_KEY and/or XAI_API_KEY for full Reddit+X access, falls back to web search.
Check Antigravity account quotas for Claude and Gemini models. Shows remaining quota and reset times with ban detection.
Manages free AI models from OpenRouter for OpenClaw. Automatically ranks models by quality, configures fallbacks for rate-limit handling, and updates opencla...
Manages free AI models from OpenRouter for OpenClaw. Automatically ranks models by quality, configures fallbacks for rate-limit handling, and updates openclaw.json. Use when the user mentions free AI, OpenRouter, model switching, rate limits, or wants to reduce AI costs.