llm-eval-router

Set up a production-quality shadow evaluation pipeline that automatically

promotes local Ollama models when they statistically prove they match cloud

model quality — reducing inference costs with evidence, not hope.

The core idea

Run every task through your best local model (shadow) in parallel with your

cloud baseline (ground truth). A lightweight judge ensemble scores the local

output. After 200+ runs, if the local model hits 0.95 mean score, promote it

to handle that task type in production. Demote it automatically if quality drops.

When to use

You're paying for Claude/GPT API calls on tasks that don't need that quality
You have Ollama running locally with capable models (qwen2.5, phi4, mistral, etc.)
You want evidence-based cost reduction, not blind routing
You have defined task types: summarize, classify, extract, format, analyze, RAG

When NOT to use

Tasks that require real-time web knowledge (use cloud)
Tasks with strict latency requirements < 2 seconds (local models on CPU are slow)
Tasks with high safety stakes (always use cloud with safety filters)
You don't have Ollama or a Mac/Linux machine with enough RAM (8GB+ per model)

Prerequisites

Ollama installed and running (ollama.com)
At least one capable model: ollama pull qwen2.5 or ollama pull phi4
Python 3.10+
API keys: Anthropic (ground truth) + OpenAI (judge) — Gemini optional (tiebreaker)
Langfuse for observability (self-hosted or cloud) — optional but strongly recommended

Network & Privacy

This skill makes outbound API calls to:

Anthropic API — to generate ground truth baseline responses (every accumulation cycle)
OpenAI API — for judge scoring (sampled at 15% of runs)
Google Gemini API — tiebreaker judge only (when primary judges disagree by ≥0.20)

What stays local:

All Ollama model inference runs entirely on your device
Scored run data is stored on disk in data/scores/*.json
No telemetry, analytics, or data collection of any kind
No data is sent anywhere other than the explicit API calls above

Langfuse (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network.

Core concepts

6-Dimension Evaluation

Every response is scored on:

|---|---|---|---|

| Structural | 25% | 10% | Format compliance, required keys present |

| Semantic | 25% | 40% | Meaning equivalence to ground truth |

| Factual | 20% | 25% | No hallucinated facts/numbers/entities |

| Completion | 15% | 18% | Task fully addressed |

| Tool use | 10% | 4% | Correct tool/format selection |

| Latency | 5% | 3% | Within acceptable bounds |

Important: Use per-task weight overrides. The default 25/25 split treats structural

accuracy equally with semantic similarity — which works for extract/classify/format tasks

(where exact format matters) but is wrong for open-ended analysis. difflib.SequenceMatcher

on two prose analyses of the same question scores ~0.29 even when they're semantically

identical. With structural weight at 25%, this alone caps analyze scores at ~0.59.

# src/evaluator.py — per-task weight profiles
TASK_WEIGHT_OVERRIDES = {
    "analyze": {
        "structural_accuracy": 0.10,   # difflib is NOT meaningful for prose
        "semantic_similarity": 0.40,   # cosine over embeddings captures meaning
        "factual_drift": 0.25,
        "task_completion": 0.18,
        "tool_correctness": 0.04,
        "latency_score": 0.03,
    },
    "code_transform": {
        "structural_accuracy": 0.15,
        "semantic_similarity": 0.35,
        "factual_drift": 0.20,
        "task_completion": 0.20,
        "tool_correctness": 0.07,
        "latency_score": 0.03,
    },
}

Also: For analyze tasks, constrain output structure via system_prompt so GT and

candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning).

This reduces Layer 2 drift and improves difflib scores even at reduced weight.

Judge ensemble

Primary judges (15% sampling rate): Claude Sonnet + gpt-4o-mini score independently
Tiebreaker (only when |score_A - score_B| ≥ 0.20): Gemini 2.5-flash
Unsampled runs (85%): Layer 1+2 validators only (deterministic, free)
Promotion gates always trigger full judge evaluation regardless of sampling rate

Layer 1+2 validators (free, deterministic)

Layer 1: JSON validity, required key presence, forbidden pattern check
Layer 2: Drift detection — novel entities/numbers/URLs not in ground truth

These run on every response at zero cost. Judges only run when L1+L2 pass and

the sampling rate triggers.

Promotion / Demotion

Promote: 200+ runs, rolling mean ≥ 0.95 for a model/task pair
Demote: rolling 7-day pass rate < 0.92
Control floor: one model (phi4, granite4, or similar) serves as the measured floor —

any model scoring below it should be flagged, not promoted

Implementation steps

Step 1 — Define your task types

Create config/task_types.yaml:

tasks:
  - id: summarize
    description: "Summarize a document in N sentences"
    require_json: false
    judge_dimensions: [semantic, factual, completion]

  - id: classify
    description: "Classify text into one of N categories"
    require_json: true    # response must be valid JSON
    judge_dimensions: [structural, semantic, completion]

  - id: extract
    description: "Extract structured data from unstructured text"
    require_json: true
    judge_dimensions: [structural, factual, completion]

  - id: format
    description: "Reformat content to match a template"
    require_json: false
    judge_dimensions: [structural, semantic, completion]

Step 2 — Set up the router

The router assigns each task to a model using a round-robin strategy during

burn-in (building n), then switches to confidence-weighted routing after promotion.

# src/router.py — simplified version
class Router:
    def __init__(self, candidates: list[str], control_floor: str):
        self.candidates = candidates
        self.control_floor = control_floor
        self._rr_counters = defaultdict(int)

    def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str:
        """Return the best model for this task type."""
        promoted = confidence_tracker.get_promoted(task_type)
        if promoted:
            return promoted  # use promoted model directly

        # Round-robin during burn-in for fair exposure
        idx = self._rr_counters[task_type] % len(self.candidates)
        self._rr_counters[task_type] += 1
        return self.candidates[idx]

Step 3 — Ground truth comparison

For each task, run it through BOTH the local model (candidate) and the cloud

baseline (ground truth). Never use the ground truth response in production —

it's only for evaluation.

async def evaluate_pair(prompt: str, local_response: str, gt_response: str,
                        task_type: str) -> float:
    # Layer 1: deterministic
    l1_score = validators.layer1(local_response, task_type)
    if l1_score == 0.0:
        return 0.0  # hard fail — safety or format violation

    # Layer 2: heuristic drift
    l2_score = validators.layer2(local_response, gt_response)

    # Sample judges (15%)
    if random.random() < JUDGE_SAMPLE_RATE:
        sonnet_score = await judge_sonnet(prompt, local_response, gt_response)
        mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response)
        if abs(sonnet_score - mini_score) >= 0.20:
            gemini_score = await judge_gemini(prompt, local_response, gt_response)
            final = median([sonnet_score, mini_score, gemini_score])
        else:
            final = (sonnet_score + mini_score) / 2
        return weighted_score(l1_score, l2_score, final)
    else:
        return weighted_score(l1_score, l2_score, judge_score=None)

Step 4 — Confidence tracker

Track scores per model/task pair on disk (so restarts don't lose data):

# src/scoring/confidence.py — simplified
@dataclass
class ModelStats:
    model_id: str
    task_type: str
    scores: list[float]   # all scores (None excluded)
    promoted: bool = False
    demoted: bool = False

    @property
    def mean(self) -> float:
        return sum(self.scores) / len(self.scores) if self.scores else 0.0

    @property
    def n(self) -> int:
        return len(self.scores)

    def should_promote(self) -> bool:
        return self.n >= 200 and self.mean >= 0.95 and not self.promoted

    def should_demote(self) -> bool:
        recent = self.scores[-50:]  # last 50
        pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent)
        return pass_rate < 0.92 and not self.demoted

Step 5 — Accumulator loop

Run this on a cron (every 10-20 minutes via launchd/systemd):

# run_accumulate.py
async def accumulate():
    task_type = pick_next_task()  # round-robin across task types
    prompt, gt_response = generate_task(task_type)  # call cloud baseline

    for candidate in router.get_candidates(task_type):
        local_response = await ollama_client.complete(candidate, prompt)
        score = await evaluate_pair(prompt, local_response, gt_response, task_type)
        confidence_tracker.record(candidate, task_type, score)

        if confidence_tracker.should_promote(candidate, task_type):
            router.promote(candidate, task_type)
            langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type))

Step 6 — Routing policy

# config/routing_policy.yaml
control_floor_model: phi4:latest   # never promote below this model's score

task_policies:
  policy_check_high_risk:
    never_local: true              # these tasks always use cloud model

  summarize:
    min_score_for_routing: 0.85
    fallback_chain: [qwen2.5, llama3.1, phi4]

  classify:
    min_score_for_routing: 0.90   # higher bar for classification
    fallback_chain: [qwen2.5, granite4, llama3.1]

Step 7 — API

Expose a simple HTTP API (FastAPI):

POST /run          — route a task through the best available model
GET  /health       — service status + promoted models + ollama connectivity
GET  /status       — full scoreboard (model × task × mean × n)
GET  /report       — cost heatmap + efficiency analysis

Key lessons learned (from 900+ production runs)

What worked:

phi4 as control floor: a measured floor model prevents "promoted because everyone

else is also bad" errors. If the floor model beats a candidate, flag it — don't promote.

Thinking token stripping: CoT models (deepseek-r1, qwen2.5-coder with reasoning)

must have ... blocks stripped before evaluation. Otherwise Layer 2

drift detection flags the reasoning chain as hallucinated content.

None ≠ 0.0 for unsampled runs: a run where no judge scored is not a failing run.

Store None, exclude from mean. Mixing None with 0.0 poisons the mean.

require_json: False for plain-text tasks: classify and extract tasks that return

formatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate

the "is the format correct" check from "is it valid JSON."

Per-task weight overrides: do not use one weight profile for all task types.

Structural accuracy (difflib) is wrong for prose analysis — use semantic similarity as

the primary signal for open-ended tasks. This lifted analyze mean from 0.44–0.59 to 0.70.

Structured output prompts for analyze tasks: add a system_prompt that specifies

an exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and

candidates follow the same template, improving structural alignment and reducing drift

penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses.

MCP server for agentic access: expose CP as MCP tools (run_task, get_status,

get_champions, get_promotion_timeline, get_cost_heatmap). Lets an LLM agent

query evaluation state without bespoke integration work.

What didn't work:

Large models (>9GB): gpt-oss:20b and similar required 39+ second inference —

the latency dimension alone tanks the composite score. Practical ceiling is ~9GB models

on 24GB unified memory to avoid GPU memory swapping.

100% judge sampling: runs through the full Claude+GPT+Gemini panel on every evaluation

costs more in judge API fees than you save by routing locally. Sample at 15%.

Chroma 1.5.1 with Python 3.14: Pydantic V1 BaseSettings incompatibility. Use

qdrant or numpy cosine store instead.

One-size-fits-all weight profiles: defining global weights at system init and never

overriding per task type led to all analyze evals silently failing for 112+ runs.

Lesson: evaluate your evaluator's scores by task type early — if a whole task type

caps at a suspicious ceiling (e.g. 0.59), the metric is wrong, not the models.

Expected timeline

With a 20-minute accumulator cadence and 9 candidates × 7 task types:

First 50 runs per model: ~5 hours
First promotions (200 runs): ~1-2 days per model/task pair
Stable routing layer: 1-2 weeks

Cost estimate

Per accumulation cycle (one task, one model):

Ground truth: ~$0.002 (Claude Sonnet, ~500 input + 200 output tokens)
Judge sample (15%): ~$0.003 (Sonnet + GPT-4o-mini)
Local model: $0 (Ollama, on-device)

At 6 runs/hour × 24 hours: ~$0.70/day during burn-in.

After first promotions: drops to ~$0.10/day (90%+ of task volume local).

Llm Eval Routerv1.2.1

Install & Quick Start