peer-reviewMulti-model peer review layer using local LLMs via Ollama to catch errors in cloud model output. Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus. Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy, checking any high-stakes Claude output before publishing or acting on it. Don't use when: simple fact-checking (just search the web), tasks that don't benefit from multi-model consensus, time-critical decisions where 60s latency is unacceptable, reviewing trivial or low-stakes content. Negative examples: - "Check if this date is correct" → No. Just web search it. - "Review my grocery list" → No. Not worth multi-model inference. - "I need this answer in 5 seconds" → No. Peer review adds 30-60s latency. Edge cases: - Short text (<50 words) → Models may not find meaningful issues. Consider skipping. - Highly technical domain → Local models may lack domain knowledge. Weight flags lower. - Creative writing → Factual review doesn't apply well. Use only for logical consistency.
Install via ClawdBot CLI:
clawdbot install staybased/peer-reviewHypothesis: Local LLMs can catch ≥30% of real errors in cloud output with <50% false positive rate.
Cloud Model (Claude) produces analysis
│
▼
┌────────────────────────┐
│ Peer Review Fan-Out │
├────────────────────────┤
│ Drift (Mistral 7B) │──► Critique A
│ Pip (TinyLlama 1.1B) │──► Critique B
│ Lume (Llama 3.1 8B) │──► Critique C
└────────────────────────┘
│
▼
Aggregator (consensus logic)
│
▼
Final: original + flagged issues
| Bot | Model | Role | Strengths |
|-----|-------|------|-----------|
| Drift 🌊 | Mistral 7B | Methodical analyst | Structured reasoning, catches logical gaps |
| Pip 🐣 | TinyLlama 1.1B | Fast checker | Quick sanity checks, low latency |
| Lume 💡 | Llama 3.1 8B | Deep thinker | Nuanced analysis, catches subtle issues |
| Script | Purpose |
|--------|---------|
| scripts/peer-review.sh | Send single input to all models, collect critiques |
| scripts/peer-review-batch.sh | Run peer review across a corpus of samples |
| scripts/seed-test-corpus.sh | Generate seeded error corpus for testing |
# Single file review
bash scripts/peer-review.sh <input_file> [output_dir]
# Batch review
bash scripts/peer-review-batch.sh <corpus_dir> [results_dir]
# Generate test corpus
bash scripts/seed-test-corpus.sh [count] [output_dir]
Scripts live at workspace/scripts/ — not bundled in skill to avoid duplication.
You are a skeptical reviewer. Analyze the following text for errors.
For each issue found, output JSON:
{"category": "factual|logical|missing|overconfidence|hallucinated_source",
"quote": "...", "issue": "...", "confidence": 0-100}
If no issues found, output: {"issues": []}
TEXT:
---
{cloud_output}
---
| Category | Description | Example |
|----------|-------------|---------|
| factual | Wrong numbers, dates, names | "Bitcoin launched in 2010" |
| logical | Non-sequiturs, unsupported conclusions | "X is rising, therefore Y will fall" |
| missing | Important context omitted | Ignoring a major counterargument |
| overconfidence | Certainty without justification | "This will definitely happen" on 55% event |
| hallucinated_source | Citing nonexistent sources | "According to a 2024 Reuters report..." |
publish | revise | flag_for_human| Outcome | TPR | FPR | Decision |
|---------|-----|-----|----------|
| Strong pass | ≥50% | <30% | Ship as default layer |
| Pass | ≥30% | <50% | Ship as opt-in layer |
| Marginal | 20–30% | 50–70% | Iterate on prompts, retest |
| Fail | <20% | >70% | Abandon approach |
mistral:7b, tinyllama:1.1b, llama3.1:8bjq and curl installedexperiments/peer-review-results/When peer review passes validation:
POST /review#reef-logs with TPR trackingGenerated Mar 1, 2026
A hedge fund uses the peer review skill to validate AI-generated market analyses before making investment decisions. The multi-model consensus catches factual errors in economic data and logical gaps in trading strategies, reducing risk in high-stakes financial operations.
A law firm employs the skill to review AI-drafted contracts and legal briefs for logical inconsistencies and missing clauses. Local models flag overconfident assertions or hallucinated legal precedents, ensuring accuracy before client submission.
Universities integrate peer review to check AI-generated research summaries or paper drafts for factual inaccuracies and unsupported conclusions. It helps researchers avoid publishing errors in technical domains by weighting flags based on model confidence.
Medical institutions use the skill to review AI-assisted diagnostic reports for logical errors or missing context in patient data analysis. It acts as a safety layer before finalizing treatment plans, though technical domain limitations require careful flag interpretation.
Media companies apply peer review to validate AI-written articles or reports for factual mistakes and overconfidence before publication. The Discord workflow synthesizes critiques to recommend revisions, maintaining editorial standards in time-insensitive scenarios.
Offer the peer review skill as a cloud API endpoint (e.g., Reef API) with tiered subscriptions based on review volume and model selection. Revenue comes from monthly fees for businesses needing high-stakes output validation, with logging and TPR tracking as premium features.
Provide custom integration services for enterprises to embed peer review into their existing AI workflows, such as financial or legal systems. Revenue is generated through one-time setup fees and ongoing support contracts, leveraging the skill's scripts and consensus logic.
Release a free version with basic peer review capabilities for individual users or small teams, monetized through a premium tier offering batch processing, advanced error categories, and lower latency options. Revenue streams include upgrades and enterprise licenses.
💬 Integration Tip
Ensure Ollama is running locally with required models pulled, and use the provided scripts in workspace/ for initial testing before full API deployment.
Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Gemini CLI for one-shot Q&A, summaries, and generation.
Research any topic from the last 30 days on Reddit + X + Web, synthesize findings, and write copy-paste-ready prompts. Use when the user wants recent social/web research on a topic, asks "what are people saying about X", or wants to learn current best practices. Requires OPENAI_API_KEY and/or XAI_API_KEY for full Reddit+X access, falls back to web search.
Check Antigravity account quotas for Claude and Gemini models. Shows remaining quota and reset times with ban detection.
Manages free AI models from OpenRouter for OpenClaw. Automatically ranks models by quality, configures fallbacks for rate-limit handling, and updates opencla...
Manages free AI models from OpenRouter for OpenClaw. Automatically ranks models by quality, configures fallbacks for rate-limit handling, and updates openclaw.json. Use when the user mentions free AI, OpenRouter, model switching, rate limits, or wants to reduce AI costs.