iterative-code-evolutionSystematically improve code through disciplined analysis, targeted mutations, verification, scoring, and logging to iteratively enhance quality and design.
Install via ClawdBot CLI:
clawdbot install aaronjmars/iterative-code-evolutionA structured methodology for improving code through disciplined reflect β mutate β verify β score cycles, adapted from the ALMA research framework for meta-learning code designs.
Every improvement cycle follows this sequence:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. ANALYZE β structured diagnosis of current code β
β 2. PLAN β prioritized, concrete changes β
β 3. MUTATE β implement the changes β
β 4. VERIFY β run it, check for errors β
β 5. SCORE β measure improvement vs. baseline β
β 6. ARCHIVE β log what was tried and what happened β
β β
β Loop back to 1 with new knowledge β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Track all iterations in .evolution/log.json at the project root. This is the memory that makes each cycle smarter than the last.
{
"baseline": {
"description": "Initial implementation before evolution began",
"score": 0.0,
"timestamp": "2025-01-15T10:00:00Z"
},
"variants": {
"v001": {
"parent": "baseline",
"description": "Added input validation and error handling",
"changes_made": [
{
"what": "Added type checks on all public methods",
"why": "Runtime crashes from malformed input in 3/10 test cases",
"priority": "High"
}
],
"score": 0.6,
"delta": "+0.6 vs parent",
"timestamp": "2025-01-15T10:30:00Z",
"learned": "Input validation was the primary failure mode β most other logic was sound"
},
"v002": {
"parent": "v001",
"description": "Refactored parsing logic to handle edge cases",
"changes_made": [
{
"what": "Rewrote parse_input() to use state machine instead of regex",
"why": "Regex approach failed on nested structures (seen in test cases 7,8)",
"priority": "High"
}
],
"score": 0.85,
"delta": "+0.25 vs parent",
"timestamp": "2025-01-15T11:00:00Z",
"learned": "State machine approach generalizes better than regex for this grammar"
}
},
"principles_learned": [
"Input validation fixes give the biggest early gains",
"Regex-based parsing breaks on recursive structures β prefer state machines",
"Small targeted changes score better than large rewrites"
]
}
Before changing anything, perform a structured analysis of the current code and its outputs. This is the most important phase β it prevents wasted mutations.
Step 1 β Learn from past edits (skip on first iteration)
Review the evolution log. For each previous change:
Step 2 β Component-level assessment
For each meaningful component (function, class, module, pipeline stage), label it:
| Label | Meaning |
|-------|---------|
| Working | Produces correct output, no issues observed |
| Fragile | Works on happy path but fails on edge cases or specific inputs |
| Broken | Produces wrong output or errors |
| Redundant | Duplicates logic found elsewhere, adds complexity without value |
| Missing | A needed component that doesn't exist yet |
For each label, write a one-line explanation of why β linked to specific test outputs or observed behavior.
Step 3 β Quality and coherence check
Look for cross-cutting issues:
Step 4 β Produce prioritized suggestions
Based on Steps 1-3, produce concrete changes. Each suggestion must have:
- PRIORITY: High | Medium | Low
- WHAT: Precise description of the change (code-level, not vague)
- WHY: Link to a specific observation from Steps 1-3
- RISK: What could go wrong if this change is made incorrectly
Rule: Every suggestion must link to an observation. No "this might help" suggestions β only changes grounded in something you actually saw in the code or outputs.
Rule: Limit to 3 suggestions per cycle. More than 3 changes at once makes it impossible to attribute improvement or regression to specific changes.
Pick 1-3 suggestions from the analysis. Selection principles:
Write the new code. Key discipline:
# evo-v003: switched to state machine per edge case failures)Execute the modified code against the same inputs/tests used for scoring.
If it crashes (up to 3 retries):
Use the reflection-fix protocol:
After 3 failed retries, revert to parent variant and log the failure:
{
"attempted": "Description of what was tried",
"failure_mode": "The error that couldn't be resolved",
"learned": "Why this approach doesn't work"
}
This failure data is valuable β it prevents re-attempting the same broken approach.
If it runs but produces wrong output:
Don't immediately retry. Go back to Phase 1 (ANALYZE) with the new outputs. The wrong output is diagnostic data.
Compare the new variant's performance against its parent (not just the baseline). Scoring depends on context:
| Context | Score Method |
|---------|-------------|
| Tests exist | Pass rate: tests_passed / total_tests |
| Performance optimization | Metric delta (latency, throughput, memory) |
| Code quality | Weighted checklist (correctness, edge cases, readability) |
| User feedback | Binary: better/worse/same per the user's judgment |
| LLM/prompt output quality | Sample outputs graded against criteria |
Always compute delta vs. parent. This is how you learn which changes help vs. hurt.
Update .evolution/log.json:
learned field: one sentence about what this cycle taught youprinciples_learnedprinciples_learned as a pitfallKeep branches in .evolution/variants/ with descriptive names. The evolution log tracks which is active.
If you have multiple variants, pick the next one to improve using:
score(variant) = normalized_reward - 0.5 * log(1 + visit_count)
Where:
normalized_reward = variant score relative to baseline (0-1 range)visit_count = how many times this variant has been selected for iterationThis balances exploitation (iterating on the best variant) with exploration (trying variants that haven't been touched recently). It prevents getting stuck in local optima.
When performing Phase 1, structure your thinking as:
## Evolution Cycle [N] β Analysis
### Lessons from Previous Cycles
- Cycle [N-1] changed [X], score went [up/down] by [amount]
- Principle: [what we learned]
- Pitfall: [what to avoid]
### Component Assessment
| Component | Status | Evidence |
|-----------|--------|----------|
| function_a() | Working | All test cases pass |
| function_b() | Fragile | Fails on empty input (test #4) |
| class_C | Broken | Returns None instead of dict |
### Cross-Cutting Issues
- [Issue 1 with specific evidence]
- [Issue 2 with specific evidence]
### Planned Changes (max 3)
1. **[High]** WHAT: ... | WHY: ... | RISK: ...
2. **[Medium]** WHAT: ... | WHY: ... | RISK: ...
Context: User asks to improve a web scraper that's failing on 40% of target pages.
Cycle 1 β Analysis:
parse_html() is Broken (crashes on pages with no tag), fetch_page() is Working, extract_links() is Fragile (misses relative URLs)parse_html() for pages without Cycle 1 β Mutate: Add cascading selector logic: try , fall back to , fall back to .
Cycle 1 β Verify: Runs without crashes.
Cycle 1 β Score: Pass rate 40% β 72%. Delta: +32%.
Cycle 1 β Archive: Learned: "Most failures were selector misses, not logic errors. Fallback chains are high-value."
Cycle 2 β Analysis:
parse_html() now Working. extract_links() still Fragile β relative URLs not resolved.urljoin in extract_links()Cycle 2 β Mutate: Add base URL resolution.
Cycle 2 β Score: 72% β 88%. Delta: +16%.
Cycle 2 β Archive: Learned: "URL resolution was second-biggest failure mode. Always normalize URLs at extraction time."
principles_learned list is the most valuable artifact; it encodes what works for this specific codebaseAI Usage Analysis
Analysis is being generated⦠refresh in a few seconds.
Automatically update Clawdbot and all installed skills once daily. Runs via cron, checks for updates, applies them, and messages the user with a summary of what changed.
Full desktop computer use for headless Linux servers. Xvfb + XFCE virtual desktop with xdotool automation. 17 actions (click, type, scroll, screenshot, drag,...
Essential Docker commands and workflows for container management, image operations, and debugging.
Tool discovery and shell one-liner reference for sysadmin, DevOps, and security tasks. AUTO-CONSULT this skill when the user is: troubleshooting network issues, debugging processes, analyzing logs, working with SSL/TLS, managing DNS, testing HTTP endpoints, auditing security, working with containers, writing shell scripts, or asks 'what tool should I use for X'. Source: github.com/trimstray/the-book-of-secret-knowledge
Deploy applications and manage projects with complete CLI reference. Commands for deployments, projects, domains, environment variables, and live documentation access.
Monitor topics of interest and proactively alert when important developments occur. Use when user wants automated monitoring of specific subjects (e.g., product releases, price changes, news topics, technology updates). Supports scheduled web searches, AI-powered importance scoring, smart alerts vs weekly digests, and memory-aware contextual summaries.