forgeAutonomous quality engineering swarm that forges production-ready code through continuous behavioral verification, exhaustive E2E testing, and self-healing fix loops. Combines DDD+ADR+TDD methodology with BDD/Gherkin specifications, 7 quality gates, defect prediction, chaos testing, and cross-context dependency awareness. Architecture-agnostic ā works with monoliths, microservices, modular monoliths, and any bounded-context topology.
Install via ClawdBot CLI:
clawdbot install ikennaokpala/forgeQuality forged in, not bolted on.
Forge is a self-learning, autonomous quality engineering swarm that unifies three approaches into one:
| Pillar | Source | What It Does |
|--------|--------|--------------|
| Build | DDD+ADR+TDD methodology | Structured development with quality gates, defect prediction, confidence-tiered fixes |
| Verify | BDD/Gherkin behavioral specs | Continuous behavioral verification ā the PRODUCT works, not just the CODE |
| Heal | Autonomous E2E fix loop | Test ā Analyze ā Fix ā Commit ā Learn ā Repeat |
"DONE DONE" means: the code compiles AND the product behaves as specified. Every Gherkin scenario passes. Every quality gate clears. Every dependency graph is satisfied.
Forge adapts to any project architecture. Before first run, it discovers your project structure:
| Architecture | How Forge Adapts |
|-------------|-----------------|
| Monolith | Single backend process, all contexts in one codebase. Forge runs all tests against one server. |
| Modular Monolith | Single deployment with bounded contexts as modules. Forge discovers modules and tests each context independently. |
| Microservices | Multiple services. Forge discovers service endpoints, tests each service, validates inter-service contracts. |
| Monorepo | Multiple apps/packages in one repo. Forge detects workspace structure (Turborepo, Nx, Lerna, Melos, Cargo workspace). |
| Mobile + Backend | Frontend app with backend API. Forge starts backend, then runs E2E tests against it. |
| Full-Stack Monolith | Frontend and backend in same deployment. Forge tests through the UI layer against real backend. |
On first invocation, Forge analyzes the project to build a context map:
# Forge automatically discovers:
# 1. Backend technology (Rust/Cargo, Node/npm, Python/pip, Go, Java/Maven/Gradle, .NET)
# 2. Frontend technology (Flutter, React, Next.js, Vue, Angular, SwiftUI, Kotlin/Compose)
# 3. Test framework (integration_test, Jest, Pytest, Go test, JUnit, xUnit)
# 4. Project structure (monorepo layout, service boundaries, module boundaries)
# 5. API protocol (REST, GraphQL, gRPC, WebSocket)
# 6. Build system (Make, npm scripts, Gradle tasks, Cargo features)
Forge stores the discovered project map:
{
"architecture": "mobile-backend",
"backend": {
"technology": "rust",
"buildCommand": "cargo build --release --features test-endpoints",
"runCommand": "cargo run --release --features test-endpoints",
"healthEndpoint": "/health",
"port": 8080,
"migrationCommand": "cargo sqlx migrate run"
},
"frontend": {
"technology": "flutter",
"testCommand": "flutter drive --driver=test_driver/integration_test.dart --target={target}",
"testDir": "integration_test/e2e/",
"specDir": "integration_test/e2e/specs/"
},
"contexts": ["identity", "rides", "payments", "..."],
"testDataSeeding": {
"method": "api",
"endpoint": "/api/v1/test/seed",
"authHeader": "X-Test-Key"
}
}
Projects can provide a forge.config.yaml at the repo root to override auto-discovery:
# forge.config.yaml (optional ā Forge auto-discovers if absent)
architecture: microservices
backend:
services:
- name: auth-service
port: 8081
healthEndpoint: /health
buildCommand: npm run build
runCommand: npm start
- name: payment-service
port: 8082
healthEndpoint: /health
buildCommand: npm run build
runCommand: npm start
frontend:
technology: react
testCommand: npx cypress run --spec {target}
testDir: cypress/e2e/
specDir: cypress/e2e/specs/
contexts:
- name: identity
testFile: auth.cy.ts
specFile: identity.feature
- name: payments
testFile: payments.cy.ts
specFile: payments.feature
dependencies:
identity:
blocks: [payments, orders]
payments:
depends_on: [identity]
blocks: [orders]
ABSOLUTE RULE: This skill NEVER uses mocking or stubbing of the backend API.
mockito, wiremock, MockClient, nock, msw, httpretty, etc.)Why No Mocking:
BEFORE ANY TESTING, the backend MUST be built, compiled, and running.
This is the FIRST thing the skill does ā no exceptions.
# 1. Read project config or auto-discover backend settings
# 2. Check if backend is already running
curl -s http://localhost:${BACKEND_PORT}/${HEALTH_ENDPOINT} || {
echo "Backend not running. Starting..."
# 3. Navigate to backend directory
cd ${BACKEND_DIR}
# 4. Ensure environment is configured
cp .env.example .env 2>/dev/null || true
# 5. Build the backend
${BUILD_COMMAND}
# 6. Run database migrations (if applicable)
${MIGRATION_COMMAND}
# 7. Start backend (background)
nohup ${RUN_COMMAND} > backend.log 2>&1 &
echo $! > backend.pid
# 8. Wait for backend to be healthy (up to 60 seconds)
for i in {1..60}; do
if curl -s http://localhost:${BACKEND_PORT}/${HEALTH_ENDPOINT} | grep -q "ok\|healthy\|UP"; then
echo "Backend healthy on port ${BACKEND_PORT}"
break
fi
sleep 1
done
}
# Verify critical endpoints are responding
curl -s http://localhost:${BACKEND_PORT}/${HEALTH_ENDPOINT} | jq .
# Verify test fixtures endpoint (for seeding)
curl -s -H "${TEST_AUTH_HEADER}" http://localhost:${BACKEND_PORT}/${TEST_STATUS_ENDPOINT} | jq .
# Verify API spec matches running API (if OpenAPI/Swagger available)
curl -s http://localhost:${BACKEND_PORT}/${OPENAPI_ENDPOINT} > /tmp/live-spec.json
# Store contract snapshot for regression detection
npx @claude-flow/cli@latest memory store \
--key "contract-snapshot-$(date +%s)" \
--value "$(cat /tmp/live-spec.json | head -c 5000)" \
--namespace forge-contracts
# Seed test data through REAL API ā adapt to your project's seeding endpoint
curl -X POST http://localhost:${BACKEND_PORT}/${SEED_ENDPOINT} \
-H "Content-Type: application/json" \
-H "${TEST_AUTH_HEADER}" \
-d '${SEED_PAYLOAD}'
Before testing, verify Gherkin specs and architecture decision records exist for the target bounded context.
Behavioral specifications define WHAT the product does from the user's perspective. Every test traces back to a Gherkin scenario. If tests pass but specs fail, the product is broken.
Gherkin specs are stored alongside tests:
${SPEC_DIR}/
āāā [context-a].feature
āāā [context-b].feature
āāā [context-c].feature
āāā ...
The exact location depends on your project's test structure. Forge auto-discovers this from the project map.
Each Gherkin Scenario maps to exactly one test function. The mapping is tracked:
Feature: [Context Name]
As a [user role]
I want to [action]
So that [outcome]
Scenario: [Descriptive scenario name]
Given [precondition]
When [action]
Then [expected result]
And [additional verification]
If specs are missing for a target context, the Specification Verifier agent creates them:
${SPEC_DIR}/[context].featureWhen Forge discovers a bounded context without an Architecture Decision Record, the Specification Verifier generates one. ADRs follow an agent-optimized format designed for machine consumption:
# ADR-NNN: [Context] Architecture Decision
## Status
Proposed | Accepted | Deprecated | Superseded by ADR-XXX
## MUST
- [Explicit required behaviors with contract references]
- [Link to OpenAPI spec: /api/v1/[context]/openapi.json]
- [Required integration patterns]
## MUST NOT
- [Explicit forbidden patterns]
- [Anti-patterns to avoid]
- [Coupling violations]
## Verification
- Command: [command to verify this decision holds]
- Expected: [expected output or exit code]
## Dependencies
- Depends on: [list of upstream contexts with ADR links]
- Blocks: [list of downstream contexts with ADR links]
ADR Storage:
docs/decisions/ or the project-configured ADR directoryBefore running tests, verify API response schemas match expected DTOs:
# For each endpoint the context uses:
# 1. Make a real API call
# 2. Compare response structure against expected DTO/schema
# 3. Flag any mismatches as contract violations
Contract violations are treated as Gate 7 failures and must be resolved before functional testing proceeds.
For bounded contexts that share dependencies, validate type consistency across context boundaries:
userId: string but context B sends userId: number{
"sharedTypeViolation": {
"type": "UserId",
"contextA": { "name": "payments", "file": "types/payment.ts", "definition": "string" },
"contextB": { "name": "orders", "file": "types/order.ts", "definition": "number" },
"severity": "error"
}
}
Verify cross-cutting concerns are consistent across all bounded contexts:
Authorization: Bearer ), same token validation approach across all endpointsCross-cutting violations are reported as warnings before Gate evaluation begins.
Bounded contexts have dependencies. When a fix touches context X, all contexts that depend on X must be re-tested.
# Context Dependency Map ā define in forge.config.yaml or auto-discover
# Example for a typical application:
#
# authentication:
# depends_on: []
# blocks: [orders, payments, profile, messaging]
#
# payments:
# depends_on: [authentication]
# blocks: [orders, subscriptions]
#
# orders:
# depends_on: [authentication, payments]
# blocks: [reviews, notifications]
When Bug Fixer modifies a file in context X:
blocks list for X# Initialize anti-drift swarm for Forge
npx @claude-flow/cli@latest swarm init --topology hierarchical --max-agents 10 --strategy specialized
# Load previous fix patterns from memory
npx @claude-flow/cli@latest memory search --query "forge fix patterns" --namespace forge-patterns
# Check current coverage and gate status
npx @claude-flow/cli@latest memory retrieve --key "forge-coverage-status" --namespace forge-state
# Load confidence tiers
npx @claude-flow/cli@latest memory search --query "confidence tier" --namespace forge-patterns
# Check defect predictions for target context
npx @claude-flow/cli@latest memory search --query "defect prediction" --namespace forge-predictions
Forge routes each agent to the appropriate model tier based on task complexity, optimizing for cost without sacrificing quality:
| Agent | Model | Rationale |
|-------|-------|-----------|
| Specification Verifier | sonnet | Reads code + generates Gherkin ā moderate reasoning |
| Test Runner | haiku | Structured execution, output parsing ā low reasoning |
| Failure Analyzer | sonnet | Root cause analysis ā moderate reasoning |
| Bug Fixer | opus | First-principles code fixes ā high reasoning |
| Quality Gate Enforcer | haiku | Threshold comparison ā low reasoning |
| Accessibility Auditor | sonnet | Code analysis + WCAG rules ā moderate reasoning |
| Auto-Committer | haiku | Git operations, message formatting ā low reasoning |
| Learning Optimizer | sonnet | Pattern analysis, prediction ā moderate reasoning |
Projects can override model assignments in forge.config.yaml:
# forge.config.yaml ā Model routing overrides (optional)
model_routing:
spec-verifier: sonnet
test-runner: haiku
failure-analyzer: sonnet
bug-fixer: opus
gate-enforcer: haiku
accessibility-auditor: sonnet
auto-committer: haiku
learning-optimizer: sonnet
When no override is specified, the defaults above are used. This routing reduces token cost by ~60% compared to running all agents on the highest-tier model.
Claude Code MUST spawn these 8 agents in a SINGLE message with run_in_background: true:
// Agent 1: Specification Verifier
Task({
model: "sonnet",
prompt: `You are the Specification Verifier agent. Your mission:
1. VERIFY backend is running: curl -sf http://localhost:${BACKEND_PORT}/${HEALTH_ENDPOINT}
2. Check if Gherkin specs exist for the target bounded context:
- Look in the project's spec directory
3. If specs are MISSING:
- Read the screen/component/route implementation files for the context
- Extract all user-visible features, interactions, states
- Generate Gherkin feature files with scenarios for every cyclomatic path
- Write specs to the correct location
4. If specs EXIST:
- Read current implementations
- Compare against existing scenarios
- Flag scenarios that no longer match implementation (stale specs)
- Generate new scenarios for uncovered features
5. Create spec-to-test mapping:
- Each Scenario name ā test function name
- Store mapping in memory for Test Runner
6. Store results:
npx @claude-flow/cli@latest memory store --key "specs-[context]-[timestamp]" \
--value "[spec status JSON]" --namespace forge-specs
CONSTRAINTS:
- NEVER generate specs for code you haven't read
- NEVER assume UI elements exist without checking implementation
- NEVER create scenarios that duplicate existing coverage
- NEVER modify existing test files ā only spec files
ACCEPTANCE:
- Every implementation file has at least one Gherkin scenario
- Spec-to-test mapping has zero unmapped entries
- All generated scenarios follow Given/When/Then format
- Results stored in forge-specs namespace
Output: List of all Gherkin scenarios with their mapped test functions, and any gaps found.`,
subagent_type: "researcher",
description: "Spec Verification",
run_in_background: true
})
// Agent 2: Test Runner
Task({
model: "haiku",
prompt: `You are the Test Runner agent. Your mission:
1. VERIFY backend is running
2. Check defect predictions from memory:
npx @claude-flow/cli@latest memory search --query "defect prediction [context]" --namespace forge-predictions
- Run predicted-to-fail tests FIRST for faster convergence
3. Run the E2E test suite for the specified context using the project's test command
4. Capture ALL test output including stack traces
5. Parse failures into structured format:
{testId, gherkinScenario, error, stackTrace, file, line, context}
6. Map each failure to its Gherkin scenario (from spec-to-test mapping)
7. Store results in memory for other agents:
npx @claude-flow/cli@latest memory store \
--key "test-run-[timestamp]" \
--value "[parsed results JSON]" \
--namespace forge-results
CONSTRAINTS:
- NEVER skip failing tests
- NEVER modify test code or source code
- NEVER mock API calls or stub responses
- NEVER continue if backend health check fails
ACCEPTANCE:
- All test results stored in memory with structured format
- Zero unparsed failures ā every failure has testId, error, stackTrace, file, line
- Predicted-to-fail tests executed first
- Results include Gherkin scenario mapping for every test`,
subagent_type: "tester",
description: "Test Runner",
run_in_background: true
})
// Agent 3: Failure Analyzer
Task({
model: "sonnet",
prompt: `You are the Failure Analyzer agent. Your mission:
1. Monitor memory for new test results from Test Runner
2. For each failure, analyze:
- Root cause category: element-not-found, assertion-failed, timeout,
api-mismatch, navigation-error, state-error, contract-violation
- Affected file and line number
- Which Gherkin scenario is violated
- Impact on dependent contexts (check dependency graph)
3. Search memory for matching fix patterns with confidence tiers:
npx @claude-flow/cli@latest memory search \
--query "[error pattern]" --namespace forge-patterns
4. If pattern found with confidence >= 0.85 (Gold+):
- Recommend auto-apply
- Include pattern key and success rate
5. If pattern found with confidence >= 0.75 (Silver):
- Suggest fix but flag for review
6. If no matching pattern:
- Perform root cause analysis from first principles
- Generate fix hypothesis
7. Store analysis in memory for Bug Fixer:
npx @claude-flow/cli@latest memory store \
--key "analysis-[testId]-[timestamp]" \
--value "[analysis JSON]" \
--namespace forge-results
CONSTRAINTS:
- NEVER assume root cause without stack trace evidence
- NEVER recommend fixes for passing tests
- NEVER skip dependency graph impact analysis
- NEVER override confidence tier thresholds
ACCEPTANCE:
- Every failure has a root cause category and affected file
- Zero unanalyzed failures
- Dependency impact documented for every failure
- Pattern search executed for every error type`,
subagent_type: "researcher",
description: "Failure Analyzer",
run_in_background: true
})
// Agent 4: Bug Fixer
Task({
model: "opus",
prompt: `You are the Bug Fixer agent. Your mission:
1. Retrieve failure analysis from memory
2. For each failure, apply fix using confidence-tiered approach:
PLATINUM (>= 0.95 confidence):
- Auto-apply the stored fix pattern immediately
- No review needed
GOLD (>= 0.85 confidence):
- Auto-apply the stored fix pattern
- Flag in commit message for awareness
SILVER (>= 0.75 confidence):
- Read the failing test file and source file
- Apply suggested fix with extra verification
- Run targeted test before proceeding
BRONZE or NO PATTERN:
- Read the failing test file
- Read the source file causing the failure
- Implement fix from first principles
- Use defensive patterns appropriate to the test framework
3. After fixing, identify affected context:
- Check dependency graph for cascade impacts
- Flag dependent contexts for re-testing
4. Store the fix pattern with initial confidence:
npx @claude-flow/cli@latest memory store \
--key "fix-[error-type]-[hash]" \
--value '{"pattern":"[fix]","confidence":0.75,"tier":"silver","applied":1,"successes":0}' \
--namespace forge-patterns
5. Signal Test Runner to re-run affected tests
6. Signal Quality Gate Enforcer to check all 7 gates
CONSTRAINTS:
- NEVER change test assertions to make tests pass
- NEVER modify Gherkin specs to match broken behavior
- NEVER introduce new dependencies without flagging
- NEVER apply fixes without reading both test file and source file
ACCEPTANCE:
- Every applied fix has a targeted test re-run result
- Zero fixes without verification
- Fix pattern stored with initial confidence score
- Cascade impacts identified and flagged for re-testing`,
subagent_type: "coder",
description: "Bug Fixer",
run_in_background: true
})
// Agent 5: Quality Gate Enforcer
Task({
model: "haiku",
prompt: `You are the Quality Gate Enforcer agent. Your mission:
After each fix cycle, evaluate ALL 7 quality gates:
GATE 1 ā FUNCTIONAL (100% required):
- All tests in the target context pass
- No regressions in previously passing tests
GATE 2 ā BEHAVIORAL (100% of targeted scenarios):
- Every Gherkin scenario that was targeted has a passing test
- Spec-to-test mapping is complete (no unmapped scenarios)
GATE 3 ā COVERAGE (>=85% overall, >=95% critical paths):
- Calculate path coverage for the context
- Critical paths: authentication, payment, core workflows
- Non-critical paths: preferences, history, settings
GATE 4 ā SECURITY (0 critical/high violations):
- No hardcoded API keys, tokens, or secrets in test files
- No hardcoded test credentials (use env vars or test fixtures)
- Secure storage patterns used (no plaintext sensitive data)
- No SQL injection vectors in dynamic queries
- No XSS vectors in rendered output
- No path traversal in file operations
- Dependencies have no known critical CVEs (when lockfile available)
- When AQE available: delegate to security-scanner for full SAST analysis
GATE 5 ā ACCESSIBILITY (WCAG AA):
- All interactive elements have accessible labels
- Touch/click targets meet minimum size requirements
- Color contrast meets WCAG AA ratios
- Screen reader navigation order is logical
GATE 6 ā RESILIENCE (tested for target context):
- Offline/disconnected state handled gracefully
- Timeout handling shows user-friendly message
- Error states show retry option
- Server errors show generic error, not stack trace
GATE 7 ā CONTRACT (0 mismatches):
- API responses match expected schemas
- No unexpected null fields
- Enum values match expected set
- Pagination format is consistent
For each gate:
- Status: PASS / FAIL / SKIP (with reason)
- Details: what passed, what failed
- Blocking: whether this gate blocks the commit
Store gate results:
npx @claude-flow/cli@latest memory store \
--key "gates-[context]-[timestamp]" \
--value "[gate results JSON]" \
--namespace forge-state
ONLY signal Auto-Committer when ALL 7 GATES PASS.
CONSTRAINTS:
- NEVER approve a commit with ANY blocking gate failure
- NEVER lower thresholds below defined minimums
- NEVER skip gate evaluation ā all 7 gates must be assessed
- NEVER mark a gate as PASS without evidence
ACCEPTANCE:
- Gate results stored in memory with PASS/FAIL/SKIP for all 7 gates
- Every FAIL includes specific details of what failed
- Every SKIP includes reason for skipping
- Auto-Committer only signaled when all blocking gates pass`,
subagent_type: "reviewer",
description: "Quality Gate Enforcer",
run_in_background: true
})
// Agent 6: Accessibility Auditor
Task({
model: "sonnet",
prompt: `You are the Accessibility Auditor agent. Your mission:
1. For each screen/page/component in the target context, audit:
LABELS:
- Every interactive element has an accessible label/aria-label/Semantics label
- Labels are descriptive (not "button1" but "Submit payment")
- Images have alt text or semantic labels
TOUCH/CLICK TARGETS:
- All interactive elements meet minimum size (48x48dp mobile, 44x44px web)
- Flag any undersized targets
CONTRAST:
- Text on colored backgrounds meets WCAG AA ratio (4.5:1 normal, 3:1 large)
- Flag low-contrast combinations
SCREEN READER:
- Accessibility tree has logical reading order
- No duplicate or misleading labels
- Form fields have associated labels
FOCUS/TAB ORDER:
- Focus order follows visual layout
- Focus trap in modals/dialogs
- Focus returns to trigger after dialog closes
2. Generate findings as:
{severity: "critical"|"warning"|"info", element, file, line, issue, fix}
3. Store audit results:
npx @claude-flow/cli@latest memory store \
--key "a11y-[context]-[timestamp]" \
--value "[audit JSON]" \
--namespace forge-state
CONSTRAINTS:
- NEVER skip interactive elements during audit
- NEVER report false positives for decorative images
- NEVER ignore focus/tab order analysis
- NEVER apply fixes ā only report findings for Bug Fixer
ACCEPTANCE:
- Every interactive element audited
- Findings stored with severity, element, file, line, issue, fix
- Zero unaudited interactive elements in target context
- WCAG AA compliance level assessed for every screen`,
subagent_type: "analyst",
description: "Accessibility Auditor",
run_in_background: true
})
// Agent 7: Auto-Committer
Task({
model: "haiku",
prompt: `You are the Auto-Committer agent. Your mission:
1. Monitor for successful fixes where ALL 7 QUALITY GATES PASS
2. For each successful fix:
- Stage only the fixed files (never git add -A)
- Create detailed commit message:
fix(forge): Fix [TEST_ID] - [brief description]
Behavioral Spec: [Gherkin scenario name]
Root Cause: [what caused the failure]
- [specific issue 1]
- [specific issue 2]
Fix Applied:
- [change 1]
- [change 2]
Quality Gates:
- Functional: PASS
- Behavioral: PASS
- Coverage: [X]%
- Security: PASS
- Accessibility: PASS
- Resilience: PASS
- Contract: PASS
Confidence Tier: [platinum|gold|silver|bronze]
Pattern Stored: fix-[error-type]-[hash]
- Commit with the message above
3. Update coverage report with new passing paths
4. Store commit hash in memory for rollback capability:
npx @claude-flow/cli@latest memory store \
--key "commit-[hash]" \
--value "[commit details JSON]" \
--namespace forge-commits
5. Store last known good commit:
npx @claude-flow/cli@latest memory store \
--key "last-green-commit" \
--value "[hash]" \
--namespace forge-state
CONSTRAINTS:
- NEVER use git add -A or git add .
- NEVER commit without all 7 gates passing
- NEVER amend previous commits
- NEVER push to remote ā only local commits
ACCEPTANCE:
- Commit message includes Behavioral Spec, Root Cause, Fix Applied, all 7 gate statuses
- Only fixed files are staged (no unrelated files)
- Commit hash stored in forge-commits namespace
- Last green commit updated in forge-state namespace`,
subagent_type: "reviewer",
description: "Auto-Committer",
run_in_background: true
})
// Agent 8: Learning Optimizer
Task({
model: "sonnet",
prompt: `You are the Learning Optimizer agent. Your mission:
1. After each test cycle, analyze patterns:
- Which error types fail most often?
- Which fix patterns have highest success rate?
- What new defensive patterns should be added?
- Which Gherkin scenarios are most fragile?
2. UPDATE CONFIDENCE TIERS:
For each fix pattern applied this cycle:
- If fix succeeded: confidence += 0.05 (cap at 1.0)
- If confidence crosses 0.95: promote to Platinum
- If confidence crosses 0.85: promote to Gold
- If fix failed: confidence -= 0.10 (floor at 0.0)
- If confidence drops below 0.70: demote to Bronze (learning-only)
Store updated pattern:
npx @claude-flow/cli@latest memory store \
--key "fix-[error-type]-[hash]" \
--value "[updated pattern JSON]" \
--namespace forge-patterns
3. DEFECT PREDICTION:
Analyze which contexts/files are likely to fail next:
- Files changed since last green run
- Historical failure rate per context
- Complexity of recent changes
Store prediction:
npx @claude-flow/cli@latest memory store \
--key "prediction-[date]" \
--value "[prediction JSON]" \
--namespace forge-predictions
4. Train neural patterns on successful fixes:
npx @claude-flow/cli@latest hooks post-task \
--task-id "forge-cycle" --success true --store-results true
5. Update coverage status:
npx @claude-flow/cli@latest memory store \
--key "forge-coverage-status" \
--value "[updated coverage JSON]" \
--namespace forge-state
6. Generate recommendations for test improvements
7. Export learning metrics:
npx @claude-flow/cli@latest neural train --pattern-type forge-fixes --epochs 5
CONSTRAINTS:
- NEVER promote a pattern that failed in the current cycle
- NEVER delete patterns ā only demote below Bronze threshold
- NEVER override confidence scores without evidence from test results
- NEVER generate predictions without historical data
ACCEPTANCE:
- All applied patterns have updated confidence scores
- Prediction stored for next run with context-level probabilities
- Coverage status updated in forge-state namespace
- Zero patterns promoted without success evidence`,
subagent_type: "researcher",
description: "Learning Optimizer",
run_in_background: true
})
7 gates evaluated after each fix cycle. ALL must pass before a commit is created.
| Gate | Check | Threshold | Blocking |
|------|-------|-----------|----------|
| 1. Functional | All tests pass | 100% pass rate | YES |
| 2. Behavioral | Gherkin scenarios satisfied | 100% of targeted scenarios | YES |
| 3. Coverage | Path coverage | >=85% overall, >=95% critical | YES (critical only) |
| 4. Security | No hardcoded secrets, secure storage, SAST checks | 0 critical/high violations | YES |
| 5. Accessibility | Accessible labels, target sizes, contrast | WCAG AA | Warning only |
| 6. Resilience | Offline handling, timeout handling, error states | Tested for target context | Warning only |
| 7. Contract | API response matches expected schema | 0 mismatches | YES |
When gates fail, failures are categorized for targeted re-runs:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā FORGE AUTONOMOUS LOOP ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā āāāāāāāāāāāā āāāāāāāāāāāā āāāāāāāāāāāā āāāāāāāāāāāā ā
ā ā Specify āāāāā¶ā Test āāāāā¶ā Analyze āāāāā¶ā Fix ā ā
ā ā (Gherkin)ā ā (Run) ā ā (Root ā ā (Tiered) ā ā
ā āāāāāāāāāāāā āāāāāāāāāāāā ā Cause) ā āāāāāāāāāāāā ā
ā ā² āāāāāāāāāāāā ā ā
ā ā ā¼ ā
ā āāāāāāāāāāāā āāāāāāāāāāāā āāāāāāāāāāāā āāāāāāāāāāāā ā
ā ā Learn āāāāāā Commit āāāāāā Gate āāāāāā Audit ā ā
ā ā (Update ā ā (Auto) ā ā (7 Gates)ā ā (A11y) ā ā
ā ā Tiers) ā āāāāāāāāāāāā āāāāāāāāāāāā āāāāāāāāāāāā ā
ā āāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāāāāāāāā REPEAT āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā Loop continues until: ALL 7 GATES PASS or MAX_ITERATIONS (10) ā
ā Gate failures are categorized for targeted re-runs (not full re-run) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Each agent emits structured progress events during execution for observability:
{"agent": "spec-verifier", "event": "spec_generated", "context": "payments", "scenarios": 12}
{"agent": "test-runner", "event": "test_started", "context": "payments", "test": "user_can_pay"}
{"agent": "test-runner", "event": "test_completed", "context": "payments", "passed": 10, "failed": 2}
{"agent": "failure-analyzer", "event": "root_cause_found", "test": "user_can_pay", "cause": "timeout"}
{"agent": "bug-fixer", "event": "fix_applied", "file": "payments.ts", "confidence": 0.92}
{"agent": "gate-enforcer", "event": "gate_evaluated", "gate": "functional", "status": "PASS"}
{"agent": "auto-committer", "event": "committed", "hash": "abc123", "tests_fixed": 2}
{"agent": "learning-optimizer", "event": "pattern_updated", "pattern": "fix-timeout-xyz", "tier": "gold"}
Progress File:
.forge/progress.jsonl (one JSON object per line)tail -f .forge/progress.jsonlIntegration with Agentic QE AG-UI:
Every fix pattern is tracked with a confidence score that evolves over time:
{
"key": "fix-element-not-found-abc123",
"pattern": {
"error": "Element not found / No element",
"fix": "Ensure element is rendered and visible before interaction",
"files_affected": ["*_test.*"],
"context": "any"
},
"tier": "gold",
"confidence": 0.92,
"auto_apply": true,
"applied_count": 47,
"success_count": 43,
"success_rate": 0.915,
"last_applied": "2026-02-06T14:30:00Z",
"last_failed": "2026-02-01T09:15:00Z"
}
| Tier | Confidence | Auto-Apply | Behavior |
|------|-----------|------------|----------|
| Platinum | >= 0.95 | Yes | Apply immediately without review |
| Gold | >= 0.85 | Yes | Apply and flag in commit message |
| Silver | >= 0.75 | No | Suggest to Bug Fixer, don't auto-apply |
| Bronze | >= 0.70 | No | Store for learning only, never auto-apply |
| Expired | < 0.70 | No | Pattern demoted, needs revalidation |
After each application:
Before running tests, the Learning Optimizer analyzes historical data to predict which tests are most likely to fail:
{
"date": "2026-02-07",
"predictions": [
{ "context": "payments", "probability": 0.73, "reason": "3 files changed in payment module" },
{ "context": "orders", "probability": 0.45, "reason": "depends on payments (changed)" },
{ "context": "identity", "probability": 0.12, "reason": "no changes, stable history" }
],
"recommended_order": ["payments", "orders", "identity"]
}
Tests are executed in descending probability order ā predicted-to-fail tests run FIRST for faster convergence.
For EVERY interactive element, test:
For each target context, inject controlled failures:
For UI-heavy projects, Forge captures and compares screenshots to detect unintended visual changes:
Screenshot Capture by Platform:
| Platform | Method |
|----------|--------|
| Web (Playwright) | page.screenshot({ fullPage: true }) |
| Web (Cypress) | cy.screenshot() |
| Flutter | await tester.binding.setSurfaceSize(size); await expectLater(find.byType(App), matchesGoldenFile('name.png')) |
| Mobile (native) | Platform-specific screenshot capture |
Configuration:
# forge.config.yaml ā Visual regression settings (optional)
visual_regression:
enabled: true
threshold: 0.001 # 0.1% pixel diff tolerance
screenshot_dir: .forge/screenshots
full_page: true
When Agentic QE is available, delegate to the visual-tester agent for parallel viewport comparison across multiple screen sizes.
# Full autonomous run ā all contexts, all gates
/forge --autonomous --all
# Single context autonomous
/forge --autonomous --context [context-name]
# Behavioral verification only (no fixes)
/forge --verify-only
/forge --verify-only --context [context-name]
# Fix-only mode (fix failures, don't generate new tests)
/forge --fix-only --context [context-name]
# Learn mode (analyze patterns, update confidence tiers)
/forge --learn
# Add coverage for new screens/pages/components
/forge --add-coverage --screens [name1],[name2]
# Generate Gherkin specs for a context
/forge --spec-gen --context [context-name]
/forge --spec-gen --all
# Run quality gates without test execution
/forge --gates-only
/forge --gates-only --context [context-name]
# Defect prediction only
/forge --predict
/forge --predict --context [context-name]
# Chaos/resilience testing for a context
/forge --chaos --context [context-name]
/forge --chaos --all
| Namespace | Purpose | Key Pattern |
|-----------|---------|-------------|
| forge-patterns | Fix patterns with confidence tiers | fix-[error-type]-[hash] |
| forge-results | Test run results | test-run-[timestamp] |
| forge-state | Coverage + gate status | forge-coverage-status, gates-[context]-[ts], last-green-commit |
| forge-commits | Commit history | commit-[hash] |
| forge-screens | Implemented screens/pages | screen-[name] |
| forge-specs | Gherkin specifications | specs-[context]-[timestamp] |
| forge-contracts | API contract snapshots | contract-snapshot-[timestamp] |
| forge-predictions | Defect prediction history | prediction-[date] |
Forge can optionally integrate with the Agentic QE framework via MCP for enhanced capabilities. All AQE features are additive ā Forge works identically without AQE.
On startup, Forge checks for AQE availability:
# Check if agentic-qe MCP server is registered
claude mcp list | grep -q "aqe" && echo "AQE available" || echo "AQE not available ā using defaults"
| Forge Component | Without AQE (Default) | With AQE |
|----------------|----------------------|----------|
| Pattern Storage | claude-flow memory (forge-patterns namespace) | ReasoningBank ā HNSW vector-indexed, 150x faster pattern search, experience replay |
| Defect Prediction | Historical failure rates + file changes | defect-intelligence domain ā root-cause-analyzer + defect-predictor agents |
| Security Scanning | Gate 4 static checks (secrets, injection vectors) | security-compliance domain ā full SAST/DAST via security-scanner agent |
| Accessibility Audit | Forge Accessibility Auditor agent | visual-accessibility domain ā visual-tester + accessibility-auditor agents |
| Contract Testing | Gate 7 schema validation | contract-testing domain ā contract-validator + graphql-tester agents |
| Progress Reporting | .forge/progress.jsonl file | AG-UI streaming protocol for real-time UI updates |
When AQE is NOT available, Forge falls back to its built-in behavior for every capability. No configuration is required ā the skill auto-detects and adapts.
# forge.config.yaml ā AQE integration settings (optional)
integrations:
agentic-qe:
enabled: true # auto-detected if not specified
domains:
- defect-intelligence
- security-compliance
- visual-accessibility
- contract-testing
reasoning_bank:
enabled: true # replaces claude-flow memory for forge-patterns namespace
ag_ui:
enabled: true # stream progress events to AG-UI protocol
When AQE is enabled, Forge delegates specific subtasks to specialized AQE agents:
| Forge Agent | AQE Domain | AQE Agents Used |
|-------------|-----------|-----------------|
| Specification Verifier | requirements-validation | bdd-generator, requirements-validator |
| Failure Analyzer | defect-intelligence | root-cause-analyzer, defect-predictor |
| Quality Gate Enforcer (Gate 4) | security-compliance | security-scanner, security-auditor |
| Accessibility Auditor | visual-accessibility | visual-tester, accessibility-auditor |
| Quality Gate Enforcer (Gate 7) | contract-testing | contract-validator, graphql-tester |
| Learning Optimizer | learning-optimization | learning-coordinator, pattern-learner |
Forge agents that have no AQE equivalent (Test Runner, Bug Fixer, Auto-Committer) continue to run as built-in agents regardless of AQE availability.
The Bug Fixer agent uses defensive patterns appropriate to the project's test framework. Examples:
Future<bool> safeTap(WidgetTester tester, Finder finder) async {
await tester.pumpAndSettle();
final elements = finder.evaluate();
if (elements.isNotEmpty) {
await tester.tap(finder.first, warnIfMissed: false);
await tester.pumpAndSettle();
return true;
}
debugPrint('Widget not found: ${finder.description}');
return false;
}
Future<bool> safeEnterText(WidgetTester tester, Finder finder, String text) async {
await tester.pumpAndSettle();
final elements = finder.evaluate();
if (elements.isNotEmpty) {
await tester.enterText(finder.first, text);
await tester.pumpAndSettle();
return true;
}
return false;
}
Future<void> visualDelay(WidgetTester tester, {String? label}) async {
if (label != null) debugPrint('Observing: $label');
await tester.pump(const Duration(milliseconds: 2500));
}
Future<bool> scrollUntilVisible(
WidgetTester tester,
Finder finder,
Finder scrollable,
) async {
for (int i = 0; i < 10; i++) {
await tester.pumpAndSettle();
if (finder.evaluate().isNotEmpty) return true;
await tester.drag(scrollable, const Offset(0, -300));
await tester.pumpAndSettle();
}
return false;
}
Future<void> waitForApiResponse(WidgetTester tester, {int maxWaitMs = 5000}) async {
final startTime = DateTime.now();
while (DateTime.now().difference(startTime).inMilliseconds < maxWaitMs) {
await tester.pump(const Duration(milliseconds: 100));
if (find.byType(CircularProgressIndicator).evaluate().isEmpty) break;
}
await tester.pumpAndSettle();
}
async function safeClick(selector, options = { timeout: 5000 }) {
try {
await page.waitForSelector(selector, { state: 'visible', timeout: options.timeout });
await page.click(selector);
return true;
} catch (e) {
console.warn(`Element not found: ${selector}`);
return false;
}
}
async function waitForApi(urlPattern, options = { timeout: 10000 }) {
return page.waitForResponse(
response => response.url().includes(urlPattern) && response.status() === 200,
{ timeout: options.timeout }
);
}
{
"error": "Element not found / No element / Bad state: No element",
"cause": "Element not rendered, wrong selector, or not in viewport",
"tier": "platinum",
"confidence": 0.97,
"fixes": [
"Wait for element to be rendered before interaction",
"Use safe interaction helpers instead of direct calls",
"Verify selector matches actual element",
"Scroll element into view before interaction"
]
}
{
"error": "Timeout / pumpAndSettle timed out / waiting for selector",
"cause": "Infinite animation, continuous rebuild, or slow API",
"tier": "gold",
"confidence": 0.89,
"fixes": [
"Use fixed-duration wait instead of settle/idle wait",
"Dispose animation controllers in tearDown",
"Check for infinite re-render loops",
"Increase timeout for slow API calls"
]
}
{
"error": "Expected: X, Actual: Y / AssertionError",
"cause": "State not updated or wrong expectation",
"tier": "silver",
"confidence": 0.78,
"fixes": [
"Add delay before assertion for async state updates",
"Verify test data seeding completed",
"Check async operation completion before asserting"
]
}
{
"error": "Type error / null value / schema mismatch",
"cause": "Backend response format changed",
"tier": "gold",
"confidence": 0.86,
"fixes": [
"Update model/DTO to match current API response",
"Add null safety handling",
"Check API version compatibility"
]
}
The Learning Optimizer maintains coverage status per context:
{
"lastRun": "2026-02-07T11:00:00Z",
"backendStatus": {
"healthy": true,
"port": 8080
},
"gateStatus": {
"functional": "PASS",
"behavioral": "PASS",
"coverage": "PASS",
"security": "PASS",
"accessibility": "WARNING",
"resilience": "PASS",
"contract": "PASS"
},
"contexts": {
"[context-a]": { "total": 68, "passing": 68, "failing": 0, "behavioralCoverage": 100 },
"[context-b]": { "total": 72, "passing": 70, "failing": 2, "behavioralCoverage": 97 }
},
"totalPaths": 0,
"passingPaths": 0,
"coveragePercent": 0,
"confidenceTiers": {
"platinum": 0,
"gold": 0,
"silver": 0,
"bronze": 0,
"expired": 0
}
}
fix(forge): Fix [TEST_ID] - [brief description]
Behavioral Spec: [Gherkin scenario name]
Root Cause: [what caused the failure]
- [specific issue 1]
- [specific issue 2]
Fix Applied:
- [change 1]
- [change 2]
Quality Gates:
- Functional: PASS
- Behavioral: PASS
- Coverage: [X]%
- Security: PASS
- Accessibility: PASS/WARNING
- Resilience: PASS
- Contract: PASS
Test Verification:
- Test now passes after fix
- No regression in related tests
- Dependent contexts re-tested: [list]
Confidence Tier: [platinum|gold|silver|bronze]
Pattern Stored: fix-[error-type]-[hash]
If a fix introduces regressions:
# Retrieve last known good commit
npx @claude-flow/cli@latest memory retrieve --key "last-green-commit" --namespace forge-state
# Rollback to that commit
git revert [bad-commit-hash]
# Store rollback event for learning (prevents pattern from being re-applied)
npx @claude-flow/cli@latest memory store \
--key "rollback-[timestamp]" \
--value '{"commit":"[hash]","reason":"[reason]","pattern":"[pattern-key]"}' \
--namespace forge-patterns
# Demote the fix pattern confidence (-0.10)
# Learning Optimizer will handle this automatically
When Bug Fixer's fix causes a cascade regression (tests in dependent contexts fail):
ESCALATION: Test [testId] has regressed 2x after fix attempts.
Original failure: [description]
Cascade failure: [description]
Attempted fixes: [list]
Recommendation: Manual review required.
When two agents disagree (e.g., Bug Fixer wants to change a file that Spec Verifier says shouldn't change):
After each autonomous run, the skill triggers comprehensive learning:
# Train on successful patterns
npx @claude-flow/cli@latest hooks post-task --task-id "forge-run" --success true --store-results true
# Update neural patterns
npx @claude-flow/cli@latest neural train --pattern-type forge-fixes --epochs 5
# Update defect predictions
npx @claude-flow/cli@latest memory store \
--key "prediction-$(date +%Y-%m-%d)" \
--value "[prediction JSON from Learning Optimizer]" \
--namespace forge-predictions
# Export metrics
npx @claude-flow/cli@latest hooks metrics --format json
Forge can be extended per-project by creating a forge.contexts.yaml file alongside the skill:
# forge.contexts.yaml ā Project-specific bounded contexts and screens
contexts:
- name: identity
testFile: click_through_identity_full_test.dart
specFile: identity.feature
paths: 68
subdomains: [Auth, Profiles, Verification]
screens:
- name: Identity Verification
file: lib/screens/compliance/identity_verification_screen.dart
route: /verification
cyclomaticPaths:
- All verifications incomplete -> show progress 0%
- Email only verified -> show 25%
- All verified -> show 100% + celebration state
- name: payments
testFile: click_through_payments_test.dart
specFile: payments.feature
paths: 89
subdomains: [Wallet, Cards, Transactions]
dependencies:
identity:
blocks: [rides, payments, driver]
payments:
depends_on: [identity]
blocks: [rides, subscriptions]
This separates the generic Forge engine from project-specific configuration, making Forge reusable across any codebase.
Before running Forge:
After Forge completes:
Generated Mar 1, 2026
A retail company migrating from a monolithic to microservices architecture uses Forge to ensure all user journeys, from browsing to checkout, remain functional. Forge's architecture adaptability and real API testing validate each service independently and their interactions, preventing regression during the transition.
A financial technology startup employs Forge to rigorously test payment processing and identity verification services without mocking. The skill's continuous behavioral verification and quality gates ensure compliance with regulatory standards by catching serialization and validation issues in real-time.
A hospital network integrates multiple modular monoliths for patient records and appointment scheduling. Forge discovers the project structure and runs exhaustive E2E tests to verify data flows and dependencies, ensuring patient safety and system reliability across contexts.
A transportation company building a mobile app with a backend API uses Forge to test the full stack, including real-time location and payment services. The autonomous fix loop and chaos testing help identify and resolve integration bugs before deployment, enhancing user experience.
An educational technology firm with a monorepo structure leverages Forge to test multiple apps and packages simultaneously. The skill's project discovery and dependency awareness ensure that updates to one module do not break others, supporting rapid scaling and feature additions.
Offer Forge as a cloud-based service with tiered pricing based on usage, such as test runs or project size. This model provides recurring revenue and scalability, appealing to startups and enterprises seeking automated quality engineering without upfront infrastructure costs.
Sell perpetual licenses or annual contracts to large organizations for on-premises deployment. This model includes customization, support, and training services, generating high-value deals and long-term partnerships in regulated industries like finance and healthcare.
Provide professional services to help companies integrate Forge into their development workflows, including architecture analysis and custom configuration. This model leverages expertise to drive adoption and can be combined with subscription or licensing for added revenue streams.
š¬ Integration Tip
Start by running Forge's project discovery to auto-generate a configuration, then review and adjust the forge.config.yaml as needed to align with your specific architecture and testing requirements.
Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Clau...
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
Search and analyze your own session logs (older/parent conversations) using jq.
Typed knowledge graph for structured agent memory and composable skills. Use when creating/querying entities (Person, Project, Task, Event, Document), linking related objects, enforcing constraints, planning multi-step actions as graph transformations, or when skills need to share state. Trigger on "remember", "what do I know about", "link X to Y", "show dependencies", entity CRUD, or cross-skill data access.
Ultimate AI agent memory system for Cursor, Claude, ChatGPT & Copilot. WAL protocol + vector search + git-notes + cloud backup. Never lose context again. Vibe-coding ready.
Headless browser automation CLI optimized for AI agents with accessibility tree snapshots and ref-based element selection