skill-spotlightdeveloper-toolsplaywright-scraper-skillclawhubopenclawweb-scrapingbrowser-automation

Playwright Scraper Skill: Beat Anti-Bot Protection with Two-Mode Web Scraping for AI Agents

March 13, 2026·5 min read

15,175+ downloads and 30 stars on ClawHub. The playwright-scraper-skill by @waisimon is an OpenClaw skill that gives AI agents a real browser — with full Playwright power and anti-bot protection built in. Verified 100% success rate on Cloudflare-protected sites like Discuss.com.hk. No external API keys needed.

The Problem It Solves

Every AI agent that fetches web content hits the same wall eventually. Simple HTTP fetches return blank pages on JavaScript-heavy sites. Cloudflare intercepts automation attempts and returns 403s. You need a real browser — one that looks, behaves, and waits like a human.

The standard web tools built into AI agents aren't designed for adversarial websites. This skill fills that gap with Playwright, configured specifically for anti-bot evasion.

Two Modes, One Skill

The core architecture is a decision tree. The skill tells your agent exactly which tool to reach for and when:

Target Website           Anti-Bot Level    Recommended Method
─────────────────────────────────────────────────────────────
Regular sites            None              web_fetch (built-in)
Dynamic sites (JS)       Medium            playwright-simple.js
Cloudflare/protected     High              playwright-stealth.js ⭐
YouTube                  Special           deep-scraper (separate)
Reddit                   Special           reddit-scraper (separate)

This prevents agents from blindly reaching for the heavy tool. Try the lightest option first.

Simple Mode

For JavaScript-rendered pages without anti-bot protection:

node scripts/playwright-simple.js "https://example.com"

Returns JSON with title, URL, content preview, and elapsed time. Completes in 3–5 seconds.

Supports environment variables for customization:

# Show browser window (debug)
HEADLESS=false node scripts/playwright-simple.js <URL>
 
# Custom wait time
WAIT_TIME=5000 node scripts/playwright-simple.js <URL>
 
# Save screenshot
SCREENSHOT_PATH=/tmp/page.png node scripts/playwright-simple.js <URL>

Stealth Mode

For Cloudflare challenges, 403s, and heavily protected sites:

node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"

The stealth mode applies a specific set of anti-detection techniques:

Hides navigator.webdriver — the single most important signal bots expose
iPhone User-Agent — realistic device fingerprint
Human-like behavior — random delays, scroll simulation
Mock permissions — prevents permission API fingerprinting
Screenshot + HTML export — saves evidence of what the agent saw

Additional environment options:

# Save HTML for deeper analysis
SAVE_HTML=true node scripts/playwright-stealth.js <URL>
 
# Custom User-Agent
USER_AGENT="Mozilla/5.0 (iPhone...)" node scripts/playwright-stealth.js <URL>
 
# Headful mode for stubborn Cloudflare challenges
HEADLESS=false WAIT_TIME=30000 node scripts/playwright-stealth.js <URL>

Installation

# Install via ClawHub
clawhub install playwright-scraper-skill
 
# Install dependencies
cd playwright-scraper-skill
npm install
npx playwright install chromium

Requirements: Node.js v18+, ~500MB disk space for Chromium.

Real Performance Numbers

Tested against actual sites (as of February 2026):

Method	Speed	Anti-Bot	Discuss.com.hk Success Rate
`web_fetch` (built-in)	Fastest	None	0%
Playwright Simple	3–5s	Low	~20%
Playwright Stealth	5–20s	Medium-High	100%
Puppeteer standard	~5s	Medium	~80%
Crawlee (deep-scraper)	Slow	Detected	0%
Chaser (Rust)	Medium	Detected	0%

The lesson from testing: framework overhead gets you detected. Pure Playwright with targeted anti-bot techniques outperforms complex frameworks.

What the Agent Sees (Output Format)

Both scripts return structured JSON:

{
  "title": "Discuss.com.hk — Hot Topics",
  "url": "https://m.discuss.com.hk/#hot",
  "htmlLength": 124508,
  "contentPreview": "...",
  "cloudflare": false,
  "screenshot": "./screenshot-1739876543.png",
  "elapsedSeconds": "8.34",
  "data": {
    "links": [
      { "text": "Post title...", "href": "https://m.discuss.com.hk/..." }
    ]
  }
}

Agents can parse this directly. The cloudflare boolean tells the agent whether a challenge page was detected so it can adjust wait times automatically.

Practical Tips

Always try web_fetch first — it's fastest and free. Only escalate when you get empty or blocked responses.
Increase WAIT_TIME for Cloudflare — default 5s often isn't enough. Try 10000–15000ms for stubborn challenges.
HEADLESS=false sometimes bypasses Cloudflare — headful mode has a slightly different fingerprint that clears some challenges that headless fails.
Batch scraping: the scripts process one URL at a time. For batch jobs, call them in a shell loop with delays between requests to avoid rate limits.
Screenshots are your debugging tool — if an agent reports wrong content, the screenshot reveals exactly what Playwright actually rendered.

Considerations

~500MB disk requirement — Chromium is large. Not suitable for severely memory-constrained environments.
5–20 second latency in Stealth mode — not suitable for real-time applications.
Cloudflare is not guaranteed — enterprise Cloudflare with CAPTCHA challenges still blocks. In that case, HEADLESS=false WAIT_TIME=30000 is your last resort before rotating IPs.
No proxy rotation built in — the CHANGELOG lists this as a planned future feature. For high-volume scraping with IP rotation, you'll need to extend the scripts.
YouTube and Reddit need dedicated skills — the SKILL.md explicitly directs you to deep-scraper and reddit-scraper for those platforms.

The Bigger Picture

The playwright-scraper-skill represents a practical philosophy: don't overengineer. Crawlee, Puppeteer Extra, Rust-based browsers — the CHANGELOG documents testing all of them on real anti-bot-protected sites. The winner was the simplest approach: pure Playwright with a few surgical anti-detection patches.

For AI agents in 2026, web access means navigating an increasingly adversarial web. Skills like this one close the gap between what agents can ask for and what they can actually retrieve.

View the skill on ClawHub: playwright-scraper-skill

← Back to Blog