Playwright Scraper Skill: Beat Anti-Bot Protection with Two-Mode Web Scraping for AI Agents
15,175+ downloads and 30 stars on ClawHub. The playwright-scraper-skill by @waisimon is an OpenClaw skill that gives AI agents a real browser — with full Playwright power and anti-bot protection built in. Verified 100% success rate on Cloudflare-protected sites like Discuss.com.hk. No external API keys needed.
The Problem It Solves
Every AI agent that fetches web content hits the same wall eventually. Simple HTTP fetches return blank pages on JavaScript-heavy sites. Cloudflare intercepts automation attempts and returns 403s. You need a real browser — one that looks, behaves, and waits like a human.
The standard web tools built into AI agents aren't designed for adversarial websites. This skill fills that gap with Playwright, configured specifically for anti-bot evasion.
Two Modes, One Skill
The core architecture is a decision tree. The skill tells your agent exactly which tool to reach for and when:
Target Website Anti-Bot Level Recommended Method
─────────────────────────────────────────────────────────────
Regular sites None web_fetch (built-in)
Dynamic sites (JS) Medium playwright-simple.js
Cloudflare/protected High playwright-stealth.js ⭐
YouTube Special deep-scraper (separate)
Reddit Special reddit-scraper (separate)
This prevents agents from blindly reaching for the heavy tool. Try the lightest option first.
Simple Mode
For JavaScript-rendered pages without anti-bot protection:
node scripts/playwright-simple.js "https://example.com"Returns JSON with title, URL, content preview, and elapsed time. Completes in 3–5 seconds.
Supports environment variables for customization:
# Show browser window (debug)
HEADLESS=false node scripts/playwright-simple.js <URL>
# Custom wait time
WAIT_TIME=5000 node scripts/playwright-simple.js <URL>
# Save screenshot
SCREENSHOT_PATH=/tmp/page.png node scripts/playwright-simple.js <URL>Stealth Mode
For Cloudflare challenges, 403s, and heavily protected sites:
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"The stealth mode applies a specific set of anti-detection techniques:
- Hides
navigator.webdriver— the single most important signal bots expose - iPhone User-Agent — realistic device fingerprint
- Human-like behavior — random delays, scroll simulation
- Mock permissions — prevents permission API fingerprinting
- Screenshot + HTML export — saves evidence of what the agent saw
Additional environment options:
# Save HTML for deeper analysis
SAVE_HTML=true node scripts/playwright-stealth.js <URL>
# Custom User-Agent
USER_AGENT="Mozilla/5.0 (iPhone...)" node scripts/playwright-stealth.js <URL>
# Headful mode for stubborn Cloudflare challenges
HEADLESS=false WAIT_TIME=30000 node scripts/playwright-stealth.js <URL>Installation
# Install via ClawHub
clawhub install playwright-scraper-skill
# Install dependencies
cd playwright-scraper-skill
npm install
npx playwright install chromiumRequirements: Node.js v18+, ~500MB disk space for Chromium.
Real Performance Numbers
Tested against actual sites (as of February 2026):
| Method | Speed | Anti-Bot | Discuss.com.hk Success Rate |
|---|---|---|---|
web_fetch (built-in) | Fastest | None | 0% |
| Playwright Simple | 3–5s | Low | ~20% |
| Playwright Stealth | 5–20s | Medium-High | 100% |
| Puppeteer standard | ~5s | Medium | ~80% |
| Crawlee (deep-scraper) | Slow | Detected | 0% |
| Chaser (Rust) | Medium | Detected | 0% |
The lesson from testing: framework overhead gets you detected. Pure Playwright with targeted anti-bot techniques outperforms complex frameworks.
What the Agent Sees (Output Format)
Both scripts return structured JSON:
{
"title": "Discuss.com.hk — Hot Topics",
"url": "https://m.discuss.com.hk/#hot",
"htmlLength": 124508,
"contentPreview": "...",
"cloudflare": false,
"screenshot": "./screenshot-1739876543.png",
"elapsedSeconds": "8.34",
"data": {
"links": [
{ "text": "Post title...", "href": "https://m.discuss.com.hk/..." }
]
}
}Agents can parse this directly. The cloudflare boolean tells the agent whether a challenge page was detected so it can adjust wait times automatically.
Practical Tips
-
Always try
web_fetchfirst — it's fastest and free. Only escalate when you get empty or blocked responses. -
Increase
WAIT_TIMEfor Cloudflare — default 5s often isn't enough. Try 10000–15000ms for stubborn challenges. -
HEADLESS=falsesometimes bypasses Cloudflare — headful mode has a slightly different fingerprint that clears some challenges that headless fails. -
Batch scraping: the scripts process one URL at a time. For batch jobs, call them in a shell loop with delays between requests to avoid rate limits.
-
Screenshots are your debugging tool — if an agent reports wrong content, the screenshot reveals exactly what Playwright actually rendered.
Considerations
- ~500MB disk requirement — Chromium is large. Not suitable for severely memory-constrained environments.
- 5–20 second latency in Stealth mode — not suitable for real-time applications.
- Cloudflare is not guaranteed — enterprise Cloudflare with CAPTCHA challenges still blocks. In that case,
HEADLESS=false WAIT_TIME=30000is your last resort before rotating IPs. - No proxy rotation built in — the CHANGELOG lists this as a planned future feature. For high-volume scraping with IP rotation, you'll need to extend the scripts.
- YouTube and Reddit need dedicated skills — the SKILL.md explicitly directs you to
deep-scraperandreddit-scraperfor those platforms.
The Bigger Picture
The playwright-scraper-skill represents a practical philosophy: don't overengineer. Crawlee, Puppeteer Extra, Rust-based browsers — the CHANGELOG documents testing all of them on real anti-bot-protected sites. The winner was the simplest approach: pure Playwright with a few surgical anti-detection patches.
For AI agents in 2026, web access means navigating an increasingly adversarial web. Skills like this one close the gap between what agents can ask for and what they can actually retrieve.
View the skill on ClawHub: playwright-scraper-skill