crawl4aiAI-powered web scraping framework for extracting structured data from websites. Use when Codex needs to crawl, scrape, or extract data from web pages using AI-powered parsing, handle dynamic content, or work with complex HTML structures.
Install via ClawdBot CLI:
clawdbot install codylrn804/crawl4aiCrawl4ai is an AI-powered web scraping framework designed to extract structured data from websites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, extract text intelligently, and clean and structure data from complex web pages.
Use when Codex needs to:
Trigger phrases:
from crawl4ai import AsyncWebCrawler, BrowserMode
async def scrape_page(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
browser_mode=BrowserMode.LATEST,
headless=True
)
return result.markdown, result.clean_html
from crawl4ai import AsyncWebCrawler, JsonModeScreener
import json
async def extract_products(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
screenshot=True,
javascript=True,
bypass_cache=True
)
# Extract product data
products = []
for item in result.extracted_content:
if item['type'] == 'product':
products.append({
'name': item['name'],
'price': item['price'],
'url': item['url']
})
return products
Scenario: User wants to scrape a website for all article titles.
from crawl4ai import AsyncWebCrawler
async def scrape_articles(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
javascript=True,
verbose=True
)
# Extract article titles from HTML
articles = result.extracted_content if result.extracted_content else []
titles = [item.get('name', item.get('text', '')) for item in articles]
return titles
Trigger: "Scrape this site for article titles" or "Get all titles from [URL]"
Scenario: Website loads data via JavaScript.
from crawl4ai import AsyncWebCrawler
async def scrape_dynamic_site(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
javascript=True, # Wait for JS execution
wait_for="body", # Wait for specific element
delay=1.5, # Wait time after load
headless=True
)
return result.markdown
Trigger: "Scrape this dynamic website" or "This page needs JavaScript to load data"
Scenario: Extract specific fields like prices, descriptions, etc.
from crawl4ai import AsyncWebCrawler
async def extract_product_details(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
screenshot=True,
js_code="""
const products = document.querySelectorAll('.product');
return Array.from(products).map(p => ({
name: p.querySelector('.name')?.textContent,
price: p.querySelector('.price')?.textContent,
url: p.querySelector('a')?.href
}));
"""
)
return result.extracted_content
Trigger: "Extract product details from this page" or "Get price and name from [URL]"
Scenario: Clean messy HTML and extract clean text.
from crawl4ai import AsyncWebCrawler
async def clean_and_parse(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
remove_tags=['script', 'style', 'nav', 'footer', 'header'],
only_main_content=True
)
# Clean and return markdown
clean_text = result.clean_html
return clean_text
Trigger: "Clean this HTML" or "Extract main content from this page"
async def custom_scrape(url, custom_js):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
js_code=custom_js,
js_only=True # Only execute JS, don't download resources
)
return result.extracted_content
from crawl4ai import AsyncWebCrawler
async def multi_page_scrape(base_url, urls):
async with AsyncWebCrawler() as crawler:
results = []
for url in urls:
result = await crawler.arun(
url=url,
session_id=f"session_{url}",
bypass_cache=True
)
results.append({
'url': url,
'content': result.markdown,
'status': result.success
})
return results
async def robust_scrape(url):
try:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
timeout=30000 # 30 seconds timeout
)
if result.success:
return result.markdown, result.extracted_content
else:
print(f"Scraping failed: {result.error_message}")
return None, None
except Exception as e:
print(f"Scraping error: {str(e)}")
return None, None
Crawl4ai supports multiple output formats:
result.markdown)result.clean_html)result.extracted_content)result.screenshot)result.links)Python scripts for common crawling operations:
scrape_single_page.py - Basic scraping utilityscrape_multiple_pages.py - Batch scraping with paginationextract_from_html.py - HTML parsing helperclean_html.py - HTML cleaning utilityDocumentation and examples:
api_reference.md - Complete API documentationexamples.md - Common use cases and patternserror_handling.md - Troubleshooting guideGenerated Mar 1, 2026
Automatically scrape competitor websites to extract product prices, descriptions, and availability for real-time price comparison and inventory tracking. This helps businesses adjust pricing strategies and monitor market trends efficiently.
Scrape news articles, blogs, and media sites to gather headlines, summaries, and publication dates for content curation and trend analysis. This supports media companies in creating aggregated feeds and performing sentiment analysis.
Extract property details such as prices, locations, square footage, and amenities from real estate websites for market research and lead generation. This aids agencies in compiling comprehensive property databases and identifying investment opportunities.
Scrape job postings from various career sites to collect job titles, descriptions, salaries, and company information for recruitment analytics and job matching services. This helps HR firms and job seekers stay updated on market demands.
Extract structured data from academic journals, conference proceedings, and educational websites for literature reviews and data analysis in research projects. This assists researchers in automating data collection and synthesizing information from diverse sources.
Offer a cloud-based web scraping platform with tiered pricing based on usage volume, features like AI-powered parsing, and API access. Revenue is generated through monthly or annual subscriptions from businesses needing automated data extraction.
Provide tailored web scraping solutions for specific client needs, such as one-time data extraction projects or ongoing monitoring services. Revenue comes from project-based fees or retainer contracts for continuous data delivery.
Scrape and aggregate data from public websites, clean and structure it, then sell the datasets to third parties like market research firms or analytics companies. Revenue is generated through one-time sales or licensing agreements for data access.
💬 Integration Tip
Integrate with existing data pipelines using the provided Python API, and ensure compliance with website terms by implementing rate limiting and respecting robots.txt files.
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
Expert frontend design guidelines for creating beautiful, modern UIs. Use when building landing pages, dashboards, or any user interface.
Use when building UI with shadcn/ui components, Tailwind CSS layouts, form patterns with react-hook-form and zod, theming, dark mode, sidebar layouts, mobile navigation, or any shadcn component question.
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when building web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
Create distinctive, production-grade static sites with React, Tailwind CSS, and shadcn/ui — no mockups needed. Generates bold, memorable designs from plain text requirements with anti-AI-slop aesthetics, mobile-first responsive patterns, and single-file bundling. Use when building landing pages, marketing sites, portfolios, dashboards, or any static web UI. Supports both Vite (pure static) and Next.js (Vercel deploy) workflows.
AI skill for automated UI audits. Evaluate interfaces against proven UX principles for visual hierarchy, accessibility, cognitive load, navigation, and more. Based on Making UX Decisions by Tommy Geoco.