Logo
ClawHub Skills Lib
HomeCategoriesUse CasesTrendingBlog
HomeCategoriesUse CasesTrendingBlog
ClawHub Skills Lib
ClawHub Skills Lib

Browse 20,000+ community-built AI agent skills for OpenClaw. Updated daily from clawhub.ai.

Explore

  • Home
  • Trending
  • Use Cases
  • Blog

Categories

  • Development
  • AI & Agents
  • Productivity
  • Communication
  • Data & Research
  • Business
  • Platforms
  • Lifestyle
  • Education
  • Design

Use Cases

  • Security Auditing
  • Workflow Automation
  • Finance & Fintech
  • MCP Integration
  • Crypto Trading
  • Web3 & DeFi
  • Data Analysis
  • Social Media
  • 中文平台技能
  • All Use Cases →
© 2026 ClawHub Skills Lib. All rights reserved.Built with Next.js · Supabase · Prisma
Home/Blog/PDF Skill: Extract Text, Tables, and Forms — or Create PDFs From Scratch
skill-spotlightdocs-officepdfclawhubopenclawdocument-processing

PDF Skill: Extract Text, Tables, and Forms — or Create PDFs From Scratch

March 11, 2026·6 min read

16,327 downloads, 237 installs, 25 stars. The PDF skill by @awspace is a comprehensive PDF toolkit that covers the full document lifecycle: reading text, extracting tables, creating new PDFs, merging, splitting, adding watermarks, OCR for scanned pages, and form filling. Built on Python libraries pypdf, pdfplumber, and reportlab.

If your agent interacts with documents — and most do — this skill handles the PDF side completely.

The Problem It Solves

PDFs are everywhere in business workflows: invoices, contracts, reports, research papers, forms. But they're hostile to programmatic access — binary format, variable structure, sometimes scanned images pretending to be text, sometimes interactive forms. A different tool for each operation is the norm.

The PDF skill bundles the right Python library for each task:

  • pypdf — basic operations (merge, split, rotate, metadata, password)
  • pdfplumber — text and table extraction with layout awareness
  • reportlab — creating new PDFs programmatically
  • pytesseract + pdf2image — OCR for scanned documents
  • pdftotext/qpdf/pdftk — CLI tools for common operations

Text Extraction

Basic Extraction

from pypdf import PdfReader
 
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
 
text = ""
for page in reader.pages:
    text += page.extract_text()

Layout-Aware Extraction (pdfplumber)

For documents where spatial layout matters (columns, tables, headers):

import pdfplumber
 
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Command Line

# Basic text extraction
pdftotext input.pdf output.txt
 
# Preserve layout (columns, spacing)
pdftotext -layout input.pdf output.txt
 
# Specific pages only
pdftotext -f 1 -l 5 input.pdf output.txt

Table Extraction

One of the most valuable capabilities for data work:

import pdfplumber
 
with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

Export Tables to Excel

import pdfplumber, pandas as pd
 
with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)
 
if all_tables:
    combined = pd.concat(all_tables, ignore_index=True)
    combined.to_excel("extracted_tables.xlsx", index=False)

This pattern — PDF tables → pandas DataFrame → Excel — covers the majority of financial document and report extraction workflows.

Creating PDFs

Simple PDF with Canvas

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
 
c = canvas.Canvas("output.pdf", pagesize=letter)
width, height = letter
 
c.drawString(100, height - 100, "Hello World!")
c.line(100, height - 140, 400, height - 140)
c.save()

Multi-Page Document with Platypus

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
 
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
 
story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Body content goes here. " * 20, styles['Normal']))
story.append(PageBreak())
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Second page content", styles['Normal']))
 
doc.build(story)

Reportlab's Platypus flow system handles pagination, headers, styles, and multi-column layouts — serious document generation without InDesign.

Merge and Split

Merge Multiple PDFs

from pypdf import PdfWriter, PdfReader
 
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)
 
with open("merged.pdf", "wb") as output:
    writer.write(output)

Command line:

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Split into Individual Pages

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Command line — page ranges:

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

Watermarking

from pypdf import PdfReader, PdfWriter
 
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
 
for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)
 
with open("watermarked.pdf", "wb") as output:
    writer.write(output)

OCR for Scanned PDFs

For PDFs that are scanned images (no selectable text):

# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
 
images = convert_from_path('scanned.pdf')
 
text = ""
for i, image in enumerate(images):
    text += f"Page {i+1}:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"
 
print(text)

This pipeline: PDF → images → Tesseract OCR → text. Requires poppler (for pdf2image) and Tesseract to be installed on the system.

Password Protection

from pypdf import PdfReader, PdfWriter
 
reader = PdfReader("input.pdf")
writer = PdfWriter()
 
for page in reader.pages:
    writer.add_page(page)
 
writer.encrypt("userpassword", "ownerpassword")
 
with open("encrypted.pdf", "wb") as output:
    writer.write(output)

Remove password:

qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

Image Extraction

# Extract all embedded images
pdfimages -j input.pdf output_prefix
# Produces: output_prefix-000.jpg, output_prefix-001.jpg, ...

Quick Reference

TaskBest Tool
Merge PDFspypdf or qpdf
Split PDFspypdf or qpdf
Extract textpdfplumber (layout) or pdftotext
Extract tablespdfplumber + pandas
Create PDFsreportlab (Platypus for documents, Canvas for precise layout)
OCR scannedpytesseract + pdf2image
Fill formspdf-lib or pypdf (see forms.md)
Password removeqpdf --decrypt
Watermarkpypdf merge_page
Extract imagespdfimages
Metadatapypdf reader.metadata
Rotatepypdf page.rotate() or qpdf

Real-World Use Cases

Invoice processing — Extract text and tables from invoices, parse line items, push to accounting system.

Contract analysis — Extract full text for AI review, check for specific clauses, flag unusual terms.

Report generation — Agent aggregates data, creates a multi-page PDF report with headers, tables, and charts using reportlab.

Document digitization — OCR pipeline converts scanned archives into searchable text.

Form automation — Agent fills PDF forms (see forms.md included with the skill) from structured data sources.

Batch processing — Split a large multi-chapter PDF, process each section independently, re-merge with results.

Considerations

  • Scanned vs. digital PDFs — pdfplumber and pypdf extract embedded text. Scanned PDFs (images) require OCR with pytesseract — different pipeline, additional dependencies.
  • Table extraction accuracy — pdfplumber is excellent but complex table layouts (merged cells, nested tables) may need manual verification.
  • reportlab learning curve — Platypus is powerful but has its own document model. For simple PDFs, Canvas is easier. For complex multi-page documents, Platypus pays off.
  • Dependencies — The full capability set requires multiple libraries. The skill includes instructions for each, but OCR specifically needs system-level dependencies (poppler, Tesseract).
  • Proprietary license — The skill has a proprietary license (see LICENSE.txt). Review terms before commercial use.

The Bigger Picture

PDF is one of those formats that's never going away. It's the standard for legally binding documents, financial reports, academic papers, and official communications. An AI agent without PDF capability is blocked from interacting with a significant portion of business information.

The PDF skill handles the breadth of what agents actually need: reading existing documents, extracting structured data, creating new documents, and handling the edge cases (scanned pages, password protection, forms). 16,000+ downloads across diverse use cases confirms it's a foundational skill.


View the skill on ClawHub: pdf

← Back to Blog