skill-spotlightdocs-officepdfclawhubopenclawdocument-processing

PDF Skill: Extract Text, Tables, and Forms — or Create PDFs From Scratch

March 11, 2026·6 min read

16,327 downloads, 237 installs, 25 stars. The PDF skill by @awspace is a comprehensive PDF toolkit that covers the full document lifecycle: reading text, extracting tables, creating new PDFs, merging, splitting, adding watermarks, OCR for scanned pages, and form filling. Built on Python libraries pypdf, pdfplumber, and reportlab.

If your agent interacts with documents — and most do — this skill handles the PDF side completely.

The Problem It Solves

PDFs are everywhere in business workflows: invoices, contracts, reports, research papers, forms. But they're hostile to programmatic access — binary format, variable structure, sometimes scanned images pretending to be text, sometimes interactive forms. A different tool for each operation is the norm.

The PDF skill bundles the right Python library for each task:

pypdf — basic operations (merge, split, rotate, metadata, password)
pdfplumber — text and table extraction with layout awareness
reportlab — creating new PDFs programmatically
pytesseract + pdf2image — OCR for scanned documents
pdftotext/qpdf/pdftk — CLI tools for common operations

Text Extraction

Basic Extraction

from pypdf import PdfReader
 
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
 
text = ""
for page in reader.pages:
    text += page.extract_text()

Layout-Aware Extraction (pdfplumber)

For documents where spatial layout matters (columns, tables, headers):

import pdfplumber
 
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Command Line

# Basic text extraction
pdftotext input.pdf output.txt
 
# Preserve layout (columns, spacing)
pdftotext -layout input.pdf output.txt
 
# Specific pages only
pdftotext -f 1 -l 5 input.pdf output.txt

Table Extraction

One of the most valuable capabilities for data work:

import pdfplumber
 
with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

Export Tables to Excel

import pdfplumber, pandas as pd
 
with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)
 
if all_tables:
    combined = pd.concat(all_tables, ignore_index=True)
    combined.to_excel("extracted_tables.xlsx", index=False)

This pattern — PDF tables → pandas DataFrame → Excel — covers the majority of financial document and report extraction workflows.

Creating PDFs

Simple PDF with Canvas

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
 
c = canvas.Canvas("output.pdf", pagesize=letter)
width, height = letter
 
c.drawString(100, height - 100, "Hello World!")
c.line(100, height - 140, 400, height - 140)
c.save()

Multi-Page Document with Platypus

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
 
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
 
story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Body content goes here. " * 20, styles['Normal']))
story.append(PageBreak())
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Second page content", styles['Normal']))
 
doc.build(story)

Reportlab's Platypus flow system handles pagination, headers, styles, and multi-column layouts — serious document generation without InDesign.

Merge and Split

Merge Multiple PDFs

from pypdf import PdfWriter, PdfReader
 
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)
 
with open("merged.pdf", "wb") as output:
    writer.write(output)

Command line:

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Split into Individual Pages

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Command line — page ranges:

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

Watermarking

from pypdf import PdfReader, PdfWriter
 
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
 
for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)
 
with open("watermarked.pdf", "wb") as output:
    writer.write(output)

OCR for Scanned PDFs

For PDFs that are scanned images (no selectable text):

# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
 
images = convert_from_path('scanned.pdf')
 
text = ""
for i, image in enumerate(images):
    text += f"Page {i+1}:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"
 
print(text)

This pipeline: PDF → images → Tesseract OCR → text. Requires poppler (for pdf2image) and Tesseract to be installed on the system.

Password Protection

from pypdf import PdfReader, PdfWriter
 
reader = PdfReader("input.pdf")
writer = PdfWriter()
 
for page in reader.pages:
    writer.add_page(page)
 
writer.encrypt("userpassword", "ownerpassword")
 
with open("encrypted.pdf", "wb") as output:
    writer.write(output)

Remove password:

qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

Image Extraction

# Extract all embedded images
pdfimages -j input.pdf output_prefix
# Produces: output_prefix-000.jpg, output_prefix-001.jpg, ...

Quick Reference

Task	Best Tool
Merge PDFs	pypdf or qpdf
Split PDFs	pypdf or qpdf
Extract text	pdfplumber (layout) or pdftotext
Extract tables	pdfplumber + pandas
Create PDFs	reportlab (Platypus for documents, Canvas for precise layout)
OCR scanned	pytesseract + pdf2image
Fill forms	pdf-lib or pypdf (see forms.md)
Password remove	qpdf --decrypt
Watermark	pypdf merge_page
Extract images	pdfimages
Metadata	pypdf reader.metadata
Rotate	pypdf page.rotate() or qpdf

Real-World Use Cases

Invoice processing — Extract text and tables from invoices, parse line items, push to accounting system.

Contract analysis — Extract full text for AI review, check for specific clauses, flag unusual terms.

Report generation — Agent aggregates data, creates a multi-page PDF report with headers, tables, and charts using reportlab.

Document digitization — OCR pipeline converts scanned archives into searchable text.

Form automation — Agent fills PDF forms (see forms.md included with the skill) from structured data sources.

Batch processing — Split a large multi-chapter PDF, process each section independently, re-merge with results.

Considerations

Scanned vs. digital PDFs — pdfplumber and pypdf extract embedded text. Scanned PDFs (images) require OCR with pytesseract — different pipeline, additional dependencies.
Table extraction accuracy — pdfplumber is excellent but complex table layouts (merged cells, nested tables) may need manual verification.
reportlab learning curve — Platypus is powerful but has its own document model. For simple PDFs, Canvas is easier. For complex multi-page documents, Platypus pays off.
Dependencies — The full capability set requires multiple libraries. The skill includes instructions for each, but OCR specifically needs system-level dependencies (poppler, Tesseract).
Proprietary license — The skill has a proprietary license (see LICENSE.txt). Review terms before commercial use.

The Bigger Picture

PDF is one of those formats that's never going away. It's the standard for legally binding documents, financial reports, academic papers, and official communications. An AI agent without PDF capability is blocked from interacting with a significant portion of business information.

The PDF skill handles the breadth of what agents actually need: reading existing documents, extracting structured data, creating new documents, and handling the edge cases (scanned pages, password protection, forms). 16,000+ downloads across diverse use cases confirms it's a foundational skill.

View the skill on ClawHub: pdf

← Back to Blog