PDF Skill: Extract Text, Tables, and Forms — or Create PDFs From Scratch
16,327 downloads, 237 installs, 25 stars. The PDF skill by @awspace is a comprehensive PDF toolkit that covers the full document lifecycle: reading text, extracting tables, creating new PDFs, merging, splitting, adding watermarks, OCR for scanned pages, and form filling. Built on Python libraries pypdf, pdfplumber, and reportlab.
If your agent interacts with documents — and most do — this skill handles the PDF side completely.
The Problem It Solves
PDFs are everywhere in business workflows: invoices, contracts, reports, research papers, forms. But they're hostile to programmatic access — binary format, variable structure, sometimes scanned images pretending to be text, sometimes interactive forms. A different tool for each operation is the norm.
The PDF skill bundles the right Python library for each task:
- pypdf — basic operations (merge, split, rotate, metadata, password)
- pdfplumber — text and table extraction with layout awareness
- reportlab — creating new PDFs programmatically
- pytesseract + pdf2image — OCR for scanned documents
- pdftotext/qpdf/pdftk — CLI tools for common operations
Text Extraction
Basic Extraction
from pypdf import PdfReader
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
text = ""
for page in reader.pages:
text += page.extract_text()Layout-Aware Extraction (pdfplumber)
For documents where spatial layout matters (columns, tables, headers):
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)Command Line
# Basic text extraction
pdftotext input.pdf output.txt
# Preserve layout (columns, spacing)
pdftotext -layout input.pdf output.txt
# Specific pages only
pdftotext -f 1 -l 5 input.pdf output.txtTable Extraction
One of the most valuable capabilities for data work:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)Export Tables to Excel
import pdfplumber, pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
if all_tables:
combined = pd.concat(all_tables, ignore_index=True)
combined.to_excel("extracted_tables.xlsx", index=False)This pattern — PDF tables → pandas DataFrame → Excel — covers the majority of financial document and report extraction workflows.
Creating PDFs
Simple PDF with Canvas
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("output.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.line(100, height - 140, 400, height - 140)
c.save()Multi-Page Document with Platypus
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Body content goes here. " * 20, styles['Normal']))
story.append(PageBreak())
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Second page content", styles['Normal']))
doc.build(story)Reportlab's Platypus flow system handles pagination, headers, styles, and multi-column layouts — serious document generation without InDesign.
Merge and Split
Merge Multiple PDFs
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)Command line:
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdfSplit into Individual Pages
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)Command line — page ranges:
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdfWatermarking
from pypdf import PdfReader, PdfWriter
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)OCR for Scanned PDFs
For PDFs that are scanned images (no selectable text):
# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
images = convert_from_path('scanned.pdf')
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)This pipeline: PDF → images → Tesseract OCR → text. Requires poppler (for pdf2image) and Tesseract to be installed on the system.
Password Protection
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)Remove password:
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdfImage Extraction
# Extract all embedded images
pdfimages -j input.pdf output_prefix
# Produces: output_prefix-000.jpg, output_prefix-001.jpg, ...Quick Reference
| Task | Best Tool |
|---|---|
| Merge PDFs | pypdf or qpdf |
| Split PDFs | pypdf or qpdf |
| Extract text | pdfplumber (layout) or pdftotext |
| Extract tables | pdfplumber + pandas |
| Create PDFs | reportlab (Platypus for documents, Canvas for precise layout) |
| OCR scanned | pytesseract + pdf2image |
| Fill forms | pdf-lib or pypdf (see forms.md) |
| Password remove | qpdf --decrypt |
| Watermark | pypdf merge_page |
| Extract images | pdfimages |
| Metadata | pypdf reader.metadata |
| Rotate | pypdf page.rotate() or qpdf |
Real-World Use Cases
Invoice processing — Extract text and tables from invoices, parse line items, push to accounting system.
Contract analysis — Extract full text for AI review, check for specific clauses, flag unusual terms.
Report generation — Agent aggregates data, creates a multi-page PDF report with headers, tables, and charts using reportlab.
Document digitization — OCR pipeline converts scanned archives into searchable text.
Form automation — Agent fills PDF forms (see forms.md included with the skill) from structured data sources.
Batch processing — Split a large multi-chapter PDF, process each section independently, re-merge with results.
Considerations
- Scanned vs. digital PDFs —
pdfplumberandpypdfextract embedded text. Scanned PDFs (images) require OCR withpytesseract— different pipeline, additional dependencies. - Table extraction accuracy —
pdfplumberis excellent but complex table layouts (merged cells, nested tables) may need manual verification. - reportlab learning curve — Platypus is powerful but has its own document model. For simple PDFs, Canvas is easier. For complex multi-page documents, Platypus pays off.
- Dependencies — The full capability set requires multiple libraries. The skill includes instructions for each, but OCR specifically needs system-level dependencies (poppler, Tesseract).
- Proprietary license — The skill has a proprietary license (see
LICENSE.txt). Review terms before commercial use.
The Bigger Picture
PDF is one of those formats that's never going away. It's the standard for legally binding documents, financial reports, academic papers, and official communications. An AI agent without PDF capability is blocked from interacting with a significant portion of business information.
The PDF skill handles the breadth of what agents actually need: reading existing documents, extracting structured data, creating new documents, and handling the edge cases (scanned pages, password protection, forms). 16,000+ downloads across diverse use cases confirms it's a foundational skill.
View the skill on ClawHub: pdf