pdf-text-extractorExtract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
Install via ClawdBot CLI:
clawdbot install Michael-laffin/pdf-text-extractorVernox Utility Skill - Perfect for document digitization.
PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).
clawhub install pdf-text-extractor
const result = await extractText({
pdfPath: './document.pdf',
options: {
outputFormat: 'text',
ocr: true,
language: 'eng'
}
});
console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);
const results = await extractBatch({
pdfFiles: [
'./document1.pdf',
'./document2.pdf',
'./document3.pdf'
],
options: {
outputFormat: 'json',
ocr: true
}
});
console.log(`Extracted ${results.length} PDFs`);
const result = await extractText({
pdfPath: './scanned-document.pdf',
options: {
ocr: true,
language: 'eng',
ocrQuality: 'high'
}
});
// OCR will be used (scanned document detected)
extractTextExtract text content from a single PDF file.
Parameters:
pdfPath (string, required): Path to PDF fileoptions (object, optional): Extraction optionsoutputFormat (string): 'text' | 'json' | 'markdown' | 'html'ocr (boolean): Enable OCR for scanned docslanguage (string): OCR language code ('eng', 'spa', 'fra', 'deu')preserveFormatting (boolean): Keep headings/structureminConfidence (number): Minimum OCR confidence score (0-100)Returns:
text (string): Extracted text contentpages (number): Number of pages processedwordCount (number): Total word countcharCount (number): Total character countlanguage (string): Detected languagemetadata (object): PDF metadata (title, author, creation date)method (string): 'text' or 'ocr' (extraction method)extractBatchExtract text from multiple PDF files at once.
Parameters:
pdfFiles (array, required): Array of PDF file pathsoptions (object, optional): Same as extractTextReturns:
results (array): Array of extraction resultstotalPages (number): Total pages across all PDFssuccessCount (number): Successfully extractedfailureCount (number): Failed extractionserrors (array): Error details for failurescountWordsCount words in extracted text.
Parameters:
text (string, required): Text to countoptions (object, optional):minWordLength (number): Minimum characters per word (default: 3)excludeNumbers (boolean): Don't count numbers as wordscountByPage (boolean): Return word count per pageReturns:
wordCount (number): Total word countcharCount (number): Total character countpageCounts (array): Word count per pageaverageWordsPerPage (number): Average words per pagedetectLanguageDetect the language of extracted text.
Parameters:
text (string, required): Text to analyzeminConfidence (number): Minimum confidence for detectionReturns:
language (string): Detected language codelanguageName (string): Full language nameconfidence (number): Confidence score (0-100)config.json:{
"ocr": {
"enabled": true,
"defaultLanguage": "eng",
"quality": "medium",
"languages": ["eng", "spa", "fra", "deu"]
},
"output": {
"defaultFormat": "text",
"preserveFormatting": true,
"includeMetadata": true
},
"batch": {
"maxConcurrent": 3,
"timeoutSeconds": 30
}
}
const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."
const contract = await extractText('./scanned-contract.pdf', {
ocr: true,
language: 'eng',
ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."
const docs = await extractBatch([
'./doc1.pdf',
'./doc2.pdf',
'./doc3.pdf',
'./doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);
MIT
Extract text from PDFs. Fast, accurate, zero dependencies. ๐ฎ
Generated Mar 1, 2026
Automate extraction of text from scanned invoices for accounting software integration. Use OCR to digitize paper invoices, extract vendor details, amounts, and dates, and feed data into ERP systems for automated reconciliation and payment processing.
Convert scanned legal contracts and agreements into searchable text for law firms. Preserve formatting with markdown output, enabling keyword searches, clause analysis, and archiving in digital document management systems to improve case preparation efficiency.
Extract text from patient records and medical reports in PDF format for electronic health record (EHR) systems. Use batch processing to handle multiple documents, detect languages for multilingual records, and ensure data accuracy with OCR confidence scoring for compliance.
Process research papers and scanned articles for content analysis in academic settings. Extract text to prepare data for LLM processing, count words for literature reviews, and output JSON with metadata for citation management and automated summarization tools.
Digitize scanned inventory reports and supplier PDFs for retail businesses. Extract structured data like product names and quantities, use batch extraction for weekly workflows, and integrate with inventory management software to automate stock updates and forecasting.
Offer a cloud-based PDF extraction service with tiered pricing based on usage volume (e.g., pages processed per month). Target small businesses with a free tier for basic needs and premium plans for advanced features like high-quality OCR and batch processing, generating recurring revenue.
License the skill as an API for integration into existing software platforms, such as document management or workflow automation tools. Charge per API call or through enterprise licensing agreements, providing scalable revenue from developers and large organizations needing embedded extraction capabilities.
Provide consulting services to customize the skill for specific industry needs, such as adding language support or integrating with proprietary systems. Offer implementation support, training, and maintenance contracts, generating project-based and ongoing service revenue.
๐ฌ Integration Tip
Start by testing with text-based PDFs to ensure basic functionality, then enable OCR for scanned documents; use the batch processing feature for handling multiple files efficiently in production workflows.
Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.
Find, evaluate, and recommend ClawHub skills by need with quality filtering and preference learning.
Fetch full tweets, long tweets, quoted tweets, and X Articles from X/Twitter without login or API keys, using no dependencies and zero configuration.
Skill ๆฅๆพๅจ | Skill Finder. ๅธฎๅฉๅ็ฐๅๅฎ่ฃ ClawHub Skills | Discover and install ClawHub Skills. ๅ็ญ'ๆไปไนๆ่ฝๅฏไปฅX'ใ'ๆพไธไธชๆ่ฝ' | Answers 'what skill can X', 'find a skill'. ่งฆๅ...
Generate QR codes from text or URL for mobile scanning.
Common git operations as a skill (status, pull, push, branch, log)