pdf-ocr-layout基于智谱 GLM-OCR、GLM-4.7 及 GLM-4.6V 的多模态文档深度解析工具。 Use when: - 需要高精度提取文档(PDF/图片)中的表格并转换为 Markdown 格式 - 需要从文档页面中自动裁剪并提取插图、图表为独立文件 - 需要对提取的图表进行深度语义理解(基于 GLM-4.6V 视觉分析) - 需要对提取的表格数据进行逻辑分析(基于 GLM-4.7 文本分析) 核心架构: 1. 视觉提取:GLM-OCR 2. 语义理解:GLM-4.7 (纯文本/表格) + GLM-4.6V (多模态/图像)
Install via ClawdBot CLI:
clawdbot install baokui/pdf-ocr-layoutThis tool builds a high-precision document parsing pipeline: using GLM-OCR for layout element extraction, calling GLM-4.7 for logical interpretation of table data, and calling GLM-4.6V for multimodal visual interpretation of images and charts.
This Skill consists of two core script stages, orchestrated through glm_ocr_pipeline.py:
scripts/glm_ocr_extract.py)scripts/glm_understanding.py)# Run complete pipeline: extraction -> cropping -> understanding analysis, supports input in .pdf, .jpg, .png and other formats
python scripts/glm_ocr_pipeline.py \
--file_path "/data/report_page.jpg" \
--output_dir "/data/output"
| Parameter | Type | Required | Description |
| --- | --- | --- | --- |
| file_path | string | ✅ | Absolute path to input file (supports .pdf, .png, .jpg) |
| output_dir | string | ✅ | Result output directory (used to save cropped images and JSON reports) |
The tool returns a list containing layout elements and their deep understanding:
[
{
"type": "table",
"bbox": [100, 200, 500, 600],
"content_info": "| Revenue | Q1 |\n|---|---|\n| 100M | ... |",
"deep_understanding": "(Generated by GLM-4.7) This table shows Q1 2024 revenue data. Combined with the 'market expansion strategy' mentioned in paragraph 3 of the body text, it can be seen that..."
},
{
"type": "image",
"bbox": [100, 700, 500, 900],
"content_info": "/data/output/images/report_page_img_2.png",
"deep_understanding": "(Generated by GLM-4.6V) This is a system architecture diagram. Visually, it shows the flow of clients connecting to servers through a Load Balancer. Combined with the title 'Fig 3' and context, this diagram is mainly used to illustrate..."
}
]
ZHIPU_API_KEY must be configuredzhipuai, pillow, beautifulsoup4All understanding is based on the complete layout logic of the document (Markdown Context), not isolated fragment analysis.
Multi-page PDFs default to processing the first page. For batch processing, please extend the loop logic at the script level.
Generated Mar 1, 2026
Extract and convert tables from quarterly financial PDF reports to Markdown for automated data entry into accounting systems, while analyzing charts for revenue trends using GLM-4.6V to generate insights on performance metrics.
Process research papers in PDF format to extract tables of experimental data as Markdown for database integration, and analyze charts with GLM-4.6V to summarize visual findings in context of the full text for literature reviews.
Analyze legal contracts or case documents to extract tables of terms or schedules as Markdown for contract management systems, and interpret diagrams or exhibits with GLM-4.6V to assess visual evidence in legal contexts.
Convert medical reports or lab results from scanned images to extract patient data tables as Markdown for electronic health records, and analyze medical charts or imaging results with GLM-4.6V to aid in diagnostic summaries.
Process business documents like sales reports to extract performance tables as Markdown for integration into BI tools, and analyze infographics with GLM-4.6V to generate automated insights on market trends and visual data representations.
Offer the tool as a cloud-based service with tiered pricing based on usage volume, targeting enterprises for automated document processing and analysis, generating recurring revenue through monthly or annual subscriptions.
License the API to software developers and integrators for embedding into custom applications, such as CRM or ERP systems, charging per API call or through enterprise licensing agreements for scalable deployment.
Provide tailored solutions and integration services for specific industries, offering customization of the pipeline for unique document formats and training support, with revenue from project-based fees and ongoing maintenance contracts.
💬 Integration Tip
Ensure the ZHIPU_API_KEY is securely configured and test with sample documents to validate output formats before full deployment in production environments.
Edit PDFs with natural-language instructions using the nano-pdf CLI.
Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.
Convert documents and files to Markdown using markitdown. Use when converting PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .xls), HTML, CSV, JSON, XML, images (with EXIF/OCR), audio (with transcription), ZIP archives, YouTube URLs, or EPubs to Markdown format for LLM processing or text analysis.
用 MinerU API 解析 PDF/Word/PPT/图片为 Markdown,支持公式、表格、OCR。适用于论文解析、文档提取。
Generate hand-drawn style diagrams, flowcharts, and architecture diagrams as PNG images from Excalidraw JSON
The awesome PPT format generation tool provided by baidu.