word-reader读取 Word 文档(.docx 和 .doc 格式)并提取文本内容。支持文档解析、表格提取、图片处理等功能。使用当用户需要分析 Word 文档内容、提取文本信息或批量处理文档时。
Install via ClawdBot CLI:
clawdbot install xtfnhcyjpgf/word-reader使用 Python 解析 Word 文档,提取文本内容和结构化信息。
python3 {baseDir}/scripts/read_word.py <文件路径>
# JSON 输出
python3 {baseDir}/scripts/read_word.py <文件路径> --format json
# 纯文本输出
python3 {baseDir}/scripts/read_word.py <文件路径> --format text
# Markdown 格式
python3 {baseDir}/scripts/read_word.py <文件路径> --format markdown
# 只提取文本
python3 {baseDir}/scripts/read_word.py <文件路径> --extract text
# 提取表格数据
python3 {baseDir}/scripts/read_word.py <文件路径> --extract tables
# 获取文档元数据
python3 {baseDir}/scripts/read_word.py <文件路径> --extract metadata
# 处理目录下所有 .docx 文件
python3 {baseDir}/scripts/read_word.py <目录路径> --batch
| 参数 | 说明 | 默认值 |
|------|------|--------|
| --format | 输出格式(json/text/markdown) | text |
| --extract | 提取内容类型(text/tables/images/metadata/all) | all |
| --batch | 批量处理模式 | false |
| --output | 输出文件路径 | stdout |
| --encoding | 文本编码(utf-8/gb2312) | utf-8 |
{
"metadata": {
"title": "文档标题",
"author": "作者姓名",
"created": "2024-01-01T10:00:00",
"modified": "2024-01-01T12:00:00"
},
"text": "文档全文内容...",
"tables": [
[
["表头1", "表头2"],
["行1列1", "行1列2"],
["行2列1", "行2列2"]
]
],
"images": [
{
"filename": "image1.png",
"description": "图片描述",
"size": "1024x768"
}
]
}
# 文档标题
**作者**:作者姓名
**创建时间**:2024-01-01 10:00:00
## 正文内容
这是文档的正文内容...
### 表格示例
| 表头1 | 表头2 |
|-------|-------|
| 行1列1 | 行1列2 |
| 行2列1 | 行2列2 |

## 图片列表
1. **image1.png** (1024x768) - 图片描述
python3 {baseDir}/scripts/read_word.py 项目需求.docx --format markdown
python3 {baseDir}/scripts/read_word.py 会议记录.docx --extract text
python3 {baseDir}/scripts/read_word.py ./文档目录 --batch --format json --output results.json
pip3 install python-docx
对于 .doc 格式支持:
# Ubuntu/Debian
sudo apt-get install antiword
# macOS
brew install antiword
脚本会自动处理以下文档元素:
Generated Mar 1, 2026
Law firms can use this skill to extract text and metadata from contracts, briefs, and legal documents for review, indexing, or compliance checks. It helps automate the extraction of key clauses, dates, and parties involved, saving time on manual document scanning.
Researchers and universities can process Word documents containing research papers, theses, or survey data to extract text, tables, and metadata for analysis or database population. This facilitates literature reviews and data aggregation from various sources.
Companies can utilize this skill to parse Word reports, such as financial statements or project updates, extracting structured data like tables and text for automated reporting or integration into business intelligence tools. It streamlines data extraction from recurring document formats.
Publishers or media organizations can batch process Word documents to extract text and images for uploading to websites or content management systems. This automates the conversion of documents into web-friendly formats like Markdown or JSON.
Healthcare providers can extract patient information, treatment notes, and metadata from Word-based medical records for digitization or analysis. This aids in organizing data for electronic health records or compliance audits.
Offer a cloud-based service where users upload Word documents via a web interface or API to extract and analyze content. Charge based on usage tiers, such as number of documents processed or features like advanced parsing.
Sell licenses to large organizations for on-premise or custom integration into their existing workflows, such as document management systems. Provide support, customization, and bulk processing capabilities.
Provide a free basic version for text extraction with limited features, and charge for advanced functionalities like batch processing, table extraction, or API access. Target individual users and small teams to upsell to premium plans.
💬 Integration Tip
Integrate this skill into existing document workflows by using its command-line interface for automation, or wrap it in a simple API for web applications to enable seamless document processing.
Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.
Find, evaluate, and recommend ClawHub skills by need with quality filtering and preference learning.
Fetch full tweets, long tweets, quoted tweets, and X Articles from X/Twitter without login or API keys, using no dependencies and zero configuration.
Skill 查找器 | Skill Finder. 帮助发现和安装 ClawHub Skills | Discover and install ClawHub Skills. 回答'有什么技能可以X'、'找一个技能' | Answers 'what skill can X', 'find a skill'. 触发...
Generate QR codes from text or URL for mobile scanning.