Convert tabular-extract skill to git submodule
Move tabular-extract from directly tracked files to a submodule pointing at https://git.prettyhefty.com/Bill/tabular-extract.git
This commit is contained in:
3
.gitmodules
vendored
3
.gitmodules
vendored
@@ -4,3 +4,6 @@
|
|||||||
[submodule "skills/docker-service-architecture"]
|
[submodule "skills/docker-service-architecture"]
|
||||||
path = skills/docker-service-architecture
|
path = skills/docker-service-architecture
|
||||||
url = https://git.prettyhefty.com/Bill/docker-service-architecture-skill.git
|
url = https://git.prettyhefty.com/Bill/docker-service-architecture-skill.git
|
||||||
|
[submodule "skills/tabular-extract"]
|
||||||
|
path = skills/tabular-extract
|
||||||
|
url = https://git.prettyhefty.com/Bill/tabular-extract.git
|
||||||
|
|||||||
1
skills/tabular-extract
Submodule
1
skills/tabular-extract
Submodule
Submodule skills/tabular-extract added at be5b36fbc4
@@ -1,97 +0,0 @@
|
|||||||
---
|
|
||||||
name: tabular-extract
|
|
||||||
description: Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files.
|
|
||||||
---
|
|
||||||
|
|
||||||
# Tabular Extract
|
|
||||||
|
|
||||||
Extract structured data from document collections into tabular format.
|
|
||||||
|
|
||||||
## Pipeline
|
|
||||||
|
|
||||||
This is a rigid, sequential pipeline. Execute every step in order.
|
|
||||||
|
|
||||||
1. **Discover documents** — find files at the user's path
|
|
||||||
2. **Read documents** — convert each file to text
|
|
||||||
3. **Define schema** — infer extraction columns from user's description
|
|
||||||
4. **Extract data** — read each document and extract each column's value
|
|
||||||
5. **Output results** — display markdown table and save JSON file
|
|
||||||
|
|
||||||
## Step 1: Discover Documents
|
|
||||||
|
|
||||||
Glob the user-provided path for supported file types:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
**/*.pdf **/*.docx **/*.txt **/*.md **/*.json
|
|
||||||
```
|
|
||||||
|
|
||||||
Display the file list and count. Ask the user to confirm before proceeding.
|
|
||||||
If no supported files are found, tell the user and stop.
|
|
||||||
|
|
||||||
## Step 2: Read Documents
|
|
||||||
|
|
||||||
Convert each file to text based on its type:
|
|
||||||
|
|
||||||
| Format | Method |
|
|
||||||
|--------|--------|
|
|
||||||
| .pdf | Use the Read tool with `pages` parameter for large files (>10 pages: read in chunks of 20 pages) |
|
|
||||||
| .docx | Run: `python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py <filepath>` (requires `pip install python-docx`) |
|
|
||||||
| .txt, .md | Use the Read tool directly |
|
|
||||||
| .json | Use the Read tool directly |
|
|
||||||
|
|
||||||
If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline.
|
|
||||||
|
|
||||||
## Step 3: Define Extraction Schema
|
|
||||||
|
|
||||||
The user describes what to extract in natural language.
|
|
||||||
|
|
||||||
Infer a structured schema — for each column determine:
|
|
||||||
- **name**: Short, descriptive column header
|
|
||||||
- **type**: One of `text`, `number`, `date`, `boolean`, `list`
|
|
||||||
- **prompt**: Specific extraction instruction
|
|
||||||
|
|
||||||
Present the inferred schema as a table and ask the user to confirm or adjust.
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```
|
|
||||||
| # | Column | Type | Extraction Prompt |
|
|
||||||
|---|--------|------|-------------------|
|
|
||||||
| 1 | Party Name | text | Identify the full legal name of each party to the agreement |
|
|
||||||
| 2 | Effective Date | date | What is the effective date of this agreement? |
|
|
||||||
| 3 | Contract Value | number | What is the total contract value or consideration amount? |
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 4: Extract Data
|
|
||||||
|
|
||||||
For each document, read its text and extract every column value.
|
|
||||||
|
|
||||||
For each cell, produce:
|
|
||||||
- **value** — the extracted data (typed per column type)
|
|
||||||
- **confidence** — high, medium, or low
|
|
||||||
- **supporting_quote** — exact text from the document
|
|
||||||
- **reasoning** — why this value was chosen
|
|
||||||
|
|
||||||
See `references/extraction-guide.md` for detailed type handling, confidence criteria, and null value handling.
|
|
||||||
|
|
||||||
## Step 5: Output Results
|
|
||||||
|
|
||||||
**Display a markdown table** in the conversation:
|
|
||||||
- One row per document, one column per extraction field
|
|
||||||
- Append `(?)` to low-confidence values
|
|
||||||
- Truncate values longer than 60 characters with `...`
|
|
||||||
|
|
||||||
**Save a JSON file** to `./extraction-results-YYYY-MM-DD.json` in the current working directory.
|
|
||||||
- Use the schema documented in `references/extraction-guide.md`
|
|
||||||
- Include metadata: timestamp, source path, document count, skipped files
|
|
||||||
|
|
||||||
**Print a summary:**
|
|
||||||
- Documents processed / skipped
|
|
||||||
- Confidence distribution (how many high / medium / low extractions)
|
|
||||||
|
|
||||||
## Error Handling
|
|
||||||
|
|
||||||
- **Missing python-docx**: Print install command `pip install python-docx` and ask user to install
|
|
||||||
- **Unreadable file**: Skip file, record in skipped list, continue pipeline
|
|
||||||
- **Large PDF (>10 pages)**: Read in 20-page chunks, concatenate text
|
|
||||||
- **No files found**: Inform user and stop
|
|
||||||
- **User cancels at confirmation**: Stop gracefully
|
|
||||||
@@ -1,94 +0,0 @@
|
|||||||
# Extraction Guide
|
|
||||||
|
|
||||||
## Extraction Prompt Template
|
|
||||||
|
|
||||||
For each document x column, use this reasoning structure:
|
|
||||||
|
|
||||||
1. Read the document text carefully
|
|
||||||
2. Locate text relevant to the extraction prompt
|
|
||||||
3. Extract the value, noting its exact location
|
|
||||||
4. Assess confidence based on clarity of the source text
|
|
||||||
|
|
||||||
## Per-Cell Output Structure
|
|
||||||
|
|
||||||
For each extraction, produce a JSON object:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"value": "<extracted value, typed appropriately>",
|
|
||||||
"confidence": "high | medium | low",
|
|
||||||
"supporting_quote": "<exact text from the document that supports this value>",
|
|
||||||
"reasoning": "<1-2 sentences explaining why this value was chosen>"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Confidence Levels
|
|
||||||
|
|
||||||
- **high**: Value is explicitly stated, unambiguous, directly answers the prompt
|
|
||||||
- **medium**: Value is implied or requires minor inference, or multiple possible values exist
|
|
||||||
- **low**: Value is uncertain, requires significant inference, or the document may not contain the answer
|
|
||||||
|
|
||||||
### Type Handling
|
|
||||||
|
|
||||||
| Column Type | Value Format | Example |
|
|
||||||
|-------------|-------------|---------|
|
|
||||||
| text | Plain string | "Acme Corporation" |
|
|
||||||
| number | Numeric value (no currency symbols) | 500000 |
|
|
||||||
| date | ISO 8601 format (YYYY-MM-DD) | "2024-01-15" |
|
|
||||||
| boolean | true or false | true |
|
|
||||||
| list | JSON array of strings | ["item1", "item2"] |
|
|
||||||
|
|
||||||
### When a Value Cannot Be Found
|
|
||||||
|
|
||||||
If the document does not contain information for a column:
|
|
||||||
- Set value to null
|
|
||||||
- Set confidence to "low"
|
|
||||||
- Set supporting_quote to ""
|
|
||||||
- Set reasoning to explain why the value could not be found
|
|
||||||
|
|
||||||
## Full Output JSON Schema
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"extraction": {
|
|
||||||
"created": "ISO 8601 timestamp",
|
|
||||||
"source_directory": "/absolute/path/to/docs",
|
|
||||||
"documents_processed": 0,
|
|
||||||
"documents_skipped": [],
|
|
||||||
"columns": [
|
|
||||||
{
|
|
||||||
"name": "Column Name",
|
|
||||||
"type": "text|number|date|boolean|list",
|
|
||||||
"prompt": "The extraction prompt used"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"results": [
|
|
||||||
{
|
|
||||||
"document": "filename.pdf",
|
|
||||||
"fields": {
|
|
||||||
"Column Name": {
|
|
||||||
"value": "extracted value",
|
|
||||||
"confidence": "high|medium|low",
|
|
||||||
"supporting_quote": "exact text from document",
|
|
||||||
"reasoning": "explanation"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Markdown Table Format
|
|
||||||
|
|
||||||
Display results as a pipe-delimited markdown table.
|
|
||||||
Append `(?)` to low-confidence values.
|
|
||||||
Truncate cell values longer than 60 characters with `...`.
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```
|
|
||||||
| Document | Party Name | Date | Amount |
|
|
||||||
|----------|-----------|------|--------|
|
|
||||||
| contract1.pdf | Acme Corp | 2024-01-15 | 500000 |
|
|
||||||
| contract2.pdf | Beta LLC(?) | 2024-03-22 | 1200000 |
|
|
||||||
```
|
|
||||||
@@ -1,83 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Convert a DOCX file to plain text.
|
|
||||||
|
|
||||||
Extracts text from paragraphs and tables.
|
|
||||||
Outputs to stdout for piping into other tools.
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
convert_docx.py <path-to-file.docx>
|
|
||||||
|
|
||||||
Requires:
|
|
||||||
pip install python-docx
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
|
|
||||||
|
|
||||||
def convert_docx(filepath: str) -> str:
|
|
||||||
"""Extract text from a DOCX file, including paragraphs and tables."""
|
|
||||||
try:
|
|
||||||
from docx import Document
|
|
||||||
except ImportError:
|
|
||||||
print(
|
|
||||||
"Error: python-docx is not installed.\n"
|
|
||||||
"Install it with: pip install python-docx",
|
|
||||||
file=sys.stderr
|
|
||||||
)
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
doc = Document(filepath)
|
|
||||||
parts = []
|
|
||||||
|
|
||||||
for element in doc.element.body:
|
|
||||||
tag = element.tag.split("}")[-1] if "}" in element.tag else element.tag
|
|
||||||
|
|
||||||
if tag == "p":
|
|
||||||
# Paragraph
|
|
||||||
for para in doc.paragraphs:
|
|
||||||
if para._element is element:
|
|
||||||
text = para.text.strip()
|
|
||||||
if text:
|
|
||||||
parts.append(text)
|
|
||||||
break
|
|
||||||
|
|
||||||
elif tag == "tbl":
|
|
||||||
# Table
|
|
||||||
for table in doc.tables:
|
|
||||||
if table._element is element:
|
|
||||||
for row in table.rows:
|
|
||||||
cells = [cell.text.strip() for cell in row.cells]
|
|
||||||
parts.append(" | ".join(cells))
|
|
||||||
parts.append("") # blank line after table
|
|
||||||
break
|
|
||||||
|
|
||||||
return "\n".join(parts)
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
if len(sys.argv) != 2:
|
|
||||||
print("Usage: convert_docx.py <path-to-file.docx>", file=sys.stderr)
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
filepath = sys.argv[1]
|
|
||||||
|
|
||||||
if not os.path.exists(filepath):
|
|
||||||
print(f"Error: File not found: {filepath}", file=sys.stderr)
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
if not filepath.lower().endswith(".docx"):
|
|
||||||
print(f"Error: Not a .docx file: {filepath}", file=sys.stderr)
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
try:
|
|
||||||
text = convert_docx(filepath)
|
|
||||||
print(text)
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error converting file: {e}", file=sys.stderr)
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,122 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""Tests for convert_docx.py"""
|
|
||||||
|
|
||||||
import subprocess
|
|
||||||
import sys
|
|
||||||
import tempfile
|
|
||||||
import os
|
|
||||||
|
|
||||||
SCRIPT = os.path.join(os.path.dirname(__file__), "convert_docx.py")
|
|
||||||
|
|
||||||
|
|
||||||
def test_missing_argument():
|
|
||||||
"""Script should print usage and exit 1 when no args given."""
|
|
||||||
result = subprocess.run(
|
|
||||||
[sys.executable, SCRIPT],
|
|
||||||
capture_output=True, text=True
|
|
||||||
)
|
|
||||||
assert result.returncode == 1
|
|
||||||
assert "Usage:" in result.stderr
|
|
||||||
|
|
||||||
|
|
||||||
def test_nonexistent_file():
|
|
||||||
"""Script should error on a file that doesn't exist."""
|
|
||||||
result = subprocess.run(
|
|
||||||
[sys.executable, SCRIPT, "/tmp/nonexistent_file_abc123.docx"],
|
|
||||||
capture_output=True, text=True
|
|
||||||
)
|
|
||||||
assert result.returncode == 1
|
|
||||||
assert "Error" in result.stderr or "not found" in result.stderr.lower()
|
|
||||||
|
|
||||||
|
|
||||||
def test_non_docx_file():
|
|
||||||
"""Script should error on a non-DOCX file."""
|
|
||||||
with tempfile.NamedTemporaryFile(suffix=".txt", delete=False) as f:
|
|
||||||
f.write(b"hello world")
|
|
||||||
f.flush()
|
|
||||||
result = subprocess.run(
|
|
||||||
[sys.executable, SCRIPT, f.name],
|
|
||||||
capture_output=True, text=True
|
|
||||||
)
|
|
||||||
os.unlink(f.name)
|
|
||||||
assert result.returncode == 1
|
|
||||||
|
|
||||||
|
|
||||||
def test_valid_docx():
|
|
||||||
"""Script should extract text from a valid DOCX file."""
|
|
||||||
try:
|
|
||||||
from docx import Document
|
|
||||||
except ImportError:
|
|
||||||
print("SKIP: python-docx not installed")
|
|
||||||
return
|
|
||||||
|
|
||||||
doc = Document()
|
|
||||||
doc.add_paragraph("Hello from test document")
|
|
||||||
doc.add_paragraph("Second paragraph here")
|
|
||||||
|
|
||||||
with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as f:
|
|
||||||
doc.save(f.name)
|
|
||||||
result = subprocess.run(
|
|
||||||
[sys.executable, SCRIPT, f.name],
|
|
||||||
capture_output=True, text=True
|
|
||||||
)
|
|
||||||
os.unlink(f.name)
|
|
||||||
|
|
||||||
assert result.returncode == 0
|
|
||||||
assert "Hello from test document" in result.stdout
|
|
||||||
assert "Second paragraph here" in result.stdout
|
|
||||||
|
|
||||||
|
|
||||||
def test_docx_with_table():
|
|
||||||
"""Script should extract table content from a DOCX file."""
|
|
||||||
try:
|
|
||||||
from docx import Document
|
|
||||||
except ImportError:
|
|
||||||
print("SKIP: python-docx not installed")
|
|
||||||
return
|
|
||||||
|
|
||||||
doc = Document()
|
|
||||||
doc.add_paragraph("Before table")
|
|
||||||
table = doc.add_table(rows=2, cols=2)
|
|
||||||
table.cell(0, 0).text = "Header1"
|
|
||||||
table.cell(0, 1).text = "Header2"
|
|
||||||
table.cell(1, 0).text = "Value1"
|
|
||||||
table.cell(1, 1).text = "Value2"
|
|
||||||
doc.add_paragraph("After table")
|
|
||||||
|
|
||||||
with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as f:
|
|
||||||
doc.save(f.name)
|
|
||||||
result = subprocess.run(
|
|
||||||
[sys.executable, SCRIPT, f.name],
|
|
||||||
capture_output=True, text=True
|
|
||||||
)
|
|
||||||
os.unlink(f.name)
|
|
||||||
|
|
||||||
assert result.returncode == 0
|
|
||||||
assert "Header1" in result.stdout
|
|
||||||
assert "Value1" in result.stdout
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
tests = [
|
|
||||||
test_missing_argument,
|
|
||||||
test_nonexistent_file,
|
|
||||||
test_non_docx_file,
|
|
||||||
test_valid_docx,
|
|
||||||
test_docx_with_table,
|
|
||||||
]
|
|
||||||
passed = 0
|
|
||||||
failed = 0
|
|
||||||
for test in tests:
|
|
||||||
try:
|
|
||||||
test()
|
|
||||||
print(f" PASS: {test.__name__}")
|
|
||||||
passed += 1
|
|
||||||
except AssertionError as e:
|
|
||||||
print(f" FAIL: {test.__name__} - {e}")
|
|
||||||
failed += 1
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ERROR: {test.__name__} - {e}")
|
|
||||||
failed += 1
|
|
||||||
print(f"\n{passed} passed, {failed} failed")
|
|
||||||
sys.exit(1 if failed else 0)
|
|
||||||
Reference in New Issue
Block a user