Convert tabular-extract skill to git submodule

Move tabular-extract from directly tracked files to a submodule
pointing at https://git.prettyhefty.com/Bill/tabular-extract.git
This commit is contained in:
2026-03-02 23:57:05 -05:00
parent 208b1700d6
commit 41f7973a6f
6 changed files with 4 additions and 396 deletions

3
.gitmodules vendored
View File

@@ -4,3 +4,6 @@
[submodule "skills/docker-service-architecture"]
path = skills/docker-service-architecture
url = https://git.prettyhefty.com/Bill/docker-service-architecture-skill.git
[submodule "skills/tabular-extract"]
path = skills/tabular-extract
url = https://git.prettyhefty.com/Bill/tabular-extract.git

View File

@@ -1,97 +0,0 @@
---
name: tabular-extract
description: Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files.
---
# Tabular Extract
Extract structured data from document collections into tabular format.
## Pipeline
This is a rigid, sequential pipeline. Execute every step in order.
1. **Discover documents** — find files at the user's path
2. **Read documents** — convert each file to text
3. **Define schema** — infer extraction columns from user's description
4. **Extract data** — read each document and extract each column's value
5. **Output results** — display markdown table and save JSON file
## Step 1: Discover Documents
Glob the user-provided path for supported file types:
```bash
**/*.pdf **/*.docx **/*.txt **/*.md **/*.json
```
Display the file list and count. Ask the user to confirm before proceeding.
If no supported files are found, tell the user and stop.
## Step 2: Read Documents
Convert each file to text based on its type:
| Format | Method |
|--------|--------|
| .pdf | Use the Read tool with `pages` parameter for large files (>10 pages: read in chunks of 20 pages) |
| .docx | Run: `python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py <filepath>` (requires `pip install python-docx`) |
| .txt, .md | Use the Read tool directly |
| .json | Use the Read tool directly |
If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline.
## Step 3: Define Extraction Schema
The user describes what to extract in natural language.
Infer a structured schema — for each column determine:
- **name**: Short, descriptive column header
- **type**: One of `text`, `number`, `date`, `boolean`, `list`
- **prompt**: Specific extraction instruction
Present the inferred schema as a table and ask the user to confirm or adjust.
Example:
```
| # | Column | Type | Extraction Prompt |
|---|--------|------|-------------------|
| 1 | Party Name | text | Identify the full legal name of each party to the agreement |
| 2 | Effective Date | date | What is the effective date of this agreement? |
| 3 | Contract Value | number | What is the total contract value or consideration amount? |
```
## Step 4: Extract Data
For each document, read its text and extract every column value.
For each cell, produce:
- **value** — the extracted data (typed per column type)
- **confidence** — high, medium, or low
- **supporting_quote** — exact text from the document
- **reasoning** — why this value was chosen
See `references/extraction-guide.md` for detailed type handling, confidence criteria, and null value handling.
## Step 5: Output Results
**Display a markdown table** in the conversation:
- One row per document, one column per extraction field
- Append `(?)` to low-confidence values
- Truncate values longer than 60 characters with `...`
**Save a JSON file** to `./extraction-results-YYYY-MM-DD.json` in the current working directory.
- Use the schema documented in `references/extraction-guide.md`
- Include metadata: timestamp, source path, document count, skipped files
**Print a summary:**
- Documents processed / skipped
- Confidence distribution (how many high / medium / low extractions)
## Error Handling
- **Missing python-docx**: Print install command `pip install python-docx` and ask user to install
- **Unreadable file**: Skip file, record in skipped list, continue pipeline
- **Large PDF (>10 pages)**: Read in 20-page chunks, concatenate text
- **No files found**: Inform user and stop
- **User cancels at confirmation**: Stop gracefully

View File

@@ -1,94 +0,0 @@
# Extraction Guide
## Extraction Prompt Template
For each document x column, use this reasoning structure:
1. Read the document text carefully
2. Locate text relevant to the extraction prompt
3. Extract the value, noting its exact location
4. Assess confidence based on clarity of the source text
## Per-Cell Output Structure
For each extraction, produce a JSON object:
```json
{
"value": "<extracted value, typed appropriately>",
"confidence": "high | medium | low",
"supporting_quote": "<exact text from the document that supports this value>",
"reasoning": "<1-2 sentences explaining why this value was chosen>"
}
```
### Confidence Levels
- **high**: Value is explicitly stated, unambiguous, directly answers the prompt
- **medium**: Value is implied or requires minor inference, or multiple possible values exist
- **low**: Value is uncertain, requires significant inference, or the document may not contain the answer
### Type Handling
| Column Type | Value Format | Example |
|-------------|-------------|---------|
| text | Plain string | "Acme Corporation" |
| number | Numeric value (no currency symbols) | 500000 |
| date | ISO 8601 format (YYYY-MM-DD) | "2024-01-15" |
| boolean | true or false | true |
| list | JSON array of strings | ["item1", "item2"] |
### When a Value Cannot Be Found
If the document does not contain information for a column:
- Set value to null
- Set confidence to "low"
- Set supporting_quote to ""
- Set reasoning to explain why the value could not be found
## Full Output JSON Schema
```json
{
"extraction": {
"created": "ISO 8601 timestamp",
"source_directory": "/absolute/path/to/docs",
"documents_processed": 0,
"documents_skipped": [],
"columns": [
{
"name": "Column Name",
"type": "text|number|date|boolean|list",
"prompt": "The extraction prompt used"
}
],
"results": [
{
"document": "filename.pdf",
"fields": {
"Column Name": {
"value": "extracted value",
"confidence": "high|medium|low",
"supporting_quote": "exact text from document",
"reasoning": "explanation"
}
}
}
]
}
}
```
## Markdown Table Format
Display results as a pipe-delimited markdown table.
Append `(?)` to low-confidence values.
Truncate cell values longer than 60 characters with `...`.
Example:
```
| Document | Party Name | Date | Amount |
|----------|-----------|------|--------|
| contract1.pdf | Acme Corp | 2024-01-15 | 500000 |
| contract2.pdf | Beta LLC(?) | 2024-03-22 | 1200000 |
```

View File

@@ -1,83 +0,0 @@
#!/usr/bin/env python3
"""
Convert a DOCX file to plain text.
Extracts text from paragraphs and tables.
Outputs to stdout for piping into other tools.
Usage:
convert_docx.py <path-to-file.docx>
Requires:
pip install python-docx
"""
import sys
import os
def convert_docx(filepath: str) -> str:
"""Extract text from a DOCX file, including paragraphs and tables."""
try:
from docx import Document
except ImportError:
print(
"Error: python-docx is not installed.\n"
"Install it with: pip install python-docx",
file=sys.stderr
)
sys.exit(1)
doc = Document(filepath)
parts = []
for element in doc.element.body:
tag = element.tag.split("}")[-1] if "}" in element.tag else element.tag
if tag == "p":
# Paragraph
for para in doc.paragraphs:
if para._element is element:
text = para.text.strip()
if text:
parts.append(text)
break
elif tag == "tbl":
# Table
for table in doc.tables:
if table._element is element:
for row in table.rows:
cells = [cell.text.strip() for cell in row.cells]
parts.append(" | ".join(cells))
parts.append("") # blank line after table
break
return "\n".join(parts)
def main():
if len(sys.argv) != 2:
print("Usage: convert_docx.py <path-to-file.docx>", file=sys.stderr)
sys.exit(1)
filepath = sys.argv[1]
if not os.path.exists(filepath):
print(f"Error: File not found: {filepath}", file=sys.stderr)
sys.exit(1)
if not filepath.lower().endswith(".docx"):
print(f"Error: Not a .docx file: {filepath}", file=sys.stderr)
sys.exit(1)
try:
text = convert_docx(filepath)
print(text)
except Exception as e:
print(f"Error converting file: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -1,122 +0,0 @@
#!/usr/bin/env python3
"""Tests for convert_docx.py"""
import subprocess
import sys
import tempfile
import os
SCRIPT = os.path.join(os.path.dirname(__file__), "convert_docx.py")
def test_missing_argument():
"""Script should print usage and exit 1 when no args given."""
result = subprocess.run(
[sys.executable, SCRIPT],
capture_output=True, text=True
)
assert result.returncode == 1
assert "Usage:" in result.stderr
def test_nonexistent_file():
"""Script should error on a file that doesn't exist."""
result = subprocess.run(
[sys.executable, SCRIPT, "/tmp/nonexistent_file_abc123.docx"],
capture_output=True, text=True
)
assert result.returncode == 1
assert "Error" in result.stderr or "not found" in result.stderr.lower()
def test_non_docx_file():
"""Script should error on a non-DOCX file."""
with tempfile.NamedTemporaryFile(suffix=".txt", delete=False) as f:
f.write(b"hello world")
f.flush()
result = subprocess.run(
[sys.executable, SCRIPT, f.name],
capture_output=True, text=True
)
os.unlink(f.name)
assert result.returncode == 1
def test_valid_docx():
"""Script should extract text from a valid DOCX file."""
try:
from docx import Document
except ImportError:
print("SKIP: python-docx not installed")
return
doc = Document()
doc.add_paragraph("Hello from test document")
doc.add_paragraph("Second paragraph here")
with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as f:
doc.save(f.name)
result = subprocess.run(
[sys.executable, SCRIPT, f.name],
capture_output=True, text=True
)
os.unlink(f.name)
assert result.returncode == 0
assert "Hello from test document" in result.stdout
assert "Second paragraph here" in result.stdout
def test_docx_with_table():
"""Script should extract table content from a DOCX file."""
try:
from docx import Document
except ImportError:
print("SKIP: python-docx not installed")
return
doc = Document()
doc.add_paragraph("Before table")
table = doc.add_table(rows=2, cols=2)
table.cell(0, 0).text = "Header1"
table.cell(0, 1).text = "Header2"
table.cell(1, 0).text = "Value1"
table.cell(1, 1).text = "Value2"
doc.add_paragraph("After table")
with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as f:
doc.save(f.name)
result = subprocess.run(
[sys.executable, SCRIPT, f.name],
capture_output=True, text=True
)
os.unlink(f.name)
assert result.returncode == 0
assert "Header1" in result.stdout
assert "Value1" in result.stdout
if __name__ == "__main__":
tests = [
test_missing_argument,
test_nonexistent_file,
test_non_docx_file,
test_valid_docx,
test_docx_with_table,
]
passed = 0
failed = 0
for test in tests:
try:
test()
print(f" PASS: {test.__name__}")
passed += 1
except AssertionError as e:
print(f" FAIL: {test.__name__} - {e}")
failed += 1
except Exception as e:
print(f" ERROR: {test.__name__} - {e}")
failed += 1
print(f"\n{passed} passed, {failed} failed")
sys.exit(1 if failed else 0)