Bill/claude-code-user-config

Fork 0

Files

Bill Ballou 06f0b3b18d feat: write SKILL.md with complete extraction pipeline

2026-03-02 23:49:30 -05:00

3.9 KiB

Raw Blame History

name, description

name	description
tabular-extract	Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files.

name

description

tabular-extract

Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files.

Tabular Extract

Extract structured data from document collections into tabular format.

Pipeline

This is a rigid, sequential pipeline. Execute every step in order.

Discover documents — find files at the user's path
Read documents — convert each file to text
Define schema — infer extraction columns from user's description
Extract data — read each document and extract each column's value
Output results — display markdown table and save JSON file

Step 1: Discover Documents

Glob the user-provided path for supported file types:

**/*.pdf **/*.docx **/*.txt **/*.md **/*.json

Display the file list and count. Ask the user to confirm before proceeding. If no supported files are found, tell the user and stop.

Step 2: Read Documents

Convert each file to text based on its type:

Format	Method
.pdf	Use the Read tool with `pages` parameter for large files (>10 pages: read in chunks of 20 pages)
.docx	Run: `python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py <filepath>` (requires `pip install python-docx`)
.txt, .md	Use the Read tool directly
.json	Use the Read tool directly

If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline.

Step 3: Define Extraction Schema

The user describes what to extract in natural language.

Infer a structured schema — for each column determine:

name: Short, descriptive column header
type: One of text, number, date, boolean, list
prompt: Specific extraction instruction

Present the inferred schema as a table and ask the user to confirm or adjust.

Example:

| # | Column | Type | Extraction Prompt |
|---|--------|------|-------------------|
| 1 | Party Name | text | Identify the full legal name of each party to the agreement |
| 2 | Effective Date | date | What is the effective date of this agreement? |
| 3 | Contract Value | number | What is the total contract value or consideration amount? |

Step 4: Extract Data

For each document, read its text and extract every column value.

For each cell, produce:

value — the extracted data (typed per column type)
confidence — high, medium, or low
supporting_quote — exact text from the document
reasoning — why this value was chosen

See references/extraction-guide.md for detailed type handling, confidence criteria, and null value handling.

Step 5: Output Results

Display a markdown table in the conversation:

One row per document, one column per extraction field
Append (?) to low-confidence values
Truncate values longer than 60 characters with ...

Save a JSON file to ./extraction-results-YYYY-MM-DD.json in the current working directory.

Use the schema documented in references/extraction-guide.md
Include metadata: timestamp, source path, document count, skipped files

Print a summary:

Documents processed / skipped
Confidence distribution (how many high / medium / low extractions)

Error Handling

Missing python-docx: Print install command pip install python-docx and ask user to install
Unreadable file: Skip file, record in skipped list, continue pipeline
Large PDF (>10 pages): Read in 20-page chunks, concatenate text
No files found: Inform user and stop
User cancels at confirmation: Stop gracefully

3.9 KiB Raw Blame History