Initial commit: tabular-extract skill

Claude Code skill that extracts structured data from document collections into tabular format using Claude's native document understanding capabilities.
2026-03-02 23:56:28 -05:00
commit be5b36fbc4
4 changed files with 396 additions and 0 deletions
--- a/SKILL.md
+++ b/SKILL.md
@@ -0,0 +1,97 @@
+---
+name: tabular-extract
+description: Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files.
+---
+
+# Tabular Extract
+
+Extract structured data from document collections into tabular format.
+
+## Pipeline
+
+This is a rigid, sequential pipeline. Execute every step in order.
+
+1. **Discover documents** — find files at the user's path
+2. **Read documents** — convert each file to text
+3. **Define schema** — infer extraction columns from user's description
+4. **Extract data** — read each document and extract each column's value
+5. **Output results** — display markdown table and save JSON file
+
+## Step 1: Discover Documents
+
+Glob the user-provided path for supported file types:
+
+```bash
+**/*.pdf **/*.docx **/*.txt **/*.md **/*.json
+```
+
+Display the file list and count. Ask the user to confirm before proceeding.
+If no supported files are found, tell the user and stop.
+
+## Step 2: Read Documents
+
+Convert each file to text based on its type:
+
+| Format | Method |
+|--------|--------|
+| .pdf | Use the Read tool with `pages` parameter for large files (>10 pages: read in chunks of 20 pages) |
+| .docx | Run: `python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py <filepath>` (requires `pip install python-docx`) |
+| .txt, .md | Use the Read tool directly |
+| .json | Use the Read tool directly |
+
+If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline.
+
+## Step 3: Define Extraction Schema
+
+The user describes what to extract in natural language.
+
+Infer a structured schema — for each column determine:
+- **name**: Short, descriptive column header
+- **type**: One of `text`, `number`, `date`, `boolean`, `list`
+- **prompt**: Specific extraction instruction
+
+Present the inferred schema as a table and ask the user to confirm or adjust.
+
+Example:
+```
+| # | Column | Type | Extraction Prompt |
+|---|--------|------|-------------------|
+| 1 | Party Name | text | Identify the full legal name of each party to the agreement |
+| 2 | Effective Date | date | What is the effective date of this agreement? |
+| 3 | Contract Value | number | What is the total contract value or consideration amount? |
+```
+
+## Step 4: Extract Data
+
+For each document, read its text and extract every column value.
+
+For each cell, produce:
+- **value** — the extracted data (typed per column type)
+- **confidence** — high, medium, or low
+- **supporting_quote** — exact text from the document
+- **reasoning** — why this value was chosen
+
+See `references/extraction-guide.md` for detailed type handling, confidence criteria, and null value handling.
+
+## Step 5: Output Results
+
+**Display a markdown table** in the conversation:
+- One row per document, one column per extraction field
+- Append `(?)` to low-confidence values
+- Truncate values longer than 60 characters with `...`
+
+**Save a JSON file** to `./extraction-results-YYYY-MM-DD.json` in the current working directory.
+- Use the schema documented in `references/extraction-guide.md`
+- Include metadata: timestamp, source path, document count, skipped files
+
+**Print a summary:**
+- Documents processed / skipped
+- Confidence distribution (how many high / medium / low extractions)
+
+## Error Handling
+
+- **Missing python-docx**: Print install command `pip install python-docx` and ask user to install
+- **Unreadable file**: Skip file, record in skipped list, continue pipeline
+- **Large PDF (>10 pages)**: Read in 20-page chunks, concatenate text
+- **No files found**: Inform user and stop
+- **User cancels at confirmation**: Stop gracefully