3.9 KiB
name, description
| name | description |
|---|---|
| tabular-extract | Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files. |
Tabular Extract
Extract structured data from document collections into tabular format.
Pipeline
This is a rigid, sequential pipeline. Execute every step in order.
- Discover documents — find files at the user's path
- Read documents — convert each file to text
- Define schema — infer extraction columns from user's description
- Extract data — read each document and extract each column's value
- Output results — display markdown table and save JSON file
Step 1: Discover Documents
Glob the user-provided path for supported file types:
**/*.pdf **/*.docx **/*.txt **/*.md **/*.json
Display the file list and count. Ask the user to confirm before proceeding. If no supported files are found, tell the user and stop.
Step 2: Read Documents
Convert each file to text based on its type:
| Format | Method |
|---|---|
Use the Read tool with pages parameter for large files (>10 pages: read in chunks of 20 pages) |
|
| .docx | Run: python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py <filepath> (requires pip install python-docx) |
| .txt, .md | Use the Read tool directly |
| .json | Use the Read tool directly |
If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline.
Step 3: Define Extraction Schema
The user describes what to extract in natural language.
Infer a structured schema — for each column determine:
- name: Short, descriptive column header
- type: One of
text,number,date,boolean,list - prompt: Specific extraction instruction
Present the inferred schema as a table and ask the user to confirm or adjust.
Example:
| # | Column | Type | Extraction Prompt |
|---|--------|------|-------------------|
| 1 | Party Name | text | Identify the full legal name of each party to the agreement |
| 2 | Effective Date | date | What is the effective date of this agreement? |
| 3 | Contract Value | number | What is the total contract value or consideration amount? |
Step 4: Extract Data
For each document, read its text and extract every column value.
For each cell, produce:
- value — the extracted data (typed per column type)
- confidence — high, medium, or low
- supporting_quote — exact text from the document
- reasoning — why this value was chosen
See references/extraction-guide.md for detailed type handling, confidence criteria, and null value handling.
Step 5: Output Results
Display a markdown table in the conversation:
- One row per document, one column per extraction field
- Append
(?)to low-confidence values - Truncate values longer than 60 characters with
...
Save a JSON file to ./extraction-results-YYYY-MM-DD.json in the current working directory.
- Use the schema documented in
references/extraction-guide.md - Include metadata: timestamp, source path, document count, skipped files
Print a summary:
- Documents processed / skipped
- Confidence distribution (how many high / medium / low extractions)
Error Handling
- Missing python-docx: Print install command
pip install python-docxand ask user to install - Unreadable file: Skip file, record in skipped list, continue pipeline
- Large PDF (>10 pages): Read in 20-page chunks, concatenate text
- No files found: Inform user and stop
- User cancels at confirmation: Stop gracefully