--- name: tabular-extract description: Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files. --- # Tabular Extract Extract structured data from document collections into tabular format. ## Pipeline This is a rigid, sequential pipeline. Execute every step in order. 1. **Discover documents** — find files at the user's path 2. **Read documents** — convert each file to text 3. **Define schema** — infer extraction columns from user's description 4. **Extract data** — read each document and extract each column's value 5. **Output results** — display markdown table and save JSON file ## Step 1: Discover Documents Glob the user-provided path for supported file types: ```bash **/*.pdf **/*.docx **/*.txt **/*.md **/*.json ``` Display the file list and count. Ask the user to confirm before proceeding. If no supported files are found, tell the user and stop. ## Step 2: Read Documents Convert each file to text based on its type: | Format | Method | |--------|--------| | .pdf | Use the Read tool with `pages` parameter for large files (>10 pages: read in chunks of 20 pages) | | .docx | Run: `python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py ` (requires `pip install python-docx`) | | .txt, .md | Use the Read tool directly | | .json | Use the Read tool directly | If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline. ## Step 3: Define Extraction Schema The user describes what to extract in natural language. Infer a structured schema — for each column determine: - **name**: Short, descriptive column header - **type**: One of `text`, `number`, `date`, `boolean`, `list` - **prompt**: Specific extraction instruction Present the inferred schema as a table and ask the user to confirm or adjust. Example: ``` | # | Column | Type | Extraction Prompt | |---|--------|------|-------------------| | 1 | Party Name | text | Identify the full legal name of each party to the agreement | | 2 | Effective Date | date | What is the effective date of this agreement? | | 3 | Contract Value | number | What is the total contract value or consideration amount? | ``` ## Step 4: Extract Data For each document, read its text and extract every column value. For each cell, produce: - **value** — the extracted data (typed per column type) - **confidence** — high, medium, or low - **supporting_quote** — exact text from the document - **reasoning** — why this value was chosen See `references/extraction-guide.md` for detailed type handling, confidence criteria, and null value handling. ## Step 5: Output Results **Display a markdown table** in the conversation: - One row per document, one column per extraction field - Append `(?)` to low-confidence values - Truncate values longer than 60 characters with `...` **Save a JSON file** in the same directory as the source file(s): - **Naming convention**: derive the output filename from the source file. Strip the extension and append `-extraction.json`. Examples: - `Orenda Proposal.pdf` → `Orenda Proposal-extraction.json` - `Contract v2.docx` → `Contract v2-extraction.json` - For multiple source files, use the common parent directory name: `-extraction.json` - Use the schema documented in `references/extraction-guide.md` - Include metadata: timestamp, source path, document count, skipped files **Print a summary:** - Documents processed / skipped - Confidence distribution (how many high / medium / low extractions) ## Error Handling - **Missing python-docx**: Print install command `pip install python-docx` and ask user to install - **Unreadable file**: Skip file, record in skipped list, continue pipeline - **Large PDF (>10 pages)**: Read in 20-page chunks, concatenate text - **No files found**: Inform user and stop - **User cancels at confirmation**: Stop gracefully