From 06f0b3b18d4d53309b454be67502db706cace607 Mon Sep 17 00:00:00 2001 From: Bill Ballou Date: Mon, 2 Mar 2026 23:49:30 -0500 Subject: [PATCH] feat: write SKILL.md with complete extraction pipeline --- skills/tabular-extract/SKILL.md | 120 ++++++++++++++++++-------------- 1 file changed, 66 insertions(+), 54 deletions(-) diff --git a/skills/tabular-extract/SKILL.md b/skills/tabular-extract/SKILL.md index 7375020..f5924ad 100644 --- a/skills/tabular-extract/SKILL.md +++ b/skills/tabular-extract/SKILL.md @@ -1,85 +1,97 @@ --- name: tabular-extract -description: [TODO: Complete and informative explanation of what the skill does and when to use it. Include WHEN to use this skill - specific scenarios, file types, or tasks that trigger it.] +description: Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files. --- # Tabular Extract -## Overview +Extract structured data from document collections into tabular format. -[TODO: 1-2 sentences explaining what this skill enables] +## Pipeline -## Structuring This Skill +This is a rigid, sequential pipeline. Execute every step in order. -[TODO: Choose the structure that best fits this skill's purpose. Common patterns: +1. **Discover documents** — find files at the user's path +2. **Read documents** — convert each file to text +3. **Define schema** — infer extraction columns from user's description +4. **Extract data** — read each document and extract each column's value +5. **Output results** — display markdown table and save JSON file -**1. Workflow-Based** (best for sequential processes) -- Works well when there are clear step-by-step procedures -- Example: DOCX skill with "Workflow Decision Tree" → "Reading" → "Creating" → "Editing" -- Structure: ## Overview → ## Workflow Decision Tree → ## Step 1 → ## Step 2... +## Step 1: Discover Documents -**2. Task-Based** (best for tool collections) -- Works well when the skill offers different operations/capabilities -- Example: PDF skill with "Quick Start" → "Merge PDFs" → "Split PDFs" → "Extract Text" -- Structure: ## Overview → ## Quick Start → ## Task Category 1 → ## Task Category 2... +Glob the user-provided path for supported file types: -**3. Reference/Guidelines** (best for standards or specifications) -- Works well for brand guidelines, coding standards, or requirements -- Example: Brand styling with "Brand Guidelines" → "Colors" → "Typography" → "Features" -- Structure: ## Overview → ## Guidelines → ## Specifications → ## Usage... +```bash +**/*.pdf **/*.docx **/*.txt **/*.md **/*.json +``` -**4. Capabilities-Based** (best for integrated systems) -- Works well when the skill provides multiple interrelated features -- Example: Product Management with "Core Capabilities" → numbered capability list -- Structure: ## Overview → ## Core Capabilities → ### 1. Feature → ### 2. Feature... +Display the file list and count. Ask the user to confirm before proceeding. +If no supported files are found, tell the user and stop. -Patterns can be mixed and matched as needed. Most skills combine patterns (e.g., start with task-based, add workflow for complex operations). +## Step 2: Read Documents -Delete this entire "Structuring This Skill" section when done - it's just guidance.] +Convert each file to text based on its type: -## [TODO: Replace with the first main section based on chosen structure] +| Format | Method | +|--------|--------| +| .pdf | Use the Read tool with `pages` parameter for large files (>10 pages: read in chunks of 20 pages) | +| .docx | Run: `python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py ` (requires `pip install python-docx`) | +| .txt, .md | Use the Read tool directly | +| .json | Use the Read tool directly | -[TODO: Add content here. See examples in existing skills: -- Code samples for technical skills -- Decision trees for complex workflows -- Concrete examples with realistic user requests -- References to scripts/templates/references as needed] +If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline. -## Resources +## Step 3: Define Extraction Schema -This skill includes example resource directories that demonstrate how to organize different types of bundled resources: +The user describes what to extract in natural language. -### scripts/ -Executable code (Python/Bash/etc.) that can be run directly to perform specific operations. +Infer a structured schema — for each column determine: +- **name**: Short, descriptive column header +- **type**: One of `text`, `number`, `date`, `boolean`, `list` +- **prompt**: Specific extraction instruction -**Examples from other skills:** -- PDF skill: `fill_fillable_fields.py`, `extract_form_field_info.py` - utilities for PDF manipulation -- DOCX skill: `document.py`, `utilities.py` - Python modules for document processing +Present the inferred schema as a table and ask the user to confirm or adjust. -**Appropriate for:** Python scripts, shell scripts, or any executable code that performs automation, data processing, or specific operations. +Example: +``` +| # | Column | Type | Extraction Prompt | +|---|--------|------|-------------------| +| 1 | Party Name | text | Identify the full legal name of each party to the agreement | +| 2 | Effective Date | date | What is the effective date of this agreement? | +| 3 | Contract Value | number | What is the total contract value or consideration amount? | +``` -**Note:** Scripts may be executed without loading into context, but can still be read by Claude for patching or environment adjustments. +## Step 4: Extract Data -### references/ -Documentation and reference material intended to be loaded into context to inform Claude's process and thinking. +For each document, read its text and extract every column value. -**Examples from other skills:** -- Product management: `communication.md`, `context_building.md` - detailed workflow guides -- BigQuery: API reference documentation and query examples -- Finance: Schema documentation, company policies +For each cell, produce: +- **value** — the extracted data (typed per column type) +- **confidence** — high, medium, or low +- **supporting_quote** — exact text from the document +- **reasoning** — why this value was chosen -**Appropriate for:** In-depth documentation, API references, database schemas, comprehensive guides, or any detailed information that Claude should reference while working. +See `references/extraction-guide.md` for detailed type handling, confidence criteria, and null value handling. -### assets/ -Files not intended to be loaded into context, but rather used within the output Claude produces. +## Step 5: Output Results -**Examples from other skills:** -- Brand styling: PowerPoint template files (.pptx), logo files -- Frontend builder: HTML/React boilerplate project directories -- Typography: Font files (.ttf, .woff2) +**Display a markdown table** in the conversation: +- One row per document, one column per extraction field +- Append `(?)` to low-confidence values +- Truncate values longer than 60 characters with `...` -**Appropriate for:** Templates, boilerplate code, document templates, images, icons, fonts, or any files meant to be copied or used in the final output. +**Save a JSON file** to `./extraction-results-YYYY-MM-DD.json` in the current working directory. +- Use the schema documented in `references/extraction-guide.md` +- Include metadata: timestamp, source path, document count, skipped files ---- +**Print a summary:** +- Documents processed / skipped +- Confidence distribution (how many high / medium / low extractions) -**Any unneeded directories can be deleted.** Not every skill requires all three types of resources. +## Error Handling + +- **Missing python-docx**: Print install command `pip install python-docx` and ask user to install +- **Unreadable file**: Skip file, record in skipped list, continue pipeline +- **Large PDF (>10 pages)**: Read in 20-page chunks, concatenate text +- **No files found**: Inform user and stop +- **User cancels at confirmation**: Stop gracefully