feat: write SKILL.md with complete extraction pipeline

This commit is contained in:
2026-03-02 23:49:30 -05:00
parent 2295bfebdc
commit 06f0b3b18d

View File

@@ -1,85 +1,97 @@
---
name: tabular-extract
description: [TODO: Complete and informative explanation of what the skill does and when to use it. Include WHEN to use this skill - specific scenarios, file types, or tasks that trigger it.]
description: Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files.
---
# Tabular Extract
## Overview
Extract structured data from document collections into tabular format.
[TODO: 1-2 sentences explaining what this skill enables]
## Pipeline
## Structuring This Skill
This is a rigid, sequential pipeline. Execute every step in order.
[TODO: Choose the structure that best fits this skill's purpose. Common patterns:
1. **Discover documents** — find files at the user's path
2. **Read documents** — convert each file to text
3. **Define schema** — infer extraction columns from user's description
4. **Extract data** — read each document and extract each column's value
5. **Output results** — display markdown table and save JSON file
**1. Workflow-Based** (best for sequential processes)
- Works well when there are clear step-by-step procedures
- Example: DOCX skill with "Workflow Decision Tree" → "Reading" → "Creating" → "Editing"
- Structure: ## Overview → ## Workflow Decision Tree → ## Step 1 → ## Step 2...
## Step 1: Discover Documents
**2. Task-Based** (best for tool collections)
- Works well when the skill offers different operations/capabilities
- Example: PDF skill with "Quick Start" → "Merge PDFs" → "Split PDFs" → "Extract Text"
- Structure: ## Overview → ## Quick Start → ## Task Category 1 → ## Task Category 2...
Glob the user-provided path for supported file types:
**3. Reference/Guidelines** (best for standards or specifications)
- Works well for brand guidelines, coding standards, or requirements
- Example: Brand styling with "Brand Guidelines" → "Colors" → "Typography" → "Features"
- Structure: ## Overview → ## Guidelines → ## Specifications → ## Usage...
```bash
**/*.pdf **/*.docx **/*.txt **/*.md **/*.json
```
**4. Capabilities-Based** (best for integrated systems)
- Works well when the skill provides multiple interrelated features
- Example: Product Management with "Core Capabilities" → numbered capability list
- Structure: ## Overview → ## Core Capabilities → ### 1. Feature → ### 2. Feature...
Display the file list and count. Ask the user to confirm before proceeding.
If no supported files are found, tell the user and stop.
Patterns can be mixed and matched as needed. Most skills combine patterns (e.g., start with task-based, add workflow for complex operations).
## Step 2: Read Documents
Delete this entire "Structuring This Skill" section when done - it's just guidance.]
Convert each file to text based on its type:
## [TODO: Replace with the first main section based on chosen structure]
| Format | Method |
|--------|--------|
| .pdf | Use the Read tool with `pages` parameter for large files (>10 pages: read in chunks of 20 pages) |
| .docx | Run: `python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py <filepath>` (requires `pip install python-docx`) |
| .txt, .md | Use the Read tool directly |
| .json | Use the Read tool directly |
[TODO: Add content here. See examples in existing skills:
- Code samples for technical skills
- Decision trees for complex workflows
- Concrete examples with realistic user requests
- References to scripts/templates/references as needed]
If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline.
## Resources
## Step 3: Define Extraction Schema
This skill includes example resource directories that demonstrate how to organize different types of bundled resources:
The user describes what to extract in natural language.
### scripts/
Executable code (Python/Bash/etc.) that can be run directly to perform specific operations.
Infer a structured schema — for each column determine:
- **name**: Short, descriptive column header
- **type**: One of `text`, `number`, `date`, `boolean`, `list`
- **prompt**: Specific extraction instruction
**Examples from other skills:**
- PDF skill: `fill_fillable_fields.py`, `extract_form_field_info.py` - utilities for PDF manipulation
- DOCX skill: `document.py`, `utilities.py` - Python modules for document processing
Present the inferred schema as a table and ask the user to confirm or adjust.
**Appropriate for:** Python scripts, shell scripts, or any executable code that performs automation, data processing, or specific operations.
Example:
```
| # | Column | Type | Extraction Prompt |
|---|--------|------|-------------------|
| 1 | Party Name | text | Identify the full legal name of each party to the agreement |
| 2 | Effective Date | date | What is the effective date of this agreement? |
| 3 | Contract Value | number | What is the total contract value or consideration amount? |
```
**Note:** Scripts may be executed without loading into context, but can still be read by Claude for patching or environment adjustments.
## Step 4: Extract Data
### references/
Documentation and reference material intended to be loaded into context to inform Claude's process and thinking.
For each document, read its text and extract every column value.
**Examples from other skills:**
- Product management: `communication.md`, `context_building.md` - detailed workflow guides
- BigQuery: API reference documentation and query examples
- Finance: Schema documentation, company policies
For each cell, produce:
- **value** — the extracted data (typed per column type)
- **confidence** — high, medium, or low
- **supporting_quote** — exact text from the document
- **reasoning** — why this value was chosen
**Appropriate for:** In-depth documentation, API references, database schemas, comprehensive guides, or any detailed information that Claude should reference while working.
See `references/extraction-guide.md` for detailed type handling, confidence criteria, and null value handling.
### assets/
Files not intended to be loaded into context, but rather used within the output Claude produces.
## Step 5: Output Results
**Examples from other skills:**
- Brand styling: PowerPoint template files (.pptx), logo files
- Frontend builder: HTML/React boilerplate project directories
- Typography: Font files (.ttf, .woff2)
**Display a markdown table** in the conversation:
- One row per document, one column per extraction field
- Append `(?)` to low-confidence values
- Truncate values longer than 60 characters with `...`
**Appropriate for:** Templates, boilerplate code, document templates, images, icons, fonts, or any files meant to be copied or used in the final output.
**Save a JSON file** to `./extraction-results-YYYY-MM-DD.json` in the current working directory.
- Use the schema documented in `references/extraction-guide.md`
- Include metadata: timestamp, source path, document count, skipped files
---
**Print a summary:**
- Documents processed / skipped
- Confidence distribution (how many high / medium / low extractions)
**Any unneeded directories can be deleted.** Not every skill requires all three types of resources.
## Error Handling
- **Missing python-docx**: Print install command `pip install python-docx` and ask user to install
- **Unreadable file**: Skip file, record in skipped list, continue pipeline
- **Large PDF (>10 pages)**: Read in 20-page chunks, concatenate text
- **No files found**: Inform user and stop
- **User cancels at confirmation**: Stop gracefully