feat: write SKILL.md with complete extraction pipeline
This commit is contained in:
@@ -1,85 +1,97 @@
|
|||||||
---
|
---
|
||||||
name: tabular-extract
|
name: tabular-extract
|
||||||
description: [TODO: Complete and informative explanation of what the skill does and when to use it. Include WHEN to use this skill - specific scenarios, file types, or tasks that trigger it.]
|
description: Extract structured data from document collections into tabular format. Reads PDFs, DOCX, TXT, MD, and JSON files from local paths, infers extraction columns from natural language descriptions, and outputs a markdown table plus a JSON file with values, confidence scores, supporting quotes, and reasoning. Use when the user asks to extract structured data from documents, turn documents into a spreadsheet or table, review or compare multiple documents side by side, or pull specific fields from a set of files.
|
||||||
---
|
---
|
||||||
|
|
||||||
# Tabular Extract
|
# Tabular Extract
|
||||||
|
|
||||||
## Overview
|
Extract structured data from document collections into tabular format.
|
||||||
|
|
||||||
[TODO: 1-2 sentences explaining what this skill enables]
|
## Pipeline
|
||||||
|
|
||||||
## Structuring This Skill
|
This is a rigid, sequential pipeline. Execute every step in order.
|
||||||
|
|
||||||
[TODO: Choose the structure that best fits this skill's purpose. Common patterns:
|
1. **Discover documents** — find files at the user's path
|
||||||
|
2. **Read documents** — convert each file to text
|
||||||
|
3. **Define schema** — infer extraction columns from user's description
|
||||||
|
4. **Extract data** — read each document and extract each column's value
|
||||||
|
5. **Output results** — display markdown table and save JSON file
|
||||||
|
|
||||||
**1. Workflow-Based** (best for sequential processes)
|
## Step 1: Discover Documents
|
||||||
- Works well when there are clear step-by-step procedures
|
|
||||||
- Example: DOCX skill with "Workflow Decision Tree" → "Reading" → "Creating" → "Editing"
|
|
||||||
- Structure: ## Overview → ## Workflow Decision Tree → ## Step 1 → ## Step 2...
|
|
||||||
|
|
||||||
**2. Task-Based** (best for tool collections)
|
Glob the user-provided path for supported file types:
|
||||||
- Works well when the skill offers different operations/capabilities
|
|
||||||
- Example: PDF skill with "Quick Start" → "Merge PDFs" → "Split PDFs" → "Extract Text"
|
|
||||||
- Structure: ## Overview → ## Quick Start → ## Task Category 1 → ## Task Category 2...
|
|
||||||
|
|
||||||
**3. Reference/Guidelines** (best for standards or specifications)
|
```bash
|
||||||
- Works well for brand guidelines, coding standards, or requirements
|
**/*.pdf **/*.docx **/*.txt **/*.md **/*.json
|
||||||
- Example: Brand styling with "Brand Guidelines" → "Colors" → "Typography" → "Features"
|
```
|
||||||
- Structure: ## Overview → ## Guidelines → ## Specifications → ## Usage...
|
|
||||||
|
|
||||||
**4. Capabilities-Based** (best for integrated systems)
|
Display the file list and count. Ask the user to confirm before proceeding.
|
||||||
- Works well when the skill provides multiple interrelated features
|
If no supported files are found, tell the user and stop.
|
||||||
- Example: Product Management with "Core Capabilities" → numbered capability list
|
|
||||||
- Structure: ## Overview → ## Core Capabilities → ### 1. Feature → ### 2. Feature...
|
|
||||||
|
|
||||||
Patterns can be mixed and matched as needed. Most skills combine patterns (e.g., start with task-based, add workflow for complex operations).
|
## Step 2: Read Documents
|
||||||
|
|
||||||
Delete this entire "Structuring This Skill" section when done - it's just guidance.]
|
Convert each file to text based on its type:
|
||||||
|
|
||||||
## [TODO: Replace with the first main section based on chosen structure]
|
| Format | Method |
|
||||||
|
|--------|--------|
|
||||||
|
| .pdf | Use the Read tool with `pages` parameter for large files (>10 pages: read in chunks of 20 pages) |
|
||||||
|
| .docx | Run: `python3 ~/.claude/skills/tabular-extract/scripts/convert_docx.py <filepath>` (requires `pip install python-docx`) |
|
||||||
|
| .txt, .md | Use the Read tool directly |
|
||||||
|
| .json | Use the Read tool directly |
|
||||||
|
|
||||||
[TODO: Add content here. See examples in existing skills:
|
If a file fails to convert, log it as skipped and continue with remaining files. Do not stop the pipeline.
|
||||||
- Code samples for technical skills
|
|
||||||
- Decision trees for complex workflows
|
|
||||||
- Concrete examples with realistic user requests
|
|
||||||
- References to scripts/templates/references as needed]
|
|
||||||
|
|
||||||
## Resources
|
## Step 3: Define Extraction Schema
|
||||||
|
|
||||||
This skill includes example resource directories that demonstrate how to organize different types of bundled resources:
|
The user describes what to extract in natural language.
|
||||||
|
|
||||||
### scripts/
|
Infer a structured schema — for each column determine:
|
||||||
Executable code (Python/Bash/etc.) that can be run directly to perform specific operations.
|
- **name**: Short, descriptive column header
|
||||||
|
- **type**: One of `text`, `number`, `date`, `boolean`, `list`
|
||||||
|
- **prompt**: Specific extraction instruction
|
||||||
|
|
||||||
**Examples from other skills:**
|
Present the inferred schema as a table and ask the user to confirm or adjust.
|
||||||
- PDF skill: `fill_fillable_fields.py`, `extract_form_field_info.py` - utilities for PDF manipulation
|
|
||||||
- DOCX skill: `document.py`, `utilities.py` - Python modules for document processing
|
|
||||||
|
|
||||||
**Appropriate for:** Python scripts, shell scripts, or any executable code that performs automation, data processing, or specific operations.
|
Example:
|
||||||
|
```
|
||||||
|
| # | Column | Type | Extraction Prompt |
|
||||||
|
|---|--------|------|-------------------|
|
||||||
|
| 1 | Party Name | text | Identify the full legal name of each party to the agreement |
|
||||||
|
| 2 | Effective Date | date | What is the effective date of this agreement? |
|
||||||
|
| 3 | Contract Value | number | What is the total contract value or consideration amount? |
|
||||||
|
```
|
||||||
|
|
||||||
**Note:** Scripts may be executed without loading into context, but can still be read by Claude for patching or environment adjustments.
|
## Step 4: Extract Data
|
||||||
|
|
||||||
### references/
|
For each document, read its text and extract every column value.
|
||||||
Documentation and reference material intended to be loaded into context to inform Claude's process and thinking.
|
|
||||||
|
|
||||||
**Examples from other skills:**
|
For each cell, produce:
|
||||||
- Product management: `communication.md`, `context_building.md` - detailed workflow guides
|
- **value** — the extracted data (typed per column type)
|
||||||
- BigQuery: API reference documentation and query examples
|
- **confidence** — high, medium, or low
|
||||||
- Finance: Schema documentation, company policies
|
- **supporting_quote** — exact text from the document
|
||||||
|
- **reasoning** — why this value was chosen
|
||||||
|
|
||||||
**Appropriate for:** In-depth documentation, API references, database schemas, comprehensive guides, or any detailed information that Claude should reference while working.
|
See `references/extraction-guide.md` for detailed type handling, confidence criteria, and null value handling.
|
||||||
|
|
||||||
### assets/
|
## Step 5: Output Results
|
||||||
Files not intended to be loaded into context, but rather used within the output Claude produces.
|
|
||||||
|
|
||||||
**Examples from other skills:**
|
**Display a markdown table** in the conversation:
|
||||||
- Brand styling: PowerPoint template files (.pptx), logo files
|
- One row per document, one column per extraction field
|
||||||
- Frontend builder: HTML/React boilerplate project directories
|
- Append `(?)` to low-confidence values
|
||||||
- Typography: Font files (.ttf, .woff2)
|
- Truncate values longer than 60 characters with `...`
|
||||||
|
|
||||||
**Appropriate for:** Templates, boilerplate code, document templates, images, icons, fonts, or any files meant to be copied or used in the final output.
|
**Save a JSON file** to `./extraction-results-YYYY-MM-DD.json` in the current working directory.
|
||||||
|
- Use the schema documented in `references/extraction-guide.md`
|
||||||
|
- Include metadata: timestamp, source path, document count, skipped files
|
||||||
|
|
||||||
---
|
**Print a summary:**
|
||||||
|
- Documents processed / skipped
|
||||||
|
- Confidence distribution (how many high / medium / low extractions)
|
||||||
|
|
||||||
**Any unneeded directories can be deleted.** Not every skill requires all three types of resources.
|
## Error Handling
|
||||||
|
|
||||||
|
- **Missing python-docx**: Print install command `pip install python-docx` and ask user to install
|
||||||
|
- **Unreadable file**: Skip file, record in skipped list, continue pipeline
|
||||||
|
- **Large PDF (>10 pages)**: Read in 20-page chunks, concatenate text
|
||||||
|
- **No files found**: Inform user and stop
|
||||||
|
- **User cancels at confirmation**: Stop gracefully
|
||||||
|
|||||||
Reference in New Issue
Block a user