diff --git a/skills/tabular-extract/references/extraction-guide.md b/skills/tabular-extract/references/extraction-guide.md new file mode 100644 index 0000000..776a9be --- /dev/null +++ b/skills/tabular-extract/references/extraction-guide.md @@ -0,0 +1,94 @@ +# Extraction Guide + +## Extraction Prompt Template + +For each document x column, use this reasoning structure: + +1. Read the document text carefully +2. Locate text relevant to the extraction prompt +3. Extract the value, noting its exact location +4. Assess confidence based on clarity of the source text + +## Per-Cell Output Structure + +For each extraction, produce a JSON object: + +```json +{ + "value": "", + "confidence": "high | medium | low", + "supporting_quote": "", + "reasoning": "<1-2 sentences explaining why this value was chosen>" +} +``` + +### Confidence Levels + +- **high**: Value is explicitly stated, unambiguous, directly answers the prompt +- **medium**: Value is implied or requires minor inference, or multiple possible values exist +- **low**: Value is uncertain, requires significant inference, or the document may not contain the answer + +### Type Handling + +| Column Type | Value Format | Example | +|-------------|-------------|---------| +| text | Plain string | "Acme Corporation" | +| number | Numeric value (no currency symbols) | 500000 | +| date | ISO 8601 format (YYYY-MM-DD) | "2024-01-15" | +| boolean | true or false | true | +| list | JSON array of strings | ["item1", "item2"] | + +### When a Value Cannot Be Found + +If the document does not contain information for a column: +- Set value to null +- Set confidence to "low" +- Set supporting_quote to "" +- Set reasoning to explain why the value could not be found + +## Full Output JSON Schema + +```json +{ + "extraction": { + "created": "ISO 8601 timestamp", + "source_directory": "/absolute/path/to/docs", + "documents_processed": 0, + "documents_skipped": [], + "columns": [ + { + "name": "Column Name", + "type": "text|number|date|boolean|list", + "prompt": "The extraction prompt used" + } + ], + "results": [ + { + "document": "filename.pdf", + "fields": { + "Column Name": { + "value": "extracted value", + "confidence": "high|medium|low", + "supporting_quote": "exact text from document", + "reasoning": "explanation" + } + } + } + ] + } +} +``` + +## Markdown Table Format + +Display results as a pipe-delimited markdown table. +Append `(?)` to low-confidence values. +Truncate cell values longer than 60 characters with `...`. + +Example: +``` +| Document | Party Name | Date | Amount | +|----------|-----------|------|--------| +| contract1.pdf | Acme Corp | 2024-01-15 | 500000 | +| contract2.pdf | Beta LLC(?) | 2024-03-22 | 1200000 | +```