feat: add extraction guide reference document

This commit is contained in:
2026-03-02 23:46:41 -05:00
parent 562868e7d8
commit 2295bfebdc

View File

@@ -0,0 +1,94 @@
# Extraction Guide
## Extraction Prompt Template
For each document x column, use this reasoning structure:
1. Read the document text carefully
2. Locate text relevant to the extraction prompt
3. Extract the value, noting its exact location
4. Assess confidence based on clarity of the source text
## Per-Cell Output Structure
For each extraction, produce a JSON object:
```json
{
"value": "<extracted value, typed appropriately>",
"confidence": "high | medium | low",
"supporting_quote": "<exact text from the document that supports this value>",
"reasoning": "<1-2 sentences explaining why this value was chosen>"
}
```
### Confidence Levels
- **high**: Value is explicitly stated, unambiguous, directly answers the prompt
- **medium**: Value is implied or requires minor inference, or multiple possible values exist
- **low**: Value is uncertain, requires significant inference, or the document may not contain the answer
### Type Handling
| Column Type | Value Format | Example |
|-------------|-------------|---------|
| text | Plain string | "Acme Corporation" |
| number | Numeric value (no currency symbols) | 500000 |
| date | ISO 8601 format (YYYY-MM-DD) | "2024-01-15" |
| boolean | true or false | true |
| list | JSON array of strings | ["item1", "item2"] |
### When a Value Cannot Be Found
If the document does not contain information for a column:
- Set value to null
- Set confidence to "low"
- Set supporting_quote to ""
- Set reasoning to explain why the value could not be found
## Full Output JSON Schema
```json
{
"extraction": {
"created": "ISO 8601 timestamp",
"source_directory": "/absolute/path/to/docs",
"documents_processed": 0,
"documents_skipped": [],
"columns": [
{
"name": "Column Name",
"type": "text|number|date|boolean|list",
"prompt": "The extraction prompt used"
}
],
"results": [
{
"document": "filename.pdf",
"fields": {
"Column Name": {
"value": "extracted value",
"confidence": "high|medium|low",
"supporting_quote": "exact text from document",
"reasoning": "explanation"
}
}
}
]
}
}
```
## Markdown Table Format
Display results as a pipe-delimited markdown table.
Append `(?)` to low-confidence values.
Truncate cell values longer than 60 characters with `...`.
Example:
```
| Document | Party Name | Date | Amount |
|----------|-----------|------|--------|
| contract1.pdf | Acme Corp | 2024-01-15 | 500000 |
| contract2.pdf | Beta LLC(?) | 2024-03-22 | 1200000 |
```