feat: add extraction guide reference document

2026-03-02 23:46:41 -05:00
parent 562868e7d8
commit 2295bfebdc
1 changed files with 94 additions and 0 deletions
@@ -0,0 +1,94 @@
+# Extraction Guide
+
+## Extraction Prompt Template
+
+For each document x column, use this reasoning structure:
+
+1. Read the document text carefully
+2. Locate text relevant to the extraction prompt
+3. Extract the value, noting its exact location
+4. Assess confidence based on clarity of the source text
+
+## Per-Cell Output Structure
+
+For each extraction, produce a JSON object:
+
+```json
+{
+  "value": "<extracted value, typed appropriately>",
+  "confidence": "high | medium | low",
+  "supporting_quote": "<exact text from the document that supports this value>",
+  "reasoning": "<1-2 sentences explaining why this value was chosen>"
+}
+```
+
+### Confidence Levels
+
+- **high**: Value is explicitly stated, unambiguous, directly answers the prompt
+- **medium**: Value is implied or requires minor inference, or multiple possible values exist
+- **low**: Value is uncertain, requires significant inference, or the document may not contain the answer
+
+### Type Handling
+
+| Column Type | Value Format | Example |
+|-------------|-------------|---------|
+| text | Plain string | "Acme Corporation" |
+| number | Numeric value (no currency symbols) | 500000 |
+| date | ISO 8601 format (YYYY-MM-DD) | "2024-01-15" |
+| boolean | true or false | true |
+| list | JSON array of strings | ["item1", "item2"] |
+
+### When a Value Cannot Be Found
+
+If the document does not contain information for a column:
+- Set value to null
+- Set confidence to "low"
+- Set supporting_quote to ""
+- Set reasoning to explain why the value could not be found
+
+## Full Output JSON Schema
+
+```json
+{
+  "extraction": {
+    "created": "ISO 8601 timestamp",
+    "source_directory": "/absolute/path/to/docs",
+    "documents_processed": 0,
+    "documents_skipped": [],
+    "columns": [
+      {
+        "name": "Column Name",
+        "type": "text|number|date|boolean|list",
+        "prompt": "The extraction prompt used"
+      }
+    ],
+    "results": [
+      {
+        "document": "filename.pdf",
+        "fields": {
+          "Column Name": {
+            "value": "extracted value",
+            "confidence": "high|medium|low",
+            "supporting_quote": "exact text from document",
+            "reasoning": "explanation"
+          }
+        }
+      }
+    ]
+  }
+}
+```
+
+## Markdown Table Format
+
+Display results as a pipe-delimited markdown table.
+Append `(?)` to low-confidence values.
+Truncate cell values longer than 60 characters with `...`.
+
+Example:
+```
+| Document | Party Name | Date | Amount |
+|----------|-----------|------|--------|
+| contract1.pdf | Acme Corp | 2024-01-15 | 500000 |
+| contract2.pdf | Beta LLC(?) | 2024-03-22 | 1200000 |
+```