Files
AI-Trader/docs/api-specification.md
Bill fb9583b374 feat: transform to REST API service with SQLite persistence (v0.3.0)
Major architecture transformation from batch-only to API service with
database persistence for Windmill integration.

## REST API Implementation
- POST /simulate/trigger - Start simulation jobs
- GET /simulate/status/{job_id} - Monitor job progress
- GET /results - Query results with filters (job_id, date, model)
- GET /health - Service health checks

## Database Layer
- SQLite persistence with 6 tables (jobs, job_details, positions,
  holdings, reasoning_logs, tool_usage)
- Foreign key constraints with cascade deletes
- Replaces JSONL file storage

## Backend Components
- JobManager: Job lifecycle management with concurrency control
- RuntimeConfigManager: Thread-safe isolated runtime configs
- ModelDayExecutor: Single model-day execution engine
- SimulationWorker: Date-sequential, model-parallel orchestration

## Testing
- 102 unit and integration tests (85% coverage)
- Database: 98% coverage
- Job manager: 98% coverage
- API endpoints: 81% coverage
- Pydantic models: 100% coverage
- TDD approach throughout

## Docker Deployment
- Dual-mode: API server (persistent) + batch (one-time)
- Health checks with 30s interval
- Volume persistence for database and logs
- Separate entrypoints for each mode

## Validation Tools
- scripts/validate_docker_build.sh - Build validation
- scripts/test_api_endpoints.sh - Complete API testing
- scripts/test_batch_mode.sh - Batch mode validation
- DOCKER_API.md - Deployment guide
- TESTING_GUIDE.md - Testing procedures

## Configuration
- API_PORT environment variable (default: 8080)
- Backwards compatible with existing configs
- FastAPI, uvicorn, pydantic>=2.0 dependencies

Co-Authored-By: AI Assistant <noreply@example.com>
2025-10-31 11:47:10 -04:00

838 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AI-Trader API Service - Technical Specification
## 1. API Endpoints Specification
### 1.1 POST /simulate/trigger
**Purpose:** Trigger a catch-up simulation from the last completed date to the most recent trading day.
**Request:**
```http
POST /simulate/trigger HTTP/1.1
Content-Type: application/json
```
**Response (202 Accepted):**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "accepted",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"created_at": "2025-01-20T14:30:00Z",
"message": "Simulation job queued successfully"
}
```
**Response (200 OK - Job Already Running):**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 6,
"completed": 3,
"failed": 0,
"current": {
"date": "2025-01-17",
"model": "gpt-5"
}
},
"created_at": "2025-01-20T14:25:00Z",
"message": "Simulation already in progress"
}
```
**Response (200 OK - Already Up To Date):**
```json
{
"status": "current",
"message": "Simulation already up-to-date",
"last_simulation_date": "2025-01-20",
"next_trading_day": "2025-01-21"
}
```
**Response (409 Conflict):**
```json
{
"error": "conflict",
"message": "Different simulation already running",
"current_job_id": "previous-job-uuid",
"current_date_range": ["2025-01-10", "2025-01-15"]
}
```
**Business Logic:**
1. Load configuration from `config_path` (or default)
2. Determine last completed date from each model's `position.jsonl`
3. Calculate date range: `max(last_dates) + 1 day``most_recent_trading_day`
4. Filter for weekdays only (Monday-Friday)
5. If date_range is empty, return "already up-to-date"
6. Check for existing jobs with same date range → return existing job
7. Check for running jobs with different date range → return 409
8. Create new job in SQLite with status=`pending`
9. Queue background task to execute simulation
10. Return 202 with job details
---
### 1.2 GET /simulate/status/{job_id}
**Purpose:** Poll the status and progress of a simulation job.
**Request:**
```http
GET /simulate/status/550e8400-e29b-41d4-a716-446655440000 HTTP/1.1
```
**Response (200 OK - Running):**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 6,
"completed": 3,
"failed": 0,
"current": {
"date": "2025-01-17",
"model": "gpt-5"
},
"details": [
{"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
{"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
{"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 42.1},
{"date": "2025-01-17", "model": "gpt-5", "status": "running", "duration_seconds": null}
]
},
"created_at": "2025-01-20T14:25:00Z",
"updated_at": "2025-01-20T14:27:15Z"
}
```
**Response (200 OK - Completed):**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 6,
"completed": 6,
"failed": 0,
"details": [
{"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
{"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
{"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 42.1},
{"date": "2025-01-17", "model": "gpt-5", "status": "completed", "duration_seconds": 40.3},
{"date": "2025-01-20", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 43.8},
{"date": "2025-01-20", "model": "gpt-5", "status": "completed", "duration_seconds": 39.1}
]
},
"created_at": "2025-01-20T14:25:00Z",
"completed_at": "2025-01-20T14:29:45Z",
"total_duration_seconds": 285.0
}
```
**Response (200 OK - Partial Failure):**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "partial",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 6,
"completed": 4,
"failed": 2,
"details": [
{"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
{"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
{"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "failed", "error": "MCP service timeout after 3 retries", "duration_seconds": null},
{"date": "2025-01-17", "model": "gpt-5", "status": "completed", "duration_seconds": 40.3},
{"date": "2025-01-20", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 43.8},
{"date": "2025-01-20", "model": "gpt-5", "status": "failed", "error": "AI model API timeout", "duration_seconds": null}
]
},
"created_at": "2025-01-20T14:25:00Z",
"completed_at": "2025-01-20T14:29:45Z"
}
```
**Response (404 Not Found):**
```json
{
"error": "not_found",
"message": "Job not found",
"job_id": "invalid-job-id"
}
```
**Business Logic:**
1. Query SQLite jobs table for job_id
2. If not found, return 404
3. Return job metadata + progress from job_details table
4. Status transitions: `pending``running``completed`/`partial`/`failed`
---
### 1.3 GET /simulate/current
**Purpose:** Get the most recent simulation job (for Windmill to discover job_id).
**Request:**
```http
GET /simulate/current HTTP/1.1
```
**Response (200 OK):**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"date_range": ["2025-01-16", "2025-01-17"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 4,
"completed": 2,
"failed": 0
},
"created_at": "2025-01-20T14:25:00Z"
}
```
**Response (404 Not Found):**
```json
{
"error": "not_found",
"message": "No simulation jobs found"
}
```
**Business Logic:**
1. Query SQLite: `SELECT * FROM jobs ORDER BY created_at DESC LIMIT 1`
2. Return job details with progress summary
---
### 1.4 GET /results
**Purpose:** Retrieve simulation results for a specific date and model.
**Request:**
```http
GET /results?date=2025-01-15&model=gpt-5&detail=minimal HTTP/1.1
```
**Query Parameters:**
- `date` (required): Trading date in YYYY-MM-DD format
- `model` (optional): Model signature (if omitted, returns all models)
- `detail` (optional): Response detail level
- `minimal` (default): Positions + daily P&L
- `full`: + trade history + AI reasoning logs + tool usage stats
**Response (200 OK - minimal):**
```json
{
"date": "2025-01-15",
"results": [
{
"model": "gpt-5",
"positions": {
"AAPL": 10,
"MSFT": 5,
"NVDA": 0,
"CASH": 8500.00
},
"daily_pnl": {
"profit": 150.50,
"return_pct": 1.5,
"portfolio_value": 10150.50
}
}
]
}
```
**Response (200 OK - full):**
```json
{
"date": "2025-01-15",
"results": [
{
"model": "gpt-5",
"positions": {
"AAPL": 10,
"MSFT": 5,
"CASH": 8500.00
},
"daily_pnl": {
"profit": 150.50,
"return_pct": 1.5,
"portfolio_value": 10150.50
},
"trades": [
{
"id": 1,
"action": "buy",
"symbol": "AAPL",
"amount": 10,
"price": 255.88,
"total": 2558.80
}
],
"ai_reasoning": {
"total_steps": 15,
"stop_signal_received": true,
"reasoning_summary": "Market analysis indicated strong buy signal for AAPL...",
"tool_usage": {
"search": 3,
"get_price": 5,
"math": 2,
"trade": 1
}
},
"log_file_path": "data/agent_data/gpt-5/log/2025-01-15/log.jsonl"
}
]
}
```
**Response (400 Bad Request):**
```json
{
"error": "invalid_date",
"message": "Date must be in YYYY-MM-DD format"
}
```
**Response (404 Not Found):**
```json
{
"error": "no_data",
"message": "No simulation data found for date 2025-01-15 and model gpt-5"
}
```
**Business Logic:**
1. Validate date format
2. Read `position.jsonl` for specified model(s) and date
3. For `detail=minimal`: Return positions + calculate daily P&L
4. For `detail=full`:
- Parse `log.jsonl` to extract reasoning summary
- Count tool usage from log messages
- Extract trades from position file
5. Return aggregated results
---
### 1.5 GET /health
**Purpose:** Health check endpoint for Docker and monitoring.
**Request:**
```http
GET /health HTTP/1.1
```
**Response (200 OK):**
```json
{
"status": "healthy",
"timestamp": "2025-01-20T14:30:00Z",
"services": {
"mcp_math": {"status": "up", "url": "http://localhost:8000/mcp"},
"mcp_search": {"status": "up", "url": "http://localhost:8001/mcp"},
"mcp_trade": {"status": "up", "url": "http://localhost:8002/mcp"},
"mcp_getprice": {"status": "up", "url": "http://localhost:8003/mcp"}
},
"storage": {
"data_directory": "/app/data",
"writable": true,
"free_space_mb": 15234
},
"database": {
"status": "connected",
"path": "/app/data/jobs.db"
}
}
```
**Response (503 Service Unavailable):**
```json
{
"status": "unhealthy",
"timestamp": "2025-01-20T14:30:00Z",
"services": {
"mcp_math": {"status": "down", "url": "http://localhost:8000/mcp", "error": "Connection refused"},
"mcp_search": {"status": "up", "url": "http://localhost:8001/mcp"},
"mcp_trade": {"status": "up", "url": "http://localhost:8002/mcp"},
"mcp_getprice": {"status": "up", "url": "http://localhost:8003/mcp"}
},
"storage": {
"data_directory": "/app/data",
"writable": true
},
"database": {
"status": "connected"
}
}
```
---
## 2. Data Models
### 2.1 SQLite Schema
**Table: jobs**
```sql
CREATE TABLE jobs (
job_id TEXT PRIMARY KEY,
config_path TEXT NOT NULL,
status TEXT NOT NULL CHECK(status IN ('pending', 'running', 'completed', 'partial', 'failed')),
date_range TEXT NOT NULL, -- JSON array of dates
models TEXT NOT NULL, -- JSON array of model signatures
created_at TEXT NOT NULL,
started_at TEXT,
completed_at TEXT,
total_duration_seconds REAL,
error TEXT
);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC);
```
**Table: job_details**
```sql
CREATE TABLE job_details (
id INTEGER PRIMARY KEY AUTOINCREMENT,
job_id TEXT NOT NULL,
date TEXT NOT NULL,
model TEXT NOT NULL,
status TEXT NOT NULL CHECK(status IN ('pending', 'running', 'completed', 'failed')),
started_at TEXT,
completed_at TEXT,
duration_seconds REAL,
error TEXT,
FOREIGN KEY (job_id) REFERENCES jobs(job_id) ON DELETE CASCADE
);
CREATE INDEX idx_job_details_job_id ON job_details(job_id);
CREATE INDEX idx_job_details_status ON job_details(status);
```
### 2.2 Pydantic Models
**Request Models:**
```python
from pydantic import BaseModel, Field
from typing import Optional, Literal
class TriggerSimulationRequest(BaseModel):
config_path: Optional[str] = Field(default="configs/default_config.json", description="Path to configuration file")
class ResultsQueryParams(BaseModel):
date: str = Field(..., pattern=r"^\d{4}-\d{2}-\d{2}$", description="Date in YYYY-MM-DD format")
model: Optional[str] = Field(None, description="Model signature filter")
detail: Literal["minimal", "full"] = Field(default="minimal", description="Response detail level")
```
**Response Models:**
```python
class JobProgress(BaseModel):
total_model_days: int
completed: int
failed: int
current: Optional[dict] = None # {"date": str, "model": str}
details: Optional[list] = None # List of JobDetailResponse
class TriggerSimulationResponse(BaseModel):
job_id: str
status: str
date_range: list[str]
models: list[str]
created_at: str
message: str
progress: Optional[JobProgress] = None
class JobStatusResponse(BaseModel):
job_id: str
status: str
date_range: list[str]
models: list[str]
progress: JobProgress
created_at: str
updated_at: Optional[str] = None
completed_at: Optional[str] = None
total_duration_seconds: Optional[float] = None
class DailyPnL(BaseModel):
profit: float
return_pct: float
portfolio_value: float
class Trade(BaseModel):
id: int
action: str
symbol: str
amount: int
price: Optional[float] = None
total: Optional[float] = None
class AIReasoning(BaseModel):
total_steps: int
stop_signal_received: bool
reasoning_summary: str
tool_usage: dict[str, int]
class ModelResult(BaseModel):
model: str
positions: dict[str, float]
daily_pnl: DailyPnL
trades: Optional[list[Trade]] = None
ai_reasoning: Optional[AIReasoning] = None
log_file_path: Optional[str] = None
class ResultsResponse(BaseModel):
date: str
results: list[ModelResult]
```
---
## 3. Configuration Management
### 3.1 Environment Variables
Required environment variables remain the same as batch mode:
```bash
# OpenAI API Configuration
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
# Alpha Vantage API
ALPHAADVANTAGE_API_KEY=...
# Jina Search API
JINA_API_KEY=...
# Runtime Config Path (now shared by API and worker)
RUNTIME_ENV_PATH=/app/data/runtime_env.json
# MCP Service Ports
MATH_HTTP_PORT=8000
SEARCH_HTTP_PORT=8001
TRADE_HTTP_PORT=8002
GETPRICE_HTTP_PORT=8003
# API Server Configuration
API_HOST=0.0.0.0
API_PORT=8080
# Job Configuration
MAX_CONCURRENT_JOBS=1 # Only one simulation job at a time
```
### 3.2 Runtime State Management
**Challenge:** Multiple model-days running concurrently need isolated `runtime_env.json` state.
**Solution:** Per-job runtime config files
- `runtime_env_base.json` - Template
- `runtime_env_{job_id}_{model}_{date}.json` - Job-specific runtime config
- Worker passes custom `RUNTIME_ENV_PATH` to each simulation execution
**Modified `write_config_value()` and `get_config_value()`:**
- Accept optional `runtime_path` parameter
- Worker manages lifecycle: create → use → cleanup
---
## 4. Error Handling
### 4.1 Error Response Format
All errors follow this structure:
```json
{
"error": "error_code",
"message": "Human-readable error description",
"details": {
// Optional additional context
}
}
```
### 4.2 HTTP Status Codes
- `200 OK` - Successful request
- `202 Accepted` - Job queued successfully
- `400 Bad Request` - Invalid input parameters
- `404 Not Found` - Resource not found (job, results)
- `409 Conflict` - Concurrent job conflict
- `500 Internal Server Error` - Unexpected server error
- `503 Service Unavailable` - Health check failed
### 4.3 Retry Strategy for Workers
Models run independently - failure of one model doesn't block others:
```python
async def run_model_day(job_id: str, date: str, model_config: dict):
try:
# Execute simulation for this model-day
await agent.run_trading_session(date)
update_job_detail_status(job_id, date, model, "completed")
except Exception as e:
# Log error, update status to failed, continue with next model-day
update_job_detail_status(job_id, date, model, "failed", error=str(e))
# Do NOT raise - let other models continue
```
---
## 5. Concurrency & Locking
### 5.1 Job Execution Policy
**Rule:** Maximum 1 running job at a time (configurable via `MAX_CONCURRENT_JOBS`)
**Enforcement:**
```python
def can_start_new_job() -> bool:
running_jobs = db.query(
"SELECT COUNT(*) FROM jobs WHERE status IN ('pending', 'running')"
).fetchone()[0]
return running_jobs < MAX_CONCURRENT_JOBS
```
### 5.2 Position File Concurrency
**Challenge:** Multiple model-days writing to same model's `position.jsonl`
**Solution:** Sequential execution per model
```python
# For each date in date_range:
# For each model in parallel: ← Models run in parallel
# Execute model-day sequentially ← Dates for same model run sequentially
```
**Execution Pattern:**
```
Date 2025-01-16:
- Model A (running)
- Model B (running)
- Model C (running)
Date 2025-01-17: ← Starts only after all models finish 2025-01-16
- Model A (running)
- Model B (running)
- Model C (running)
```
**Rationale:**
- Models write to different position files → No conflict
- Same model's dates run sequentially → No race condition on position.jsonl
- Date-level parallelism across models → Faster overall execution
---
## 6. Performance Considerations
### 6.1 Execution Time Estimates
Based on current implementation:
- Single model-day: ~30-60 seconds (depends on AI model latency + tool calls)
- 3 models × 5 days = 15 model-days ≈ 7.5-15 minutes (parallel execution)
### 6.2 Timeout Configuration
**API Request Timeout:**
- `/simulate/trigger`: 10 seconds (just queue job)
- `/simulate/status`: 5 seconds (read from DB)
- `/results`: 30 seconds (file I/O + parsing)
**Worker Timeout:**
- Per model-day: 5 minutes (inherited from `max_retries` × `base_delay`)
- Entire job: No timeout (job runs until all model-days complete or fail)
### 6.3 Optimization Opportunities (Future)
1. **Results caching:** Store computed daily_pnl in SQLite to avoid recomputation
2. **Parallel date execution:** If position file locking is implemented, run dates in parallel
3. **Streaming responses:** For `/simulate/status`, use SSE to push updates instead of polling
---
## 7. Logging & Observability
### 7.1 Structured Logging
All API logs use JSON format:
```json
{
"timestamp": "2025-01-20T14:30:00Z",
"level": "INFO",
"logger": "api.worker",
"message": "Starting simulation for model-day",
"job_id": "550e8400-...",
"date": "2025-01-16",
"model": "gpt-5"
}
```
### 7.2 Log Levels
- `DEBUG` - Detailed execution flow (tool calls, price fetches)
- `INFO` - Job lifecycle events (created, started, completed)
- `WARNING` - Recoverable errors (retry attempts)
- `ERROR` - Model-day failures (logged but job continues)
- `CRITICAL` - System failures (MCP services down, DB corruption)
### 7.3 Audit Trail
All job state transitions logged to `api_audit.log`:
```json
{
"timestamp": "2025-01-20T14:30:00Z",
"event": "job_created",
"job_id": "550e8400-...",
"user": "windmill-service", // Future: from auth header
"details": {"date_range": [...], "models": [...]}
}
```
---
## 8. Security Considerations
### 8.1 Authentication (Future)
For MVP, API relies on network isolation (Docker network). Future enhancements:
- API key authentication via header: `X-API-Key: <token>`
- JWT tokens for Windmill integration
- Rate limiting per API key
### 8.2 Input Validation
- All date parameters validated with regex: `^\d{4}-\d{2}-\d{2}$`
- Config paths restricted to `configs/` directory (prevent path traversal)
- Model signatures sanitized (alphanumeric + hyphens only)
### 8.3 File Access Controls
- Results API only reads from `data/agent_data/` directory
- Config API only reads from `configs/` directory
- No arbitrary file read via API parameters
---
## 9. Deployment Configuration
### 9.1 Docker Compose
```yaml
version: '3.8'
services:
ai-trader-api:
build:
context: .
dockerfile: Dockerfile
ports:
- "8080:8080"
volumes:
- ./data:/app/data
- ./configs:/app/configs
env_file:
- .env
environment:
- MODE=api
- API_PORT=8080
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
restart: unless-stopped
```
### 9.2 Dockerfile Modifications
```dockerfile
# ... existing layers ...
# Install API dependencies
COPY requirements-api.txt /app/
RUN pip install --no-cache-dir -r requirements-api.txt
# Copy API application code
COPY api/ /app/api/
# Copy entrypoint script
COPY docker-entrypoint.sh /app/
RUN chmod +x /app/docker-entrypoint.sh
EXPOSE 8080
CMD ["/app/docker-entrypoint.sh"]
```
### 9.3 Entrypoint Script
```bash
#!/bin/bash
set -e
echo "Starting MCP services..."
cd /app/agent_tools
python start_mcp_services.py &
MCP_PID=$!
echo "Waiting for MCP services to be ready..."
sleep 10
echo "Starting API server..."
cd /app
uvicorn api.main:app --host ${API_HOST:-0.0.0.0} --port ${API_PORT:-8080} --workers 1
# Cleanup on exit
trap "kill $MCP_PID 2>/dev/null || true" EXIT
```
---
## 10. API Versioning (Future)
For v2 and beyond:
- URL prefix: `/api/v1/simulate/trigger`, `/api/v2/simulate/trigger`
- Header-based: `Accept: application/vnd.ai-trader.v1+json`
MVP uses unversioned endpoints (implied v1).
---
## Next Steps
After reviewing this specification, we'll proceed to:
1. **Component 2:** Job Manager & SQLite Schema Implementation
2. **Component 3:** Background Worker Architecture
3. **Component 4:** BaseAgent Refactoring for Single-Day Execution
4. **Component 5:** Docker & Deployment Configuration
5. **Component 6:** Windmill Integration Flows
Please review this API specification and provide feedback or approval to continue.