Major architecture transformation from batch-only to API service with
database persistence for Windmill integration.
## REST API Implementation
- POST /simulate/trigger - Start simulation jobs
- GET /simulate/status/{job_id} - Monitor job progress
- GET /results - Query results with filters (job_id, date, model)
- GET /health - Service health checks
## Database Layer
- SQLite persistence with 6 tables (jobs, job_details, positions,
holdings, reasoning_logs, tool_usage)
- Foreign key constraints with cascade deletes
- Replaces JSONL file storage
## Backend Components
- JobManager: Job lifecycle management with concurrency control
- RuntimeConfigManager: Thread-safe isolated runtime configs
- ModelDayExecutor: Single model-day execution engine
- SimulationWorker: Date-sequential, model-parallel orchestration
## Testing
- 102 unit and integration tests (85% coverage)
- Database: 98% coverage
- Job manager: 98% coverage
- API endpoints: 81% coverage
- Pydantic models: 100% coverage
- TDD approach throughout
## Docker Deployment
- Dual-mode: API server (persistent) + batch (one-time)
- Health checks with 30s interval
- Volume persistence for database and logs
- Separate entrypoints for each mode
## Validation Tools
- scripts/validate_docker_build.sh - Build validation
- scripts/test_api_endpoints.sh - Complete API testing
- scripts/test_batch_mode.sh - Batch mode validation
- DOCKER_API.md - Deployment guide
- TESTING_GUIDE.md - Testing procedures
## Configuration
- API_PORT environment variable (default: 8080)
- Backwards compatible with existing configs
- FastAPI, uvicorn, pydantic>=2.0 dependencies
Co-Authored-By: AI Assistant <noreply@example.com>
21 KiB
AI-Trader API Service - Technical Specification
1. API Endpoints Specification
1.1 POST /simulate/trigger
Purpose: Trigger a catch-up simulation from the last completed date to the most recent trading day.
Request:
POST /simulate/trigger HTTP/1.1
Content-Type: application/json
{
"config_path": "configs/default_config.json" // Optional: defaults to configs/default_config.json
}
Response (202 Accepted):
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "accepted",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"created_at": "2025-01-20T14:30:00Z",
"message": "Simulation job queued successfully"
}
Response (200 OK - Job Already Running):
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 6,
"completed": 3,
"failed": 0,
"current": {
"date": "2025-01-17",
"model": "gpt-5"
}
},
"created_at": "2025-01-20T14:25:00Z",
"message": "Simulation already in progress"
}
Response (200 OK - Already Up To Date):
{
"status": "current",
"message": "Simulation already up-to-date",
"last_simulation_date": "2025-01-20",
"next_trading_day": "2025-01-21"
}
Response (409 Conflict):
{
"error": "conflict",
"message": "Different simulation already running",
"current_job_id": "previous-job-uuid",
"current_date_range": ["2025-01-10", "2025-01-15"]
}
Business Logic:
- Load configuration from
config_path(or default) - Determine last completed date from each model's
position.jsonl - Calculate date range:
max(last_dates) + 1 day→most_recent_trading_day - Filter for weekdays only (Monday-Friday)
- If date_range is empty, return "already up-to-date"
- Check for existing jobs with same date range → return existing job
- Check for running jobs with different date range → return 409
- Create new job in SQLite with status=
pending - Queue background task to execute simulation
- Return 202 with job details
1.2 GET /simulate/status/{job_id}
Purpose: Poll the status and progress of a simulation job.
Request:
GET /simulate/status/550e8400-e29b-41d4-a716-446655440000 HTTP/1.1
Response (200 OK - Running):
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 6,
"completed": 3,
"failed": 0,
"current": {
"date": "2025-01-17",
"model": "gpt-5"
},
"details": [
{"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
{"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
{"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 42.1},
{"date": "2025-01-17", "model": "gpt-5", "status": "running", "duration_seconds": null}
]
},
"created_at": "2025-01-20T14:25:00Z",
"updated_at": "2025-01-20T14:27:15Z"
}
Response (200 OK - Completed):
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 6,
"completed": 6,
"failed": 0,
"details": [
{"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
{"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
{"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 42.1},
{"date": "2025-01-17", "model": "gpt-5", "status": "completed", "duration_seconds": 40.3},
{"date": "2025-01-20", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 43.8},
{"date": "2025-01-20", "model": "gpt-5", "status": "completed", "duration_seconds": 39.1}
]
},
"created_at": "2025-01-20T14:25:00Z",
"completed_at": "2025-01-20T14:29:45Z",
"total_duration_seconds": 285.0
}
Response (200 OK - Partial Failure):
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "partial",
"date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 6,
"completed": 4,
"failed": 2,
"details": [
{"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
{"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
{"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "failed", "error": "MCP service timeout after 3 retries", "duration_seconds": null},
{"date": "2025-01-17", "model": "gpt-5", "status": "completed", "duration_seconds": 40.3},
{"date": "2025-01-20", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 43.8},
{"date": "2025-01-20", "model": "gpt-5", "status": "failed", "error": "AI model API timeout", "duration_seconds": null}
]
},
"created_at": "2025-01-20T14:25:00Z",
"completed_at": "2025-01-20T14:29:45Z"
}
Response (404 Not Found):
{
"error": "not_found",
"message": "Job not found",
"job_id": "invalid-job-id"
}
Business Logic:
- Query SQLite jobs table for job_id
- If not found, return 404
- Return job metadata + progress from job_details table
- Status transitions:
pending→running→completed/partial/failed
1.3 GET /simulate/current
Purpose: Get the most recent simulation job (for Windmill to discover job_id).
Request:
GET /simulate/current HTTP/1.1
Response (200 OK):
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"date_range": ["2025-01-16", "2025-01-17"],
"models": ["claude-3.7-sonnet", "gpt-5"],
"progress": {
"total_model_days": 4,
"completed": 2,
"failed": 0
},
"created_at": "2025-01-20T14:25:00Z"
}
Response (404 Not Found):
{
"error": "not_found",
"message": "No simulation jobs found"
}
Business Logic:
- Query SQLite:
SELECT * FROM jobs ORDER BY created_at DESC LIMIT 1 - Return job details with progress summary
1.4 GET /results
Purpose: Retrieve simulation results for a specific date and model.
Request:
GET /results?date=2025-01-15&model=gpt-5&detail=minimal HTTP/1.1
Query Parameters:
date(required): Trading date in YYYY-MM-DD formatmodel(optional): Model signature (if omitted, returns all models)detail(optional): Response detail levelminimal(default): Positions + daily P&Lfull: + trade history + AI reasoning logs + tool usage stats
Response (200 OK - minimal):
{
"date": "2025-01-15",
"results": [
{
"model": "gpt-5",
"positions": {
"AAPL": 10,
"MSFT": 5,
"NVDA": 0,
"CASH": 8500.00
},
"daily_pnl": {
"profit": 150.50,
"return_pct": 1.5,
"portfolio_value": 10150.50
}
}
]
}
Response (200 OK - full):
{
"date": "2025-01-15",
"results": [
{
"model": "gpt-5",
"positions": {
"AAPL": 10,
"MSFT": 5,
"CASH": 8500.00
},
"daily_pnl": {
"profit": 150.50,
"return_pct": 1.5,
"portfolio_value": 10150.50
},
"trades": [
{
"id": 1,
"action": "buy",
"symbol": "AAPL",
"amount": 10,
"price": 255.88,
"total": 2558.80
}
],
"ai_reasoning": {
"total_steps": 15,
"stop_signal_received": true,
"reasoning_summary": "Market analysis indicated strong buy signal for AAPL...",
"tool_usage": {
"search": 3,
"get_price": 5,
"math": 2,
"trade": 1
}
},
"log_file_path": "data/agent_data/gpt-5/log/2025-01-15/log.jsonl"
}
]
}
Response (400 Bad Request):
{
"error": "invalid_date",
"message": "Date must be in YYYY-MM-DD format"
}
Response (404 Not Found):
{
"error": "no_data",
"message": "No simulation data found for date 2025-01-15 and model gpt-5"
}
Business Logic:
- Validate date format
- Read
position.jsonlfor specified model(s) and date - For
detail=minimal: Return positions + calculate daily P&L - For
detail=full:- Parse
log.jsonlto extract reasoning summary - Count tool usage from log messages
- Extract trades from position file
- Parse
- Return aggregated results
1.5 GET /health
Purpose: Health check endpoint for Docker and monitoring.
Request:
GET /health HTTP/1.1
Response (200 OK):
{
"status": "healthy",
"timestamp": "2025-01-20T14:30:00Z",
"services": {
"mcp_math": {"status": "up", "url": "http://localhost:8000/mcp"},
"mcp_search": {"status": "up", "url": "http://localhost:8001/mcp"},
"mcp_trade": {"status": "up", "url": "http://localhost:8002/mcp"},
"mcp_getprice": {"status": "up", "url": "http://localhost:8003/mcp"}
},
"storage": {
"data_directory": "/app/data",
"writable": true,
"free_space_mb": 15234
},
"database": {
"status": "connected",
"path": "/app/data/jobs.db"
}
}
Response (503 Service Unavailable):
{
"status": "unhealthy",
"timestamp": "2025-01-20T14:30:00Z",
"services": {
"mcp_math": {"status": "down", "url": "http://localhost:8000/mcp", "error": "Connection refused"},
"mcp_search": {"status": "up", "url": "http://localhost:8001/mcp"},
"mcp_trade": {"status": "up", "url": "http://localhost:8002/mcp"},
"mcp_getprice": {"status": "up", "url": "http://localhost:8003/mcp"}
},
"storage": {
"data_directory": "/app/data",
"writable": true
},
"database": {
"status": "connected"
}
}
2. Data Models
2.1 SQLite Schema
Table: jobs
CREATE TABLE jobs (
job_id TEXT PRIMARY KEY,
config_path TEXT NOT NULL,
status TEXT NOT NULL CHECK(status IN ('pending', 'running', 'completed', 'partial', 'failed')),
date_range TEXT NOT NULL, -- JSON array of dates
models TEXT NOT NULL, -- JSON array of model signatures
created_at TEXT NOT NULL,
started_at TEXT,
completed_at TEXT,
total_duration_seconds REAL,
error TEXT
);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC);
Table: job_details
CREATE TABLE job_details (
id INTEGER PRIMARY KEY AUTOINCREMENT,
job_id TEXT NOT NULL,
date TEXT NOT NULL,
model TEXT NOT NULL,
status TEXT NOT NULL CHECK(status IN ('pending', 'running', 'completed', 'failed')),
started_at TEXT,
completed_at TEXT,
duration_seconds REAL,
error TEXT,
FOREIGN KEY (job_id) REFERENCES jobs(job_id) ON DELETE CASCADE
);
CREATE INDEX idx_job_details_job_id ON job_details(job_id);
CREATE INDEX idx_job_details_status ON job_details(status);
2.2 Pydantic Models
Request Models:
from pydantic import BaseModel, Field
from typing import Optional, Literal
class TriggerSimulationRequest(BaseModel):
config_path: Optional[str] = Field(default="configs/default_config.json", description="Path to configuration file")
class ResultsQueryParams(BaseModel):
date: str = Field(..., pattern=r"^\d{4}-\d{2}-\d{2}$", description="Date in YYYY-MM-DD format")
model: Optional[str] = Field(None, description="Model signature filter")
detail: Literal["minimal", "full"] = Field(default="minimal", description="Response detail level")
Response Models:
class JobProgress(BaseModel):
total_model_days: int
completed: int
failed: int
current: Optional[dict] = None # {"date": str, "model": str}
details: Optional[list] = None # List of JobDetailResponse
class TriggerSimulationResponse(BaseModel):
job_id: str
status: str
date_range: list[str]
models: list[str]
created_at: str
message: str
progress: Optional[JobProgress] = None
class JobStatusResponse(BaseModel):
job_id: str
status: str
date_range: list[str]
models: list[str]
progress: JobProgress
created_at: str
updated_at: Optional[str] = None
completed_at: Optional[str] = None
total_duration_seconds: Optional[float] = None
class DailyPnL(BaseModel):
profit: float
return_pct: float
portfolio_value: float
class Trade(BaseModel):
id: int
action: str
symbol: str
amount: int
price: Optional[float] = None
total: Optional[float] = None
class AIReasoning(BaseModel):
total_steps: int
stop_signal_received: bool
reasoning_summary: str
tool_usage: dict[str, int]
class ModelResult(BaseModel):
model: str
positions: dict[str, float]
daily_pnl: DailyPnL
trades: Optional[list[Trade]] = None
ai_reasoning: Optional[AIReasoning] = None
log_file_path: Optional[str] = None
class ResultsResponse(BaseModel):
date: str
results: list[ModelResult]
3. Configuration Management
3.1 Environment Variables
Required environment variables remain the same as batch mode:
# OpenAI API Configuration
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
# Alpha Vantage API
ALPHAADVANTAGE_API_KEY=...
# Jina Search API
JINA_API_KEY=...
# Runtime Config Path (now shared by API and worker)
RUNTIME_ENV_PATH=/app/data/runtime_env.json
# MCP Service Ports
MATH_HTTP_PORT=8000
SEARCH_HTTP_PORT=8001
TRADE_HTTP_PORT=8002
GETPRICE_HTTP_PORT=8003
# API Server Configuration
API_HOST=0.0.0.0
API_PORT=8080
# Job Configuration
MAX_CONCURRENT_JOBS=1 # Only one simulation job at a time
3.2 Runtime State Management
Challenge: Multiple model-days running concurrently need isolated runtime_env.json state.
Solution: Per-job runtime config files
runtime_env_base.json- Templateruntime_env_{job_id}_{model}_{date}.json- Job-specific runtime config- Worker passes custom
RUNTIME_ENV_PATHto each simulation execution
Modified write_config_value() and get_config_value():
- Accept optional
runtime_pathparameter - Worker manages lifecycle: create → use → cleanup
4. Error Handling
4.1 Error Response Format
All errors follow this structure:
{
"error": "error_code",
"message": "Human-readable error description",
"details": {
// Optional additional context
}
}
4.2 HTTP Status Codes
200 OK- Successful request202 Accepted- Job queued successfully400 Bad Request- Invalid input parameters404 Not Found- Resource not found (job, results)409 Conflict- Concurrent job conflict500 Internal Server Error- Unexpected server error503 Service Unavailable- Health check failed
4.3 Retry Strategy for Workers
Models run independently - failure of one model doesn't block others:
async def run_model_day(job_id: str, date: str, model_config: dict):
try:
# Execute simulation for this model-day
await agent.run_trading_session(date)
update_job_detail_status(job_id, date, model, "completed")
except Exception as e:
# Log error, update status to failed, continue with next model-day
update_job_detail_status(job_id, date, model, "failed", error=str(e))
# Do NOT raise - let other models continue
5. Concurrency & Locking
5.1 Job Execution Policy
Rule: Maximum 1 running job at a time (configurable via MAX_CONCURRENT_JOBS)
Enforcement:
def can_start_new_job() -> bool:
running_jobs = db.query(
"SELECT COUNT(*) FROM jobs WHERE status IN ('pending', 'running')"
).fetchone()[0]
return running_jobs < MAX_CONCURRENT_JOBS
5.2 Position File Concurrency
Challenge: Multiple model-days writing to same model's position.jsonl
Solution: Sequential execution per model
# For each date in date_range:
# For each model in parallel: ← Models run in parallel
# Execute model-day sequentially ← Dates for same model run sequentially
Execution Pattern:
Date 2025-01-16:
- Model A (running)
- Model B (running)
- Model C (running)
Date 2025-01-17: ← Starts only after all models finish 2025-01-16
- Model A (running)
- Model B (running)
- Model C (running)
Rationale:
- Models write to different position files → No conflict
- Same model's dates run sequentially → No race condition on position.jsonl
- Date-level parallelism across models → Faster overall execution
6. Performance Considerations
6.1 Execution Time Estimates
Based on current implementation:
- Single model-day: ~30-60 seconds (depends on AI model latency + tool calls)
- 3 models × 5 days = 15 model-days ≈ 7.5-15 minutes (parallel execution)
6.2 Timeout Configuration
API Request Timeout:
/simulate/trigger: 10 seconds (just queue job)/simulate/status: 5 seconds (read from DB)/results: 30 seconds (file I/O + parsing)
Worker Timeout:
- Per model-day: 5 minutes (inherited from
max_retries×base_delay) - Entire job: No timeout (job runs until all model-days complete or fail)
6.3 Optimization Opportunities (Future)
- Results caching: Store computed daily_pnl in SQLite to avoid recomputation
- Parallel date execution: If position file locking is implemented, run dates in parallel
- Streaming responses: For
/simulate/status, use SSE to push updates instead of polling
7. Logging & Observability
7.1 Structured Logging
All API logs use JSON format:
{
"timestamp": "2025-01-20T14:30:00Z",
"level": "INFO",
"logger": "api.worker",
"message": "Starting simulation for model-day",
"job_id": "550e8400-...",
"date": "2025-01-16",
"model": "gpt-5"
}
7.2 Log Levels
DEBUG- Detailed execution flow (tool calls, price fetches)INFO- Job lifecycle events (created, started, completed)WARNING- Recoverable errors (retry attempts)ERROR- Model-day failures (logged but job continues)CRITICAL- System failures (MCP services down, DB corruption)
7.3 Audit Trail
All job state transitions logged to api_audit.log:
{
"timestamp": "2025-01-20T14:30:00Z",
"event": "job_created",
"job_id": "550e8400-...",
"user": "windmill-service", // Future: from auth header
"details": {"date_range": [...], "models": [...]}
}
8. Security Considerations
8.1 Authentication (Future)
For MVP, API relies on network isolation (Docker network). Future enhancements:
- API key authentication via header:
X-API-Key: <token> - JWT tokens for Windmill integration
- Rate limiting per API key
8.2 Input Validation
- All date parameters validated with regex:
^\d{4}-\d{2}-\d{2}$ - Config paths restricted to
configs/directory (prevent path traversal) - Model signatures sanitized (alphanumeric + hyphens only)
8.3 File Access Controls
- Results API only reads from
data/agent_data/directory - Config API only reads from
configs/directory - No arbitrary file read via API parameters
9. Deployment Configuration
9.1 Docker Compose
version: '3.8'
services:
ai-trader-api:
build:
context: .
dockerfile: Dockerfile
ports:
- "8080:8080"
volumes:
- ./data:/app/data
- ./configs:/app/configs
env_file:
- .env
environment:
- MODE=api
- API_PORT=8080
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
restart: unless-stopped
9.2 Dockerfile Modifications
# ... existing layers ...
# Install API dependencies
COPY requirements-api.txt /app/
RUN pip install --no-cache-dir -r requirements-api.txt
# Copy API application code
COPY api/ /app/api/
# Copy entrypoint script
COPY docker-entrypoint.sh /app/
RUN chmod +x /app/docker-entrypoint.sh
EXPOSE 8080
CMD ["/app/docker-entrypoint.sh"]
9.3 Entrypoint Script
#!/bin/bash
set -e
echo "Starting MCP services..."
cd /app/agent_tools
python start_mcp_services.py &
MCP_PID=$!
echo "Waiting for MCP services to be ready..."
sleep 10
echo "Starting API server..."
cd /app
uvicorn api.main:app --host ${API_HOST:-0.0.0.0} --port ${API_PORT:-8080} --workers 1
# Cleanup on exit
trap "kill $MCP_PID 2>/dev/null || true" EXIT
10. API Versioning (Future)
For v2 and beyond:
- URL prefix:
/api/v1/simulate/trigger,/api/v2/simulate/trigger - Header-based:
Accept: application/vnd.ai-trader.v1+json
MVP uses unversioned endpoints (implied v1).
Next Steps
After reviewing this specification, we'll proceed to:
- Component 2: Job Manager & SQLite Schema Implementation
- Component 3: Background Worker Architecture
- Component 4: BaseAgent Refactoring for Single-Day Execution
- Component 5: Docker & Deployment Configuration
- Component 6: Windmill Integration Flows
Please review this API specification and provide feedback or approval to continue.