mirror of https://github.com/Xe138/AI-Trader.git synced 2026-04-02 01:27:24 -04:00

Files

Bill fb9583b374 feat: transform to REST API service with SQLite persistence (v0.3.0)

Major architecture transformation from batch-only to API service with
database persistence for Windmill integration.

## REST API Implementation
- POST /simulate/trigger - Start simulation jobs
- GET /simulate/status/{job_id} - Monitor job progress
- GET /results - Query results with filters (job_id, date, model)
- GET /health - Service health checks

## Database Layer
- SQLite persistence with 6 tables (jobs, job_details, positions,
  holdings, reasoning_logs, tool_usage)
- Foreign key constraints with cascade deletes
- Replaces JSONL file storage

## Backend Components
- JobManager: Job lifecycle management with concurrency control
- RuntimeConfigManager: Thread-safe isolated runtime configs
- ModelDayExecutor: Single model-day execution engine
- SimulationWorker: Date-sequential, model-parallel orchestration

## Testing
- 102 unit and integration tests (85% coverage)
- Database: 98% coverage
- Job manager: 98% coverage
- API endpoints: 81% coverage
- Pydantic models: 100% coverage
- TDD approach throughout

## Docker Deployment
- Dual-mode: API server (persistent) + batch (one-time)
- Health checks with 30s interval
- Volume persistence for database and logs
- Separate entrypoints for each mode

## Validation Tools
- scripts/validate_docker_build.sh - Build validation
- scripts/test_api_endpoints.sh - Complete API testing
- scripts/test_batch_mode.sh - Batch mode validation
- DOCKER_API.md - Deployment guide
- TESTING_GUIDE.md - Testing procedures

## Configuration
- API_PORT environment variable (default: 8080)
- Backwards compatible with existing configs
- FastAPI, uvicorn, pydantic>=2.0 dependencies

Co-Authored-By: AI Assistant <noreply@example.com>

2025-10-31 11:47:10 -04:00

21 KiB

Raw Blame History

AI-Trader API Service - Technical Specification

1. API Endpoints Specification

1.1 POST /simulate/trigger

Purpose: Trigger a catch-up simulation from the last completed date to the most recent trading day.

Request:

POST /simulate/trigger HTTP/1.1
Content-Type: application/json

{
  "config_path": "configs/default_config.json"  // Optional: defaults to configs/default_config.json
}

Response (202 Accepted):

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "accepted",
  "date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
  "models": ["claude-3.7-sonnet", "gpt-5"],
  "created_at": "2025-01-20T14:30:00Z",
  "message": "Simulation job queued successfully"
}

Response (200 OK - Job Already Running):

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
  "models": ["claude-3.7-sonnet", "gpt-5"],
  "progress": {
    "total_model_days": 6,
    "completed": 3,
    "failed": 0,
    "current": {
      "date": "2025-01-17",
      "model": "gpt-5"
    }
  },
  "created_at": "2025-01-20T14:25:00Z",
  "message": "Simulation already in progress"
}

Response (200 OK - Already Up To Date):

{
  "status": "current",
  "message": "Simulation already up-to-date",
  "last_simulation_date": "2025-01-20",
  "next_trading_day": "2025-01-21"
}

Response (409 Conflict):

{
  "error": "conflict",
  "message": "Different simulation already running",
  "current_job_id": "previous-job-uuid",
  "current_date_range": ["2025-01-10", "2025-01-15"]
}

Business Logic:

Load configuration from config_path (or default)
Determine last completed date from each model's position.jsonl
Calculate date range: max(last_dates) + 1 day → most_recent_trading_day
Filter for weekdays only (Monday-Friday)
If date_range is empty, return "already up-to-date"
Check for existing jobs with same date range → return existing job
Check for running jobs with different date range → return 409
Create new job in SQLite with status=pending
Queue background task to execute simulation
Return 202 with job details

1.2 GET /simulate/status/{job_id}

Purpose: Poll the status and progress of a simulation job.

Request:

GET /simulate/status/550e8400-e29b-41d4-a716-446655440000 HTTP/1.1

Response (200 OK - Running):

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
  "models": ["claude-3.7-sonnet", "gpt-5"],
  "progress": {
    "total_model_days": 6,
    "completed": 3,
    "failed": 0,
    "current": {
      "date": "2025-01-17",
      "model": "gpt-5"
    },
    "details": [
      {"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
      {"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
      {"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 42.1},
      {"date": "2025-01-17", "model": "gpt-5", "status": "running", "duration_seconds": null}
    ]
  },
  "created_at": "2025-01-20T14:25:00Z",
  "updated_at": "2025-01-20T14:27:15Z"
}

Response (200 OK - Completed):

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
  "models": ["claude-3.7-sonnet", "gpt-5"],
  "progress": {
    "total_model_days": 6,
    "completed": 6,
    "failed": 0,
    "details": [
      {"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
      {"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
      {"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 42.1},
      {"date": "2025-01-17", "model": "gpt-5", "status": "completed", "duration_seconds": 40.3},
      {"date": "2025-01-20", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 43.8},
      {"date": "2025-01-20", "model": "gpt-5", "status": "completed", "duration_seconds": 39.1}
    ]
  },
  "created_at": "2025-01-20T14:25:00Z",
  "completed_at": "2025-01-20T14:29:45Z",
  "total_duration_seconds": 285.0
}

Response (200 OK - Partial Failure):

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "partial",
  "date_range": ["2025-01-16", "2025-01-17", "2025-01-20"],
  "models": ["claude-3.7-sonnet", "gpt-5"],
  "progress": {
    "total_model_days": 6,
    "completed": 4,
    "failed": 2,
    "details": [
      {"date": "2025-01-16", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 45.2},
      {"date": "2025-01-16", "model": "gpt-5", "status": "completed", "duration_seconds": 38.7},
      {"date": "2025-01-17", "model": "claude-3.7-sonnet", "status": "failed", "error": "MCP service timeout after 3 retries", "duration_seconds": null},
      {"date": "2025-01-17", "model": "gpt-5", "status": "completed", "duration_seconds": 40.3},
      {"date": "2025-01-20", "model": "claude-3.7-sonnet", "status": "completed", "duration_seconds": 43.8},
      {"date": "2025-01-20", "model": "gpt-5", "status": "failed", "error": "AI model API timeout", "duration_seconds": null}
    ]
  },
  "created_at": "2025-01-20T14:25:00Z",
  "completed_at": "2025-01-20T14:29:45Z"
}

Response (404 Not Found):

{
  "error": "not_found",
  "message": "Job not found",
  "job_id": "invalid-job-id"
}

Business Logic:

Query SQLite jobs table for job_id
If not found, return 404
Return job metadata + progress from job_details table
Status transitions: pending → running → completed/partial/failed

1.3 GET /simulate/current

Purpose: Get the most recent simulation job (for Windmill to discover job_id).

Request:

GET /simulate/current HTTP/1.1

Response (200 OK):

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "date_range": ["2025-01-16", "2025-01-17"],
  "models": ["claude-3.7-sonnet", "gpt-5"],
  "progress": {
    "total_model_days": 4,
    "completed": 2,
    "failed": 0
  },
  "created_at": "2025-01-20T14:25:00Z"
}

Response (404 Not Found):

{
  "error": "not_found",
  "message": "No simulation jobs found"
}

Business Logic:

Query SQLite: SELECT * FROM jobs ORDER BY created_at DESC LIMIT 1
Return job details with progress summary

1.4 GET /results

Purpose: Retrieve simulation results for a specific date and model.

Request:

GET /results?date=2025-01-15&model=gpt-5&detail=minimal HTTP/1.1

Query Parameters:

date (required): Trading date in YYYY-MM-DD format
model (optional): Model signature (if omitted, returns all models)
detail (optional): Response detail level
- minimal (default): Positions + daily P&L
- full: + trade history + AI reasoning logs + tool usage stats

Response (200 OK - minimal):

{
  "date": "2025-01-15",
  "results": [
    {
      "model": "gpt-5",
      "positions": {
        "AAPL": 10,
        "MSFT": 5,
        "NVDA": 0,
        "CASH": 8500.00
      },
      "daily_pnl": {
        "profit": 150.50,
        "return_pct": 1.5,
        "portfolio_value": 10150.50
      }
    }
  ]
}

Response (200 OK - full):

{
  "date": "2025-01-15",
  "results": [
    {
      "model": "gpt-5",
      "positions": {
        "AAPL": 10,
        "MSFT": 5,
        "CASH": 8500.00
      },
      "daily_pnl": {
        "profit": 150.50,
        "return_pct": 1.5,
        "portfolio_value": 10150.50
      },
      "trades": [
        {
          "id": 1,
          "action": "buy",
          "symbol": "AAPL",
          "amount": 10,
          "price": 255.88,
          "total": 2558.80
        }
      ],
      "ai_reasoning": {
        "total_steps": 15,
        "stop_signal_received": true,
        "reasoning_summary": "Market analysis indicated strong buy signal for AAPL...",
        "tool_usage": {
          "search": 3,
          "get_price": 5,
          "math": 2,
          "trade": 1
        }
      },
      "log_file_path": "data/agent_data/gpt-5/log/2025-01-15/log.jsonl"
    }
  ]
}

Response (400 Bad Request):

{
  "error": "invalid_date",
  "message": "Date must be in YYYY-MM-DD format"
}

Response (404 Not Found):

{
  "error": "no_data",
  "message": "No simulation data found for date 2025-01-15 and model gpt-5"
}

Business Logic:

Validate date format
Read position.jsonl for specified model(s) and date
For detail=minimal: Return positions + calculate daily P&L
For detail=full:
- Parse log.jsonl to extract reasoning summary
- Count tool usage from log messages
- Extract trades from position file
Return aggregated results

1.5 GET /health

Purpose: Health check endpoint for Docker and monitoring.

Request:

GET /health HTTP/1.1

Response (200 OK):

{
  "status": "healthy",
  "timestamp": "2025-01-20T14:30:00Z",
  "services": {
    "mcp_math": {"status": "up", "url": "http://localhost:8000/mcp"},
    "mcp_search": {"status": "up", "url": "http://localhost:8001/mcp"},
    "mcp_trade": {"status": "up", "url": "http://localhost:8002/mcp"},
    "mcp_getprice": {"status": "up", "url": "http://localhost:8003/mcp"}
  },
  "storage": {
    "data_directory": "/app/data",
    "writable": true,
    "free_space_mb": 15234
  },
  "database": {
    "status": "connected",
    "path": "/app/data/jobs.db"
  }
}

Response (503 Service Unavailable):

{
  "status": "unhealthy",
  "timestamp": "2025-01-20T14:30:00Z",
  "services": {
    "mcp_math": {"status": "down", "url": "http://localhost:8000/mcp", "error": "Connection refused"},
    "mcp_search": {"status": "up", "url": "http://localhost:8001/mcp"},
    "mcp_trade": {"status": "up", "url": "http://localhost:8002/mcp"},
    "mcp_getprice": {"status": "up", "url": "http://localhost:8003/mcp"}
  },
  "storage": {
    "data_directory": "/app/data",
    "writable": true
  },
  "database": {
    "status": "connected"
  }
}

2. Data Models

2.1 SQLite Schema

Table: jobs

CREATE TABLE jobs (
    job_id TEXT PRIMARY KEY,
    config_path TEXT NOT NULL,
    status TEXT NOT NULL CHECK(status IN ('pending', 'running', 'completed', 'partial', 'failed')),
    date_range TEXT NOT NULL,  -- JSON array of dates
    models TEXT NOT NULL,      -- JSON array of model signatures
    created_at TEXT NOT NULL,
    started_at TEXT,
    completed_at TEXT,
    total_duration_seconds REAL,
    error TEXT
);

CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC);

Table: job_details

CREATE TABLE job_details (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    job_id TEXT NOT NULL,
    date TEXT NOT NULL,
    model TEXT NOT NULL,
    status TEXT NOT NULL CHECK(status IN ('pending', 'running', 'completed', 'failed')),
    started_at TEXT,
    completed_at TEXT,
    duration_seconds REAL,
    error TEXT,
    FOREIGN KEY (job_id) REFERENCES jobs(job_id) ON DELETE CASCADE
);

CREATE INDEX idx_job_details_job_id ON job_details(job_id);
CREATE INDEX idx_job_details_status ON job_details(status);

2.2 Pydantic Models

Request Models:

from pydantic import BaseModel, Field
from typing import Optional, Literal

class TriggerSimulationRequest(BaseModel):
    config_path: Optional[str] = Field(default="configs/default_config.json", description="Path to configuration file")

class ResultsQueryParams(BaseModel):
    date: str = Field(..., pattern=r"^\d{4}-\d{2}-\d{2}$", description="Date in YYYY-MM-DD format")
    model: Optional[str] = Field(None, description="Model signature filter")
    detail: Literal["minimal", "full"] = Field(default="minimal", description="Response detail level")

Response Models:

class JobProgress(BaseModel):
    total_model_days: int
    completed: int
    failed: int
    current: Optional[dict] = None  # {"date": str, "model": str}
    details: Optional[list] = None  # List of JobDetailResponse

class TriggerSimulationResponse(BaseModel):
    job_id: str
    status: str
    date_range: list[str]
    models: list[str]
    created_at: str
    message: str
    progress: Optional[JobProgress] = None

class JobStatusResponse(BaseModel):
    job_id: str
    status: str
    date_range: list[str]
    models: list[str]
    progress: JobProgress
    created_at: str
    updated_at: Optional[str] = None
    completed_at: Optional[str] = None
    total_duration_seconds: Optional[float] = None

class DailyPnL(BaseModel):
    profit: float
    return_pct: float
    portfolio_value: float

class Trade(BaseModel):
    id: int
    action: str
    symbol: str
    amount: int
    price: Optional[float] = None
    total: Optional[float] = None

class AIReasoning(BaseModel):
    total_steps: int
    stop_signal_received: bool
    reasoning_summary: str
    tool_usage: dict[str, int]

class ModelResult(BaseModel):
    model: str
    positions: dict[str, float]
    daily_pnl: DailyPnL
    trades: Optional[list[Trade]] = None
    ai_reasoning: Optional[AIReasoning] = None
    log_file_path: Optional[str] = None

class ResultsResponse(BaseModel):
    date: str
    results: list[ModelResult]

3. Configuration Management

3.1 Environment Variables

Required environment variables remain the same as batch mode:

# OpenAI API Configuration
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-...

# Alpha Vantage API
ALPHAADVANTAGE_API_KEY=...

# Jina Search API
JINA_API_KEY=...

# Runtime Config Path (now shared by API and worker)
RUNTIME_ENV_PATH=/app/data/runtime_env.json

# MCP Service Ports
MATH_HTTP_PORT=8000
SEARCH_HTTP_PORT=8001
TRADE_HTTP_PORT=8002
GETPRICE_HTTP_PORT=8003

# API Server Configuration
API_HOST=0.0.0.0
API_PORT=8080

# Job Configuration
MAX_CONCURRENT_JOBS=1  # Only one simulation job at a time

3.2 Runtime State Management

Challenge: Multiple model-days running concurrently need isolated runtime_env.json state.

Solution: Per-job runtime config files

runtime_env_base.json - Template
runtime_env_{job_id}_{model}_{date}.json - Job-specific runtime config
Worker passes custom RUNTIME_ENV_PATH to each simulation execution

Modified write_config_value() and get_config_value():

Accept optional runtime_path parameter
Worker manages lifecycle: create → use → cleanup

4. Error Handling

4.1 Error Response Format

All errors follow this structure:

{
  "error": "error_code",
  "message": "Human-readable error description",
  "details": {
    // Optional additional context
  }
}

4.2 HTTP Status Codes

200 OK - Successful request
202 Accepted - Job queued successfully
400 Bad Request - Invalid input parameters
404 Not Found - Resource not found (job, results)
409 Conflict - Concurrent job conflict
500 Internal Server Error - Unexpected server error
503 Service Unavailable - Health check failed

4.3 Retry Strategy for Workers

Models run independently - failure of one model doesn't block others:

async def run_model_day(job_id: str, date: str, model_config: dict):
    try:
        # Execute simulation for this model-day
        await agent.run_trading_session(date)
        update_job_detail_status(job_id, date, model, "completed")
    except Exception as e:
        # Log error, update status to failed, continue with next model-day
        update_job_detail_status(job_id, date, model, "failed", error=str(e))
        # Do NOT raise - let other models continue

5. Concurrency & Locking

5.1 Job Execution Policy

Rule: Maximum 1 running job at a time (configurable via MAX_CONCURRENT_JOBS)

Enforcement:

def can_start_new_job() -> bool:
    running_jobs = db.query(
        "SELECT COUNT(*) FROM jobs WHERE status IN ('pending', 'running')"
    ).fetchone()[0]
    return running_jobs < MAX_CONCURRENT_JOBS

5.2 Position File Concurrency

Challenge: Multiple model-days writing to same model's position.jsonl

Solution: Sequential execution per model

# For each date in date_range:
#   For each model in parallel:  ← Models run in parallel
#     Execute model-day sequentially  ← Dates for same model run sequentially

Execution Pattern:

Date 2025-01-16:
  - Model A (running)
  - Model B (running)
  - Model C (running)

Date 2025-01-17:  ← Starts only after all models finish 2025-01-16
  - Model A (running)
  - Model B (running)
  - Model C (running)

Rationale:

Models write to different position files → No conflict
Same model's dates run sequentially → No race condition on position.jsonl
Date-level parallelism across models → Faster overall execution

6. Performance Considerations

6.1 Execution Time Estimates

Based on current implementation:

Single model-day: ~30-60 seconds (depends on AI model latency + tool calls)
3 models × 5 days = 15 model-days ≈ 7.5-15 minutes (parallel execution)

6.2 Timeout Configuration

API Request Timeout:

/simulate/trigger: 10 seconds (just queue job)
/simulate/status: 5 seconds (read from DB)
/results: 30 seconds (file I/O + parsing)

Worker Timeout:

Per model-day: 5 minutes (inherited from max_retries × base_delay)
Entire job: No timeout (job runs until all model-days complete or fail)

6.3 Optimization Opportunities (Future)

Results caching: Store computed daily_pnl in SQLite to avoid recomputation
Parallel date execution: If position file locking is implemented, run dates in parallel
Streaming responses: For /simulate/status, use SSE to push updates instead of polling

7. Logging & Observability

7.1 Structured Logging

All API logs use JSON format:

{
  "timestamp": "2025-01-20T14:30:00Z",
  "level": "INFO",
  "logger": "api.worker",
  "message": "Starting simulation for model-day",
  "job_id": "550e8400-...",
  "date": "2025-01-16",
  "model": "gpt-5"
}

7.2 Log Levels

DEBUG - Detailed execution flow (tool calls, price fetches)
INFO - Job lifecycle events (created, started, completed)
WARNING - Recoverable errors (retry attempts)
ERROR - Model-day failures (logged but job continues)
CRITICAL - System failures (MCP services down, DB corruption)

7.3 Audit Trail

All job state transitions logged to api_audit.log:

{
  "timestamp": "2025-01-20T14:30:00Z",
  "event": "job_created",
  "job_id": "550e8400-...",
  "user": "windmill-service",  // Future: from auth header
  "details": {"date_range": [...], "models": [...]}
}

8. Security Considerations

8.1 Authentication (Future)

For MVP, API relies on network isolation (Docker network). Future enhancements:

API key authentication via header: X-API-Key: <token>
JWT tokens for Windmill integration
Rate limiting per API key

8.2 Input Validation

All date parameters validated with regex: ^\d{4}-\d{2}-\d{2}$
Config paths restricted to configs/ directory (prevent path traversal)
Model signatures sanitized (alphanumeric + hyphens only)

8.3 File Access Controls

Results API only reads from data/agent_data/ directory
Config API only reads from configs/ directory
No arbitrary file read via API parameters

9. Deployment Configuration

9.1 Docker Compose

version: '3.8'

services:
  ai-trader-api:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    volumes:
      - ./data:/app/data
      - ./configs:/app/configs
    env_file:
      - .env
    environment:
      - MODE=api
      - API_PORT=8080
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    restart: unless-stopped

9.2 Dockerfile Modifications

# ... existing layers ...

# Install API dependencies
COPY requirements-api.txt /app/
RUN pip install --no-cache-dir -r requirements-api.txt

# Copy API application code
COPY api/ /app/api/

# Copy entrypoint script
COPY docker-entrypoint.sh /app/
RUN chmod +x /app/docker-entrypoint.sh

EXPOSE 8080

CMD ["/app/docker-entrypoint.sh"]

9.3 Entrypoint Script

#!/bin/bash
set -e

echo "Starting MCP services..."
cd /app/agent_tools
python start_mcp_services.py &
MCP_PID=$!

echo "Waiting for MCP services to be ready..."
sleep 10

echo "Starting API server..."
cd /app
uvicorn api.main:app --host ${API_HOST:-0.0.0.0} --port ${API_PORT:-8080} --workers 1

# Cleanup on exit
trap "kill $MCP_PID 2>/dev/null || true" EXIT

10. API Versioning (Future)

For v2 and beyond:

URL prefix: /api/v1/simulate/trigger, /api/v2/simulate/trigger
Header-based: Accept: application/vnd.ai-trader.v1+json

MVP uses unversioned endpoints (implied v1).

Next Steps

After reviewing this specification, we'll proceed to:

Component 2: Job Manager & SQLite Schema Implementation
Component 3: Background Worker Architecture
Component 4: BaseAgent Refactoring for Single-Day Execution
Component 5: Docker & Deployment Configuration
Component 6: Windmill Integration Flows

Please review this API specification and provide feedback or approval to continue.

21 KiB Raw Blame History Unescape Escape

AI-Trader API Service - Technical Specification

1. API Endpoints Specification

1.1 POST /simulate/trigger

1.2 GET /simulate/status/{job_id}

1.3 GET /simulate/current

1.4 GET /results

1.5 GET /health

2. Data Models

2.1 SQLite Schema

2.2 Pydantic Models

3. Configuration Management

3.1 Environment Variables

3.2 Runtime State Management

4. Error Handling

4.1 Error Response Format

4.2 HTTP Status Codes

4.3 Retry Strategy for Workers

5. Concurrency & Locking

5.1 Job Execution Policy

5.2 Position File Concurrency

6. Performance Considerations

6.1 Execution Time Estimates

6.2 Timeout Configuration

6.3 Optimization Opportunities (Future)

7. Logging & Observability

7.1 Structured Logging

7.2 Log Levels

7.3 Audit Trail

8. Security Considerations

8.1 Authentication (Future)

8.2 Input Validation

8.3 File Access Controls

9. Deployment Configuration

9.1 Docker Compose

9.2 Dockerfile Modifications

9.3 Entrypoint Script

10. API Versioning (Future)

Next Steps

21 KiB

Raw Blame History