AI-Trader/docs/README-SPECS.md

# AI-Trader API Service - Technical Specifications Summary

## Overview

This directory contains comprehensive technical specifications for transforming the AI-Trader batch simulation system into an API service compatible with Windmill automation.

## Specification Documents

### 1. [API Specification](./api-specification.md)
**Purpose:** Defines all API endpoints, request/response formats, and data models

**Key Contents:**
- **5 REST Endpoints:**
  - `POST /simulate/trigger` - Queue catch-up simulation job
  - `GET /simulate/status/{job_id}` - Poll job progress
  - `GET /simulate/current` - Get latest job
  - `GET /results` - Retrieve simulation results (minimal/full detail)
  - `GET /health` - Service health check
- **Pydantic Models** for type-safe request/response handling
- **Error Handling** strategies and HTTP status codes
- **SQLite Schema** for jobs and job_details tables
- **Configuration Management** via environment variables

**Status Codes:** 200 OK, 202 Accepted, 400 Bad Request, 404 Not Found, 409 Conflict, 503 Service Unavailable

---

### 2. [Job Manager Specification](./job-manager-specification.md)
**Purpose:** Details the job tracking and database layer

**Key Contents:**
- **SQLite Database Schema:**
  - `jobs` table - High-level job metadata
  - `job_details` table - Per model-day execution tracking
- **JobManager Class Interface:**
  - `create_job()` - Create new simulation job
  - `get_job()` - Retrieve job by ID
  - `update_job_status()` - State transitions (pending → running → completed/partial/failed)
  - `get_job_progress()` - Detailed progress metrics
  - `can_start_new_job()` - Concurrency control
- **State Machine:** Job status transitions and business logic
- **Concurrency Control:** Single-job execution enforcement
- **Testing Strategy:** Unit tests with temporary databases

**Key Feature:** Independent model execution - one model's failure doesn't block others (results in "partial" status)

---

### 3. [Background Worker Specification](./worker-specification.md)
**Purpose:** Defines async job execution architecture

**Key Contents:**
- **Execution Pattern:** Date-sequential, Model-parallel
  - All models for Date 1 run in parallel
  - Date 2 starts only after all models finish Date 1
  - Ensures position.jsonl integrity (no concurrent writes)
- **SimulationWorker Class:**
  - Orchestrates job execution
  - Manages date sequencing
  - Handles job-level errors
- **ModelDayExecutor Class:**
  - Executes single model-day simulation
  - Updates job_detail status
  - Isolates runtime configuration
- **RuntimeConfigManager:**
  - Creates temporary runtime_env_{job_id}_{model}_{date}.json files
  - Prevents state collisions between concurrent models
  - Cleans up after execution
- **Error Handling:** Graceful failure (models continue despite peer failures)
- **Logging:** Structured JSON logging with job/model/date context

**Performance:** 3 models × 5 days = ~7-15 minutes (vs. ~22-45 minutes sequential)

---

### 4. [Implementation Specification](./implementation-specifications.md)
**Purpose:** Complete implementation guide covering Agent, Docker, and Windmill

**Key Contents:**

#### Part 1: BaseAgent Refactoring
- **Analysis:** Existing `run_trading_session()` already compatible with API mode
- **Required Changes:** ✅ NONE! Existing code works as-is
- **Worker Integration:** Calls `agent.run_trading_session(date)` directly

#### Part 2: Docker Configuration
- **Modified Dockerfile:** Adds FastAPI dependencies, new entrypoint
- **docker-entrypoint-api.sh:** Starts MCP services → launches uvicorn
- **Health Checks:** Verifies MCP services and database connectivity
- **Volume Mounts:** `./data`, `./configs` for persistence

#### Part 3: Windmill Integration
- **Flow 1: trigger_simulation.ts** - Daily cron triggers API
- **Flow 2: poll_simulation_status.ts** - Polls every 5 min until complete
- **Flow 3: store_simulation_results.py** - Stores results in Windmill DB
- **Dashboard:** Charts and tables showing portfolio performance
- **Workflow Orchestration:** Complete YAML workflow definition

#### Part 4: File Structure
- New `api/` directory with 7 modules
- New `windmill/` directory with scripts and dashboard
- New `docs/` directory (this folder)
- `data/jobs.db` for job tracking

#### Part 5: Implementation Checklist
10-day implementation plan broken into 6 phases

---

## Architecture Highlights

### Request Flow

```
1. Windmill → POST /simulate/trigger
2. API creates job in SQLite (status: pending)
3. API queues BackgroundTask
4. API returns 202 Accepted with job_id
   ↓
5. Worker starts (status: running)
6. For each date sequentially:
     For each model in parallel:
       - Create isolated runtime config
       - Execute agent.run_trading_session(date)
       - Update job_detail status
7. Worker finishes (status: completed/partial/failed)
   ↓
8. Windmill polls GET /simulate/status/{job_id}
9. When complete: Windmill calls GET /results?date=X
10. Windmill stores results in internal DB
11. Windmill dashboard displays performance
```

### Data Flow

```
Input: configs/default_config.json
       ↓
API: Calculates date_range (last position → today)
       ↓
Worker: Executes simulations
       ↓
Output: data/agent_data/{model}/position/position.jsonl
        data/agent_data/{model}/log/{date}/log.jsonl
        data/jobs.db (job tracking)
       ↓
API: Reads position.jsonl + calculates P&L
       ↓
Windmill: Stores in internal DB → Dashboard visualization
```

---

## Key Design Decisions

### 1. Pattern B: Lazy On-Demand Processing
- **Chosen:** Windmill controls simulation timing via API calls
- **Benefit:** Centralized scheduling in Windmill
- **Tradeoff:** First Windmill call of the day triggers long-running job

### 2. SQLite vs. PostgreSQL
- **Chosen:** SQLite for MVP
- **Rationale:** Low concurrency (1 job at a time), simple deployment
- **Future:** PostgreSQL for production with multiple concurrent jobs

### 3. Date-Sequential, Model-Parallel Execution
- **Chosen:** Dates run sequentially, models run in parallel per date
- **Rationale:** Prevents position.jsonl race conditions, faster than fully sequential
- **Performance:** ~50% faster than sequential (3 models in parallel)

### 4. Independent Model Failures
- **Chosen:** One model's failure doesn't block others
- **Benefit:** Partial results better than no results
- **Implementation:** Job status becomes "partial" if any model fails

### 5. Minimal BaseAgent Changes
- **Chosen:** No modifications to agent code
- **Rationale:** Existing `run_trading_session()` is perfect API interface
- **Benefit:** Maintains backward compatibility with batch mode

---

## Implementation Prerequisites

### Required Environment Variables
```bash
OPENAI_API_BASE=...
OPENAI_API_KEY=...
ALPHAADVANTAGE_API_KEY=...
JINA_API_KEY=...
RUNTIME_ENV_PATH=/app/data/runtime_env.json
MATH_HTTP_PORT=8000
SEARCH_HTTP_PORT=8001
TRADE_HTTP_PORT=8002
GETPRICE_HTTP_PORT=8003
API_HOST=0.0.0.0
API_PORT=8080
```

### Required Python Packages (new)
```
fastapi==0.109.0
uvicorn[standard]==0.27.0
pydantic==2.5.3
```

### Docker Requirements
- Docker Engine 20.10+
- Docker Compose 2.0+
- 2GB RAM minimum for container
- 10GB disk space for data

### Windmill Requirements
- Windmill instance (self-hosted or cloud)
- Network access from Windmill to AI-Trader API
- Windmill CLI for deployment (optional)

---

## Testing Strategy

### Unit Tests
- `tests/test_job_manager.py` - Database operations
- `tests/test_worker.py` - Job execution logic
- `tests/test_executor.py` - Model-day execution

### Integration Tests
- `tests/test_api_endpoints.py` - FastAPI endpoint behavior
- `tests/test_end_to_end.py` - Full workflow (trigger → execute → retrieve)

### Manual Testing
- Docker container startup
- Health check endpoint
- Windmill workflow execution
- Dashboard visualization

---

## Performance Expectations

### Single Model-Day Execution
- **Duration:** 30-60 seconds (varies by AI model latency)
- **Bottlenecks:** AI API calls, MCP tool latency

### Multi-Model Job
- **Example:** 3 models × 5 days = 15 model-days
- **Parallel Execution:** ~7-15 minutes
- **Sequential Execution:** ~22-45 minutes
- **Speedup:** ~3x (number of models)

### API Response Times
- `/simulate/trigger`: < 1 second (just queues job)
- `/simulate/status`: < 100ms (SQLite query)
- `/results?detail=minimal`: < 500ms (file read + JSON parsing)
- `/results?detail=full`: < 2 seconds (parse log files)

---

## Security Considerations

### MVP Security
- **Network Isolation:** Docker network (no public exposure)
- **No Authentication:** Assumes Windmill → API is trusted network

### Future Enhancements
- API key authentication (`X-API-Key` header)
- Rate limiting per client
- HTTPS/TLS encryption
- Input sanitization for path traversal prevention

---

## Deployment Steps

### 1. Build Docker Image
```bash
docker-compose build
```

### 2. Start API Service
```bash
docker-compose up -d
```

### 3. Verify Health
```bash
curl http://localhost:8080/health
```

### 4. Test Trigger
```bash
curl -X POST http://localhost:8080/simulate/trigger \
  -H "Content-Type: application/json" \
  -d '{"config_path": "configs/default_config.json"}'
```

### 5. Deploy Windmill Scripts
```bash
wmill script push windmill/trigger_simulation.ts
wmill script push windmill/poll_simulation_status.ts
wmill script push windmill/store_simulation_results.py
```

### 6. Create Windmill Workflow
- Import `windmill/daily_simulation_workflow.yaml`
- Configure resource `ai_trader_api` with API URL
- Set cron schedule (daily 6 AM)

### 7. Create Windmill Dashboard
- Import `windmill/dashboard.json`
- Verify data visualization

---

## Troubleshooting Guide

### Issue: Health check fails
**Symptoms:** `curl http://localhost:8080/health` returns 503

**Possible Causes:**
1. MCP services not running
2. Database file permission error
3. API server not started

**Solutions:**
```bash
# Check MCP services
docker-compose exec ai-trader curl http://localhost:8000/health

# Check API logs
docker-compose logs -f ai-trader

# Restart container
docker-compose restart
```

### Issue: Job stuck in "running" status
**Symptoms:** Job never completes, status remains "running"

**Possible Causes:**
1. Agent execution crashed
2. Model API timeout
3. Worker process died

**Solutions:**
```bash
# Check job details for error messages
curl http://localhost:8080/simulate/status/{job_id}

# Check container logs
docker-compose logs -f ai-trader

# If API restarted, stale jobs are marked as failed on startup
docker-compose restart
```

### Issue: Windmill can't reach API
**Symptoms:** Connection refused from Windmill scripts

**Solutions:**
- Verify Windmill and AI-Trader on same Docker network
- Check firewall rules
- Use container name (ai-trader) instead of localhost in Windmill resource
- Verify API_PORT environment variable

---

## Migration from Batch Mode

### For Users Currently Running Batch Mode

**Option 1: Dual Mode (Recommended)**
- Keep existing `main.py` for manual testing
- Add new API mode for production automation
- Use different config files for each mode

**Option 2: API-Only**
- Replace batch execution entirely
- All simulations via API calls
- More consistent with production workflow

### Migration Checklist
- [ ] Backup existing `data/` directory
- [ ] Update `.env` with API configuration
- [ ] Test API mode in separate environment first
- [ ] Gradually migrate Windmill workflows
- [ ] Monitor logs for errors
- [ ] Validate results match batch mode output

---

## Next Steps

1. **Review Specifications**
   - Read all 4 specification documents
   - Ask clarifying questions
   - Approve design before implementation

2. **Implementation Phase 1** (Days 1-2)
   - Set up `api/` directory structure
   - Implement database and job_manager
   - Write unit tests

3. **Implementation Phase 2** (Days 3-4)
   - Implement worker and executor
   - Test with mock agents

4. **Implementation Phase 3** (Days 5-6)
   - Implement FastAPI endpoints
   - Test with Postman/curl

5. **Implementation Phase 4** (Day 7)
   - Docker integration
   - End-to-end testing

6. **Implementation Phase 5** (Days 8-9)
   - Windmill integration
   - Dashboard creation

7. **Implementation Phase 6** (Day 10)
   - Final testing
   - Documentation

---

## Questions or Feedback?

Please review all specifications and provide feedback on:
1. API endpoint design
2. Database schema
3. Execution pattern (date-sequential, model-parallel)
4. Error handling approach
5. Windmill integration workflow
6. Any concerns or suggested improvements

**Ready to proceed with implementation?** Confirm approval of specifications to begin Phase 1.