Files
AI-Trader/docs/README-SPECS.md
Bill fb9583b374 feat: transform to REST API service with SQLite persistence (v0.3.0)
Major architecture transformation from batch-only to API service with
database persistence for Windmill integration.

## REST API Implementation
- POST /simulate/trigger - Start simulation jobs
- GET /simulate/status/{job_id} - Monitor job progress
- GET /results - Query results with filters (job_id, date, model)
- GET /health - Service health checks

## Database Layer
- SQLite persistence with 6 tables (jobs, job_details, positions,
  holdings, reasoning_logs, tool_usage)
- Foreign key constraints with cascade deletes
- Replaces JSONL file storage

## Backend Components
- JobManager: Job lifecycle management with concurrency control
- RuntimeConfigManager: Thread-safe isolated runtime configs
- ModelDayExecutor: Single model-day execution engine
- SimulationWorker: Date-sequential, model-parallel orchestration

## Testing
- 102 unit and integration tests (85% coverage)
- Database: 98% coverage
- Job manager: 98% coverage
- API endpoints: 81% coverage
- Pydantic models: 100% coverage
- TDD approach throughout

## Docker Deployment
- Dual-mode: API server (persistent) + batch (one-time)
- Health checks with 30s interval
- Volume persistence for database and logs
- Separate entrypoints for each mode

## Validation Tools
- scripts/validate_docker_build.sh - Build validation
- scripts/test_api_endpoints.sh - Complete API testing
- scripts/test_batch_mode.sh - Batch mode validation
- DOCKER_API.md - Deployment guide
- TESTING_GUIDE.md - Testing procedures

## Configuration
- API_PORT environment variable (default: 8080)
- Backwards compatible with existing configs
- FastAPI, uvicorn, pydantic>=2.0 dependencies

Co-Authored-By: AI Assistant <noreply@example.com>
2025-10-31 11:47:10 -04:00

437 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AI-Trader API Service - Technical Specifications Summary
## Overview
This directory contains comprehensive technical specifications for transforming the AI-Trader batch simulation system into an API service compatible with Windmill automation.
## Specification Documents
### 1. [API Specification](./api-specification.md)
**Purpose:** Defines all API endpoints, request/response formats, and data models
**Key Contents:**
- **5 REST Endpoints:**
- `POST /simulate/trigger` - Queue catch-up simulation job
- `GET /simulate/status/{job_id}` - Poll job progress
- `GET /simulate/current` - Get latest job
- `GET /results` - Retrieve simulation results (minimal/full detail)
- `GET /health` - Service health check
- **Pydantic Models** for type-safe request/response handling
- **Error Handling** strategies and HTTP status codes
- **SQLite Schema** for jobs and job_details tables
- **Configuration Management** via environment variables
**Status Codes:** 200 OK, 202 Accepted, 400 Bad Request, 404 Not Found, 409 Conflict, 503 Service Unavailable
---
### 2. [Job Manager Specification](./job-manager-specification.md)
**Purpose:** Details the job tracking and database layer
**Key Contents:**
- **SQLite Database Schema:**
- `jobs` table - High-level job metadata
- `job_details` table - Per model-day execution tracking
- **JobManager Class Interface:**
- `create_job()` - Create new simulation job
- `get_job()` - Retrieve job by ID
- `update_job_status()` - State transitions (pending → running → completed/partial/failed)
- `get_job_progress()` - Detailed progress metrics
- `can_start_new_job()` - Concurrency control
- **State Machine:** Job status transitions and business logic
- **Concurrency Control:** Single-job execution enforcement
- **Testing Strategy:** Unit tests with temporary databases
**Key Feature:** Independent model execution - one model's failure doesn't block others (results in "partial" status)
---
### 3. [Background Worker Specification](./worker-specification.md)
**Purpose:** Defines async job execution architecture
**Key Contents:**
- **Execution Pattern:** Date-sequential, Model-parallel
- All models for Date 1 run in parallel
- Date 2 starts only after all models finish Date 1
- Ensures position.jsonl integrity (no concurrent writes)
- **SimulationWorker Class:**
- Orchestrates job execution
- Manages date sequencing
- Handles job-level errors
- **ModelDayExecutor Class:**
- Executes single model-day simulation
- Updates job_detail status
- Isolates runtime configuration
- **RuntimeConfigManager:**
- Creates temporary runtime_env_{job_id}_{model}_{date}.json files
- Prevents state collisions between concurrent models
- Cleans up after execution
- **Error Handling:** Graceful failure (models continue despite peer failures)
- **Logging:** Structured JSON logging with job/model/date context
**Performance:** 3 models × 5 days = ~7-15 minutes (vs. ~22-45 minutes sequential)
---
### 4. [Implementation Specification](./implementation-specifications.md)
**Purpose:** Complete implementation guide covering Agent, Docker, and Windmill
**Key Contents:**
#### Part 1: BaseAgent Refactoring
- **Analysis:** Existing `run_trading_session()` already compatible with API mode
- **Required Changes:** ✅ NONE! Existing code works as-is
- **Worker Integration:** Calls `agent.run_trading_session(date)` directly
#### Part 2: Docker Configuration
- **Modified Dockerfile:** Adds FastAPI dependencies, new entrypoint
- **docker-entrypoint-api.sh:** Starts MCP services → launches uvicorn
- **Health Checks:** Verifies MCP services and database connectivity
- **Volume Mounts:** `./data`, `./configs` for persistence
#### Part 3: Windmill Integration
- **Flow 1: trigger_simulation.ts** - Daily cron triggers API
- **Flow 2: poll_simulation_status.ts** - Polls every 5 min until complete
- **Flow 3: store_simulation_results.py** - Stores results in Windmill DB
- **Dashboard:** Charts and tables showing portfolio performance
- **Workflow Orchestration:** Complete YAML workflow definition
#### Part 4: File Structure
- New `api/` directory with 7 modules
- New `windmill/` directory with scripts and dashboard
- New `docs/` directory (this folder)
- `data/jobs.db` for job tracking
#### Part 5: Implementation Checklist
10-day implementation plan broken into 6 phases
---
## Architecture Highlights
### Request Flow
```
1. Windmill → POST /simulate/trigger
2. API creates job in SQLite (status: pending)
3. API queues BackgroundTask
4. API returns 202 Accepted with job_id
5. Worker starts (status: running)
6. For each date sequentially:
For each model in parallel:
- Create isolated runtime config
- Execute agent.run_trading_session(date)
- Update job_detail status
7. Worker finishes (status: completed/partial/failed)
8. Windmill polls GET /simulate/status/{job_id}
9. When complete: Windmill calls GET /results?date=X
10. Windmill stores results in internal DB
11. Windmill dashboard displays performance
```
### Data Flow
```
Input: configs/default_config.json
API: Calculates date_range (last position → today)
Worker: Executes simulations
Output: data/agent_data/{model}/position/position.jsonl
data/agent_data/{model}/log/{date}/log.jsonl
data/jobs.db (job tracking)
API: Reads position.jsonl + calculates P&L
Windmill: Stores in internal DB → Dashboard visualization
```
---
## Key Design Decisions
### 1. Pattern B: Lazy On-Demand Processing
- **Chosen:** Windmill controls simulation timing via API calls
- **Benefit:** Centralized scheduling in Windmill
- **Tradeoff:** First Windmill call of the day triggers long-running job
### 2. SQLite vs. PostgreSQL
- **Chosen:** SQLite for MVP
- **Rationale:** Low concurrency (1 job at a time), simple deployment
- **Future:** PostgreSQL for production with multiple concurrent jobs
### 3. Date-Sequential, Model-Parallel Execution
- **Chosen:** Dates run sequentially, models run in parallel per date
- **Rationale:** Prevents position.jsonl race conditions, faster than fully sequential
- **Performance:** ~50% faster than sequential (3 models in parallel)
### 4. Independent Model Failures
- **Chosen:** One model's failure doesn't block others
- **Benefit:** Partial results better than no results
- **Implementation:** Job status becomes "partial" if any model fails
### 5. Minimal BaseAgent Changes
- **Chosen:** No modifications to agent code
- **Rationale:** Existing `run_trading_session()` is perfect API interface
- **Benefit:** Maintains backward compatibility with batch mode
---
## Implementation Prerequisites
### Required Environment Variables
```bash
OPENAI_API_BASE=...
OPENAI_API_KEY=...
ALPHAADVANTAGE_API_KEY=...
JINA_API_KEY=...
RUNTIME_ENV_PATH=/app/data/runtime_env.json
MATH_HTTP_PORT=8000
SEARCH_HTTP_PORT=8001
TRADE_HTTP_PORT=8002
GETPRICE_HTTP_PORT=8003
API_HOST=0.0.0.0
API_PORT=8080
```
### Required Python Packages (new)
```
fastapi==0.109.0
uvicorn[standard]==0.27.0
pydantic==2.5.3
```
### Docker Requirements
- Docker Engine 20.10+
- Docker Compose 2.0+
- 2GB RAM minimum for container
- 10GB disk space for data
### Windmill Requirements
- Windmill instance (self-hosted or cloud)
- Network access from Windmill to AI-Trader API
- Windmill CLI for deployment (optional)
---
## Testing Strategy
### Unit Tests
- `tests/test_job_manager.py` - Database operations
- `tests/test_worker.py` - Job execution logic
- `tests/test_executor.py` - Model-day execution
### Integration Tests
- `tests/test_api_endpoints.py` - FastAPI endpoint behavior
- `tests/test_end_to_end.py` - Full workflow (trigger → execute → retrieve)
### Manual Testing
- Docker container startup
- Health check endpoint
- Windmill workflow execution
- Dashboard visualization
---
## Performance Expectations
### Single Model-Day Execution
- **Duration:** 30-60 seconds (varies by AI model latency)
- **Bottlenecks:** AI API calls, MCP tool latency
### Multi-Model Job
- **Example:** 3 models × 5 days = 15 model-days
- **Parallel Execution:** ~7-15 minutes
- **Sequential Execution:** ~22-45 minutes
- **Speedup:** ~3x (number of models)
### API Response Times
- `/simulate/trigger`: < 1 second (just queues job)
- `/simulate/status`: < 100ms (SQLite query)
- `/results?detail=minimal`: < 500ms (file read + JSON parsing)
- `/results?detail=full`: < 2 seconds (parse log files)
---
## Security Considerations
### MVP Security
- **Network Isolation:** Docker network (no public exposure)
- **No Authentication:** Assumes Windmill → API is trusted network
### Future Enhancements
- API key authentication (`X-API-Key` header)
- Rate limiting per client
- HTTPS/TLS encryption
- Input sanitization for path traversal prevention
---
## Deployment Steps
### 1. Build Docker Image
```bash
docker-compose build
```
### 2. Start API Service
```bash
docker-compose up -d
```
### 3. Verify Health
```bash
curl http://localhost:8080/health
```
### 4. Test Trigger
```bash
curl -X POST http://localhost:8080/simulate/trigger \
-H "Content-Type: application/json" \
-d '{"config_path": "configs/default_config.json"}'
```
### 5. Deploy Windmill Scripts
```bash
wmill script push windmill/trigger_simulation.ts
wmill script push windmill/poll_simulation_status.ts
wmill script push windmill/store_simulation_results.py
```
### 6. Create Windmill Workflow
- Import `windmill/daily_simulation_workflow.yaml`
- Configure resource `ai_trader_api` with API URL
- Set cron schedule (daily 6 AM)
### 7. Create Windmill Dashboard
- Import `windmill/dashboard.json`
- Verify data visualization
---
## Troubleshooting Guide
### Issue: Health check fails
**Symptoms:** `curl http://localhost:8080/health` returns 503
**Possible Causes:**
1. MCP services not running
2. Database file permission error
3. API server not started
**Solutions:**
```bash
# Check MCP services
docker-compose exec ai-trader curl http://localhost:8000/health
# Check API logs
docker-compose logs -f ai-trader
# Restart container
docker-compose restart
```
### Issue: Job stuck in "running" status
**Symptoms:** Job never completes, status remains "running"
**Possible Causes:**
1. Agent execution crashed
2. Model API timeout
3. Worker process died
**Solutions:**
```bash
# Check job details for error messages
curl http://localhost:8080/simulate/status/{job_id}
# Check container logs
docker-compose logs -f ai-trader
# If API restarted, stale jobs are marked as failed on startup
docker-compose restart
```
### Issue: Windmill can't reach API
**Symptoms:** Connection refused from Windmill scripts
**Solutions:**
- Verify Windmill and AI-Trader on same Docker network
- Check firewall rules
- Use container name (ai-trader) instead of localhost in Windmill resource
- Verify API_PORT environment variable
---
## Migration from Batch Mode
### For Users Currently Running Batch Mode
**Option 1: Dual Mode (Recommended)**
- Keep existing `main.py` for manual testing
- Add new API mode for production automation
- Use different config files for each mode
**Option 2: API-Only**
- Replace batch execution entirely
- All simulations via API calls
- More consistent with production workflow
### Migration Checklist
- [ ] Backup existing `data/` directory
- [ ] Update `.env` with API configuration
- [ ] Test API mode in separate environment first
- [ ] Gradually migrate Windmill workflows
- [ ] Monitor logs for errors
- [ ] Validate results match batch mode output
---
## Next Steps
1. **Review Specifications**
- Read all 4 specification documents
- Ask clarifying questions
- Approve design before implementation
2. **Implementation Phase 1** (Days 1-2)
- Set up `api/` directory structure
- Implement database and job_manager
- Write unit tests
3. **Implementation Phase 2** (Days 3-4)
- Implement worker and executor
- Test with mock agents
4. **Implementation Phase 3** (Days 5-6)
- Implement FastAPI endpoints
- Test with Postman/curl
5. **Implementation Phase 4** (Day 7)
- Docker integration
- End-to-end testing
6. **Implementation Phase 5** (Days 8-9)
- Windmill integration
- Dashboard creation
7. **Implementation Phase 6** (Day 10)
- Final testing
- Documentation
---
## Questions or Feedback?
Please review all specifications and provide feedback on:
1. API endpoint design
2. Database schema
3. Execution pattern (date-sequential, model-parallel)
4. Error handling approach
5. Windmill integration workflow
6. Any concerns or suggested improvements
**Ready to proceed with implementation?** Confirm approval of specifications to begin Phase 1.