feat: transform to REST API service with SQLite persistence (v0.3.0)

Major architecture transformation from batch-only to API service with
database persistence for Windmill integration.

## REST API Implementation
- POST /simulate/trigger - Start simulation jobs
- GET /simulate/status/{job_id} - Monitor job progress
- GET /results - Query results with filters (job_id, date, model)
- GET /health - Service health checks

## Database Layer
- SQLite persistence with 6 tables (jobs, job_details, positions,
  holdings, reasoning_logs, tool_usage)
- Foreign key constraints with cascade deletes
- Replaces JSONL file storage

## Backend Components
- JobManager: Job lifecycle management with concurrency control
- RuntimeConfigManager: Thread-safe isolated runtime configs
- ModelDayExecutor: Single model-day execution engine
- SimulationWorker: Date-sequential, model-parallel orchestration

## Testing
- 102 unit and integration tests (85% coverage)
- Database: 98% coverage
- Job manager: 98% coverage
- API endpoints: 81% coverage
- Pydantic models: 100% coverage
- TDD approach throughout

## Docker Deployment
- Dual-mode: API server (persistent) + batch (one-time)
- Health checks with 30s interval
- Volume persistence for database and logs
- Separate entrypoints for each mode

## Validation Tools
- scripts/validate_docker_build.sh - Build validation
- scripts/test_api_endpoints.sh - Complete API testing
- scripts/test_batch_mode.sh - Batch mode validation
- DOCKER_API.md - Deployment guide
- TESTING_GUIDE.md - Testing procedures

## Configuration
- API_PORT environment variable (default: 8080)
- Backwards compatible with existing configs
- FastAPI, uvicorn, pydantic>=2.0 dependencies

Co-Authored-By: AI Assistant <noreply@example.com>
This commit is contained in:
2025-10-31 11:47:10 -04:00
parent 5da02b4ba0
commit fb9583b374
45 changed files with 13775 additions and 18 deletions

436
docs/README-SPECS.md Normal file
View File

@@ -0,0 +1,436 @@
# AI-Trader API Service - Technical Specifications Summary
## Overview
This directory contains comprehensive technical specifications for transforming the AI-Trader batch simulation system into an API service compatible with Windmill automation.
## Specification Documents
### 1. [API Specification](./api-specification.md)
**Purpose:** Defines all API endpoints, request/response formats, and data models
**Key Contents:**
- **5 REST Endpoints:**
- `POST /simulate/trigger` - Queue catch-up simulation job
- `GET /simulate/status/{job_id}` - Poll job progress
- `GET /simulate/current` - Get latest job
- `GET /results` - Retrieve simulation results (minimal/full detail)
- `GET /health` - Service health check
- **Pydantic Models** for type-safe request/response handling
- **Error Handling** strategies and HTTP status codes
- **SQLite Schema** for jobs and job_details tables
- **Configuration Management** via environment variables
**Status Codes:** 200 OK, 202 Accepted, 400 Bad Request, 404 Not Found, 409 Conflict, 503 Service Unavailable
---
### 2. [Job Manager Specification](./job-manager-specification.md)
**Purpose:** Details the job tracking and database layer
**Key Contents:**
- **SQLite Database Schema:**
- `jobs` table - High-level job metadata
- `job_details` table - Per model-day execution tracking
- **JobManager Class Interface:**
- `create_job()` - Create new simulation job
- `get_job()` - Retrieve job by ID
- `update_job_status()` - State transitions (pending → running → completed/partial/failed)
- `get_job_progress()` - Detailed progress metrics
- `can_start_new_job()` - Concurrency control
- **State Machine:** Job status transitions and business logic
- **Concurrency Control:** Single-job execution enforcement
- **Testing Strategy:** Unit tests with temporary databases
**Key Feature:** Independent model execution - one model's failure doesn't block others (results in "partial" status)
---
### 3. [Background Worker Specification](./worker-specification.md)
**Purpose:** Defines async job execution architecture
**Key Contents:**
- **Execution Pattern:** Date-sequential, Model-parallel
- All models for Date 1 run in parallel
- Date 2 starts only after all models finish Date 1
- Ensures position.jsonl integrity (no concurrent writes)
- **SimulationWorker Class:**
- Orchestrates job execution
- Manages date sequencing
- Handles job-level errors
- **ModelDayExecutor Class:**
- Executes single model-day simulation
- Updates job_detail status
- Isolates runtime configuration
- **RuntimeConfigManager:**
- Creates temporary runtime_env_{job_id}_{model}_{date}.json files
- Prevents state collisions between concurrent models
- Cleans up after execution
- **Error Handling:** Graceful failure (models continue despite peer failures)
- **Logging:** Structured JSON logging with job/model/date context
**Performance:** 3 models × 5 days = ~7-15 minutes (vs. ~22-45 minutes sequential)
---
### 4. [Implementation Specification](./implementation-specifications.md)
**Purpose:** Complete implementation guide covering Agent, Docker, and Windmill
**Key Contents:**
#### Part 1: BaseAgent Refactoring
- **Analysis:** Existing `run_trading_session()` already compatible with API mode
- **Required Changes:** ✅ NONE! Existing code works as-is
- **Worker Integration:** Calls `agent.run_trading_session(date)` directly
#### Part 2: Docker Configuration
- **Modified Dockerfile:** Adds FastAPI dependencies, new entrypoint
- **docker-entrypoint-api.sh:** Starts MCP services → launches uvicorn
- **Health Checks:** Verifies MCP services and database connectivity
- **Volume Mounts:** `./data`, `./configs` for persistence
#### Part 3: Windmill Integration
- **Flow 1: trigger_simulation.ts** - Daily cron triggers API
- **Flow 2: poll_simulation_status.ts** - Polls every 5 min until complete
- **Flow 3: store_simulation_results.py** - Stores results in Windmill DB
- **Dashboard:** Charts and tables showing portfolio performance
- **Workflow Orchestration:** Complete YAML workflow definition
#### Part 4: File Structure
- New `api/` directory with 7 modules
- New `windmill/` directory with scripts and dashboard
- New `docs/` directory (this folder)
- `data/jobs.db` for job tracking
#### Part 5: Implementation Checklist
10-day implementation plan broken into 6 phases
---
## Architecture Highlights
### Request Flow
```
1. Windmill → POST /simulate/trigger
2. API creates job in SQLite (status: pending)
3. API queues BackgroundTask
4. API returns 202 Accepted with job_id
5. Worker starts (status: running)
6. For each date sequentially:
For each model in parallel:
- Create isolated runtime config
- Execute agent.run_trading_session(date)
- Update job_detail status
7. Worker finishes (status: completed/partial/failed)
8. Windmill polls GET /simulate/status/{job_id}
9. When complete: Windmill calls GET /results?date=X
10. Windmill stores results in internal DB
11. Windmill dashboard displays performance
```
### Data Flow
```
Input: configs/default_config.json
API: Calculates date_range (last position → today)
Worker: Executes simulations
Output: data/agent_data/{model}/position/position.jsonl
data/agent_data/{model}/log/{date}/log.jsonl
data/jobs.db (job tracking)
API: Reads position.jsonl + calculates P&L
Windmill: Stores in internal DB → Dashboard visualization
```
---
## Key Design Decisions
### 1. Pattern B: Lazy On-Demand Processing
- **Chosen:** Windmill controls simulation timing via API calls
- **Benefit:** Centralized scheduling in Windmill
- **Tradeoff:** First Windmill call of the day triggers long-running job
### 2. SQLite vs. PostgreSQL
- **Chosen:** SQLite for MVP
- **Rationale:** Low concurrency (1 job at a time), simple deployment
- **Future:** PostgreSQL for production with multiple concurrent jobs
### 3. Date-Sequential, Model-Parallel Execution
- **Chosen:** Dates run sequentially, models run in parallel per date
- **Rationale:** Prevents position.jsonl race conditions, faster than fully sequential
- **Performance:** ~50% faster than sequential (3 models in parallel)
### 4. Independent Model Failures
- **Chosen:** One model's failure doesn't block others
- **Benefit:** Partial results better than no results
- **Implementation:** Job status becomes "partial" if any model fails
### 5. Minimal BaseAgent Changes
- **Chosen:** No modifications to agent code
- **Rationale:** Existing `run_trading_session()` is perfect API interface
- **Benefit:** Maintains backward compatibility with batch mode
---
## Implementation Prerequisites
### Required Environment Variables
```bash
OPENAI_API_BASE=...
OPENAI_API_KEY=...
ALPHAADVANTAGE_API_KEY=...
JINA_API_KEY=...
RUNTIME_ENV_PATH=/app/data/runtime_env.json
MATH_HTTP_PORT=8000
SEARCH_HTTP_PORT=8001
TRADE_HTTP_PORT=8002
GETPRICE_HTTP_PORT=8003
API_HOST=0.0.0.0
API_PORT=8080
```
### Required Python Packages (new)
```
fastapi==0.109.0
uvicorn[standard]==0.27.0
pydantic==2.5.3
```
### Docker Requirements
- Docker Engine 20.10+
- Docker Compose 2.0+
- 2GB RAM minimum for container
- 10GB disk space for data
### Windmill Requirements
- Windmill instance (self-hosted or cloud)
- Network access from Windmill to AI-Trader API
- Windmill CLI for deployment (optional)
---
## Testing Strategy
### Unit Tests
- `tests/test_job_manager.py` - Database operations
- `tests/test_worker.py` - Job execution logic
- `tests/test_executor.py` - Model-day execution
### Integration Tests
- `tests/test_api_endpoints.py` - FastAPI endpoint behavior
- `tests/test_end_to_end.py` - Full workflow (trigger → execute → retrieve)
### Manual Testing
- Docker container startup
- Health check endpoint
- Windmill workflow execution
- Dashboard visualization
---
## Performance Expectations
### Single Model-Day Execution
- **Duration:** 30-60 seconds (varies by AI model latency)
- **Bottlenecks:** AI API calls, MCP tool latency
### Multi-Model Job
- **Example:** 3 models × 5 days = 15 model-days
- **Parallel Execution:** ~7-15 minutes
- **Sequential Execution:** ~22-45 minutes
- **Speedup:** ~3x (number of models)
### API Response Times
- `/simulate/trigger`: < 1 second (just queues job)
- `/simulate/status`: < 100ms (SQLite query)
- `/results?detail=minimal`: < 500ms (file read + JSON parsing)
- `/results?detail=full`: < 2 seconds (parse log files)
---
## Security Considerations
### MVP Security
- **Network Isolation:** Docker network (no public exposure)
- **No Authentication:** Assumes Windmill → API is trusted network
### Future Enhancements
- API key authentication (`X-API-Key` header)
- Rate limiting per client
- HTTPS/TLS encryption
- Input sanitization for path traversal prevention
---
## Deployment Steps
### 1. Build Docker Image
```bash
docker-compose build
```
### 2. Start API Service
```bash
docker-compose up -d
```
### 3. Verify Health
```bash
curl http://localhost:8080/health
```
### 4. Test Trigger
```bash
curl -X POST http://localhost:8080/simulate/trigger \
-H "Content-Type: application/json" \
-d '{"config_path": "configs/default_config.json"}'
```
### 5. Deploy Windmill Scripts
```bash
wmill script push windmill/trigger_simulation.ts
wmill script push windmill/poll_simulation_status.ts
wmill script push windmill/store_simulation_results.py
```
### 6. Create Windmill Workflow
- Import `windmill/daily_simulation_workflow.yaml`
- Configure resource `ai_trader_api` with API URL
- Set cron schedule (daily 6 AM)
### 7. Create Windmill Dashboard
- Import `windmill/dashboard.json`
- Verify data visualization
---
## Troubleshooting Guide
### Issue: Health check fails
**Symptoms:** `curl http://localhost:8080/health` returns 503
**Possible Causes:**
1. MCP services not running
2. Database file permission error
3. API server not started
**Solutions:**
```bash
# Check MCP services
docker-compose exec ai-trader curl http://localhost:8000/health
# Check API logs
docker-compose logs -f ai-trader
# Restart container
docker-compose restart
```
### Issue: Job stuck in "running" status
**Symptoms:** Job never completes, status remains "running"
**Possible Causes:**
1. Agent execution crashed
2. Model API timeout
3. Worker process died
**Solutions:**
```bash
# Check job details for error messages
curl http://localhost:8080/simulate/status/{job_id}
# Check container logs
docker-compose logs -f ai-trader
# If API restarted, stale jobs are marked as failed on startup
docker-compose restart
```
### Issue: Windmill can't reach API
**Symptoms:** Connection refused from Windmill scripts
**Solutions:**
- Verify Windmill and AI-Trader on same Docker network
- Check firewall rules
- Use container name (ai-trader) instead of localhost in Windmill resource
- Verify API_PORT environment variable
---
## Migration from Batch Mode
### For Users Currently Running Batch Mode
**Option 1: Dual Mode (Recommended)**
- Keep existing `main.py` for manual testing
- Add new API mode for production automation
- Use different config files for each mode
**Option 2: API-Only**
- Replace batch execution entirely
- All simulations via API calls
- More consistent with production workflow
### Migration Checklist
- [ ] Backup existing `data/` directory
- [ ] Update `.env` with API configuration
- [ ] Test API mode in separate environment first
- [ ] Gradually migrate Windmill workflows
- [ ] Monitor logs for errors
- [ ] Validate results match batch mode output
---
## Next Steps
1. **Review Specifications**
- Read all 4 specification documents
- Ask clarifying questions
- Approve design before implementation
2. **Implementation Phase 1** (Days 1-2)
- Set up `api/` directory structure
- Implement database and job_manager
- Write unit tests
3. **Implementation Phase 2** (Days 3-4)
- Implement worker and executor
- Test with mock agents
4. **Implementation Phase 3** (Days 5-6)
- Implement FastAPI endpoints
- Test with Postman/curl
5. **Implementation Phase 4** (Day 7)
- Docker integration
- End-to-end testing
6. **Implementation Phase 5** (Days 8-9)
- Windmill integration
- Dashboard creation
7. **Implementation Phase 6** (Day 10)
- Final testing
- Documentation
---
## Questions or Feedback?
Please review all specifications and provide feedback on:
1. API endpoint design
2. Database schema
3. Execution pattern (date-sequential, model-parallel)
4. Error handling approach
5. Windmill integration workflow
6. Any concerns or suggested improvements
**Ready to proceed with implementation?** Confirm approval of specifications to begin Phase 1.