diff --git a/docs/plans/2025-10-30-data-cache-reuse-design.md b/docs/plans/2025-10-30-data-cache-reuse-design.md new file mode 100644 index 0000000..7e4c5f6 --- /dev/null +++ b/docs/plans/2025-10-30-data-cache-reuse-design.md @@ -0,0 +1,197 @@ +# Data Cache Reuse Design + +**Date:** 2025-10-30 +**Status:** Approved + +## Problem Statement + +Docker containers currently fetch all 103 NASDAQ 100 tickers from Alpha Vantage on every startup, even when price data is volume-mounted and already cached in `./data`. This causes: +- Slow startup times (103 API calls) +- Unnecessary API quota consumption +- Rate limit risks during frequent development iterations + +## Solution Overview + +Implement staleness-based data refresh with configurable age threshold. Container checks all `daily_prices_*.json` files and only refetches if any file is missing or older than `MAX_DATA_AGE_DAYS`. + +## Design Decisions + +### Architecture Choice +**Selected:** Check all `daily_prices_*.json` files individually +**Rationale:** Ensures data integrity by detecting partial/missing files, not just stale merged data + +### Implementation Location +**Selected:** Bash wrapper logic in `entrypoint.sh` +**Rationale:** Keeps data fetching scripts unchanged, adds orchestration at container startup layer + +### Staleness Threshold +**Selected:** Configurable via `MAX_DATA_AGE_DAYS` environment variable (default: 7 days) +**Rationale:** Balances freshness with API usage; flexible for different use cases (development vs production) + +## Technical Design + +### Components + +#### 1. Staleness Check Function +Location: `entrypoint.sh` (after environment validation, before data fetch) + +```bash +should_refresh_data() { + MAX_AGE=${MAX_DATA_AGE_DAYS:-7} + + # Check if at least one price file exists + if ! ls /app/data/daily_prices_*.json >/dev/null 2>&1; then + echo "📭 No price data found" + return 0 # Need refresh + fi + + # Find any files older than MAX_AGE days + STALE_COUNT=$(find /app/data -name "daily_prices_*.json" -mtime +$MAX_AGE | wc -l) + TOTAL_COUNT=$(ls /app/data/daily_prices_*.json 2>/dev/null | wc -l) + + if [ $STALE_COUNT -gt 0 ]; then + echo "📅 Found $STALE_COUNT stale files (>$MAX_AGE days old)" + return 0 # Need refresh + fi + + echo "✅ All $TOTAL_COUNT price files are fresh (<$MAX_AGE days old)" + return 1 # Skip refresh +} +``` + +**Logic:** +- Uses `find -mtime +N` to detect files modified more than N days ago +- Returns shell exit codes: 0 (refresh needed), 1 (skip refresh) +- Logs informative messages for debugging + +#### 2. Conditional Data Fetch +Location: `entrypoint.sh` lines 40-46 (replace existing unconditional fetch) + +```bash +# Step 1: Data preparation (conditional) +echo "📊 Checking price data freshness..." + +if should_refresh_data; then + echo "🔄 Fetching and merging price data..." + cd /app/data + python /app/scripts/get_daily_price.py + python /app/scripts/merge_jsonl.py + cd /app +else + echo "⏭️ Skipping data fetch (using cached data)" +fi +``` + +#### 3. Environment Configuration +**docker-compose.yml:** +```yaml +environment: + - MAX_DATA_AGE_DAYS=${MAX_DATA_AGE_DAYS:-7} +``` + +**.env.example:** +```bash +# Data Refresh Configuration +MAX_DATA_AGE_DAYS=7 # Refresh price data older than N days (0=always refresh) +``` + +### Data Flow + +1. **Container Startup** → entrypoint.sh begins execution +2. **Environment Validation** → Check required API keys (existing logic) +3. **Staleness Check** → `should_refresh_data()` scans `/app/data/daily_prices_*.json` + - No files found → Return 0 (refresh) + - Any file older than `MAX_DATA_AGE_DAYS` → Return 0 (refresh) + - All files fresh → Return 1 (skip) +4. **Conditional Fetch** → Run get_daily_price.py only if refresh needed +5. **Merge Data** → Always run merge_jsonl.py (handles missing merged.jsonl) +6. **MCP Services** → Start services (existing logic) +7. **Trading Agent** → Begin trading (existing logic) + +### Edge Cases + +| Scenario | Behavior | +|----------|----------| +| **First run (no data)** | Detects no files → triggers full fetch | +| **Restart within 7 days** | All files fresh → skips fetch (fast startup) | +| **Restart after 7 days** | Files stale → refreshes all data | +| **Partial data (some files missing)** | Missing files treated as infinitely old → triggers refresh | +| **Corrupt merged.jsonl but fresh price files** | Skips fetch, re-runs merge to rebuild merged.jsonl | +| **MAX_DATA_AGE_DAYS=0** | Always refresh (useful for testing/production) | +| **MAX_DATA_AGE_DAYS unset** | Defaults to 7 days | +| **Alpha Vantage rate limit** | get_daily_price.py handles with warning (existing behavior) | + +## Configuration Options + +| Variable | Default | Purpose | +|----------|---------|---------| +| `MAX_DATA_AGE_DAYS` | 7 | Days before price data considered stale | + +**Special Values:** +- `0` → Always refresh (force fresh data) +- `999` → Never refresh (use cached data indefinitely) + +## User Experience + +### Scenario 1: Fresh Container +``` +🚀 Starting AI-Trader... +🔍 Validating environment variables... +✅ Environment variables validated +📊 Checking price data freshness... +📭 No price data found +🔄 Fetching and merging price data... +✓ Fetched NVDA +✓ Fetched MSFT +... +``` + +### Scenario 2: Restart Within 7 Days +``` +🚀 Starting AI-Trader... +🔍 Validating environment variables... +✅ Environment variables validated +📊 Checking price data freshness... +✅ All 103 price files are fresh (<7 days old) +⏭️ Skipping data fetch (using cached data) +🔧 Starting MCP services... +``` + +### Scenario 3: Restart After 7 Days +``` +🚀 Starting AI-Trader... +🔍 Validating environment variables... +✅ Environment variables validated +📊 Checking price data freshness... +📅 Found 103 stale files (>7 days old) +🔄 Fetching and merging price data... +✓ Fetched NVDA +✓ Fetched MSFT +... +``` + +## Testing Plan + +1. **Test fresh container:** Delete `./data/daily_prices_*.json`, start container → should fetch all +2. **Test cached data:** Restart immediately → should skip fetch +3. **Test staleness:** `touch -d "8 days ago" ./data/daily_prices_AAPL.json`, restart → should refresh +4. **Test partial data:** Delete 10 random price files → should refresh all +5. **Test MAX_DATA_AGE_DAYS=0:** Restart with env var set → should always fetch +6. **Test MAX_DATA_AGE_DAYS=30:** Restart with 8-day-old data → should skip + +## Documentation Updates + +Files requiring updates: +- `entrypoint.sh` → Add function and conditional logic +- `docker-compose.yml` → Add MAX_DATA_AGE_DAYS environment variable +- `.env.example` → Document MAX_DATA_AGE_DAYS with default value +- `CLAUDE.md` → Update "Docker Deployment" section with new env var +- `docs/DOCKER.md` (if exists) → Explain data caching behavior + +## Benefits + +- **Development:** Instant container restarts during iteration +- **API Quota:** ~103 fewer API calls per restart +- **Reliability:** No rate limit risks during frequent testing +- **Flexibility:** Configurable threshold for different use cases +- **Consistency:** Checks all files to ensure complete data