Files
AI-Trader/docs/plans/2025-10-30-data-cache-reuse-design.md
Bill 79d14444ed docs: add data cache reuse design document
Captures design for staleness-based data refresh to avoid
re-fetching all 103 NASDAQ tickers on every container restart.

Key features:
- Check all daily_prices_*.json files for staleness
- Configurable MAX_DATA_AGE_DAYS threshold (default: 7)
- Bash wrapper logic in entrypoint.sh
- Handles edge cases (partial data, missing files, rate limits)
2025-10-30 21:46:05 -04:00

6.7 KiB

Data Cache Reuse Design

Date: 2025-10-30 Status: Approved

Problem Statement

Docker containers currently fetch all 103 NASDAQ 100 tickers from Alpha Vantage on every startup, even when price data is volume-mounted and already cached in ./data. This causes:

  • Slow startup times (103 API calls)
  • Unnecessary API quota consumption
  • Rate limit risks during frequent development iterations

Solution Overview

Implement staleness-based data refresh with configurable age threshold. Container checks all daily_prices_*.json files and only refetches if any file is missing or older than MAX_DATA_AGE_DAYS.

Design Decisions

Architecture Choice

Selected: Check all daily_prices_*.json files individually Rationale: Ensures data integrity by detecting partial/missing files, not just stale merged data

Implementation Location

Selected: Bash wrapper logic in entrypoint.sh Rationale: Keeps data fetching scripts unchanged, adds orchestration at container startup layer

Staleness Threshold

Selected: Configurable via MAX_DATA_AGE_DAYS environment variable (default: 7 days) Rationale: Balances freshness with API usage; flexible for different use cases (development vs production)

Technical Design

Components

1. Staleness Check Function

Location: entrypoint.sh (after environment validation, before data fetch)

should_refresh_data() {
    MAX_AGE=${MAX_DATA_AGE_DAYS:-7}

    # Check if at least one price file exists
    if ! ls /app/data/daily_prices_*.json >/dev/null 2>&1; then
        echo "📭 No price data found"
        return 0  # Need refresh
    fi

    # Find any files older than MAX_AGE days
    STALE_COUNT=$(find /app/data -name "daily_prices_*.json" -mtime +$MAX_AGE | wc -l)
    TOTAL_COUNT=$(ls /app/data/daily_prices_*.json 2>/dev/null | wc -l)

    if [ $STALE_COUNT -gt 0 ]; then
        echo "📅 Found $STALE_COUNT stale files (>$MAX_AGE days old)"
        return 0  # Need refresh
    fi

    echo "✅ All $TOTAL_COUNT price files are fresh (<$MAX_AGE days old)"
    return 1  # Skip refresh
}

Logic:

  • Uses find -mtime +N to detect files modified more than N days ago
  • Returns shell exit codes: 0 (refresh needed), 1 (skip refresh)
  • Logs informative messages for debugging

2. Conditional Data Fetch

Location: entrypoint.sh lines 40-46 (replace existing unconditional fetch)

# Step 1: Data preparation (conditional)
echo "📊 Checking price data freshness..."

if should_refresh_data; then
    echo "🔄 Fetching and merging price data..."
    cd /app/data
    python /app/scripts/get_daily_price.py
    python /app/scripts/merge_jsonl.py
    cd /app
else
    echo "⏭️  Skipping data fetch (using cached data)"
fi

3. Environment Configuration

docker-compose.yml:

environment:
  - MAX_DATA_AGE_DAYS=${MAX_DATA_AGE_DAYS:-7}

.env.example:

# Data Refresh Configuration
MAX_DATA_AGE_DAYS=7  # Refresh price data older than N days (0=always refresh)

Data Flow

  1. Container Startup → entrypoint.sh begins execution
  2. Environment Validation → Check required API keys (existing logic)
  3. Staleness Checkshould_refresh_data() scans /app/data/daily_prices_*.json
    • No files found → Return 0 (refresh)
    • Any file older than MAX_DATA_AGE_DAYS → Return 0 (refresh)
    • All files fresh → Return 1 (skip)
  4. Conditional Fetch → Run get_daily_price.py only if refresh needed
  5. Merge Data → Always run merge_jsonl.py (handles missing merged.jsonl)
  6. MCP Services → Start services (existing logic)
  7. Trading Agent → Begin trading (existing logic)

Edge Cases

Scenario Behavior
First run (no data) Detects no files → triggers full fetch
Restart within 7 days All files fresh → skips fetch (fast startup)
Restart after 7 days Files stale → refreshes all data
Partial data (some files missing) Missing files treated as infinitely old → triggers refresh
Corrupt merged.jsonl but fresh price files Skips fetch, re-runs merge to rebuild merged.jsonl
MAX_DATA_AGE_DAYS=0 Always refresh (useful for testing/production)
MAX_DATA_AGE_DAYS unset Defaults to 7 days
Alpha Vantage rate limit get_daily_price.py handles with warning (existing behavior)

Configuration Options

Variable Default Purpose
MAX_DATA_AGE_DAYS 7 Days before price data considered stale

Special Values:

  • 0 → Always refresh (force fresh data)
  • 999 → Never refresh (use cached data indefinitely)

User Experience

Scenario 1: Fresh Container

🚀 Starting AI-Trader...
🔍 Validating environment variables...
✅ Environment variables validated
📊 Checking price data freshness...
📭 No price data found
🔄 Fetching and merging price data...
✓ Fetched NVDA
✓ Fetched MSFT
...

Scenario 2: Restart Within 7 Days

🚀 Starting AI-Trader...
🔍 Validating environment variables...
✅ Environment variables validated
📊 Checking price data freshness...
✅ All 103 price files are fresh (<7 days old)
⏭️  Skipping data fetch (using cached data)
🔧 Starting MCP services...

Scenario 3: Restart After 7 Days

🚀 Starting AI-Trader...
🔍 Validating environment variables...
✅ Environment variables validated
📊 Checking price data freshness...
📅 Found 103 stale files (>7 days old)
🔄 Fetching and merging price data...
✓ Fetched NVDA
✓ Fetched MSFT
...

Testing Plan

  1. Test fresh container: Delete ./data/daily_prices_*.json, start container → should fetch all
  2. Test cached data: Restart immediately → should skip fetch
  3. Test staleness: touch -d "8 days ago" ./data/daily_prices_AAPL.json, restart → should refresh
  4. Test partial data: Delete 10 random price files → should refresh all
  5. Test MAX_DATA_AGE_DAYS=0: Restart with env var set → should always fetch
  6. Test MAX_DATA_AGE_DAYS=30: Restart with 8-day-old data → should skip

Documentation Updates

Files requiring updates:

  • entrypoint.sh → Add function and conditional logic
  • docker-compose.yml → Add MAX_DATA_AGE_DAYS environment variable
  • .env.example → Document MAX_DATA_AGE_DAYS with default value
  • CLAUDE.md → Update "Docker Deployment" section with new env var
  • docs/DOCKER.md (if exists) → Explain data caching behavior

Benefits

  • Development: Instant container restarts during iteration
  • API Quota: ~103 fewer API calls per restart
  • Reliability: No rate limit risks during frequent testing
  • Flexibility: Configurable threshold for different use cases
  • Consistency: Checks all files to ensure complete data