# Docker TLS Security Setup This directory contains scripts and configuration for securing Docker API access with TLS authentication, replacing the insecure socket mounting approach. ## Overview Previously, the session-manager service mounted the Docker socket (`/var/run/docker.sock`) directly into containers, granting full root access to the host Docker daemon. This is a critical security vulnerability. This setup replaces socket mounting with authenticated TLS API access over the network. ## Security Benefits - ✅ **No socket mounting**: Eliminates privilege escalation risk - ✅ **Mutual TLS authentication**: Both client and server authenticate - ✅ **Encrypted communication**: All API calls are encrypted - ✅ **Certificate-based access**: Granular access control - ✅ **Network isolation**: API access is network-bound, not filesystem-bound ## Docker Service Abstraction The session-manager now uses a clean `DockerService` abstraction layer that separates Docker operations from business logic, enabling better testing, maintainability, and future Docker client changes. ### Architecture Benefits - 🧪 **Testability**: MockDockerService enables testing without Docker daemon - 🔧 **Maintainability**: Clean separation of concerns - 🔄 **Flexibility**: Easy to swap Docker client implementations - 📦 **Dependency Injection**: SessionManager receives DockerService via constructor - ⚡ **Performance**: Both async and sync Docker operations supported ### Service Interface ```python class DockerService: async def create_container(self, name: str, image: str, **kwargs) -> ContainerInfo async def start_container(self, container_id: str) -> None async def stop_container(self, container_id: str, timeout: int = 10) -> None async def remove_container(self, container_id: str, force: bool = False) -> None async def get_container_info(self, container_id: str) -> Optional[ContainerInfo] async def list_containers(self, all: bool = False) -> List[ContainerInfo] async def ping(self) -> bool ``` ### Testing Run the comprehensive test suite: ```bash # Test Docker service abstraction ./docker/scripts/test-docker-service.py # Results: 7/7 tests passed ✅ # - Service Interface ✅ # - Error Handling ✅ # - Async vs Sync Modes ✅ # - Container Info Operations ✅ # - Context Management ✅ # - Integration Patterns ✅ # - Performance and Scaling ✅ ``` ### Usage in SessionManager ```python # Dependency injection pattern session_manager = SessionManager(docker_service=DockerService(use_async=True)) # Or with mock for testing test_manager = SessionManager(docker_service=MockDockerService()) ``` ## Files Structure ``` docker/ ├── certs/ # Generated TLS certificates (not in git) ├── scripts/ │ ├── generate-certs.sh # Certificate generation script │ ├── setup-docker-tls.sh # Docker daemon TLS configuration │ └── test-tls-connection.py # Connection testing script ├── daemon.json # Docker daemon TLS configuration └── .env.example # Environment configuration template ``` ## Quick Start ### 1. Generate TLS Certificates ```bash # Generate certificates for development DOCKER_ENV=development ./docker/scripts/generate-certs.sh # Or for production with custom settings DOCKER_ENV=production \ DOCKER_HOST_IP=your-server-ip \ DOCKER_HOST_NAME=your-docker-host \ ./docker/scripts/generate-certs.sh ``` ### 2. Configure Docker Daemon **For local development (Docker Desktop):** ```bash # Certificates are automatically mounted in docker-compose.yml docker-compose up -d ``` **For production/server setup:** ```bash # Configure system Docker daemon with TLS sudo ./docker/scripts/setup-docker-tls.sh ``` ### 3. Configure Environment ```bash # Copy and customize environment file cp docker/.env.example .env # Edit .env with your settings # DOCKER_HOST_IP=host.docker.internal # for Docker Desktop # DOCKER_HOST_IP=your-server-ip # for production ``` ### 4. Test Configuration ```bash # Test TLS connection ./docker/scripts/test-tls-connection.py # Start services docker-compose --env-file .env up -d session-manager # Check logs docker-compose logs session-manager ``` ## Configuration Options ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `DOCKER_TLS_VERIFY` | `1` | Enable TLS verification | | `DOCKER_CERT_PATH` | `./docker/certs` | Certificate directory path | | `DOCKER_HOST` | `tcp://host.docker.internal:2376` | Docker daemon endpoint | | `DOCKER_TLS_PORT` | `2376` | TLS port for Docker API | | `DOCKER_CA_CERT` | `./docker/certs/ca.pem` | CA certificate path | | `DOCKER_CLIENT_CERT` | `./docker/certs/client-cert.pem` | Client certificate path | | `DOCKER_CLIENT_KEY` | `./docker/certs/client-key.pem` | Client key path | | `DOCKER_HOST_IP` | `host.docker.internal` | Docker host IP | ### Certificate Generation Options | Variable | Default | Description | |----------|---------|-------------| | `DOCKER_ENV` | `development` | Environment name for certificates | | `DOCKER_HOST_IP` | `127.0.0.1` | IP address for server certificate | | `DOCKER_HOST_NAME` | `localhost` | Hostname for server certificate | | `DAYS` | `3650` | Certificate validity in days | ## Production Deployment ### Certificate Management 1. **Generate certificates on a secure machine** 2. **Distribute to servers securely** (SCP, Ansible, etc.) 3. **Set proper permissions**: ```bash chmod 444 /etc/docker/certs/*.pem # certs readable by all chmod 400 /etc/docker/certs/*-key.pem # keys readable by root only ``` 4. **Rotate certificates regularly** (every 6-12 months) 5. **Revoke compromised certificates** and regenerate ### Docker Daemon Configuration For production servers, use the `setup-docker-tls.sh` script or manually configure `/etc/docker/daemon.json`: ```json { "tls": true, "tlsverify": true, "tlscacert": "/etc/docker/certs/ca.pem", "tlscert": "/etc/docker/certs/server-cert.pem", "tlskey": "/etc/docker/certs/server-key.pem", "hosts": ["tcp://0.0.0.0:2376"], "iptables": false, "bridge": "none", "live-restore": true, "userland-proxy": false, "no-new-privileges": true } ``` ### Security Hardening - **Firewall**: Only allow TLS port (2376) from trusted networks - **TLS 1.3**: Ensure modern TLS version support - **Certificate pinning**: Consider certificate pinning in client code - **Monitoring**: Log and monitor Docker API access - **Rate limiting**: Implement API rate limiting ## Troubleshooting ### Common Issues **"Connection refused"** - Check if Docker daemon is running with TLS - Verify `DOCKER_HOST` points to correct endpoint - Ensure firewall allows port 2376 **"TLS handshake failed"** - Verify certificates exist and have correct permissions - Check certificate validity dates - Ensure CA certificate is correct **"Permission denied"** - Check certificate file permissions (444 for certs, 400 for keys) - Ensure client certificate is signed by the CA ### Debug Commands ```bash # Test TLS connection manually docker --tlsverify \ --tlscacert=./docker/certs/ca.pem \ --tlscert=./docker/certs/client-cert.pem \ --tlskey=./docker/certs/client-key.pem \ -H tcp://host.docker.internal:2376 \ version # Check certificate validity openssl x509 -in ./docker/certs/server-cert.pem -text -noout # Test from container docker-compose exec session-manager ./docker/scripts/test-tls-connection.py ``` ## Migration from Socket Mounting ### Before (Insecure) ```yaml volumes: - /var/run/docker.sock:/var/run/docker.sock ``` ### After (Secure) ```yaml volumes: - ./docker/certs:/etc/docker/certs:ro environment: - DOCKER_TLS_VERIFY=1 - DOCKER_HOST=tcp://host.docker.internal:2376 ``` ### Code Changes Required Update Docker client initialization: ```python # Before self.docker_client = docker.from_env() # After tls_config = docker.tls.TLSConfig( ca_cert=os.getenv('DOCKER_CA_CERT'), client_cert=(os.getenv('DOCKER_CLIENT_CERT'), os.getenv('DOCKER_CLIENT_KEY')), verify=True ) self.docker_client = docker.from_env() self.docker_client.api = docker.APIClient( base_url=os.getenv('DOCKER_HOST'), tls=tls_config ) ``` ## Dynamic Host IP Detection The session-manager service now includes robust host IP detection to support proxy routing across different Docker environments: ### Supported Environments - **Docker Desktop (Mac/Windows)**: Uses `host.docker.internal` resolution - **Linux Docker**: Reads gateway from `/proc/net/route` - **Cloud environments**: Respects `DOCKER_HOST_GATEWAY` and `GATEWAY` environment variables - **Custom networks**: Tests connectivity to common Docker gateway IPs ### Detection Methods (in priority order) 1. **Docker Internal**: Resolves `host.docker.internal` (Docker Desktop) 2. **Environment Variables**: Checks `HOST_IP`, `DOCKER_HOST_GATEWAY`, `GATEWAY` 3. **Route Table**: Parses `/proc/net/route` for default gateway 4. **Network Connection**: Tests connectivity to determine local routing 5. **Common Gateways**: Falls back to known Docker bridge IPs ### Configuration The detection is automatic and cached for 5 minutes. Override with: ```bash # Force specific host IP export HOST_IP=192.168.1.100 # Or in docker-compose.yml environment: - HOST_IP=your-host-ip ``` ### Testing ```bash # Test host IP detection ./docker/scripts/test-host-ip-detection.py # Run integration test ./docker/scripts/test-integration.sh ``` ### Troubleshooting **"Could not detect Docker host IP"** - Check network configuration: `docker network inspect bridge` - Verify environment variables - Test connectivity: `ping host.docker.internal` - Set explicit `HOST_IP` if needed **Proxy routing fails** - Verify detected IP is accessible from containers - Check firewall rules blocking container-to-host traffic - Ensure Docker network allows communication ## Structured Logging Comprehensive logging infrastructure with structured JSON logs, request tracking, and production-ready log management for debugging and monitoring. ### Log Features - **Structured JSON Logs**: Machine-readable logs for production analysis - **Request ID Tracking**: Trace requests across distributed operations - **Human-Readable Development**: Clear logs for local development - **Performance Metrics**: Built-in request timing and performance tracking - **Security Event Logging**: Audit trail for security-related events - **Log Rotation**: Automatic log rotation with size limits ### Configuration ```bash # Log level and format export LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL export LOG_FORMAT=auto # json, human, auto (detects environment) # File logging export LOG_FILE=/var/log/lovdata-chat.log export LOG_MAX_SIZE_MB=10 # Max log file size export LOG_BACKUP_COUNT=5 # Number of backup files # Output control export LOG_CONSOLE=true # Enable console logging export LOG_FILE_ENABLED=true # Enable file logging ``` ### Testing Structured Logging ```bash # Test logging functionality and formatters ./docker/scripts/test-structured-logging.py ``` ### Log Analysis **JSON Format (Production):** ```json { "timestamp": "2024-01-15T10:30:45.123Z", "level": "INFO", "logger": "session_manager.main", "message": "Session created successfully", "request_id": "req-abc123", "session_id": "ses-xyz789", "operation": "create_session", "duration_ms": 245.67 } ``` **Human-Readable Format (Development):** ``` 2024-01-15 10:30:45 [INFO ] session_manager.main:create_session:145 [req-abc123] - Session created successfully ``` ### Request Tracing All logs include request IDs for tracing operations across the system: ```python with RequestContext(): log_session_operation(session_id, "created") # All subsequent logs in this context include request_id ``` ## Database Persistence Session data is now stored in PostgreSQL for reliability, multi-instance deployment support, and elimination of JSON file corruption vulnerabilities. ### Database Configuration ```bash # PostgreSQL connection settings export DB_HOST=localhost # Database host export DB_PORT=5432 # Database port export DB_USER=lovdata # Database user export DB_PASSWORD=password # Database password export DB_NAME=lovdata_chat # Database name # Connection pool settings export DB_MIN_CONNECTIONS=5 # Minimum pool connections export DB_MAX_CONNECTIONS=20 # Maximum pool connections export DB_MAX_QUERIES=50000 # Max queries per connection export DB_MAX_INACTIVE_LIFETIME=300.0 # Connection timeout ``` ### Storage Backend Selection ```bash # Enable database storage (recommended for production) export USE_DATABASE_STORAGE=true # Or use JSON file storage (legacy/development) export USE_DATABASE_STORAGE=false ``` ### Database Schema **Sessions Table:** - `session_id` (VARCHAR, Primary Key): Unique session identifier - `container_name` (VARCHAR): Docker container name - `container_id` (VARCHAR): Docker container ID - `host_dir` (VARCHAR): Host directory path - `port` (INTEGER): Container port - `auth_token` (VARCHAR): Authentication token - `created_at` (TIMESTAMP): Creation timestamp - `last_accessed` (TIMESTAMP): Last access timestamp - `status` (VARCHAR): Session status (creating, running, stopped, error) - `metadata` (JSONB): Additional session metadata **Indexes:** - Primary key on `session_id` - Status index for filtering active sessions - Last accessed index for cleanup operations - Created at index for session listing ### Testing Database Persistence ```bash # Test database connection and operations ./docker/scripts/test-database-persistence.py ``` ### Health Monitoring The `/health` endpoint now includes database status: ```json { "storage_backend": "database", "database": { "status": "healthy", "total_sessions": 15, "active_sessions": 8, "database_size": "25 MB" } } ``` ### Migration Strategy **From JSON File to Database:** 1. **Backup existing sessions** (if any) 2. **Set environment variables** for database connection 3. **Enable database storage**: `USE_DATABASE_STORAGE=true` 4. **Restart service** - automatic schema creation and migration 5. **Verify data migration** in health endpoint 6. **Monitor performance** and adjust connection pool settings **Backward Compatibility:** - JSON file storage remains available for development - Automatic fallback if database is unavailable - Zero-downtime migration possible ## Container Health Monitoring Active monitoring of Docker containers with automatic failure detection and recovery mechanisms to prevent stuck sessions and improve system reliability. ### Health Monitoring Features - **Periodic Health Checks**: Continuous monitoring of running containers every 30 seconds - **Automatic Failure Detection**: Identifies unhealthy or failed containers - **Smart Restart Logic**: Automatic container restart with configurable limits - **Health History Tracking**: Maintains health check history for analysis - **Status Integration**: Updates session status based on container health ### Configuration ```bash # Health check intervals and timeouts CONTAINER_HEALTH_CHECK_INTERVAL=30 # Check every 30 seconds CONTAINER_HEALTH_TIMEOUT=10.0 # Health check timeout CONTAINER_MAX_RESTART_ATTEMPTS=3 # Max restart attempts CONTAINER_RESTART_DELAY=5 # Delay between restarts CONTAINER_FAILURE_THRESHOLD=3 # Failures before restart ``` ### Health Status Types - **HEALTHY**: Container running normally with optional health checks passing - **UNHEALTHY**: Container running but health checks failing - **RESTARTING**: Container being restarted due to failures - **FAILED**: Container stopped or permanently failed - **UNKNOWN**: Unable to determine container status ### Testing Health Monitoring ```bash # Test health monitoring functionality ./docker/scripts/test-container-health.py ``` ### Health Endpoints **System Health:** ```bash GET /health # Includes container health statistics ``` **Detailed Container Health:** ```bash GET /health/container # Overall health stats GET /health/container/{session_id} # Specific session health ``` **Health Response:** ```json { "container_health": { "monitoring_active": true, "check_interval": 30, "total_sessions_monitored": 5, "sessions_with_failures": 1, "session_ses123": { "total_checks": 10, "healthy_checks": 8, "failed_checks": 2, "average_response_time": 45.2 } } } ``` ### Recovery Mechanisms 1. **Health Check Failure**: Container marked as unhealthy 2. **Consecutive Failures**: After threshold, automatic restart initiated 3. **Restart Attempts**: Limited to prevent infinite restart loops 4. **Session Status Update**: Session status reflects container health 5. **Logging & Alerts**: Comprehensive logging of health events ### Integration Benefits - **Proactive Monitoring**: Detects issues before users are affected - **Automatic Recovery**: Reduces manual intervention requirements - **Improved Reliability**: Prevents stuck sessions and system instability - **Operational Visibility**: Detailed health metrics and history - **Scalable Architecture**: Works with multiple concurrent sessions ## Session Authentication OpenCode servers now require token-based authentication for secure individual user sessions, preventing unauthorized access and ensuring session isolation. ### Authentication Features - **Token Generation**: Unique cryptographically secure tokens per session - **Automatic Expiry**: Configurable token lifetime (default 24 hours) - **Token Rotation**: Ability to rotate tokens for enhanced security - **Session Isolation**: Each user session has its own authentication credentials - **Proxy Integration**: Authentication headers automatically included in proxy requests ### Configuration ```bash # Token configuration export SESSION_TOKEN_LENGTH=32 # Token length in characters export SESSION_TOKEN_EXPIRY_HOURS=24 # Token validity period export SESSION_TOKEN_SECRET=auto # Token signing secret (auto-generated) export TOKEN_CLEANUP_INTERVAL_MINUTES=60 # Expired token cleanup interval ``` ### Testing Authentication ```bash # Test authentication functionality ./docker/scripts/test-session-auth.py # End-to-end authentication testing ./docker/scripts/test-auth-end-to-end.sh ``` ### API Endpoints **Authentication Management:** - `GET /sessions/{id}/auth` - Get session authentication info - `POST /sessions/{id}/auth/rotate` - Rotate session token - `GET /auth/sessions` - List authenticated sessions **Health Monitoring:** ```json { "authenticated_sessions": 3, "status": "healthy" } ``` ### Security Benefits - **Session Isolation**: Users cannot access each other's OpenCode servers - **Token Expiry**: Automatic cleanup prevents token accumulation - **Secure Generation**: Cryptographically secure random tokens - **Proxy Security**: Authentication headers prevent unauthorized proxy access ## HTTP Connection Pooling Proxy requests now use a global HTTP connection pool instead of creating new httpx clients for each request, eliminating connection overhead and dramatically improving proxy performance. ### Connection Pool Benefits - **Eliminated Connection Overhead**: No more client creation/teardown per request - **Connection Reuse**: Persistent keep-alive connections reduce latency - **Improved Throughput**: Handle significantly more concurrent proxy requests - **Reduced Resource Usage**: Lower memory and CPU overhead for HTTP operations - **Better Scalability**: Support higher request rates with the same system resources ### Pool Configuration The connection pool is automatically configured with optimized settings: ```python # Connection pool settings max_keepalive_connections=20 # Keep connections alive max_connections=100 # Max total connections keepalive_expiry=300.0 # 5-minute connection lifetime connect_timeout=10.0 # Connection establishment timeout read_timeout=30.0 # Read operation timeout ``` ### Performance Testing ```bash # Test HTTP connection pool functionality ./docker/scripts/test-http-connection-pool.py # Load test proxy performance improvements ./docker/scripts/test-http-pool-load.sh ``` ### Health Monitoring The `/health` endpoint now includes HTTP connection pool status: ```json { "http_connection_pool": { "status": "healthy", "config": { "max_keepalive_connections": 20, "max_connections": 100, "keepalive_expiry": 300.0 } } } ``` ## Async Docker Operations Docker operations now run asynchronously using aiodeocker to eliminate blocking calls in FastAPI's async event loop, significantly improving concurrency and preventing thread pool exhaustion. ### Async Benefits - **Non-Blocking Operations**: Container creation, management, and cleanup no longer block the event loop - **Improved Concurrency**: Handle multiple concurrent user sessions without performance degradation - **Better Scalability**: Support higher throughput with the same system resources - **Thread Pool Preservation**: Prevent exhaustion of async thread pools ### Configuration ```bash # Enable async Docker operations (recommended) export USE_ASYNC_DOCKER=true # Or disable for sync mode (legacy) export USE_ASYNC_DOCKER=false ``` ### Testing Async Operations ```bash # Test async Docker functionality ./docker/scripts/test-async-docker.py # Load test concurrent operations ./docker/scripts/test-async-docker-load.sh ``` ### Performance Impact Async operations provide significant performance improvements: - **Concurrent Sessions**: Handle 10+ concurrent container operations without blocking - **Response Times**: Faster session creation under load - **Resource Efficiency**: Better CPU utilization with non-blocking I/O - **Scalability**: Support more users per server instance ## Resource Limits Enforcement Container resource limits are now actively enforced to prevent resource exhaustion attacks and ensure fair resource allocation across user sessions. ### Configurable Limits | Environment Variable | Default | Description | |---------------------|---------|-------------| | `CONTAINER_MEMORY_LIMIT` | `4g` | Memory limit per container | | `CONTAINER_CPU_QUOTA` | `100000` | CPU quota (microseconds per period) | | `CONTAINER_CPU_PERIOD` | `100000` | CPU period (microseconds) | | `MAX_CONCURRENT_SESSIONS` | `3` | Maximum concurrent user sessions | | `MEMORY_WARNING_THRESHOLD` | `0.8` | Memory usage warning threshold (80%) | | `CPU_WARNING_THRESHOLD` | `0.9` | CPU usage warning threshold (90%) | ### Resource Protection Features - **Memory Limits**: Prevents containers from consuming unlimited RAM - **CPU Quotas**: Ensures fair CPU allocation across sessions - **Session Throttling**: Blocks new sessions when resources are constrained - **System Monitoring**: Continuous resource usage tracking - **Graceful Degradation**: Alerts and throttling before system failure ### Testing Resource Limits ```bash # Test resource limit configuration and validation ./docker/scripts/test-resource-limits.py # Load testing with enforcement verification ./docker/scripts/test-resource-limits-load.sh ``` ### Health Monitoring The `/health` endpoint now includes comprehensive resource information: ```json { "resource_limits": { "memory_limit": "4g", "cpu_quota": 100000, "max_concurrent_sessions": 3 }, "system_resources": { "memory_percent": 0.65, "cpu_percent": 0.45 }, "resource_alerts": [] } ``` ### Resource Alert Levels - **Warning**: System resources approaching limits (80% memory, 90% CPU) - **Critical**: System resources at dangerous levels (95%+ usage) - **Throttling**: New sessions blocked when critical alerts active ## Security Audit Checklist - [ ] TLS certificates generated with strong encryption - [ ] Certificate permissions set correctly (400/444) - [ ] No socket mounting in docker-compose.yml - [ ] Environment variables properly configured - [ ] TLS connection tested successfully - [ ] Host IP detection working correctly - [ ] Proxy routing functional across environments - [ ] Resource limits properly configured and enforced - [ ] Session throttling prevents resource exhaustion - [ ] System resource monitoring active - [ ] Certificate rotation process documented - [ ] Firewall rules restrict Docker API access - [ ] Docker daemon configured with security options - [ ] Monitoring and logging enabled for API access