Files
lovdata-chat/docker
2026-01-18 23:29:04 +01:00
..
2026-01-18 23:29:04 +01:00
2026-01-18 23:29:04 +01:00
2026-01-18 23:29:04 +01:00
2026-01-18 23:29:04 +01:00
2026-01-18 23:29:04 +01:00
2026-01-18 23:29:04 +01:00
2026-01-18 23:29:04 +01:00

Docker TLS Security Setup

This directory contains scripts and configuration for securing Docker API access with TLS authentication, replacing the insecure socket mounting approach.

Overview

Previously, the session-manager service mounted the Docker socket (/var/run/docker.sock) directly into containers, granting full root access to the host Docker daemon. This is a critical security vulnerability.

This setup replaces socket mounting with authenticated TLS API access over the network.

Security Benefits

  • No socket mounting: Eliminates privilege escalation risk
  • Mutual TLS authentication: Both client and server authenticate
  • Encrypted communication: All API calls are encrypted
  • Certificate-based access: Granular access control
  • Network isolation: API access is network-bound, not filesystem-bound

Docker Service Abstraction

The session-manager now uses a clean DockerService abstraction layer that separates Docker operations from business logic, enabling better testing, maintainability, and future Docker client changes.

Architecture Benefits

  • 🧪 Testability: MockDockerService enables testing without Docker daemon
  • 🔧 Maintainability: Clean separation of concerns
  • 🔄 Flexibility: Easy to swap Docker client implementations
  • 📦 Dependency Injection: SessionManager receives DockerService via constructor
  • Performance: Both async and sync Docker operations supported

Service Interface

class DockerService:
    async def create_container(self, name: str, image: str, **kwargs) -> ContainerInfo
    async def start_container(self, container_id: str) -> None
    async def stop_container(self, container_id: str, timeout: int = 10) -> None
    async def remove_container(self, container_id: str, force: bool = False) -> None
    async def get_container_info(self, container_id: str) -> Optional[ContainerInfo]
    async def list_containers(self, all: bool = False) -> List[ContainerInfo]
    async def ping(self) -> bool

Testing

Run the comprehensive test suite:

# Test Docker service abstraction
./docker/scripts/test-docker-service.py

# Results: 7/7 tests passed ✅
# - Service Interface ✅
# - Error Handling ✅
# - Async vs Sync Modes ✅
# - Container Info Operations ✅
# - Context Management ✅
# - Integration Patterns ✅
# - Performance and Scaling ✅

Usage in SessionManager

# Dependency injection pattern
session_manager = SessionManager(docker_service=DockerService(use_async=True))

# Or with mock for testing
test_manager = SessionManager(docker_service=MockDockerService())

Files Structure

docker/
├── certs/                    # Generated TLS certificates (not in git)
├── scripts/
│   ├── generate-certs.sh     # Certificate generation script
│   ├── setup-docker-tls.sh   # Docker daemon TLS configuration
│   └── test-tls-connection.py # Connection testing script
├── daemon.json               # Docker daemon TLS configuration
└── .env.example              # Environment configuration template

Quick Start

1. Generate TLS Certificates

# Generate certificates for development
DOCKER_ENV=development ./docker/scripts/generate-certs.sh

# Or for production with custom settings
DOCKER_ENV=production \
DOCKER_HOST_IP=your-server-ip \
DOCKER_HOST_NAME=your-docker-host \
./docker/scripts/generate-certs.sh

2. Configure Docker Daemon

For local development (Docker Desktop):

# Certificates are automatically mounted in docker-compose.yml
docker-compose up -d

For production/server setup:

# Configure system Docker daemon with TLS
sudo ./docker/scripts/setup-docker-tls.sh

3. Configure Environment

# Copy and customize environment file
cp docker/.env.example .env

# Edit .env with your settings
# DOCKER_HOST_IP=host.docker.internal  # for Docker Desktop
# DOCKER_HOST_IP=your-server-ip        # for production

4. Test Configuration

# Test TLS connection
./docker/scripts/test-tls-connection.py

# Start services
docker-compose --env-file .env up -d session-manager

# Check logs
docker-compose logs session-manager

Configuration Options

Environment Variables

Variable Default Description
DOCKER_TLS_VERIFY 1 Enable TLS verification
DOCKER_CERT_PATH ./docker/certs Certificate directory path
DOCKER_HOST tcp://host.docker.internal:2376 Docker daemon endpoint
DOCKER_TLS_PORT 2376 TLS port for Docker API
DOCKER_CA_CERT ./docker/certs/ca.pem CA certificate path
DOCKER_CLIENT_CERT ./docker/certs/client-cert.pem Client certificate path
DOCKER_CLIENT_KEY ./docker/certs/client-key.pem Client key path
DOCKER_HOST_IP host.docker.internal Docker host IP

Certificate Generation Options

Variable Default Description
DOCKER_ENV development Environment name for certificates
DOCKER_HOST_IP 127.0.0.1 IP address for server certificate
DOCKER_HOST_NAME localhost Hostname for server certificate
DAYS 3650 Certificate validity in days

Production Deployment

Certificate Management

  1. Generate certificates on a secure machine
  2. Distribute to servers securely (SCP, Ansible, etc.)
  3. Set proper permissions:
    chmod 444 /etc/docker/certs/*.pem  # certs readable by all
    chmod 400 /etc/docker/certs/*-key.pem  # keys readable by root only
    
  4. Rotate certificates regularly (every 6-12 months)
  5. Revoke compromised certificates and regenerate

Docker Daemon Configuration

For production servers, use the setup-docker-tls.sh script or manually configure /etc/docker/daemon.json:

{
  "tls": true,
  "tlsverify": true,
  "tlscacert": "/etc/docker/certs/ca.pem",
  "tlscert": "/etc/docker/certs/server-cert.pem",
  "tlskey": "/etc/docker/certs/server-key.pem",
  "hosts": ["tcp://0.0.0.0:2376"],
  "iptables": false,
  "bridge": "none",
  "live-restore": true,
  "userland-proxy": false,
  "no-new-privileges": true
}

Security Hardening

  • Firewall: Only allow TLS port (2376) from trusted networks
  • TLS 1.3: Ensure modern TLS version support
  • Certificate pinning: Consider certificate pinning in client code
  • Monitoring: Log and monitor Docker API access
  • Rate limiting: Implement API rate limiting

Troubleshooting

Common Issues

"Connection refused"

  • Check if Docker daemon is running with TLS
  • Verify DOCKER_HOST points to correct endpoint
  • Ensure firewall allows port 2376

"TLS handshake failed"

  • Verify certificates exist and have correct permissions
  • Check certificate validity dates
  • Ensure CA certificate is correct

"Permission denied"

  • Check certificate file permissions (444 for certs, 400 for keys)
  • Ensure client certificate is signed by the CA

Debug Commands

# Test TLS connection manually
docker --tlsverify \
  --tlscacert=./docker/certs/ca.pem \
  --tlscert=./docker/certs/client-cert.pem \
  --tlskey=./docker/certs/client-key.pem \
  -H tcp://host.docker.internal:2376 \
  version

# Check certificate validity
openssl x509 -in ./docker/certs/server-cert.pem -text -noout

# Test from container
docker-compose exec session-manager ./docker/scripts/test-tls-connection.py

Migration from Socket Mounting

Before (Insecure)

volumes:
  - /var/run/docker.sock:/var/run/docker.sock

After (Secure)

volumes:
  - ./docker/certs:/etc/docker/certs:ro
environment:
  - DOCKER_TLS_VERIFY=1
  - DOCKER_HOST=tcp://host.docker.internal:2376

Code Changes Required

Update Docker client initialization:

# Before
self.docker_client = docker.from_env()

# After
tls_config = docker.tls.TLSConfig(
    ca_cert=os.getenv('DOCKER_CA_CERT'),
    client_cert=(os.getenv('DOCKER_CLIENT_CERT'), os.getenv('DOCKER_CLIENT_KEY')),
    verify=True
)
self.docker_client = docker.from_env()
self.docker_client.api = docker.APIClient(
    base_url=os.getenv('DOCKER_HOST'),
    tls=tls_config
)

Dynamic Host IP Detection

The session-manager service now includes robust host IP detection to support proxy routing across different Docker environments:

Supported Environments

  • Docker Desktop (Mac/Windows): Uses host.docker.internal resolution
  • Linux Docker: Reads gateway from /proc/net/route
  • Cloud environments: Respects DOCKER_HOST_GATEWAY and GATEWAY environment variables
  • Custom networks: Tests connectivity to common Docker gateway IPs

Detection Methods (in priority order)

  1. Docker Internal: Resolves host.docker.internal (Docker Desktop)
  2. Environment Variables: Checks HOST_IP, DOCKER_HOST_GATEWAY, GATEWAY
  3. Route Table: Parses /proc/net/route for default gateway
  4. Network Connection: Tests connectivity to determine local routing
  5. Common Gateways: Falls back to known Docker bridge IPs

Configuration

The detection is automatic and cached for 5 minutes. Override with:

# Force specific host IP
export HOST_IP=192.168.1.100

# Or in docker-compose.yml
environment:
  - HOST_IP=your-host-ip

Testing

# Test host IP detection
./docker/scripts/test-host-ip-detection.py

# Run integration test
./docker/scripts/test-integration.sh

Troubleshooting

"Could not detect Docker host IP"

  • Check network configuration: docker network inspect bridge
  • Verify environment variables
  • Test connectivity: ping host.docker.internal
  • Set explicit HOST_IP if needed

Proxy routing fails

  • Verify detected IP is accessible from containers
  • Check firewall rules blocking container-to-host traffic
  • Ensure Docker network allows communication

Structured Logging

Comprehensive logging infrastructure with structured JSON logs, request tracking, and production-ready log management for debugging and monitoring.

Log Features

  • Structured JSON Logs: Machine-readable logs for production analysis
  • Request ID Tracking: Trace requests across distributed operations
  • Human-Readable Development: Clear logs for local development
  • Performance Metrics: Built-in request timing and performance tracking
  • Security Event Logging: Audit trail for security-related events
  • Log Rotation: Automatic log rotation with size limits

Configuration

# Log level and format
export LOG_LEVEL=INFO                    # DEBUG, INFO, WARNING, ERROR, CRITICAL
export LOG_FORMAT=auto                   # json, human, auto (detects environment)

# File logging
export LOG_FILE=/var/log/lovdata-chat.log
export LOG_MAX_SIZE_MB=10                # Max log file size
export LOG_BACKUP_COUNT=5                # Number of backup files

# Output control
export LOG_CONSOLE=true                  # Enable console logging
export LOG_FILE_ENABLED=true             # Enable file logging

Testing Structured Logging

# Test logging functionality and formatters
./docker/scripts/test-structured-logging.py

Log Analysis

JSON Format (Production):

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "INFO",
  "logger": "session_manager.main",
  "message": "Session created successfully",
  "request_id": "req-abc123",
  "session_id": "ses-xyz789",
  "operation": "create_session",
  "duration_ms": 245.67
}

Human-Readable Format (Development):

2024-01-15 10:30:45 [INFO   ] session_manager.main:create_session:145 [req-abc123] - Session created successfully

Request Tracing

All logs include request IDs for tracing operations across the system:

with RequestContext():
    log_session_operation(session_id, "created")
    # All subsequent logs in this context include request_id

Database Persistence

Session data is now stored in PostgreSQL for reliability, multi-instance deployment support, and elimination of JSON file corruption vulnerabilities.

Database Configuration

# PostgreSQL connection settings
export DB_HOST=localhost                    # Database host
export DB_PORT=5432                         # Database port
export DB_USER=lovdata                      # Database user
export DB_PASSWORD=password                 # Database password
export DB_NAME=lovdata_chat                 # Database name

# Connection pool settings
export DB_MIN_CONNECTIONS=5                 # Minimum pool connections
export DB_MAX_CONNECTIONS=20                # Maximum pool connections
export DB_MAX_QUERIES=50000                 # Max queries per connection
export DB_MAX_INACTIVE_LIFETIME=300.0       # Connection timeout

Storage Backend Selection

# Enable database storage (recommended for production)
export USE_DATABASE_STORAGE=true

# Or use JSON file storage (legacy/development)
export USE_DATABASE_STORAGE=false

Database Schema

Sessions Table:

  • session_id (VARCHAR, Primary Key): Unique session identifier
  • container_name (VARCHAR): Docker container name
  • container_id (VARCHAR): Docker container ID
  • host_dir (VARCHAR): Host directory path
  • port (INTEGER): Container port
  • auth_token (VARCHAR): Authentication token
  • created_at (TIMESTAMP): Creation timestamp
  • last_accessed (TIMESTAMP): Last access timestamp
  • status (VARCHAR): Session status (creating, running, stopped, error)
  • metadata (JSONB): Additional session metadata

Indexes:

  • Primary key on session_id
  • Status index for filtering active sessions
  • Last accessed index for cleanup operations
  • Created at index for session listing

Testing Database Persistence

# Test database connection and operations
./docker/scripts/test-database-persistence.py

Health Monitoring

The /health endpoint now includes database status:

{
  "storage_backend": "database",
  "database": {
    "status": "healthy",
    "total_sessions": 15,
    "active_sessions": 8,
    "database_size": "25 MB"
  }
}

Migration Strategy

From JSON File to Database:

  1. Backup existing sessions (if any)
  2. Set environment variables for database connection
  3. Enable database storage: USE_DATABASE_STORAGE=true
  4. Restart service - automatic schema creation and migration
  5. Verify data migration in health endpoint
  6. Monitor performance and adjust connection pool settings

Backward Compatibility:

  • JSON file storage remains available for development
  • Automatic fallback if database is unavailable
  • Zero-downtime migration possible

Container Health Monitoring

Active monitoring of Docker containers with automatic failure detection and recovery mechanisms to prevent stuck sessions and improve system reliability.

Health Monitoring Features

  • Periodic Health Checks: Continuous monitoring of running containers every 30 seconds
  • Automatic Failure Detection: Identifies unhealthy or failed containers
  • Smart Restart Logic: Automatic container restart with configurable limits
  • Health History Tracking: Maintains health check history for analysis
  • Status Integration: Updates session status based on container health

Configuration

# Health check intervals and timeouts
CONTAINER_HEALTH_CHECK_INTERVAL=30          # Check every 30 seconds
CONTAINER_HEALTH_TIMEOUT=10.0               # Health check timeout
CONTAINER_MAX_RESTART_ATTEMPTS=3            # Max restart attempts
CONTAINER_RESTART_DELAY=5                   # Delay between restarts
CONTAINER_FAILURE_THRESHOLD=3               # Failures before restart

Health Status Types

  • HEALTHY: Container running normally with optional health checks passing
  • UNHEALTHY: Container running but health checks failing
  • RESTARTING: Container being restarted due to failures
  • FAILED: Container stopped or permanently failed
  • UNKNOWN: Unable to determine container status

Testing Health Monitoring

# Test health monitoring functionality
./docker/scripts/test-container-health.py

Health Endpoints

System Health:

GET /health  # Includes container health statistics

Detailed Container Health:

GET /health/container                    # Overall health stats
GET /health/container/{session_id}       # Specific session health

Health Response:

{
  "container_health": {
    "monitoring_active": true,
    "check_interval": 30,
    "total_sessions_monitored": 5,
    "sessions_with_failures": 1,
    "session_ses123": {
      "total_checks": 10,
      "healthy_checks": 8,
      "failed_checks": 2,
      "average_response_time": 45.2
    }
  }
}

Recovery Mechanisms

  1. Health Check Failure: Container marked as unhealthy
  2. Consecutive Failures: After threshold, automatic restart initiated
  3. Restart Attempts: Limited to prevent infinite restart loops
  4. Session Status Update: Session status reflects container health
  5. Logging & Alerts: Comprehensive logging of health events

Integration Benefits

  • Proactive Monitoring: Detects issues before users are affected
  • Automatic Recovery: Reduces manual intervention requirements
  • Improved Reliability: Prevents stuck sessions and system instability
  • Operational Visibility: Detailed health metrics and history
  • Scalable Architecture: Works with multiple concurrent sessions

Session Authentication

OpenCode servers now require token-based authentication for secure individual user sessions, preventing unauthorized access and ensuring session isolation.

Authentication Features

  • Token Generation: Unique cryptographically secure tokens per session
  • Automatic Expiry: Configurable token lifetime (default 24 hours)
  • Token Rotation: Ability to rotate tokens for enhanced security
  • Session Isolation: Each user session has its own authentication credentials
  • Proxy Integration: Authentication headers automatically included in proxy requests

Configuration

# Token configuration
export SESSION_TOKEN_LENGTH=32          # Token length in characters
export SESSION_TOKEN_EXPIRY_HOURS=24    # Token validity period
export SESSION_TOKEN_SECRET=auto        # Token signing secret (auto-generated)
export TOKEN_CLEANUP_INTERVAL_MINUTES=60 # Expired token cleanup interval

Testing Authentication

# Test authentication functionality
./docker/scripts/test-session-auth.py

# End-to-end authentication testing
./docker/scripts/test-auth-end-to-end.sh

API Endpoints

Authentication Management:

  • GET /sessions/{id}/auth - Get session authentication info
  • POST /sessions/{id}/auth/rotate - Rotate session token
  • GET /auth/sessions - List authenticated sessions

Health Monitoring:

{
  "authenticated_sessions": 3,
  "status": "healthy"
}

Security Benefits

  • Session Isolation: Users cannot access each other's OpenCode servers
  • Token Expiry: Automatic cleanup prevents token accumulation
  • Secure Generation: Cryptographically secure random tokens
  • Proxy Security: Authentication headers prevent unauthorized proxy access

HTTP Connection Pooling

Proxy requests now use a global HTTP connection pool instead of creating new httpx clients for each request, eliminating connection overhead and dramatically improving proxy performance.

Connection Pool Benefits

  • Eliminated Connection Overhead: No more client creation/teardown per request
  • Connection Reuse: Persistent keep-alive connections reduce latency
  • Improved Throughput: Handle significantly more concurrent proxy requests
  • Reduced Resource Usage: Lower memory and CPU overhead for HTTP operations
  • Better Scalability: Support higher request rates with the same system resources

Pool Configuration

The connection pool is automatically configured with optimized settings:

# Connection pool settings
max_keepalive_connections=20    # Keep connections alive
max_connections=100            # Max total connections
keepalive_expiry=300.0         # 5-minute connection lifetime
connect_timeout=10.0           # Connection establishment timeout
read_timeout=30.0              # Read operation timeout

Performance Testing

# Test HTTP connection pool functionality
./docker/scripts/test-http-connection-pool.py

# Load test proxy performance improvements
./docker/scripts/test-http-pool-load.sh

Health Monitoring

The /health endpoint now includes HTTP connection pool status:

{
  "http_connection_pool": {
    "status": "healthy",
    "config": {
      "max_keepalive_connections": 20,
      "max_connections": 100,
      "keepalive_expiry": 300.0
    }
  }
}

Async Docker Operations

Docker operations now run asynchronously using aiodeocker to eliminate blocking calls in FastAPI's async event loop, significantly improving concurrency and preventing thread pool exhaustion.

Async Benefits

  • Non-Blocking Operations: Container creation, management, and cleanup no longer block the event loop
  • Improved Concurrency: Handle multiple concurrent user sessions without performance degradation
  • Better Scalability: Support higher throughput with the same system resources
  • Thread Pool Preservation: Prevent exhaustion of async thread pools

Configuration

# Enable async Docker operations (recommended)
export USE_ASYNC_DOCKER=true

# Or disable for sync mode (legacy)
export USE_ASYNC_DOCKER=false

Testing Async Operations

# Test async Docker functionality
./docker/scripts/test-async-docker.py

# Load test concurrent operations
./docker/scripts/test-async-docker-load.sh

Performance Impact

Async operations provide significant performance improvements:

  • Concurrent Sessions: Handle 10+ concurrent container operations without blocking
  • Response Times: Faster session creation under load
  • Resource Efficiency: Better CPU utilization with non-blocking I/O
  • Scalability: Support more users per server instance

Resource Limits Enforcement

Container resource limits are now actively enforced to prevent resource exhaustion attacks and ensure fair resource allocation across user sessions.

Configurable Limits

Environment Variable Default Description
CONTAINER_MEMORY_LIMIT 4g Memory limit per container
CONTAINER_CPU_QUOTA 100000 CPU quota (microseconds per period)
CONTAINER_CPU_PERIOD 100000 CPU period (microseconds)
MAX_CONCURRENT_SESSIONS 3 Maximum concurrent user sessions
MEMORY_WARNING_THRESHOLD 0.8 Memory usage warning threshold (80%)
CPU_WARNING_THRESHOLD 0.9 CPU usage warning threshold (90%)

Resource Protection Features

  • Memory Limits: Prevents containers from consuming unlimited RAM
  • CPU Quotas: Ensures fair CPU allocation across sessions
  • Session Throttling: Blocks new sessions when resources are constrained
  • System Monitoring: Continuous resource usage tracking
  • Graceful Degradation: Alerts and throttling before system failure

Testing Resource Limits

# Test resource limit configuration and validation
./docker/scripts/test-resource-limits.py

# Load testing with enforcement verification
./docker/scripts/test-resource-limits-load.sh

Health Monitoring

The /health endpoint now includes comprehensive resource information:

{
  "resource_limits": {
    "memory_limit": "4g",
    "cpu_quota": 100000,
    "max_concurrent_sessions": 3
  },
  "system_resources": {
    "memory_percent": 0.65,
    "cpu_percent": 0.45
  },
  "resource_alerts": []
}

Resource Alert Levels

  • Warning: System resources approaching limits (80% memory, 90% CPU)
  • Critical: System resources at dangerous levels (95%+ usage)
  • Throttling: New sessions blocked when critical alerts active

Security Audit Checklist

  • TLS certificates generated with strong encryption
  • Certificate permissions set correctly (400/444)
  • No socket mounting in docker-compose.yml
  • Environment variables properly configured
  • TLS connection tested successfully
  • Host IP detection working correctly
  • Proxy routing functional across environments
  • Resource limits properly configured and enforced
  • Session throttling prevents resource exhaustion
  • System resource monitoring active
  • Certificate rotation process documented
  • Firewall rules restrict Docker API access
  • Docker daemon configured with security options
  • Monitoring and logging enabled for API access