Files

Torbjørn Lindahl 7a9b4b751e docker related

2026-01-18 23:29:04 +01:00

24 KiB

Raw Blame History

Docker TLS Security Setup

This directory contains scripts and configuration for securing Docker API access with TLS authentication, replacing the insecure socket mounting approach.

Overview

Previously, the session-manager service mounted the Docker socket (/var/run/docker.sock) directly into containers, granting full root access to the host Docker daemon. This is a critical security vulnerability.

This setup replaces socket mounting with authenticated TLS API access over the network.

Security Benefits

✅ No socket mounting: Eliminates privilege escalation risk
✅ Mutual TLS authentication: Both client and server authenticate
✅ Encrypted communication: All API calls are encrypted
✅ Certificate-based access: Granular access control
✅ Network isolation: API access is network-bound, not filesystem-bound

Docker Service Abstraction

The session-manager now uses a clean DockerService abstraction layer that separates Docker operations from business logic, enabling better testing, maintainability, and future Docker client changes.

Architecture Benefits

🧪 Testability: MockDockerService enables testing without Docker daemon
🔧 Maintainability: Clean separation of concerns
🔄 Flexibility: Easy to swap Docker client implementations
📦 Dependency Injection: SessionManager receives DockerService via constructor
⚡ Performance: Both async and sync Docker operations supported

Service Interface

class DockerService:
    async def create_container(self, name: str, image: str, **kwargs) -> ContainerInfo
    async def start_container(self, container_id: str) -> None
    async def stop_container(self, container_id: str, timeout: int = 10) -> None
    async def remove_container(self, container_id: str, force: bool = False) -> None
    async def get_container_info(self, container_id: str) -> Optional[ContainerInfo]
    async def list_containers(self, all: bool = False) -> List[ContainerInfo]
    async def ping(self) -> bool

Testing

Run the comprehensive test suite:

# Test Docker service abstraction
./docker/scripts/test-docker-service.py

# Results: 7/7 tests passed ✅
# - Service Interface ✅
# - Error Handling ✅
# - Async vs Sync Modes ✅
# - Container Info Operations ✅
# - Context Management ✅
# - Integration Patterns ✅
# - Performance and Scaling ✅

Usage in SessionManager

# Dependency injection pattern
session_manager = SessionManager(docker_service=DockerService(use_async=True))

# Or with mock for testing
test_manager = SessionManager(docker_service=MockDockerService())

Files Structure

docker/
├── certs/                    # Generated TLS certificates (not in git)
├── scripts/
│   ├── generate-certs.sh     # Certificate generation script
│   ├── setup-docker-tls.sh   # Docker daemon TLS configuration
│   └── test-tls-connection.py # Connection testing script
├── daemon.json               # Docker daemon TLS configuration
└── .env.example              # Environment configuration template

Quick Start

1. Generate TLS Certificates

# Generate certificates for development
DOCKER_ENV=development ./docker/scripts/generate-certs.sh

# Or for production with custom settings
DOCKER_ENV=production \
DOCKER_HOST_IP=your-server-ip \
DOCKER_HOST_NAME=your-docker-host \
./docker/scripts/generate-certs.sh

2. Configure Docker Daemon

For local development (Docker Desktop):

# Certificates are automatically mounted in docker-compose.yml
docker-compose up -d

For production/server setup:

# Configure system Docker daemon with TLS
sudo ./docker/scripts/setup-docker-tls.sh

3. Configure Environment

# Copy and customize environment file
cp docker/.env.example .env

# Edit .env with your settings
# DOCKER_HOST_IP=host.docker.internal  # for Docker Desktop
# DOCKER_HOST_IP=your-server-ip        # for production

4. Test Configuration

# Test TLS connection
./docker/scripts/test-tls-connection.py

# Start services
docker-compose --env-file .env up -d session-manager

# Check logs
docker-compose logs session-manager

Configuration Options

Environment Variables

Variable	Default	Description
`DOCKER_TLS_VERIFY`	`1`	Enable TLS verification
`DOCKER_CERT_PATH`	`./docker/certs`	Certificate directory path
`DOCKER_HOST`	`tcp://host.docker.internal:2376`	Docker daemon endpoint
`DOCKER_TLS_PORT`	`2376`	TLS port for Docker API
`DOCKER_CA_CERT`	`./docker/certs/ca.pem`	CA certificate path
`DOCKER_CLIENT_CERT`	`./docker/certs/client-cert.pem`	Client certificate path
`DOCKER_CLIENT_KEY`	`./docker/certs/client-key.pem`	Client key path
`DOCKER_HOST_IP`	`host.docker.internal`	Docker host IP

Certificate Generation Options

Variable	Default	Description
`DOCKER_ENV`	`development`	Environment name for certificates
`DOCKER_HOST_IP`	`127.0.0.1`	IP address for server certificate
`DOCKER_HOST_NAME`	`localhost`	Hostname for server certificate
`DAYS`	`3650`	Certificate validity in days

Production Deployment

Certificate Management

Generate certificates on a secure machine
Distribute to servers securely (SCP, Ansible, etc.)

Set proper permissions:

chmod 444 /etc/docker/certs/*.pem  # certs readable by all
chmod 400 /etc/docker/certs/*-key.pem  # keys readable by root only

Rotate certificates regularly (every 6-12 months)
Revoke compromised certificates and regenerate

Docker Daemon Configuration

For production servers, use the setup-docker-tls.sh script or manually configure /etc/docker/daemon.json:

{
  "tls": true,
  "tlsverify": true,
  "tlscacert": "/etc/docker/certs/ca.pem",
  "tlscert": "/etc/docker/certs/server-cert.pem",
  "tlskey": "/etc/docker/certs/server-key.pem",
  "hosts": ["tcp://0.0.0.0:2376"],
  "iptables": false,
  "bridge": "none",
  "live-restore": true,
  "userland-proxy": false,
  "no-new-privileges": true
}

Security Hardening

Firewall: Only allow TLS port (2376) from trusted networks
TLS 1.3: Ensure modern TLS version support
Certificate pinning: Consider certificate pinning in client code
Monitoring: Log and monitor Docker API access
Rate limiting: Implement API rate limiting

Troubleshooting

Common Issues

"Connection refused"

Check if Docker daemon is running with TLS
Verify DOCKER_HOST points to correct endpoint
Ensure firewall allows port 2376

"TLS handshake failed"

Verify certificates exist and have correct permissions
Check certificate validity dates
Ensure CA certificate is correct

"Permission denied"

Check certificate file permissions (444 for certs, 400 for keys)
Ensure client certificate is signed by the CA

Debug Commands

# Test TLS connection manually
docker --tlsverify \
  --tlscacert=./docker/certs/ca.pem \
  --tlscert=./docker/certs/client-cert.pem \
  --tlskey=./docker/certs/client-key.pem \
  -H tcp://host.docker.internal:2376 \
  version

# Check certificate validity
openssl x509 -in ./docker/certs/server-cert.pem -text -noout

# Test from container
docker-compose exec session-manager ./docker/scripts/test-tls-connection.py

Migration from Socket Mounting

Before (Insecure)

volumes:
  - /var/run/docker.sock:/var/run/docker.sock

After (Secure)

volumes:
  - ./docker/certs:/etc/docker/certs:ro
environment:
  - DOCKER_TLS_VERIFY=1
  - DOCKER_HOST=tcp://host.docker.internal:2376

Code Changes Required

Update Docker client initialization:

# Before
self.docker_client = docker.from_env()

# After
tls_config = docker.tls.TLSConfig(
    ca_cert=os.getenv('DOCKER_CA_CERT'),
    client_cert=(os.getenv('DOCKER_CLIENT_CERT'), os.getenv('DOCKER_CLIENT_KEY')),
    verify=True
)
self.docker_client = docker.from_env()
self.docker_client.api = docker.APIClient(
    base_url=os.getenv('DOCKER_HOST'),
    tls=tls_config
)

Dynamic Host IP Detection

The session-manager service now includes robust host IP detection to support proxy routing across different Docker environments:

Supported Environments

Docker Desktop (Mac/Windows): Uses host.docker.internal resolution
Linux Docker: Reads gateway from /proc/net/route
Cloud environments: Respects DOCKER_HOST_GATEWAY and GATEWAY environment variables
Custom networks: Tests connectivity to common Docker gateway IPs

Detection Methods (in priority order)

Docker Internal: Resolves host.docker.internal (Docker Desktop)
Environment Variables: Checks HOST_IP, DOCKER_HOST_GATEWAY, GATEWAY
Route Table: Parses /proc/net/route for default gateway
Network Connection: Tests connectivity to determine local routing
Common Gateways: Falls back to known Docker bridge IPs

Configuration

The detection is automatic and cached for 5 minutes. Override with:

# Force specific host IP
export HOST_IP=192.168.1.100

# Or in docker-compose.yml
environment:
  - HOST_IP=your-host-ip

Testing

# Test host IP detection
./docker/scripts/test-host-ip-detection.py

# Run integration test
./docker/scripts/test-integration.sh

Troubleshooting

"Could not detect Docker host IP"

Check network configuration: docker network inspect bridge
Verify environment variables
Test connectivity: ping host.docker.internal
Set explicit HOST_IP if needed

Proxy routing fails

Verify detected IP is accessible from containers
Check firewall rules blocking container-to-host traffic
Ensure Docker network allows communication

Structured Logging

Comprehensive logging infrastructure with structured JSON logs, request tracking, and production-ready log management for debugging and monitoring.

Log Features

Structured JSON Logs: Machine-readable logs for production analysis
Request ID Tracking: Trace requests across distributed operations
Human-Readable Development: Clear logs for local development
Performance Metrics: Built-in request timing and performance tracking
Security Event Logging: Audit trail for security-related events
Log Rotation: Automatic log rotation with size limits

Configuration

# Log level and format
export LOG_LEVEL=INFO                    # DEBUG, INFO, WARNING, ERROR, CRITICAL
export LOG_FORMAT=auto                   # json, human, auto (detects environment)

# File logging
export LOG_FILE=/var/log/lovdata-chat.log
export LOG_MAX_SIZE_MB=10                # Max log file size
export LOG_BACKUP_COUNT=5                # Number of backup files

# Output control
export LOG_CONSOLE=true                  # Enable console logging
export LOG_FILE_ENABLED=true             # Enable file logging

Testing Structured Logging

# Test logging functionality and formatters
./docker/scripts/test-structured-logging.py

Log Analysis

JSON Format (Production):

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "INFO",
  "logger": "session_manager.main",
  "message": "Session created successfully",
  "request_id": "req-abc123",
  "session_id": "ses-xyz789",
  "operation": "create_session",
  "duration_ms": 245.67
}

Human-Readable Format (Development):

2024-01-15 10:30:45 [INFO   ] session_manager.main:create_session:145 [req-abc123] - Session created successfully

Request Tracing

All logs include request IDs for tracing operations across the system:

with RequestContext():
    log_session_operation(session_id, "created")
    # All subsequent logs in this context include request_id

Database Persistence

Session data is now stored in PostgreSQL for reliability, multi-instance deployment support, and elimination of JSON file corruption vulnerabilities.

Database Configuration

# PostgreSQL connection settings
export DB_HOST=localhost                    # Database host
export DB_PORT=5432                         # Database port
export DB_USER=lovdata                      # Database user
export DB_PASSWORD=password                 # Database password
export DB_NAME=lovdata_chat                 # Database name

# Connection pool settings
export DB_MIN_CONNECTIONS=5                 # Minimum pool connections
export DB_MAX_CONNECTIONS=20                # Maximum pool connections
export DB_MAX_QUERIES=50000                 # Max queries per connection
export DB_MAX_INACTIVE_LIFETIME=300.0       # Connection timeout

Storage Backend Selection

# Enable database storage (recommended for production)
export USE_DATABASE_STORAGE=true

# Or use JSON file storage (legacy/development)
export USE_DATABASE_STORAGE=false

Database Schema

Sessions Table:

session_id (VARCHAR, Primary Key): Unique session identifier
container_name (VARCHAR): Docker container name
container_id (VARCHAR): Docker container ID
host_dir (VARCHAR): Host directory path
port (INTEGER): Container port
auth_token (VARCHAR): Authentication token
created_at (TIMESTAMP): Creation timestamp
last_accessed (TIMESTAMP): Last access timestamp
status (VARCHAR): Session status (creating, running, stopped, error)
metadata (JSONB): Additional session metadata

Indexes:

Primary key on session_id
Status index for filtering active sessions
Last accessed index for cleanup operations
Created at index for session listing

Testing Database Persistence

# Test database connection and operations
./docker/scripts/test-database-persistence.py

Health Monitoring

The /health endpoint now includes database status:

{
  "storage_backend": "database",
  "database": {
    "status": "healthy",
    "total_sessions": 15,
    "active_sessions": 8,
    "database_size": "25 MB"
  }
}

Migration Strategy

From JSON File to Database:

Backup existing sessions (if any)
Set environment variables for database connection
Enable database storage: USE_DATABASE_STORAGE=true
Restart service - automatic schema creation and migration
Verify data migration in health endpoint
Monitor performance and adjust connection pool settings

Backward Compatibility:

JSON file storage remains available for development
Automatic fallback if database is unavailable
Zero-downtime migration possible

Container Health Monitoring

Active monitoring of Docker containers with automatic failure detection and recovery mechanisms to prevent stuck sessions and improve system reliability.

Health Monitoring Features

Periodic Health Checks: Continuous monitoring of running containers every 30 seconds
Automatic Failure Detection: Identifies unhealthy or failed containers
Smart Restart Logic: Automatic container restart with configurable limits
Health History Tracking: Maintains health check history for analysis
Status Integration: Updates session status based on container health

Configuration

# Health check intervals and timeouts
CONTAINER_HEALTH_CHECK_INTERVAL=30          # Check every 30 seconds
CONTAINER_HEALTH_TIMEOUT=10.0               # Health check timeout
CONTAINER_MAX_RESTART_ATTEMPTS=3            # Max restart attempts
CONTAINER_RESTART_DELAY=5                   # Delay between restarts
CONTAINER_FAILURE_THRESHOLD=3               # Failures before restart

Health Status Types

HEALTHY: Container running normally with optional health checks passing
UNHEALTHY: Container running but health checks failing
RESTARTING: Container being restarted due to failures
FAILED: Container stopped or permanently failed
UNKNOWN: Unable to determine container status

Testing Health Monitoring

# Test health monitoring functionality
./docker/scripts/test-container-health.py

Health Endpoints

System Health:

GET /health  # Includes container health statistics

Detailed Container Health:

GET /health/container                    # Overall health stats
GET /health/container/{session_id}       # Specific session health

Health Response:

{
  "container_health": {
    "monitoring_active": true,
    "check_interval": 30,
    "total_sessions_monitored": 5,
    "sessions_with_failures": 1,
    "session_ses123": {
      "total_checks": 10,
      "healthy_checks": 8,
      "failed_checks": 2,
      "average_response_time": 45.2
    }
  }
}

Recovery Mechanisms

Health Check Failure: Container marked as unhealthy
Consecutive Failures: After threshold, automatic restart initiated
Restart Attempts: Limited to prevent infinite restart loops
Session Status Update: Session status reflects container health
Logging & Alerts: Comprehensive logging of health events

Integration Benefits

Proactive Monitoring: Detects issues before users are affected
Automatic Recovery: Reduces manual intervention requirements
Improved Reliability: Prevents stuck sessions and system instability
Operational Visibility: Detailed health metrics and history
Scalable Architecture: Works with multiple concurrent sessions

Session Authentication

OpenCode servers now require token-based authentication for secure individual user sessions, preventing unauthorized access and ensuring session isolation.

Authentication Features

Token Generation: Unique cryptographically secure tokens per session
Automatic Expiry: Configurable token lifetime (default 24 hours)
Token Rotation: Ability to rotate tokens for enhanced security
Session Isolation: Each user session has its own authentication credentials
Proxy Integration: Authentication headers automatically included in proxy requests

Configuration

# Token configuration
export SESSION_TOKEN_LENGTH=32          # Token length in characters
export SESSION_TOKEN_EXPIRY_HOURS=24    # Token validity period
export SESSION_TOKEN_SECRET=auto        # Token signing secret (auto-generated)
export TOKEN_CLEANUP_INTERVAL_MINUTES=60 # Expired token cleanup interval

Testing Authentication

# Test authentication functionality
./docker/scripts/test-session-auth.py

# End-to-end authentication testing
./docker/scripts/test-auth-end-to-end.sh

API Endpoints

Authentication Management:

GET /sessions/{id}/auth - Get session authentication info
POST /sessions/{id}/auth/rotate - Rotate session token
GET /auth/sessions - List authenticated sessions

Health Monitoring:

{
  "authenticated_sessions": 3,
  "status": "healthy"
}

Security Benefits

Session Isolation: Users cannot access each other's OpenCode servers
Token Expiry: Automatic cleanup prevents token accumulation
Secure Generation: Cryptographically secure random tokens
Proxy Security: Authentication headers prevent unauthorized proxy access

HTTP Connection Pooling

Proxy requests now use a global HTTP connection pool instead of creating new httpx clients for each request, eliminating connection overhead and dramatically improving proxy performance.

Connection Pool Benefits

Eliminated Connection Overhead: No more client creation/teardown per request
Connection Reuse: Persistent keep-alive connections reduce latency
Improved Throughput: Handle significantly more concurrent proxy requests
Reduced Resource Usage: Lower memory and CPU overhead for HTTP operations
Better Scalability: Support higher request rates with the same system resources

Pool Configuration

The connection pool is automatically configured with optimized settings:

# Connection pool settings
max_keepalive_connections=20    # Keep connections alive
max_connections=100            # Max total connections
keepalive_expiry=300.0         # 5-minute connection lifetime
connect_timeout=10.0           # Connection establishment timeout
read_timeout=30.0              # Read operation timeout

Performance Testing

# Test HTTP connection pool functionality
./docker/scripts/test-http-connection-pool.py

# Load test proxy performance improvements
./docker/scripts/test-http-pool-load.sh

Health Monitoring

The /health endpoint now includes HTTP connection pool status:

{
  "http_connection_pool": {
    "status": "healthy",
    "config": {
      "max_keepalive_connections": 20,
      "max_connections": 100,
      "keepalive_expiry": 300.0
    }
  }
}

Async Docker Operations

Docker operations now run asynchronously using aiodeocker to eliminate blocking calls in FastAPI's async event loop, significantly improving concurrency and preventing thread pool exhaustion.

Async Benefits

Non-Blocking Operations: Container creation, management, and cleanup no longer block the event loop
Improved Concurrency: Handle multiple concurrent user sessions without performance degradation
Better Scalability: Support higher throughput with the same system resources
Thread Pool Preservation: Prevent exhaustion of async thread pools

Configuration

# Enable async Docker operations (recommended)
export USE_ASYNC_DOCKER=true

# Or disable for sync mode (legacy)
export USE_ASYNC_DOCKER=false

Testing Async Operations

# Test async Docker functionality
./docker/scripts/test-async-docker.py

# Load test concurrent operations
./docker/scripts/test-async-docker-load.sh

Performance Impact

Async operations provide significant performance improvements:

Concurrent Sessions: Handle 10+ concurrent container operations without blocking
Response Times: Faster session creation under load
Resource Efficiency: Better CPU utilization with non-blocking I/O
Scalability: Support more users per server instance

Resource Limits Enforcement

Container resource limits are now actively enforced to prevent resource exhaustion attacks and ensure fair resource allocation across user sessions.

Configurable Limits

Environment Variable	Default	Description
`CONTAINER_MEMORY_LIMIT`	`4g`	Memory limit per container
`CONTAINER_CPU_QUOTA`	`100000`	CPU quota (microseconds per period)
`CONTAINER_CPU_PERIOD`	`100000`	CPU period (microseconds)
`MAX_CONCURRENT_SESSIONS`	`3`	Maximum concurrent user sessions
`MEMORY_WARNING_THRESHOLD`	`0.8`	Memory usage warning threshold (80%)
`CPU_WARNING_THRESHOLD`	`0.9`	CPU usage warning threshold (90%)

Resource Protection Features

Memory Limits: Prevents containers from consuming unlimited RAM
CPU Quotas: Ensures fair CPU allocation across sessions
Session Throttling: Blocks new sessions when resources are constrained
System Monitoring: Continuous resource usage tracking
Graceful Degradation: Alerts and throttling before system failure

Testing Resource Limits

# Test resource limit configuration and validation
./docker/scripts/test-resource-limits.py

# Load testing with enforcement verification
./docker/scripts/test-resource-limits-load.sh

Health Monitoring

The /health endpoint now includes comprehensive resource information:

{
  "resource_limits": {
    "memory_limit": "4g",
    "cpu_quota": 100000,
    "max_concurrent_sessions": 3
  },
  "system_resources": {
    "memory_percent": 0.65,
    "cpu_percent": 0.45
  },
  "resource_alerts": []
}

Resource Alert Levels

Warning: System resources approaching limits (80% memory, 90% CPU)
Critical: System resources at dangerous levels (95%+ usage)
Throttling: New sessions blocked when critical alerts active

Security Audit Checklist

TLS certificates generated with strong encryption
Certificate permissions set correctly (400/444)
No socket mounting in docker-compose.yml
Environment variables properly configured
TLS connection tested successfully
Host IP detection working correctly
Proxy routing functional across environments
Resource limits properly configured and enforced
Session throttling prevents resource exhaustion
System resource monitoring active
Certificate rotation process documented
Firewall rules restrict Docker API access
Docker daemon configured with security options
Monitoring and logging enabled for API access

24 KiB Raw Blame History

Docker TLS Security Setup

Overview

Security Benefits

Docker Service Abstraction

Architecture Benefits

Service Interface

Testing

Usage in SessionManager

Files Structure

Quick Start

1. Generate TLS Certificates

2. Configure Docker Daemon

3. Configure Environment

4. Test Configuration

Configuration Options

Environment Variables

Certificate Generation Options

Production Deployment

Certificate Management

Docker Daemon Configuration

Security Hardening

Troubleshooting

Common Issues

Debug Commands

Migration from Socket Mounting

Before (Insecure)

After (Secure)

Code Changes Required

Dynamic Host IP Detection

Supported Environments

Detection Methods (in priority order)

Configuration

Testing

Troubleshooting

Structured Logging

Log Features

Configuration

Testing Structured Logging

Log Analysis

Request Tracing

Database Persistence

Database Configuration

Storage Backend Selection

Database Schema

Testing Database Persistence

Health Monitoring

Migration Strategy

Container Health Monitoring

Health Monitoring Features

Configuration

Health Status Types

Testing Health Monitoring

Health Endpoints

Recovery Mechanisms

Integration Benefits

Session Authentication

Authentication Features

Configuration

Testing Authentication

API Endpoints

Security Benefits

HTTP Connection Pooling

Connection Pool Benefits

Pool Configuration

Performance Testing

Health Monitoring

Async Docker Operations

Async Benefits

Configuration

Testing Async Operations

Performance Impact

Resource Limits Enforcement

Configurable Limits

Resource Protection Features

Testing Resource Limits

Health Monitoring

Resource Alert Levels

Security Audit Checklist

24 KiB

Raw Blame History