24 KiB
Docker TLS Security Setup
This directory contains scripts and configuration for securing Docker API access with TLS authentication, replacing the insecure socket mounting approach.
Overview
Previously, the session-manager service mounted the Docker socket (/var/run/docker.sock) directly into containers, granting full root access to the host Docker daemon. This is a critical security vulnerability.
This setup replaces socket mounting with authenticated TLS API access over the network.
Security Benefits
- ✅ No socket mounting: Eliminates privilege escalation risk
- ✅ Mutual TLS authentication: Both client and server authenticate
- ✅ Encrypted communication: All API calls are encrypted
- ✅ Certificate-based access: Granular access control
- ✅ Network isolation: API access is network-bound, not filesystem-bound
Docker Service Abstraction
The session-manager now uses a clean DockerService abstraction layer that separates Docker operations from business logic, enabling better testing, maintainability, and future Docker client changes.
Architecture Benefits
- 🧪 Testability: MockDockerService enables testing without Docker daemon
- 🔧 Maintainability: Clean separation of concerns
- 🔄 Flexibility: Easy to swap Docker client implementations
- 📦 Dependency Injection: SessionManager receives DockerService via constructor
- ⚡ Performance: Both async and sync Docker operations supported
Service Interface
class DockerService:
async def create_container(self, name: str, image: str, **kwargs) -> ContainerInfo
async def start_container(self, container_id: str) -> None
async def stop_container(self, container_id: str, timeout: int = 10) -> None
async def remove_container(self, container_id: str, force: bool = False) -> None
async def get_container_info(self, container_id: str) -> Optional[ContainerInfo]
async def list_containers(self, all: bool = False) -> List[ContainerInfo]
async def ping(self) -> bool
Testing
Run the comprehensive test suite:
# Test Docker service abstraction
./docker/scripts/test-docker-service.py
# Results: 7/7 tests passed ✅
# - Service Interface ✅
# - Error Handling ✅
# - Async vs Sync Modes ✅
# - Container Info Operations ✅
# - Context Management ✅
# - Integration Patterns ✅
# - Performance and Scaling ✅
Usage in SessionManager
# Dependency injection pattern
session_manager = SessionManager(docker_service=DockerService(use_async=True))
# Or with mock for testing
test_manager = SessionManager(docker_service=MockDockerService())
Files Structure
docker/
├── certs/ # Generated TLS certificates (not in git)
├── scripts/
│ ├── generate-certs.sh # Certificate generation script
│ ├── setup-docker-tls.sh # Docker daemon TLS configuration
│ └── test-tls-connection.py # Connection testing script
├── daemon.json # Docker daemon TLS configuration
└── .env.example # Environment configuration template
Quick Start
1. Generate TLS Certificates
# Generate certificates for development
DOCKER_ENV=development ./docker/scripts/generate-certs.sh
# Or for production with custom settings
DOCKER_ENV=production \
DOCKER_HOST_IP=your-server-ip \
DOCKER_HOST_NAME=your-docker-host \
./docker/scripts/generate-certs.sh
2. Configure Docker Daemon
For local development (Docker Desktop):
# Certificates are automatically mounted in docker-compose.yml
docker-compose up -d
For production/server setup:
# Configure system Docker daemon with TLS
sudo ./docker/scripts/setup-docker-tls.sh
3. Configure Environment
# Copy and customize environment file
cp docker/.env.example .env
# Edit .env with your settings
# DOCKER_HOST_IP=host.docker.internal # for Docker Desktop
# DOCKER_HOST_IP=your-server-ip # for production
4. Test Configuration
# Test TLS connection
./docker/scripts/test-tls-connection.py
# Start services
docker-compose --env-file .env up -d session-manager
# Check logs
docker-compose logs session-manager
Configuration Options
Environment Variables
| Variable | Default | Description |
|---|---|---|
DOCKER_TLS_VERIFY |
1 |
Enable TLS verification |
DOCKER_CERT_PATH |
./docker/certs |
Certificate directory path |
DOCKER_HOST |
tcp://host.docker.internal:2376 |
Docker daemon endpoint |
DOCKER_TLS_PORT |
2376 |
TLS port for Docker API |
DOCKER_CA_CERT |
./docker/certs/ca.pem |
CA certificate path |
DOCKER_CLIENT_CERT |
./docker/certs/client-cert.pem |
Client certificate path |
DOCKER_CLIENT_KEY |
./docker/certs/client-key.pem |
Client key path |
DOCKER_HOST_IP |
host.docker.internal |
Docker host IP |
Certificate Generation Options
| Variable | Default | Description |
|---|---|---|
DOCKER_ENV |
development |
Environment name for certificates |
DOCKER_HOST_IP |
127.0.0.1 |
IP address for server certificate |
DOCKER_HOST_NAME |
localhost |
Hostname for server certificate |
DAYS |
3650 |
Certificate validity in days |
Production Deployment
Certificate Management
- Generate certificates on a secure machine
- Distribute to servers securely (SCP, Ansible, etc.)
- Set proper permissions:
chmod 444 /etc/docker/certs/*.pem # certs readable by all chmod 400 /etc/docker/certs/*-key.pem # keys readable by root only - Rotate certificates regularly (every 6-12 months)
- Revoke compromised certificates and regenerate
Docker Daemon Configuration
For production servers, use the setup-docker-tls.sh script or manually configure /etc/docker/daemon.json:
{
"tls": true,
"tlsverify": true,
"tlscacert": "/etc/docker/certs/ca.pem",
"tlscert": "/etc/docker/certs/server-cert.pem",
"tlskey": "/etc/docker/certs/server-key.pem",
"hosts": ["tcp://0.0.0.0:2376"],
"iptables": false,
"bridge": "none",
"live-restore": true,
"userland-proxy": false,
"no-new-privileges": true
}
Security Hardening
- Firewall: Only allow TLS port (2376) from trusted networks
- TLS 1.3: Ensure modern TLS version support
- Certificate pinning: Consider certificate pinning in client code
- Monitoring: Log and monitor Docker API access
- Rate limiting: Implement API rate limiting
Troubleshooting
Common Issues
"Connection refused"
- Check if Docker daemon is running with TLS
- Verify
DOCKER_HOSTpoints to correct endpoint - Ensure firewall allows port 2376
"TLS handshake failed"
- Verify certificates exist and have correct permissions
- Check certificate validity dates
- Ensure CA certificate is correct
"Permission denied"
- Check certificate file permissions (444 for certs, 400 for keys)
- Ensure client certificate is signed by the CA
Debug Commands
# Test TLS connection manually
docker --tlsverify \
--tlscacert=./docker/certs/ca.pem \
--tlscert=./docker/certs/client-cert.pem \
--tlskey=./docker/certs/client-key.pem \
-H tcp://host.docker.internal:2376 \
version
# Check certificate validity
openssl x509 -in ./docker/certs/server-cert.pem -text -noout
# Test from container
docker-compose exec session-manager ./docker/scripts/test-tls-connection.py
Migration from Socket Mounting
Before (Insecure)
volumes:
- /var/run/docker.sock:/var/run/docker.sock
After (Secure)
volumes:
- ./docker/certs:/etc/docker/certs:ro
environment:
- DOCKER_TLS_VERIFY=1
- DOCKER_HOST=tcp://host.docker.internal:2376
Code Changes Required
Update Docker client initialization:
# Before
self.docker_client = docker.from_env()
# After
tls_config = docker.tls.TLSConfig(
ca_cert=os.getenv('DOCKER_CA_CERT'),
client_cert=(os.getenv('DOCKER_CLIENT_CERT'), os.getenv('DOCKER_CLIENT_KEY')),
verify=True
)
self.docker_client = docker.from_env()
self.docker_client.api = docker.APIClient(
base_url=os.getenv('DOCKER_HOST'),
tls=tls_config
)
Dynamic Host IP Detection
The session-manager service now includes robust host IP detection to support proxy routing across different Docker environments:
Supported Environments
- Docker Desktop (Mac/Windows): Uses
host.docker.internalresolution - Linux Docker: Reads gateway from
/proc/net/route - Cloud environments: Respects
DOCKER_HOST_GATEWAYandGATEWAYenvironment variables - Custom networks: Tests connectivity to common Docker gateway IPs
Detection Methods (in priority order)
- Docker Internal: Resolves
host.docker.internal(Docker Desktop) - Environment Variables: Checks
HOST_IP,DOCKER_HOST_GATEWAY,GATEWAY - Route Table: Parses
/proc/net/routefor default gateway - Network Connection: Tests connectivity to determine local routing
- Common Gateways: Falls back to known Docker bridge IPs
Configuration
The detection is automatic and cached for 5 minutes. Override with:
# Force specific host IP
export HOST_IP=192.168.1.100
# Or in docker-compose.yml
environment:
- HOST_IP=your-host-ip
Testing
# Test host IP detection
./docker/scripts/test-host-ip-detection.py
# Run integration test
./docker/scripts/test-integration.sh
Troubleshooting
"Could not detect Docker host IP"
- Check network configuration:
docker network inspect bridge - Verify environment variables
- Test connectivity:
ping host.docker.internal - Set explicit
HOST_IPif needed
Proxy routing fails
- Verify detected IP is accessible from containers
- Check firewall rules blocking container-to-host traffic
- Ensure Docker network allows communication
Structured Logging
Comprehensive logging infrastructure with structured JSON logs, request tracking, and production-ready log management for debugging and monitoring.
Log Features
- Structured JSON Logs: Machine-readable logs for production analysis
- Request ID Tracking: Trace requests across distributed operations
- Human-Readable Development: Clear logs for local development
- Performance Metrics: Built-in request timing and performance tracking
- Security Event Logging: Audit trail for security-related events
- Log Rotation: Automatic log rotation with size limits
Configuration
# Log level and format
export LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
export LOG_FORMAT=auto # json, human, auto (detects environment)
# File logging
export LOG_FILE=/var/log/lovdata-chat.log
export LOG_MAX_SIZE_MB=10 # Max log file size
export LOG_BACKUP_COUNT=5 # Number of backup files
# Output control
export LOG_CONSOLE=true # Enable console logging
export LOG_FILE_ENABLED=true # Enable file logging
Testing Structured Logging
# Test logging functionality and formatters
./docker/scripts/test-structured-logging.py
Log Analysis
JSON Format (Production):
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "INFO",
"logger": "session_manager.main",
"message": "Session created successfully",
"request_id": "req-abc123",
"session_id": "ses-xyz789",
"operation": "create_session",
"duration_ms": 245.67
}
Human-Readable Format (Development):
2024-01-15 10:30:45 [INFO ] session_manager.main:create_session:145 [req-abc123] - Session created successfully
Request Tracing
All logs include request IDs for tracing operations across the system:
with RequestContext():
log_session_operation(session_id, "created")
# All subsequent logs in this context include request_id
Database Persistence
Session data is now stored in PostgreSQL for reliability, multi-instance deployment support, and elimination of JSON file corruption vulnerabilities.
Database Configuration
# PostgreSQL connection settings
export DB_HOST=localhost # Database host
export DB_PORT=5432 # Database port
export DB_USER=lovdata # Database user
export DB_PASSWORD=password # Database password
export DB_NAME=lovdata_chat # Database name
# Connection pool settings
export DB_MIN_CONNECTIONS=5 # Minimum pool connections
export DB_MAX_CONNECTIONS=20 # Maximum pool connections
export DB_MAX_QUERIES=50000 # Max queries per connection
export DB_MAX_INACTIVE_LIFETIME=300.0 # Connection timeout
Storage Backend Selection
# Enable database storage (recommended for production)
export USE_DATABASE_STORAGE=true
# Or use JSON file storage (legacy/development)
export USE_DATABASE_STORAGE=false
Database Schema
Sessions Table:
session_id(VARCHAR, Primary Key): Unique session identifiercontainer_name(VARCHAR): Docker container namecontainer_id(VARCHAR): Docker container IDhost_dir(VARCHAR): Host directory pathport(INTEGER): Container portauth_token(VARCHAR): Authentication tokencreated_at(TIMESTAMP): Creation timestamplast_accessed(TIMESTAMP): Last access timestampstatus(VARCHAR): Session status (creating, running, stopped, error)metadata(JSONB): Additional session metadata
Indexes:
- Primary key on
session_id - Status index for filtering active sessions
- Last accessed index for cleanup operations
- Created at index for session listing
Testing Database Persistence
# Test database connection and operations
./docker/scripts/test-database-persistence.py
Health Monitoring
The /health endpoint now includes database status:
{
"storage_backend": "database",
"database": {
"status": "healthy",
"total_sessions": 15,
"active_sessions": 8,
"database_size": "25 MB"
}
}
Migration Strategy
From JSON File to Database:
- Backup existing sessions (if any)
- Set environment variables for database connection
- Enable database storage:
USE_DATABASE_STORAGE=true - Restart service - automatic schema creation and migration
- Verify data migration in health endpoint
- Monitor performance and adjust connection pool settings
Backward Compatibility:
- JSON file storage remains available for development
- Automatic fallback if database is unavailable
- Zero-downtime migration possible
Container Health Monitoring
Active monitoring of Docker containers with automatic failure detection and recovery mechanisms to prevent stuck sessions and improve system reliability.
Health Monitoring Features
- Periodic Health Checks: Continuous monitoring of running containers every 30 seconds
- Automatic Failure Detection: Identifies unhealthy or failed containers
- Smart Restart Logic: Automatic container restart with configurable limits
- Health History Tracking: Maintains health check history for analysis
- Status Integration: Updates session status based on container health
Configuration
# Health check intervals and timeouts
CONTAINER_HEALTH_CHECK_INTERVAL=30 # Check every 30 seconds
CONTAINER_HEALTH_TIMEOUT=10.0 # Health check timeout
CONTAINER_MAX_RESTART_ATTEMPTS=3 # Max restart attempts
CONTAINER_RESTART_DELAY=5 # Delay between restarts
CONTAINER_FAILURE_THRESHOLD=3 # Failures before restart
Health Status Types
- HEALTHY: Container running normally with optional health checks passing
- UNHEALTHY: Container running but health checks failing
- RESTARTING: Container being restarted due to failures
- FAILED: Container stopped or permanently failed
- UNKNOWN: Unable to determine container status
Testing Health Monitoring
# Test health monitoring functionality
./docker/scripts/test-container-health.py
Health Endpoints
System Health:
GET /health # Includes container health statistics
Detailed Container Health:
GET /health/container # Overall health stats
GET /health/container/{session_id} # Specific session health
Health Response:
{
"container_health": {
"monitoring_active": true,
"check_interval": 30,
"total_sessions_monitored": 5,
"sessions_with_failures": 1,
"session_ses123": {
"total_checks": 10,
"healthy_checks": 8,
"failed_checks": 2,
"average_response_time": 45.2
}
}
}
Recovery Mechanisms
- Health Check Failure: Container marked as unhealthy
- Consecutive Failures: After threshold, automatic restart initiated
- Restart Attempts: Limited to prevent infinite restart loops
- Session Status Update: Session status reflects container health
- Logging & Alerts: Comprehensive logging of health events
Integration Benefits
- Proactive Monitoring: Detects issues before users are affected
- Automatic Recovery: Reduces manual intervention requirements
- Improved Reliability: Prevents stuck sessions and system instability
- Operational Visibility: Detailed health metrics and history
- Scalable Architecture: Works with multiple concurrent sessions
Session Authentication
OpenCode servers now require token-based authentication for secure individual user sessions, preventing unauthorized access and ensuring session isolation.
Authentication Features
- Token Generation: Unique cryptographically secure tokens per session
- Automatic Expiry: Configurable token lifetime (default 24 hours)
- Token Rotation: Ability to rotate tokens for enhanced security
- Session Isolation: Each user session has its own authentication credentials
- Proxy Integration: Authentication headers automatically included in proxy requests
Configuration
# Token configuration
export SESSION_TOKEN_LENGTH=32 # Token length in characters
export SESSION_TOKEN_EXPIRY_HOURS=24 # Token validity period
export SESSION_TOKEN_SECRET=auto # Token signing secret (auto-generated)
export TOKEN_CLEANUP_INTERVAL_MINUTES=60 # Expired token cleanup interval
Testing Authentication
# Test authentication functionality
./docker/scripts/test-session-auth.py
# End-to-end authentication testing
./docker/scripts/test-auth-end-to-end.sh
API Endpoints
Authentication Management:
GET /sessions/{id}/auth- Get session authentication infoPOST /sessions/{id}/auth/rotate- Rotate session tokenGET /auth/sessions- List authenticated sessions
Health Monitoring:
{
"authenticated_sessions": 3,
"status": "healthy"
}
Security Benefits
- Session Isolation: Users cannot access each other's OpenCode servers
- Token Expiry: Automatic cleanup prevents token accumulation
- Secure Generation: Cryptographically secure random tokens
- Proxy Security: Authentication headers prevent unauthorized proxy access
HTTP Connection Pooling
Proxy requests now use a global HTTP connection pool instead of creating new httpx clients for each request, eliminating connection overhead and dramatically improving proxy performance.
Connection Pool Benefits
- Eliminated Connection Overhead: No more client creation/teardown per request
- Connection Reuse: Persistent keep-alive connections reduce latency
- Improved Throughput: Handle significantly more concurrent proxy requests
- Reduced Resource Usage: Lower memory and CPU overhead for HTTP operations
- Better Scalability: Support higher request rates with the same system resources
Pool Configuration
The connection pool is automatically configured with optimized settings:
# Connection pool settings
max_keepalive_connections=20 # Keep connections alive
max_connections=100 # Max total connections
keepalive_expiry=300.0 # 5-minute connection lifetime
connect_timeout=10.0 # Connection establishment timeout
read_timeout=30.0 # Read operation timeout
Performance Testing
# Test HTTP connection pool functionality
./docker/scripts/test-http-connection-pool.py
# Load test proxy performance improvements
./docker/scripts/test-http-pool-load.sh
Health Monitoring
The /health endpoint now includes HTTP connection pool status:
{
"http_connection_pool": {
"status": "healthy",
"config": {
"max_keepalive_connections": 20,
"max_connections": 100,
"keepalive_expiry": 300.0
}
}
}
Async Docker Operations
Docker operations now run asynchronously using aiodeocker to eliminate blocking calls in FastAPI's async event loop, significantly improving concurrency and preventing thread pool exhaustion.
Async Benefits
- Non-Blocking Operations: Container creation, management, and cleanup no longer block the event loop
- Improved Concurrency: Handle multiple concurrent user sessions without performance degradation
- Better Scalability: Support higher throughput with the same system resources
- Thread Pool Preservation: Prevent exhaustion of async thread pools
Configuration
# Enable async Docker operations (recommended)
export USE_ASYNC_DOCKER=true
# Or disable for sync mode (legacy)
export USE_ASYNC_DOCKER=false
Testing Async Operations
# Test async Docker functionality
./docker/scripts/test-async-docker.py
# Load test concurrent operations
./docker/scripts/test-async-docker-load.sh
Performance Impact
Async operations provide significant performance improvements:
- Concurrent Sessions: Handle 10+ concurrent container operations without blocking
- Response Times: Faster session creation under load
- Resource Efficiency: Better CPU utilization with non-blocking I/O
- Scalability: Support more users per server instance
Resource Limits Enforcement
Container resource limits are now actively enforced to prevent resource exhaustion attacks and ensure fair resource allocation across user sessions.
Configurable Limits
| Environment Variable | Default | Description |
|---|---|---|
CONTAINER_MEMORY_LIMIT |
4g |
Memory limit per container |
CONTAINER_CPU_QUOTA |
100000 |
CPU quota (microseconds per period) |
CONTAINER_CPU_PERIOD |
100000 |
CPU period (microseconds) |
MAX_CONCURRENT_SESSIONS |
3 |
Maximum concurrent user sessions |
MEMORY_WARNING_THRESHOLD |
0.8 |
Memory usage warning threshold (80%) |
CPU_WARNING_THRESHOLD |
0.9 |
CPU usage warning threshold (90%) |
Resource Protection Features
- Memory Limits: Prevents containers from consuming unlimited RAM
- CPU Quotas: Ensures fair CPU allocation across sessions
- Session Throttling: Blocks new sessions when resources are constrained
- System Monitoring: Continuous resource usage tracking
- Graceful Degradation: Alerts and throttling before system failure
Testing Resource Limits
# Test resource limit configuration and validation
./docker/scripts/test-resource-limits.py
# Load testing with enforcement verification
./docker/scripts/test-resource-limits-load.sh
Health Monitoring
The /health endpoint now includes comprehensive resource information:
{
"resource_limits": {
"memory_limit": "4g",
"cpu_quota": 100000,
"max_concurrent_sessions": 3
},
"system_resources": {
"memory_percent": 0.65,
"cpu_percent": 0.45
},
"resource_alerts": []
}
Resource Alert Levels
- Warning: System resources approaching limits (80% memory, 90% CPU)
- Critical: System resources at dangerous levels (95%+ usage)
- Throttling: New sessions blocked when critical alerts active
Security Audit Checklist
- TLS certificates generated with strong encryption
- Certificate permissions set correctly (400/444)
- No socket mounting in docker-compose.yml
- Environment variables properly configured
- TLS connection tested successfully
- Host IP detection working correctly
- Proxy routing functional across environments
- Resource limits properly configured and enforced
- Session throttling prevents resource exhaustion
- System resource monitoring active
- Certificate rotation process documented
- Firewall rules restrict Docker API access
- Docker daemon configured with security options
- Monitoring and logging enabled for API access