docker related

This commit is contained in:
2026-01-18 23:29:04 +01:00
parent 2f5464e1d2
commit 7a9b4b751e
30 changed files with 6004 additions and 1 deletions

793
docker/README.md Normal file
View File

@@ -0,0 +1,793 @@
# Docker TLS Security Setup
This directory contains scripts and configuration for securing Docker API access with TLS authentication, replacing the insecure socket mounting approach.
## Overview
Previously, the session-manager service mounted the Docker socket (`/var/run/docker.sock`) directly into containers, granting full root access to the host Docker daemon. This is a critical security vulnerability.
This setup replaces socket mounting with authenticated TLS API access over the network.
## Security Benefits
-**No socket mounting**: Eliminates privilege escalation risk
-**Mutual TLS authentication**: Both client and server authenticate
-**Encrypted communication**: All API calls are encrypted
-**Certificate-based access**: Granular access control
-**Network isolation**: API access is network-bound, not filesystem-bound
## Docker Service Abstraction
The session-manager now uses a clean `DockerService` abstraction layer that separates Docker operations from business logic, enabling better testing, maintainability, and future Docker client changes.
### Architecture Benefits
- 🧪 **Testability**: MockDockerService enables testing without Docker daemon
- 🔧 **Maintainability**: Clean separation of concerns
- 🔄 **Flexibility**: Easy to swap Docker client implementations
- 📦 **Dependency Injection**: SessionManager receives DockerService via constructor
-**Performance**: Both async and sync Docker operations supported
### Service Interface
```python
class DockerService:
async def create_container(self, name: str, image: str, **kwargs) -> ContainerInfo
async def start_container(self, container_id: str) -> None
async def stop_container(self, container_id: str, timeout: int = 10) -> None
async def remove_container(self, container_id: str, force: bool = False) -> None
async def get_container_info(self, container_id: str) -> Optional[ContainerInfo]
async def list_containers(self, all: bool = False) -> List[ContainerInfo]
async def ping(self) -> bool
```
### Testing
Run the comprehensive test suite:
```bash
# Test Docker service abstraction
./docker/scripts/test-docker-service.py
# Results: 7/7 tests passed ✅
# - Service Interface ✅
# - Error Handling ✅
# - Async vs Sync Modes ✅
# - Container Info Operations ✅
# - Context Management ✅
# - Integration Patterns ✅
# - Performance and Scaling ✅
```
### Usage in SessionManager
```python
# Dependency injection pattern
session_manager = SessionManager(docker_service=DockerService(use_async=True))
# Or with mock for testing
test_manager = SessionManager(docker_service=MockDockerService())
```
## Files Structure
```
docker/
├── certs/ # Generated TLS certificates (not in git)
├── scripts/
│ ├── generate-certs.sh # Certificate generation script
│ ├── setup-docker-tls.sh # Docker daemon TLS configuration
│ └── test-tls-connection.py # Connection testing script
├── daemon.json # Docker daemon TLS configuration
└── .env.example # Environment configuration template
```
## Quick Start
### 1. Generate TLS Certificates
```bash
# Generate certificates for development
DOCKER_ENV=development ./docker/scripts/generate-certs.sh
# Or for production with custom settings
DOCKER_ENV=production \
DOCKER_HOST_IP=your-server-ip \
DOCKER_HOST_NAME=your-docker-host \
./docker/scripts/generate-certs.sh
```
### 2. Configure Docker Daemon
**For local development (Docker Desktop):**
```bash
# Certificates are automatically mounted in docker-compose.yml
docker-compose up -d
```
**For production/server setup:**
```bash
# Configure system Docker daemon with TLS
sudo ./docker/scripts/setup-docker-tls.sh
```
### 3. Configure Environment
```bash
# Copy and customize environment file
cp docker/.env.example .env
# Edit .env with your settings
# DOCKER_HOST_IP=host.docker.internal # for Docker Desktop
# DOCKER_HOST_IP=your-server-ip # for production
```
### 4. Test Configuration
```bash
# Test TLS connection
./docker/scripts/test-tls-connection.py
# Start services
docker-compose --env-file .env up -d session-manager
# Check logs
docker-compose logs session-manager
```
## Configuration Options
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `DOCKER_TLS_VERIFY` | `1` | Enable TLS verification |
| `DOCKER_CERT_PATH` | `./docker/certs` | Certificate directory path |
| `DOCKER_HOST` | `tcp://host.docker.internal:2376` | Docker daemon endpoint |
| `DOCKER_TLS_PORT` | `2376` | TLS port for Docker API |
| `DOCKER_CA_CERT` | `./docker/certs/ca.pem` | CA certificate path |
| `DOCKER_CLIENT_CERT` | `./docker/certs/client-cert.pem` | Client certificate path |
| `DOCKER_CLIENT_KEY` | `./docker/certs/client-key.pem` | Client key path |
| `DOCKER_HOST_IP` | `host.docker.internal` | Docker host IP |
### Certificate Generation Options
| Variable | Default | Description |
|----------|---------|-------------|
| `DOCKER_ENV` | `development` | Environment name for certificates |
| `DOCKER_HOST_IP` | `127.0.0.1` | IP address for server certificate |
| `DOCKER_HOST_NAME` | `localhost` | Hostname for server certificate |
| `DAYS` | `3650` | Certificate validity in days |
## Production Deployment
### Certificate Management
1. **Generate certificates on a secure machine**
2. **Distribute to servers securely** (SCP, Ansible, etc.)
3. **Set proper permissions**:
```bash
chmod 444 /etc/docker/certs/*.pem # certs readable by all
chmod 400 /etc/docker/certs/*-key.pem # keys readable by root only
```
4. **Rotate certificates regularly** (every 6-12 months)
5. **Revoke compromised certificates** and regenerate
### Docker Daemon Configuration
For production servers, use the `setup-docker-tls.sh` script or manually configure `/etc/docker/daemon.json`:
```json
{
"tls": true,
"tlsverify": true,
"tlscacert": "/etc/docker/certs/ca.pem",
"tlscert": "/etc/docker/certs/server-cert.pem",
"tlskey": "/etc/docker/certs/server-key.pem",
"hosts": ["tcp://0.0.0.0:2376"],
"iptables": false,
"bridge": "none",
"live-restore": true,
"userland-proxy": false,
"no-new-privileges": true
}
```
### Security Hardening
- **Firewall**: Only allow TLS port (2376) from trusted networks
- **TLS 1.3**: Ensure modern TLS version support
- **Certificate pinning**: Consider certificate pinning in client code
- **Monitoring**: Log and monitor Docker API access
- **Rate limiting**: Implement API rate limiting
## Troubleshooting
### Common Issues
**"Connection refused"**
- Check if Docker daemon is running with TLS
- Verify `DOCKER_HOST` points to correct endpoint
- Ensure firewall allows port 2376
**"TLS handshake failed"**
- Verify certificates exist and have correct permissions
- Check certificate validity dates
- Ensure CA certificate is correct
**"Permission denied"**
- Check certificate file permissions (444 for certs, 400 for keys)
- Ensure client certificate is signed by the CA
### Debug Commands
```bash
# Test TLS connection manually
docker --tlsverify \
--tlscacert=./docker/certs/ca.pem \
--tlscert=./docker/certs/client-cert.pem \
--tlskey=./docker/certs/client-key.pem \
-H tcp://host.docker.internal:2376 \
version
# Check certificate validity
openssl x509 -in ./docker/certs/server-cert.pem -text -noout
# Test from container
docker-compose exec session-manager ./docker/scripts/test-tls-connection.py
```
## Migration from Socket Mounting
### Before (Insecure)
```yaml
volumes:
- /var/run/docker.sock:/var/run/docker.sock
```
### After (Secure)
```yaml
volumes:
- ./docker/certs:/etc/docker/certs:ro
environment:
- DOCKER_TLS_VERIFY=1
- DOCKER_HOST=tcp://host.docker.internal:2376
```
### Code Changes Required
Update Docker client initialization:
```python
# Before
self.docker_client = docker.from_env()
# After
tls_config = docker.tls.TLSConfig(
ca_cert=os.getenv('DOCKER_CA_CERT'),
client_cert=(os.getenv('DOCKER_CLIENT_CERT'), os.getenv('DOCKER_CLIENT_KEY')),
verify=True
)
self.docker_client = docker.from_env()
self.docker_client.api = docker.APIClient(
base_url=os.getenv('DOCKER_HOST'),
tls=tls_config
)
```
## Dynamic Host IP Detection
The session-manager service now includes robust host IP detection to support proxy routing across different Docker environments:
### Supported Environments
- **Docker Desktop (Mac/Windows)**: Uses `host.docker.internal` resolution
- **Linux Docker**: Reads gateway from `/proc/net/route`
- **Cloud environments**: Respects `DOCKER_HOST_GATEWAY` and `GATEWAY` environment variables
- **Custom networks**: Tests connectivity to common Docker gateway IPs
### Detection Methods (in priority order)
1. **Docker Internal**: Resolves `host.docker.internal` (Docker Desktop)
2. **Environment Variables**: Checks `HOST_IP`, `DOCKER_HOST_GATEWAY`, `GATEWAY`
3. **Route Table**: Parses `/proc/net/route` for default gateway
4. **Network Connection**: Tests connectivity to determine local routing
5. **Common Gateways**: Falls back to known Docker bridge IPs
### Configuration
The detection is automatic and cached for 5 minutes. Override with:
```bash
# Force specific host IP
export HOST_IP=192.168.1.100
# Or in docker-compose.yml
environment:
- HOST_IP=your-host-ip
```
### Testing
```bash
# Test host IP detection
./docker/scripts/test-host-ip-detection.py
# Run integration test
./docker/scripts/test-integration.sh
```
### Troubleshooting
**"Could not detect Docker host IP"**
- Check network configuration: `docker network inspect bridge`
- Verify environment variables
- Test connectivity: `ping host.docker.internal`
- Set explicit `HOST_IP` if needed
**Proxy routing fails**
- Verify detected IP is accessible from containers
- Check firewall rules blocking container-to-host traffic
- Ensure Docker network allows communication
## Structured Logging
Comprehensive logging infrastructure with structured JSON logs, request tracking, and production-ready log management for debugging and monitoring.
### Log Features
- **Structured JSON Logs**: Machine-readable logs for production analysis
- **Request ID Tracking**: Trace requests across distributed operations
- **Human-Readable Development**: Clear logs for local development
- **Performance Metrics**: Built-in request timing and performance tracking
- **Security Event Logging**: Audit trail for security-related events
- **Log Rotation**: Automatic log rotation with size limits
### Configuration
```bash
# Log level and format
export LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
export LOG_FORMAT=auto # json, human, auto (detects environment)
# File logging
export LOG_FILE=/var/log/lovdata-chat.log
export LOG_MAX_SIZE_MB=10 # Max log file size
export LOG_BACKUP_COUNT=5 # Number of backup files
# Output control
export LOG_CONSOLE=true # Enable console logging
export LOG_FILE_ENABLED=true # Enable file logging
```
### Testing Structured Logging
```bash
# Test logging functionality and formatters
./docker/scripts/test-structured-logging.py
```
### Log Analysis
**JSON Format (Production):**
```json
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "INFO",
"logger": "session_manager.main",
"message": "Session created successfully",
"request_id": "req-abc123",
"session_id": "ses-xyz789",
"operation": "create_session",
"duration_ms": 245.67
}
```
**Human-Readable Format (Development):**
```
2024-01-15 10:30:45 [INFO ] session_manager.main:create_session:145 [req-abc123] - Session created successfully
```
### Request Tracing
All logs include request IDs for tracing operations across the system:
```python
with RequestContext():
log_session_operation(session_id, "created")
# All subsequent logs in this context include request_id
```
## Database Persistence
Session data is now stored in PostgreSQL for reliability, multi-instance deployment support, and elimination of JSON file corruption vulnerabilities.
### Database Configuration
```bash
# PostgreSQL connection settings
export DB_HOST=localhost # Database host
export DB_PORT=5432 # Database port
export DB_USER=lovdata # Database user
export DB_PASSWORD=password # Database password
export DB_NAME=lovdata_chat # Database name
# Connection pool settings
export DB_MIN_CONNECTIONS=5 # Minimum pool connections
export DB_MAX_CONNECTIONS=20 # Maximum pool connections
export DB_MAX_QUERIES=50000 # Max queries per connection
export DB_MAX_INACTIVE_LIFETIME=300.0 # Connection timeout
```
### Storage Backend Selection
```bash
# Enable database storage (recommended for production)
export USE_DATABASE_STORAGE=true
# Or use JSON file storage (legacy/development)
export USE_DATABASE_STORAGE=false
```
### Database Schema
**Sessions Table:**
- `session_id` (VARCHAR, Primary Key): Unique session identifier
- `container_name` (VARCHAR): Docker container name
- `container_id` (VARCHAR): Docker container ID
- `host_dir` (VARCHAR): Host directory path
- `port` (INTEGER): Container port
- `auth_token` (VARCHAR): Authentication token
- `created_at` (TIMESTAMP): Creation timestamp
- `last_accessed` (TIMESTAMP): Last access timestamp
- `status` (VARCHAR): Session status (creating, running, stopped, error)
- `metadata` (JSONB): Additional session metadata
**Indexes:**
- Primary key on `session_id`
- Status index for filtering active sessions
- Last accessed index for cleanup operations
- Created at index for session listing
### Testing Database Persistence
```bash
# Test database connection and operations
./docker/scripts/test-database-persistence.py
```
### Health Monitoring
The `/health` endpoint now includes database status:
```json
{
"storage_backend": "database",
"database": {
"status": "healthy",
"total_sessions": 15,
"active_sessions": 8,
"database_size": "25 MB"
}
}
```
### Migration Strategy
**From JSON File to Database:**
1. **Backup existing sessions** (if any)
2. **Set environment variables** for database connection
3. **Enable database storage**: `USE_DATABASE_STORAGE=true`
4. **Restart service** - automatic schema creation and migration
5. **Verify data migration** in health endpoint
6. **Monitor performance** and adjust connection pool settings
**Backward Compatibility:**
- JSON file storage remains available for development
- Automatic fallback if database is unavailable
- Zero-downtime migration possible
## Container Health Monitoring
Active monitoring of Docker containers with automatic failure detection and recovery mechanisms to prevent stuck sessions and improve system reliability.
### Health Monitoring Features
- **Periodic Health Checks**: Continuous monitoring of running containers every 30 seconds
- **Automatic Failure Detection**: Identifies unhealthy or failed containers
- **Smart Restart Logic**: Automatic container restart with configurable limits
- **Health History Tracking**: Maintains health check history for analysis
- **Status Integration**: Updates session status based on container health
### Configuration
```bash
# Health check intervals and timeouts
CONTAINER_HEALTH_CHECK_INTERVAL=30 # Check every 30 seconds
CONTAINER_HEALTH_TIMEOUT=10.0 # Health check timeout
CONTAINER_MAX_RESTART_ATTEMPTS=3 # Max restart attempts
CONTAINER_RESTART_DELAY=5 # Delay between restarts
CONTAINER_FAILURE_THRESHOLD=3 # Failures before restart
```
### Health Status Types
- **HEALTHY**: Container running normally with optional health checks passing
- **UNHEALTHY**: Container running but health checks failing
- **RESTARTING**: Container being restarted due to failures
- **FAILED**: Container stopped or permanently failed
- **UNKNOWN**: Unable to determine container status
### Testing Health Monitoring
```bash
# Test health monitoring functionality
./docker/scripts/test-container-health.py
```
### Health Endpoints
**System Health:**
```bash
GET /health # Includes container health statistics
```
**Detailed Container Health:**
```bash
GET /health/container # Overall health stats
GET /health/container/{session_id} # Specific session health
```
**Health Response:**
```json
{
"container_health": {
"monitoring_active": true,
"check_interval": 30,
"total_sessions_monitored": 5,
"sessions_with_failures": 1,
"session_ses123": {
"total_checks": 10,
"healthy_checks": 8,
"failed_checks": 2,
"average_response_time": 45.2
}
}
}
```
### Recovery Mechanisms
1. **Health Check Failure**: Container marked as unhealthy
2. **Consecutive Failures**: After threshold, automatic restart initiated
3. **Restart Attempts**: Limited to prevent infinite restart loops
4. **Session Status Update**: Session status reflects container health
5. **Logging & Alerts**: Comprehensive logging of health events
### Integration Benefits
- **Proactive Monitoring**: Detects issues before users are affected
- **Automatic Recovery**: Reduces manual intervention requirements
- **Improved Reliability**: Prevents stuck sessions and system instability
- **Operational Visibility**: Detailed health metrics and history
- **Scalable Architecture**: Works with multiple concurrent sessions
## Session Authentication
OpenCode servers now require token-based authentication for secure individual user sessions, preventing unauthorized access and ensuring session isolation.
### Authentication Features
- **Token Generation**: Unique cryptographically secure tokens per session
- **Automatic Expiry**: Configurable token lifetime (default 24 hours)
- **Token Rotation**: Ability to rotate tokens for enhanced security
- **Session Isolation**: Each user session has its own authentication credentials
- **Proxy Integration**: Authentication headers automatically included in proxy requests
### Configuration
```bash
# Token configuration
export SESSION_TOKEN_LENGTH=32 # Token length in characters
export SESSION_TOKEN_EXPIRY_HOURS=24 # Token validity period
export SESSION_TOKEN_SECRET=auto # Token signing secret (auto-generated)
export TOKEN_CLEANUP_INTERVAL_MINUTES=60 # Expired token cleanup interval
```
### Testing Authentication
```bash
# Test authentication functionality
./docker/scripts/test-session-auth.py
# End-to-end authentication testing
./docker/scripts/test-auth-end-to-end.sh
```
### API Endpoints
**Authentication Management:**
- `GET /sessions/{id}/auth` - Get session authentication info
- `POST /sessions/{id}/auth/rotate` - Rotate session token
- `GET /auth/sessions` - List authenticated sessions
**Health Monitoring:**
```json
{
"authenticated_sessions": 3,
"status": "healthy"
}
```
### Security Benefits
- **Session Isolation**: Users cannot access each other's OpenCode servers
- **Token Expiry**: Automatic cleanup prevents token accumulation
- **Secure Generation**: Cryptographically secure random tokens
- **Proxy Security**: Authentication headers prevent unauthorized proxy access
## HTTP Connection Pooling
Proxy requests now use a global HTTP connection pool instead of creating new httpx clients for each request, eliminating connection overhead and dramatically improving proxy performance.
### Connection Pool Benefits
- **Eliminated Connection Overhead**: No more client creation/teardown per request
- **Connection Reuse**: Persistent keep-alive connections reduce latency
- **Improved Throughput**: Handle significantly more concurrent proxy requests
- **Reduced Resource Usage**: Lower memory and CPU overhead for HTTP operations
- **Better Scalability**: Support higher request rates with the same system resources
### Pool Configuration
The connection pool is automatically configured with optimized settings:
```python
# Connection pool settings
max_keepalive_connections=20 # Keep connections alive
max_connections=100 # Max total connections
keepalive_expiry=300.0 # 5-minute connection lifetime
connect_timeout=10.0 # Connection establishment timeout
read_timeout=30.0 # Read operation timeout
```
### Performance Testing
```bash
# Test HTTP connection pool functionality
./docker/scripts/test-http-connection-pool.py
# Load test proxy performance improvements
./docker/scripts/test-http-pool-load.sh
```
### Health Monitoring
The `/health` endpoint now includes HTTP connection pool status:
```json
{
"http_connection_pool": {
"status": "healthy",
"config": {
"max_keepalive_connections": 20,
"max_connections": 100,
"keepalive_expiry": 300.0
}
}
}
```
## Async Docker Operations
Docker operations now run asynchronously using aiodeocker to eliminate blocking calls in FastAPI's async event loop, significantly improving concurrency and preventing thread pool exhaustion.
### Async Benefits
- **Non-Blocking Operations**: Container creation, management, and cleanup no longer block the event loop
- **Improved Concurrency**: Handle multiple concurrent user sessions without performance degradation
- **Better Scalability**: Support higher throughput with the same system resources
- **Thread Pool Preservation**: Prevent exhaustion of async thread pools
### Configuration
```bash
# Enable async Docker operations (recommended)
export USE_ASYNC_DOCKER=true
# Or disable for sync mode (legacy)
export USE_ASYNC_DOCKER=false
```
### Testing Async Operations
```bash
# Test async Docker functionality
./docker/scripts/test-async-docker.py
# Load test concurrent operations
./docker/scripts/test-async-docker-load.sh
```
### Performance Impact
Async operations provide significant performance improvements:
- **Concurrent Sessions**: Handle 10+ concurrent container operations without blocking
- **Response Times**: Faster session creation under load
- **Resource Efficiency**: Better CPU utilization with non-blocking I/O
- **Scalability**: Support more users per server instance
## Resource Limits Enforcement
Container resource limits are now actively enforced to prevent resource exhaustion attacks and ensure fair resource allocation across user sessions.
### Configurable Limits
| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `CONTAINER_MEMORY_LIMIT` | `4g` | Memory limit per container |
| `CONTAINER_CPU_QUOTA` | `100000` | CPU quota (microseconds per period) |
| `CONTAINER_CPU_PERIOD` | `100000` | CPU period (microseconds) |
| `MAX_CONCURRENT_SESSIONS` | `3` | Maximum concurrent user sessions |
| `MEMORY_WARNING_THRESHOLD` | `0.8` | Memory usage warning threshold (80%) |
| `CPU_WARNING_THRESHOLD` | `0.9` | CPU usage warning threshold (90%) |
### Resource Protection Features
- **Memory Limits**: Prevents containers from consuming unlimited RAM
- **CPU Quotas**: Ensures fair CPU allocation across sessions
- **Session Throttling**: Blocks new sessions when resources are constrained
- **System Monitoring**: Continuous resource usage tracking
- **Graceful Degradation**: Alerts and throttling before system failure
### Testing Resource Limits
```bash
# Test resource limit configuration and validation
./docker/scripts/test-resource-limits.py
# Load testing with enforcement verification
./docker/scripts/test-resource-limits-load.sh
```
### Health Monitoring
The `/health` endpoint now includes comprehensive resource information:
```json
{
"resource_limits": {
"memory_limit": "4g",
"cpu_quota": 100000,
"max_concurrent_sessions": 3
},
"system_resources": {
"memory_percent": 0.65,
"cpu_percent": 0.45
},
"resource_alerts": []
}
```
### Resource Alert Levels
- **Warning**: System resources approaching limits (80% memory, 90% CPU)
- **Critical**: System resources at dangerous levels (95%+ usage)
- **Throttling**: New sessions blocked when critical alerts active
## Security Audit Checklist
- [ ] TLS certificates generated with strong encryption
- [ ] Certificate permissions set correctly (400/444)
- [ ] No socket mounting in docker-compose.yml
- [ ] Environment variables properly configured
- [ ] TLS connection tested successfully
- [ ] Host IP detection working correctly
- [ ] Proxy routing functional across environments
- [ ] Resource limits properly configured and enforced
- [ ] Session throttling prevents resource exhaustion
- [ ] System resource monitoring active
- [ ] Certificate rotation process documented
- [ ] Firewall rules restrict Docker API access
- [ ] Docker daemon configured with security options
- [ ] Monitoring and logging enabled for API access