Production Deployment: VPS to GPU Clusters
title: "Lesson 18 | Production Deployment: VPS to GPU Clusters" summary: "Industrial-grade deployment solutions: from cloud server selection and cluster orchestration to enterprise intranet deployment and high-availability strategy design." sortOrder: 180 status: "published"
Lesson 18 | Production Deployment: VPS to GPU Clusters
Subtitle: Industrial-grade deployment solutions, covering cloud server selection, cluster orchestration, enterprise intranet deployment, and the design of high-availability and disaster recovery strategies.
Welcome to the 18th lesson in the Hermes Agent tutorial series. In previous lessons, we have explored the core functionalities of the Hermes Agent in-depth, from model configuration and skill extension to the memory system. Now, it's time to push our agent from the development environment into the vast world of production. This lesson will focus on the industrial-grade deployment of the Hermes Agent, ensuring it runs stably and efficiently in high-concurrency, high-availability production environments.
Learning Objectives
Upon completing this lesson, you will be able to:
- Evaluate and select suitable production deployment environments: From simple VPS to complex GPU clusters, understand the pros, cons, and applicable scenarios for different solutions.
- Master containerized deployment: Use Docker and Docker Compose to package the Hermes Agent and its dependencies, achieving environmental consistency and rapid deployment.
- Implement Kubernetes cluster orchestration: Deploy the Hermes Agent to a Kubernetes (K8s) cluster to achieve auto-scaling, self-healing, and rolling updates.
- Address the challenges of enterprise intranet deployment: Understand the special considerations for deploying an Agent in an isolated network, such as proxy configuration and private dependency management.
- Design and implement High Availability (HA) strategies: Ensure the robustness of the Agent service, eliminate single points of failure, and create a basic disaster recovery plan.
Core Concepts Explained
Before diving into the hands-on section, we must understand several key DevOps and MLOps concepts. These are the cornerstones of building stable, scalable production systems.
1. Choosing a Deployment Environment
- VPS (Virtual Private Server): The most basic cloud server. It provides an isolated virtual environment with limited resources.
- Pros: Low cost, simple configuration, suitable for personal projects, prototype validation, or low-traffic applications.
- Cons: Single Point of Failure (SPOF)—if the server goes down, the service is interrupted; scaling usually requires manual intervention and may cause downtime; limited resources (CPU/RAM/GPU).
- Dedicated GPU Server: A physical server or a high-performance cloud instance equipped with powerful GPUs.
- Pros: Provides ultimate performance for local large model inference, with no virtualization overhead.
- Cons: High cost, complex maintenance, and potentially low resource utilization (you pay even when it's idle).
- Cloud GPU Instances: Such as AWS EC2 P/G series, Google Cloud N1/A2 series, Azure NC series.
- Pros: Pay-as-you-go, elastic scaling, and deep integration with the cloud ecosystem (storage, networking, databases).
- Cons: Configuration and management have a learning curve; long-term running costs may be higher than a dedicated server.
- Kubernetes (K8s) Cluster: The industry standard for container orchestration. It combines a group of physical or virtual machines into a unified resource pool to automatically deploy, scale, and manage containerized applications.
- Pros: High availability, auto-scaling, self-healing, rolling updates, declarative configuration. It is the top choice for building large-scale, resilient microservice applications.
- Cons: Steep learning curve; the cost of setting up and maintaining the cluster itself is relatively high.
2. Containerization & Orchestration
- Docker: An open-source application container engine. It allows developers to package an application and all its dependencies (libraries, runtimes, system tools) into a lightweight, portable container.
- Core Value: "Build once, run anywhere." It ensures consistency from development to testing to production, completely solving the classic "it works on my machine" problem.
- Kubernetes (K8s): Originating from Google's Borg system, K8s is responsible for automating the lifecycle management of hundreds or thousands of containers.
- Core Components:
Pod: The smallest deployable unit in K8s, usually containing one or more tightly coupled containers.Deployment: Defines the desired state of Pods (e.g., number of replicas, image version) and is responsible for maintaining that state.Service: Provides a stable network endpoint (IP address and DNS name) for a set of Pods, enabling service discovery and load balancing.PersistentVolume (PV)andPersistentVolumeClaim (PVC): Used to manage persistent storage, ensuring data is not lost when a Pod restarts or migrates, which is crucial for the Hermes Agent's Memory system.ConfigMap&Secret: Used to decouple configuration and sensitive information (like API Keys) from the application image.
- Core Components:
3. High Availability (HA)
The core goal of HA is to eliminate single points of failure, ensuring the system continues to provide service even when some components fail. Common strategies include:
- Redundancy: Running multiple instances (replicas) of an application, distributed across different physical nodes or availability zones.
- Load Balancing: Distributing traffic across all healthy instances to prevent any single instance from being overloaded.
- Health Checks: Periodically checking the health status of application instances and automatically removing or restarting problematic ones.
- Data Backup & Recovery: Regularly backing up persistent data (like user profiles, memory database) and having a recovery process in place.
💻 Hands-on Demo
We will proceed step-by-step, starting with the simplest VPS deployment and gradually upgrading to a complex K8s cluster deployment.
Step 1: Basic Deployment - Running on a VPS with Docker and Docker Compose
This is the quickest and most direct way to productionize, suitable for personal projects or small-scale internal use.
Prerequisites:
- A VPS with Docker and Docker Compose installed.
- The Hermes Agent project code.
1. Create the Dockerfile
In the root directory of your Hermes Agent project, create a Dockerfile to build the application image.
# Dockerfile
# Use an official image containing Python and basic build tools as the base
FROM python:3.11-slim-buster AS builder
# Set the working directory
WORKDIR /app
# Install Poetry (Python dependency management tool)
RUN pip install poetry
# Copy the dependency definition files
COPY poetry.lock pyproject.toml ./
# Install project dependencies. --no-root means do not install the project itself, only dependencies.
# --no-dev means do not install development dependencies, reducing the image size.
RUN poetry config virtualenvs.create false && \
poetry install --no-dev --no-interaction --no-ansi
# --- Final Image ---
FROM python:3.11-slim-buster
WORKDIR /app
# Copy the installed dependencies from the builder stage
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
# Copy the application code
COPY . .
# Expose the Agent service's port (e.g., 8000)
EXPOSE 8000
# Command to start the application
CMD ["python", "main.py"]
- Note: We are using a multi-stage build. The
builderstage is responsible for installing all dependencies. The final production image only copies the necessary dependencies and application code from thebuilderstage. This can significantly reduce the final image size and improve security.
2. Create docker-compose.yml
To manage the Hermes Agent and its potential dependencies (like Redis as a Memory backend), we use Docker Compose.
# docker-compose.yml
version: '3.8'
services:
hermes-agent:
build: .
container_name: hermes_agent_prod
restart: always
ports:
- "8000:8000" # Map port 8000 of the host to port 8000 of the container
volumes:
- ./config.prod.yaml:/app/config.yaml # Mount the production configuration file
- agent_data:/app/data # Mount a data volume for persistence
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY} # Read sensitive information from environment variables
- MEMORY_BACKEND=redis
- REDIS_HOST=hermes_redis
depends_on:
- hermes_redis
hermes_redis:
image: redis:7-alpine
container_name: hermes_redis_prod
restart: always
volumes:
- redis_data:/data
volumes:
agent_data:
redis_data:
- Note:
restart: alwaysensures the container restarts automatically after an unexpected exit or server reboot.volumesare used for data persistence.config.prod.yamlmounts the production config file, andagent_dataandredis_dataare named volumes for saving the Agent's working data and Redis data.environmentis used to pass configurations, especially sensitive information likeOPENAI_API_KEY. It's recommended to pass these via environment variables or a.envfile rather than hardcoding them in the configuration file.
3. Deploy
On your VPS, execute the following commands:
# Create a .env file and add your API Key
echo "OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxx" > .env
# Build and start the services (in detached mode)
docker-compose up --build -d
# Check the running status
docker-compose ps
# View the logs
docker-compose logs -f hermes-agent
At this point, your Hermes Agent is running stably on the VPS. However, it is still a single-point service.
Step 2: Advanced Deployment - Migrating to a Kubernetes (K8s) Cluster
Now, we will deploy the Agent to K8s to gain high availability and scalability.
Prerequisites:
- An available K8s cluster (e.g., GKE, EKS, AKS provided by cloud vendors, or a self-hosted cluster).
- The
kubectlcommand-line tool is configured and connected to your cluster. - A container image registry (e.g., Docker Hub, GCR, ECR) to store our built Hermes Agent image.
1. Build and Push the Image
# Log in to your image registry
docker login your-registry.io
# Build the image and tag it
docker build -t your-registry.io/hermes-agent:v1.0.0 .
# Push the image
docker push your-registry.io/hermes-agent:v1.0.0
2. Write K8s Manifests (YAML files)
We will create several YAML files to define our deployment.
a. hermes-configmap.yaml - Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: hermes-agent-config
data:
config.yaml: |
# Your non-sensitive configuration goes here
server:
host: "0.0.0.0"
port: 8000
memory:
backend: "redis"
redis_host: "hermes-redis-service" # Use the K8s Service name
b. hermes-secret.yaml - Sensitive Information
apiVersion: v1
kind: Secret
metadata:
name: hermes-agent-secret
type: Opaque
stringData:
openai-api-key: "sk-xxxxxxxxxxxxxxxxxxxx" # In actual production, you should use a more secure way to manage secrets
c. hermes-redis-deployment.yaml - Redis Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: hermes-redis
spec:
replicas: 1
selector:
matchLabels:
app: hermes-redis
template:
metadata:
labels:
app: hermes-redis
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
---
apiVersion: v1
kind: Service
metadata:
name: hermes-redis-service
spec:
selector:
app: hermes-redis
ports:
- protocol: TCP
port: 6379
targetPort: 6379
d. hermes-agent-deployment.yaml - Core Hermes Agent Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: hermes-agent-deployment
spec:
replicas: 3 # <-- The starting point for high availability: run 3 replicas
selector:
matchLabels:
app: hermes-agent
template:
metadata:
labels:
app: hermes-agent
spec:
containers:
- name: hermes-agent
image: your-registry.io/hermes-agent:v1.0.0 # Use your own image
ports:
- containerPort: 8000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: hermes-agent-secret
key: openai-api-key
volumeMounts:
- name: config-volume
mountPath: /app/config.yaml
subPath: config.yaml
# --- Health Checks ---
livenessProbe:
httpGet:
path: /health # Assuming your application has a /health endpoint
port: 8000
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: config-volume
configMap:
name: hermes-agent-config
e. hermes-agent-service.yaml - Exposing the Service
apiVersion: v1
kind: Service
metadata:
name: hermes-agent-service
spec:
type: LoadBalancer # <-- The cloud provider will automatically create a load balancer
selector:
app: hermes-agent
ports:
- protocol: TCP
port: 80 # The port for external access
targetPort: 8000 # The port exposed by the container
3. Deploy to the Cluster
# Apply all configurations sequentially
kubectl apply -f hermes-configmap.yaml
kubectl apply -f hermes-secret.yaml
kubectl apply -f hermes-redis-deployment.yaml
kubectl apply -f hermes-agent-deployment.yaml
kubectl apply -f hermes-agent-service.yaml
# Check the deployment status
kubectl get deployments
kubectl get pods -o wide
kubectl get service hermes-agent-service # Get the external IP
Now, your Hermes Agent is running in a high-availability mode on the K8s cluster. K8s will automatically handle Pod scheduling, failure recovery, and traffic load balancing.
Step 3: Enterprise-Level Considerations - Intranet Deployment and High Availability Strategies
1. Enterprise Intranet Deployment
When deploying in an enterprise environment without public internet access, special handling is required:
- Network Proxy: If you need to access external APIs (like OpenAI) through an HTTP/HTTPS proxy, you need to set the proxy environment variables in the
envsection of yourDeployment:env: - name: HTTP_PROXY value: "http://your-proxy.com:port" - name: HTTPS_PROXY value: "http://your-proxy.com:port" - Private Image Registry: Enterprises often use private registries like Harbor, Nexus, or those provided by cloud vendors. You'll need to create an
imagePullSecretsfirst and then reference it in yourDeploymentunderspec.template.specso that K8s has permission to pull the image. - Local Models and Dependencies: If the Agent uses a local large model, you need to mount the model files into all Pods using a
PersistentVolume. For Python dependencies, you can set up a private PyPI mirror (likedevpi) to manage them.
2. Enhancing High Availability (HA) Strategies
- Horizontal Pod Autoscaler (HPA): Automatically increase or decrease the number of Pods based on CPU or memory utilization.
# Create an HPA that auto-scales up to 10 pods when CPU utilization exceeds 80% kubectl autoscale deployment hermes-agent-deployment --cpu-percent=80 --min=3 --max=10 - Multi-AZ Deployment: At the K8s cluster level, ensure your worker nodes are distributed across multiple Availability Zones of your cloud provider. This prevents a single data center failure from taking down your entire service.
- Data Backup and Disaster Recovery (DR):
- Database Backup: Set up regular snapshot backups for your Redis or other persistent storage.
- Configuration Backup: Keep all your K8s YAML files under Git version control.
- Tools: Use tools like Velero to back up the entire state of your K8s cluster, including PV snapshots, enabling one-click, cluster-level disaster recovery.
Commands Used
Docker & Docker Compose:
docker-compose up --build -d: Build images and start all services in the background.docker-compose ps: Check the status of currently running services.docker-compose logs -f <service_name>: Tail the logs of a service in real-time.docker build -t <tag> .: Build a Docker image.docker push <tag>: Push an image to a registry.
Kubernetes (kubectl):
kubectl apply -f <filename.yaml>: Apply or update a resource configuration.kubectl get pods: List the status of all Pods.kubectl get deployments: List the status of Deployments.kubectl get services: List the status of Services and their external IPs.kubectl describe pod <pod_name>: Describe a Pod's details and events for troubleshooting.kubectl logs -f <pod_name>: Tail the logs of a Pod in real-time.kubectl scale deployment <name> --replicas=<num>: Manually scale the number of replicas for a Deployment.kubectl autoscale deployment ...: Create a Horizontal Pod Autoscaler.
Key Takeaways
- Deployment is an evolutionary process: Start with a simple VPS, and as business complexity and traffic grow, gradually migrate to Docker Compose, and finally adopt Kubernetes for industrial-grade deployment.
- Containerization is the cornerstone of modern deployment: Docker ensures environmental consistency and delivery portability.
- Kubernetes provides ultimate elasticity and reliability: Through mechanisms like
Deployment,Service, andProbes, K8s achieves application self-healing, auto-scaling, and high availability. - Separate configuration from code: Always use
ConfigMaps,Secrets, or environment variables to manage configurations. Never hardcode sensitive information in your code or image. - High availability is a systems engineering challenge: It involves not just running multiple replicas but also health checks, load balancing, multi-zone deployment, and a comprehensive data backup and recovery strategy.
- Enterprise intranet deployment has its own unique challenges: It requires a focus on networking, security, and private dependency management.
Transforming the Hermes Agent from a local script into an intelligent agent that provides 24/7 uninterrupted service globally is an exciting engineering challenge. We hope this lesson paves the way for you on this journey from development to production.