LLM Deployment on Kubernetes with GPU Nodes

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Deployment on Kubernetes with GPU Nodes
Complex
~5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Deploying LLM on Kubernetes with GPU

Kubernetes with GPU nodes is the standard for scalable LLM deployments in the enterprise. It provides autoscaling, rolling updates, health checks, and resource isolation. While more complex than bare metal, it offers significantly better manageability and reliability.

Preparing a Kubernetes cluster for GPUs

NVIDIA Device Plugin is a required component:

# Установка через Helm
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --set gfd.enabled=true \
  --set devicePlugin.config.sharing.timeSlicing.resources[0].name=nvidia.com/gpu \
  --set devicePlugin.config.sharing.timeSlicing.resources[0].replicas=4  # time-slicing для малых моделей

NVIDIA GPU Operator (for managed K8s or when driver management is needed):

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace

Deployment for vLLM

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-8b
  namespace: ai-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama3-8b
  template:
    metadata:
      labels:
        app: vllm-llama3-8b
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      # GPU ноды
      nodeSelector:
        nvidia.com/gpu.product: "A100-SXM4-80GB"

      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.5.0
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model=/models/llama-3-8b-instruct
            - --tensor-parallel-size=1
            - --max-model-len=8192
            - --max-num-seqs=256
            - --gpu-memory-utilization=0.90
            - --port=8000
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "32Gi"
              cpu: "8"
            requests:
              nvidia.com/gpu: "1"
              memory: "24Gi"
              cpu: "4"
          volumeMounts:
            - name: model-storage
              mountPath: /models
              readOnly: true
            - name: shm
              mountPath: /dev/shm        # для shared memory torch
          env:
            - name: NCCL_DEBUG
              value: "WARN"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60     # загрузка модели занимает время
            periodSeconds: 10
            failureThreshold: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 30

      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-storage-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi

Service and Ingress

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-8b
  namespace: ai-serving
spec:
  selector:
    app: vllm-llama3-8b
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: ai-serving
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"  # для streaming
spec:
  ingressClassName: nginx
  rules:
    - host: llm.company.internal
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service:
                name: vllm-llama3-8b
                port:
                  number: 80

HorizontalPodAutoscaler for GPU

The standard CPU HPA doesn't work for LLM. We're using custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: ai-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama3-8b
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_queue_size   # custom metric из Prometheus
        target:
          type: AverageValue
          averageValue: "10"       # скейлим при > 10 запросов в очереди
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120       # один новый pod каждые 2 минуты
    scaleDown:
      stabilizationWindowSeconds: 300  # ждём 5 минут перед уменьшением

PersistentVolume for models

# Для ReadWriteMany нужен NFS или CSI драйвер (например, AWS EFS)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage-pvc
  namespace: ai-serving
spec:
  accessModes: [ReadOnlyMany]
  storageClassName: nfs-fast
  resources:
    requests:
      storage: 200Gi

Multi-GPU with tensor parallelism

# Для 70B модели: pod с 4 GPU
resources:
  limits:
    nvidia.com/gpu: "4"
    memory: "320Gi"
    cpu: "32"
# args добавляем --tensor-parallel-size=4

Important: a pod with 4 GPUs must be on the same physical host (affinity rules), otherwise NVLink will not work:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: kubernetes.io/hostname

Implementation timeframes

Week 1: Installing NVIDIA Device Plugin, test deployment, checking GPU access

Week 2: Setting up PVC for models, Ingress, health checks

Week 3: HPA with Custom Metrics, Monitoring, Rolling Updates

Month 2: Multi-model deployment, cost optimization, disaster recovery