On-premise deployment of LLM
On-premise deployment of LLM is performed on the company's own equipment in its data center. It ensures complete data control, predictable costs under high load, and compliance with regulatory requirements (Federal Law No. 152, banking requirements, and medical data).
Selection of equipment
Entry-level servers (7-13B models):
- Dell PowerEdge R750xa with NVIDIA A30 24GB × 4
- HPE ProLiant DL380 Gen10 Plus with A10 24GB × 4
- Cost: $20,000–$40,000
Mid-Range Servers (70B BF16 or 13B × multiple):
- Supermicro SYS-421GE-TNRT with A100 80GB × 4
- NVIDIA DGX A100 (8× A100 80GB, NVLink)
- Cost: $100,000–400,000
DGX H100 (flagship):
- 8× H100 80GB SXM5, NVLink4
- Up to 1TB of VRAM in total
- Price: $400,000+
Economical option (testing, low loads):
- Workstation with RTX 4090 24GB × 2-4
- 7B models in BF16, 70B in 4-bit
- Cost: $15,000–30,000
Network infrastructure
InfiniBand is a must for multi-GPU servers with tensor parallelism: 400 Gb/s HDR InfiniBand vs 100 Gb/s Ethernet is a critical difference for NCCL all-reduce.
NVLink for inter-GPU inside the server: NVLink4 – 900 GB/s bidirectional bandwidth. Required for DGX H100.
Basic configuration of an on-premise LLM cluster
# Проверка и настройка NVIDIA окружения
nvidia-smi topo -m # topology GPU ↔ CPU ↔ NIC
nvidia-smi nvlink --status # статус NVLink
# Настройка NCCL для InfiniBand
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_DEBUG=INFO
# Проверка P2P доступа между GPU
python3 -c "
import torch
for i in range(torch.cuda.device_count()):
for j in range(torch.cuda.device_count()):
if i != j:
print(f'GPU{i}→GPU{j}: P2P={torch.cuda.can_device_access_peer(i, j)}')
"
Docker Compose for the production stack
# docker-compose.yml
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:v0.5.0
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0,1,2,3
- CUDA_VISIBLE_DEVICES=0,1,2,3
command: >
python -m vllm.entrypoints.openai.api_server
--model /models/llama-3-70b-instruct
--tensor-parallel-size 4
--max-model-len 16384
--max-num-seqs 128
--gpu-memory-utilization 0.92
--host 0.0.0.0
--port 8000
volumes:
- /data/models:/models:ro
- /dev/shm:/dev/shm
shm_size: 32gb
restart: unless-stopped
ports:
- "127.0.0.1:8000:8000"
nginx:
image: nginx:alpine
ports: ["443:443", "80:80"]
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./ssl:/etc/nginx/ssl:ro
depends_on: [vllm]
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports: ["9090:9090"]
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
ports: ["9400:9400"]
cap_add: [SYS_ADMIN]
volumes:
prometheus_data:
grafana_data:
On-premise deployment security
Network Isolation: The LLM server is on a separate VLAN, accessible only through the API Gateway. External internet access is only for proxy updates.
Encryption: TLS 1.3 for all API calls. Encryption of the model disk (LUKS). Encryption of traffic between GPU servers in multi-node mode.
Authentication: API keys or OAuth via a corporate IdP (LDAP, AD). Audit log of all requests.
Physical security: BIOS password, USB disablement, physical rack access monitoring.
Backup and DR
# Backup конфигурации (не модели — слишком большие)
rsync -av /etc/docker/ backup-server:/backups/docker-configs/
rsync -av /opt/llm-stack/ backup-server:/backups/llm-stack/
# Модели хранятся на NAS с RAID
# Проверка целостности модели
sha256sum /data/models/llama-3-70b/*.safetensors > model_checksums.txt
TCO analysis vs. cloud
For workloads > 1M tokens/day, on-premise pays for itself in 12–18 months compared to cloud GPUs. For workloads < 100K tokens/day, cloud is cheaper due to hardware downtime.







