Deploying Language Models on Azure Kubernetes: A Complete Beginner's Guide
Deploying Language Models on Azure Kubernetes Service (AKS)
A Detailed Step-by-Step Implementation Guide
Introduction
This comprehensive guide explains how to deploy Large Language Models (LLMs) on Azure Kubernetes Service using the vLLM serving engine. Each step is broken down with detailed explanations of why it's necessary and how it contributes to the overall deployment.
Table of Contents
Prerequisites
Infrastructure Setup
Kubernetes Configuration
Model Deployment
Testing and Validation
Production Considerations
Maintenance and Monitoring
Prerequisites
Azure Infrastructure Requirements
Azure Subscription
What: An active Azure subscription with billing enabled
Why: Required for creating and managing Azure resources
How to verify:
az account show
Azure CLI
What: Command-line tool for managing Azure resources
Why: Enables automated resource creation and management
Installation:
# For Ubuntu/Debian
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# For macOS
brew install azure-cli
Kubernetes Tools
What: kubectl and related tools
Why: Required for interacting with Kubernetes clusters
Installation:
# Install kubectl
az aks install-cli
Required Permissions
Azure Permissions
Contributor role or higher on subscription/resource group
Network Contributor for virtual network configuration
Why: Enables creation and management of all required resources
Hugging Face Account (for gated models)
Account with approved access to gated models
Access token with read permissions
Why: Required for downloading and using gated models like Llama
Infrastructure Setup
1. Environment Preparation
# Set environment variables
export MY_RESOURCE_GROUP_NAME="llm-deployment-rg"
export MY_AKS_CLUSTER_NAME="llm-cluster"
export LOCATION="eastus"
Why these variables?
Resource group name: Logical container for related resources
Cluster name: Unique identifier for your AKS cluster
Location: Determines data center location (choose based on latency requirements)
2. Resource Group Creation
az group create \
--name $MY_RESOURCE_GROUP_NAME \
--location $LOCATION
Purpose:
Creates a logical container for all deployment resources
Enables easier resource management and billing tracking
Allows for bulk operations and access control
3. AKS Cluster Creation
az aks create \
--resource-group $MY_RESOURCE_GROUP_NAME \
--name $MY_AKS_CLUSTER_NAME \
--node-count 1 \
--generate-ssh-keys \
--network-plugin azure \
--network-policy azure
Key Configuration Explained:
node-count
: Initial number of nodes (start small, scale as needed)generate-ssh-keys
: Automatic SSH key generation for node accessnetwork-plugin
: Azure CNI for advanced networking featuresnetwork-policy
: Enables network policy enforcement
4. Node Pool Configuration
System Node Pool
az aks nodepool add \
--resource-group $MY_RESOURCE_GROUP_NAME \
--cluster-name $MY_AKS_CLUSTER_NAME \
--name system \
--node-count 3 \
--node-vm-size D2s_v3
Why these specifications?
node-count: 3
: Provides high availability for system componentsD2s_v3
: Balanced CPU/memory for system servicesDedicated pool for system components ensures stability
GPU Node Pool
az aks nodepool add \
--resource-group $MY_RESOURCE_GROUP_NAME \
--cluster-name $MY_AKS_CLUSTER_NAME \
--name gpunp \
--node-count 1 \
--node-vm-size Standard_NC4as_T4_v3 \
--node-taints sku=gpu:NoSchedule \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
Configuration Details:
Standard_NC4as_T4_v3
: T4 GPU for optimal LLM inferencenode-taints
: Ensures only GPU workloads run on these expensive nodesenable-cluster-autoscaler
: Automatic scaling based on demandmin-count/max-count
: Scaling boundaries for cost control
5. NVIDIA Device Plugin Installation
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
Why each component matters:
DaemonSet
: Ensures plugin runs on all GPU nodestolerations
: Allows running on GPU-tainted nodespriorityClassName
: Ensures plugin isn't evictedsecurityContext
: Implements security best practices
Model Deployment Configuration
1. Persistent Volume Setup
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
Purpose of each setting:
50Gi storage
: Accommodates model weights and cacheReadWriteOnce
: Single node access for data consistencydefault
storage class: Uses Azure managed disks
2. Service Configuration
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
targetPort: 8000
selector:
app: mistral-7b
type: LoadBalancer
Key components explained:
LoadBalancer
type: Provides external accessPort mapping: Routes external port 80 to container port 8000
Selector: Links service to specific deployment
3. Deployment Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
spec:
replicas: 1
template:
spec:
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1
memory: 20G
requests:
nvidia.com/gpu: 1
memory: 6G
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
Configuration details:
replicas: 1
: Single instance per GPUResource limits: Prevents memory issues
Volume mounts: Persists model cache
Health probes: Ensures container health
Deployment Process
1. Basic Deployment
# Create namespace
kubectl create namespace llm-serving
# Apply configurations
kubectl apply -f volume.yaml
kubectl apply -f service.yaml
kubectl apply -f deployment.yaml
2. Verify Deployment
# Check pod status
kubectl get pods -n llm-serving
kubectl describe pod <pod-name>
# Verify service
kubectl get service mistral-7b
Testing and Validation
1. API Testing
# Get external IP
export SERVICE_IP=$(kubectl get service mistral-7b -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Test API
curl --location "http://$SERVICE_IP/v1/completions" \
--header 'Content-Type: application/json' \
--data '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"prompt": "Test prompt",
"max_tokens": 50
}'
2. Performance Testing
# Monitor GPU usage
kubectl exec -it <pod-name> -- nvidia-smi
# Check response times
time curl -X POST "http://$SERVICE_IP/v1/completions" ...
Production Considerations
Security Implementation
Network Security
# Create network policy
kubectl apply -f network-policy.yaml
# Enable Azure DDoS protection
az network ddos-protection enable ...
API Security
Implement authentication
Set up rate limiting
Enable monitoring
Cost Optimization
Resource Monitoring
# Monitor costs
az cost management query ...
# Scale based on usage
kubectl scale deployment mistral-7b --replicas=0
Cost Reduction Strategies
Use spot instances for non-critical workloads
Implement automatic scaling
Monitor and optimize resource usage
Maintenance Procedures
1. Regular Updates
# Update deployment
kubectl set image deployment/mistral-7b mistral-7b=vllm/vllm-openai:new-version
# Verify update
kubectl rollout status deployment/mistral-7b
2. Backup and Recovery
# Backup persistent volumes
velero backup create llm-backup
# Restore if needed
velero restore create --from-backup llm-backup
Troubleshooting Guide
Common Issues and Solutions
GPU Not Detected
Verify NVIDIA plugin installation
Check node labels and taints
Validate GPU driver installation
Memory Issues
Adjust resource limits
Monitor memory usage
Check for memory leaks
Network Issues
Verify network policy configuration
Check service endpoint availability
Validate load balancer configuration
Monitoring Setup
1. Metrics Collection
# Install Prometheus
helm install prometheus prometheus-community/prometheus
# Configure Grafana
helm install grafana grafana/grafana
2. Log Management
# Enable log analytics
az monitor log-analytics workspace create ...
# Configure container insights
az aks enable-addons -a monitoring ...
Additional Resources and References
Documentation
Community Support
Azure Kubernetes Service GitHub
vLLM Discord community
Hugging Face forums