GKE AI/ML Inference

This reference covers deploying AI/ML inference workloads on GKE using Google's Inference Quickstart (GIQ) and best practices for LLM serving.

MCP Tools: apply_k8s_manifest, get_k8s_resource, get_k8s_logs, get_k8s_rollout_status, describe_k8s_resource, list_k8s_events. CLI-only: gcloud container ai profiles *

When to Use

Deploy an AI model (Llama, Gemma, Mistral, etc.) to GKE
Generate optimized Kubernetes manifests for inference
Select GPU/TPU accelerators for model serving
Configure autoscaling for LLM inference

Prerequisites

A golden path GKE Autopilot cluster (GPU workloads are supported via ComputeClasses and NAP)
gcloud CLI authenticated
Sufficient GPU/TPU quota in the target region

Workflow

1. Discovery: Find Models and Hardware

# List all supported models
gcloud container ai profiles models list --quiet

# Find valid accelerator/server combinations for a model
gcloud container ai profiles list --model=<MODEL_NAME> --quiet

# Example: what can run Gemma 2 9B?
gcloud container ai profiles list --model=gemma-2-9b-it --quiet

2. Generate Manifest

gcloud container ai profiles manifests create \
  --model=<MODEL_NAME> \
  --model-server=<SERVER> \
  --accelerator-type=<ACCELERATOR> \
  --target-ntpot-milliseconds=<NTPOT> --quiet > inference.yaml

Parameters:

--model: Model ID (e.g., gemma-2-9b-it, llama-3-8b)
--model-server: Inference server (vllm, tgi, triton, tensorrt-llm)
--accelerator-type: GPU/TPU type (nvidia-l4, nvidia-tesla-a100, nvidia-h100-80gb)
--target-ntpot-milliseconds: Target Normalized Time Per Output Token (optional, for latency optimization)

Example:

gcloud container ai profiles manifests create \
  --model=gemma-2-9b-it \
  --model-server=vllm \
  --accelerator-type=nvidia-l4 \
  --target-ntpot-milliseconds=50 --quiet > inference.yaml

3. Review and Deploy

# Review for placeholders (HF tokens, PVCs)
cat inference.yaml

# Deploy
kubectl apply -f inference.yaml

# Monitor
kubectl get pods -w
kubectl logs -f <POD_NAME>

Some models require Hugging Face tokens. Create a Kubernetes Secret and reference it in the manifest.

GPU ComputeClass for Inference

For Autopilot clusters, create a ComputeClass to target GPU nodes:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: l4-inference
spec:
  priorities:
  - machineFamily: g2
    gpu:
      type: nvidia-l4
      count: 1
    minCores: 4
    minMemoryGb: 16

Accelerator Selection Guide

Accelerator	Best For	Memory	Relative Cost
NVIDIA T4	Budget inference,	16 GB	Lowest
: : lightweight legacy : : :
: : models : : :
NVIDIA L4 (G2)	Small-medium model	24 GB	Low
: : inference, video, : : :
: : graphics : : :
NVIDIA RTX PRO 6000	Multimodal AI,	96 GB	Medium
: (G4) : high-fidelity 3D, : : :
: : fine-tuning : : :
Cloud TPU v5e	Cost-effective	Varies	Medium
: : transformer inference : : :
Cloud TPU v5p	High-performance	Varies	High
: : training : : :
Cloud TPU v6e	High-efficiency next-gen	32 GB/chip	Medium-High
: (Trillium) : training & serving : : :
Cloud TPU v7x	Ultra-scale inference &	192 GB/chip	High
: (Ironwood) : agentic workflows : : :
NVIDIA A100	Large model inference,	40/80 GB	High
: : enterprise ML : : :
NVIDIA H100 / H200	Frontier model training,	80/141 GB	Highest
: : high throughput : : :
NVIDIA B200 (A4)	Blackwell-scale	192 GB	Highest
: : training, FP4 precision : : :
NVIDIA GB200 (A4X)	Rack-scale AI (Grace	Massive	Highest
: : Blackwell Superchip) : : :

Autoscaling LLM Inference

GPU-based autoscaling

Use custom metrics for GPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_duty_cycle
      target:
        type: AverageValue
        averageValue: "80"

Best practices for inference autoscaling

Use DCGM metrics: Golden path enables DCGM monitoring for GPU utilization metrics
Set appropriate minReplicas: At least 1 for always-on serving; 0 for batch/on-demand
Tune scale-down delay: LLM model loading is slow; use longer stabilization windows
Consider queue depth: Scale on pending requests rather than pure GPU utilization for latency-sensitive workloads

Optimization Tips

Quantization: Use quantized models (GPTQ, AWQ) to reduce GPU memory and increase throughput
Batching: Configure model server batch size for throughput vs latency trade-off
Tensor parallelism: Split large models across multiple GPUs within a node
KV cache optimization: Tune --gpu-memory-utilization in vLLM for KV cache allocation

Troubleshooting

Issue	Cause	Fix
Invalid	Unsupported tuple	Re-run `gcloud container ai
: model/accelerator : : profiles list :
: combination : : --model=` :
GPU quota exceeded	Regional quota limit	Request quota increase or
: : : try a different region :
OOM on GPU	Model too large for	Use larger GPU, enable
: : accelerator : quantization, or use tensor :
: : : parallelism :
Slow cold start	Large model loading from	Use local SSD for model
: : registry : caching; pre-pull images :

Files1

1 files · 11.1 KB

Select a file to preview

Overall Score

80/100

Grade

B

Good

Safety

85

Quality

82

Clarity

88

Completeness

72

Summary

A reference guide for deploying and optimizing AI/ML inference workloads on Google Kubernetes Engine (GKE) using GPUs, TPUs, and model servers. The skill teaches agents to discover supported models and hardware, generate optimized Kubernetes manifests via gcloud, configure autoscaling for LLM serving, and troubleshoot common inference issues.

Detected Capabilities

kubernetes-manifest-generationgcloud-cli-executionk8s-resource-deploymentgpu-tpu-resource-selectionpod-monitoringconfiguration-authoring

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

deploy llm on gkegke inference setupgpu kubernetes clusterllama model servingautoscale inference workload

Referenced Domains

External domains referenced in skill content, detected by static analysis.

www.apache.org

Use Cases

/api/endpoint with deploy Llama, Gemma, or Mistral models to GKE for production inference
Generate Kubernetes manifests optimized for specific GPU/TPU hardware combinations
Configure autoscaling policies for cost-effective LLM serving on variable workloads
Select appropriate accelerators (T4, L4, A100, TPUs) based on model size and latency requirements
Debug inference deployment issues like GPU quota limits, OOM errors, and slow cold starts

Quality Notes

Comprehensive accelerator selection guide with memory, cost, and use-case mapping enables informed hardware decisions
Clear workflow structure (discovery → manifest generation → deployment → monitoring) is agent-friendly and actionable
Troubleshooting table directly maps symptoms to causes and fixes, supporting autonomous debugging
Best practices section covers advanced optimization techniques (quantization, batching, tensor parallelism, KV cache tuning) for production workloads
Prerequisites are explicit about cluster type (Autopilot), authentication, and quota requirements, preventing common deployment failures
Examples use realistic model names (gemma-2-9b-it, llama-3-8b) and concrete accelerator types, improving reproducibility
ComputeClass YAML example demonstrates GKE-specific node targeting for GPU workloads
Autoscaling section includes both metric selection and tuning guidance (scale-down delay, minReplicas strategy) rather than just syntax
Scope is well-bounded by explicit non-use case: 'Don't use for generic batch jobs or HPC task queues (use gke-batch-hpc instead)'
Missing: error handling patterns for common manifest generation failures (e.g., when gcloud ai profiles returns unsupported combinations)
Missing: guidance on Hugging Face token management beyond 'create a secret' — no example Secret YAML
Missing: multi-node GPU setup and distributed inference patterns for very large models

Model: claude-haiku-4-5-20251001Analyzed: Jun 24, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

gke-inference