Catalog
google/gke-inference

google

gke-inference

Deploys and optimizes AI/ML inference workloads on GKE, using GPUs, TPUs, and model servers. Use when deploying GKE inference servers, configuring GKE GPU resources for inference, or deploying LLMs on GKE. Don't use for generic batch jobs or HPC task queues (use gke-batch-hpc instead).

global
New~1.9k
v1.0Saved Jun 24, 2026

GKE AI/ML Inference

This reference covers deploying AI/ML inference workloads on GKE using Google's Inference Quickstart (GIQ) and best practices for LLM serving.

MCP Tools: apply_k8s_manifest, get_k8s_resource, get_k8s_logs, get_k8s_rollout_status, describe_k8s_resource, list_k8s_events. CLI-only: gcloud container ai profiles *

When to Use

  • Deploy an AI model (Llama, Gemma, Mistral, etc.) to GKE
  • Generate optimized Kubernetes manifests for inference
  • Select GPU/TPU accelerators for model serving
  • Configure autoscaling for LLM inference

Prerequisites

  • A golden path GKE Autopilot cluster (GPU workloads are supported via ComputeClasses and NAP)
  • gcloud CLI authenticated
  • Sufficient GPU/TPU quota in the target region

Workflow

1. Discovery: Find Models and Hardware

# List all supported models
gcloud container ai profiles models list --quiet

# Find valid accelerator/server combinations for a model
gcloud container ai profiles list --model=<MODEL_NAME> --quiet

# Example: what can run Gemma 2 9B?
gcloud container ai profiles list --model=gemma-2-9b-it --quiet

2. Generate Manifest

gcloud container ai profiles manifests create \
  --model=<MODEL_NAME> \
  --model-server=<SERVER> \
  --accelerator-type=<ACCELERATOR> \
  --target-ntpot-milliseconds=<NTPOT> --quiet > inference.yaml

Parameters:

  • --model: Model ID (e.g., gemma-2-9b-it, llama-3-8b)
  • --model-server: Inference server (vllm, tgi, triton, tensorrt-llm)
  • --accelerator-type: GPU/TPU type (nvidia-l4, nvidia-tesla-a100, nvidia-h100-80gb)
  • --target-ntpot-milliseconds: Target Normalized Time Per Output Token (optional, for latency optimization)

Example:

gcloud container ai profiles manifests create \
  --model=gemma-2-9b-it \
  --model-server=vllm \
  --accelerator-type=nvidia-l4 \
  --target-ntpot-milliseconds=50 --quiet > inference.yaml

3. Review and Deploy

# Review for placeholders (HF tokens, PVCs)
cat inference.yaml

# Deploy
kubectl apply -f inference.yaml

# Monitor
kubectl get pods -w
kubectl logs -f <POD_NAME>

Some models require Hugging Face tokens. Create a Kubernetes Secret and reference it in the manifest.

GPU ComputeClass for Inference

For Autopilot clusters, create a ComputeClass to target GPU nodes:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: l4-inference
spec:
  priorities:
  - machineFamily: g2
    gpu:
      type: nvidia-l4
      count: 1
    minCores: 4
    minMemoryGb: 16

Accelerator Selection Guide

Accelerator Best For Memory Relative Cost
NVIDIA T4 Budget inference, 16 GB Lowest
: : lightweight legacy : : :
: : models : : :
NVIDIA L4 (G2) Small-medium model 24 GB Low
: : inference, video, : : :
: : graphics : : :
NVIDIA RTX PRO 6000 Multimodal AI, 96 GB Medium
: (G4) : high-fidelity 3D, : : :
: : fine-tuning : : :
Cloud TPU v5e Cost-effective Varies Medium
: : transformer inference : : :
Cloud TPU v5p High-performance Varies High
: : training : : :
Cloud TPU v6e High-efficiency next-gen 32 GB/chip Medium-High
: (Trillium) : training & serving : : :
Cloud TPU v7x Ultra-scale inference & 192 GB/chip High
: (Ironwood) : agentic workflows : : :
NVIDIA A100 Large model inference, 40/80 GB High
: : enterprise ML : : :
NVIDIA H100 / H200 Frontier model training, 80/141 GB Highest
: : high throughput : : :
NVIDIA B200 (A4) Blackwell-scale 192 GB Highest
: : training, FP4 precision : : :
NVIDIA GB200 (A4X) Rack-scale AI (Grace Massive Highest
: : Blackwell Superchip) : : :

Autoscaling LLM Inference

GPU-based autoscaling

Use custom metrics for GPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_duty_cycle
      target:
        type: AverageValue
        averageValue: "80"

Best practices for inference autoscaling

  1. Use DCGM metrics: Golden path enables DCGM monitoring for GPU utilization metrics
  2. Set appropriate minReplicas: At least 1 for always-on serving; 0 for batch/on-demand
  3. Tune scale-down delay: LLM model loading is slow; use longer stabilization windows
  4. Consider queue depth: Scale on pending requests rather than pure GPU utilization for latency-sensitive workloads

Optimization Tips

  • Quantization: Use quantized models (GPTQ, AWQ) to reduce GPU memory and increase throughput
  • Batching: Configure model server batch size for throughput vs latency trade-off
  • Tensor parallelism: Split large models across multiple GPUs within a node
  • KV cache optimization: Tune --gpu-memory-utilization in vLLM for KV cache allocation

Troubleshooting

Issue Cause Fix
Invalid Unsupported tuple Re-run `gcloud container ai
: model/accelerator : : profiles list :
: combination : : --model=` :
GPU quota exceeded Regional quota limit Request quota increase or
: : : try a different region :
OOM on GPU Model too large for Use larger GPU, enable
: : accelerator : quantization, or use tensor :
: : : parallelism :
Slow cold start Large model loading from Use local SSD for model
: : registry : caching; pre-pull images :
Files1
1 files · 11.1 KB

Select a file to preview

Overall Score

80/100

Grade

B

Good

Safety

85

Quality

82

Clarity

88

Completeness

72

Summary

A reference guide for deploying and optimizing AI/ML inference workloads on Google Kubernetes Engine (GKE) using GPUs, TPUs, and model servers. The skill teaches agents to discover supported models and hardware, generate optimized Kubernetes manifests via gcloud, configure autoscaling for LLM serving, and troubleshoot common inference issues.

Detected Capabilities

kubernetes-manifest-generationgcloud-cli-executionk8s-resource-deploymentgpu-tpu-resource-selectionpod-monitoringconfiguration-authoring

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

deploy llm on gkegke inference setupgpu kubernetes clusterllama model servingautoscale inference workload

Referenced Domains

External domains referenced in skill content, detected by static analysis.

www.apache.org

Use Cases

  • /api/endpoint with deploy Llama, Gemma, or Mistral models to GKE for production inference
  • Generate Kubernetes manifests optimized for specific GPU/TPU hardware combinations
  • Configure autoscaling policies for cost-effective LLM serving on variable workloads
  • Select appropriate accelerators (T4, L4, A100, TPUs) based on model size and latency requirements
  • Debug inference deployment issues like GPU quota limits, OOM errors, and slow cold starts

Quality Notes

  • Comprehensive accelerator selection guide with memory, cost, and use-case mapping enables informed hardware decisions
  • Clear workflow structure (discovery → manifest generation → deployment → monitoring) is agent-friendly and actionable
  • Troubleshooting table directly maps symptoms to causes and fixes, supporting autonomous debugging
  • Best practices section covers advanced optimization techniques (quantization, batching, tensor parallelism, KV cache tuning) for production workloads
  • Prerequisites are explicit about cluster type (Autopilot), authentication, and quota requirements, preventing common deployment failures
  • Examples use realistic model names (gemma-2-9b-it, llama-3-8b) and concrete accelerator types, improving reproducibility
  • ComputeClass YAML example demonstrates GKE-specific node targeting for GPU workloads
  • Autoscaling section includes both metric selection and tuning guidance (scale-down delay, minReplicas strategy) rather than just syntax
  • Scope is well-bounded by explicit non-use case: 'Don't use for generic batch jobs or HPC task queues (use gke-batch-hpc instead)'
  • Missing: error handling patterns for common manifest generation failures (e.g., when gcloud ai profiles returns unsupported combinations)
  • Missing: guidance on Hugging Face token management beyond 'create a secret' — no example Secret YAML
  • Missing: multi-node GPU setup and distributed inference patterns for very large models
Model: claude-haiku-4-5-20251001Analyzed: Jun 24, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

Add google/gke-inference to your library

Command Palette

Search for a command to run...