GKE Batch & HPC Workloads

This reference covers running batch processing and high-performance computing (HPC) workloads on GKE.

MCP Tools: apply_k8s_manifest, get_k8s_resource, describe_k8s_resource, get_k8s_logs, delete_k8s_resource, list_k8s_events

When to Use

Running batch data processing pipelines
HPC simulations (CFD, molecular dynamics, financial modeling)
Large-scale parallel computation (MPI, MapReduce)
ML training jobs
CI/CD build farms

Batch Processing on GKE

Kubernetes Jobs

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  parallelism: 10
  completions: 100
  backoffLimit: 3
  template:
    spec:
      containers:
      - name: worker
        image: <IMAGE>
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
      restartPolicy: Never

JobSet (for Complex Multi-Job Workflows)

The golden path enables JobSet monitoring (JOBSET in monitoringConfig).

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: training-job
spec:
  replicatedJobs:
  - name: workers
    replicas: 4
    template:
      spec:
        parallelism: 1
        completions: 1
        template:
          spec:
            containers:
            - name: worker
              image: <IMAGE>
              resources:
                requests:
                  cpu: "4"
                  memory: "8Gi"

Kueue (Job Queuing)

Kueue manages job scheduling and resource allocation for batch workloads:

# Install Kueue
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/latest/download/manifests.yaml

# Define a ClusterQueue
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: batch-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: default
      resources:
      - name: "cpu"
        nominalQuota: 100
      - name: "memory"
        nominalQuota: "200Gi"
---
# Allow a namespace to use the queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: batch-local
  namespace: batch-jobs
spec:
  clusterQueue: batch-queue

HPC on GKE

Compact Placement (Low-Latency Networking)

For tightly-coupled HPC workloads that need low-latency inter-node communication:

# Standard clusters: create node pool with compact placement
gcloud container node-pools create hpc-pool \
  --cluster <CLUSTER_NAME> --region <REGION> \
  --machine-type c3-standard-44 \
  --placement-type COMPACT \
  --num-nodes 8 \
  --enable-autoscaling --min-nodes 0 --max-nodes 16 \
  --quiet

MPI Workloads

Use the MPI Operator for MPI-based HPC applications:

# Install MPI Operator
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: hpc-simulation
spec:
  slotsPerWorker: 4
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - name: launcher
            image: <MPI_IMAGE>
            command: ["mpirun", "-np", "32", "./simulation"]
            resources:
              requests:
                cpu: "1"
                memory: "2Gi"
              limits:
                cpu: "2"
                memory: "4Gi"
    Worker:
      replicas: 8
      template:
        spec:
          containers:
          - name: worker
            image: <MPI_IMAGE>
            resources:
              requests:
                cpu: "4"
                memory: "8Gi"
              limits:
                cpu: "8"
                memory: "16Gi"

Cost Optimization for Batch/HPC

Spot VMs for Batch

Batch workloads are ideal Spot VM candidates (interruptible, can checkpoint). Use a ComputeClass with Spot-first priority and activeMigration to return to Spot when available. See the gke-compute-classes skill for the Spot-with-fallback pattern.

Scale-to-Zero

For batch clusters, allow node pools to scale to zero when no jobs are running:

Autopilot (golden path): Automatic, nodes scale to zero when no pods are scheduled
Standard: Set --min-nodes 0 on batch node pools

Best Practices & Production Guidelines

Resource Quotas: Always specify resource requests and limits (CPU, memory, and optionally GPU/TPU) for all batch/HPC manifests. This is critical for Kueue admission, autoscaling, and preventing resource starvation in the cluster.
TPU/Spot Cluster Maintenance: For long-running AI training runs on Spot VMs/TPUs, advise using GKE maintenance exclusions to block automatic cluster upgrades/reboots during the active training window to minimize unnecessary preemption.
MPI Workloads: Use the Kubeflow Training Operator to orchestrate distributed MPI applications via the MPIJob custom resource.
Kueue & JobSet: Use Kueue for multi-tenant job queueing and fair sharing; use JobSet for multi-component tightly coupled workloads.
Resilience: Always set a backoffLimit on Jobs, and implement application-level checkpointing (e.g., using Orbax or PyTorch checkpointing) to survive Spot VM preemption.

Files1

1 files · 11.1 KB

Select a file to preview

Overall Score

78/100

Grade

B

Good

Safety

80

Quality

76

Clarity

82

Completeness

72

Summary

This skill guides agents to run batch processing and HPC workloads on Google Kubernetes Engine (GKE), covering Kubernetes Jobs, JobSet, Kueue job queueing, MPI-based distributed computing, and cost optimization via Spot VMs. It provides YAML manifests and gcloud commands for setting up compact placement, resource quotas, and resilience patterns.

Detected Capabilities

apply Kubernetes manifestsrun gcloud commandsread Kubernetes resource definitionsquery GKE cluster configurationdownload external YAML manifests from GitHub

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

gke batch jobshpc on kubernetesjob queuing kueuempi distributed computingjobset multi-job workflowsspot vm batch workloads

Risk Signals

INFO

Download and apply manifest from GitHub (https://github.com/kubernetes-sigs/kueue/releases/latest/download/manifests.yaml)

Kueue installation section

INFO

Download and apply manifest from raw.githubusercontent.com (https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml)

MPI Operator installation section

Referenced Domains

External domains referenced in skill content, detected by static analysis.

github.comraw.githubusercontent.comwww.apache.org

Use Cases

Run batch data processing pipelines on GKE clusters
Configure HPC simulations (CFD, molecular dynamics) with MPI-based distributed computing
Set up multi-tenant job queues with Kueue for fair resource sharing
Deploy complex multi-job ML training workflows using JobSet
Optimize batch workload costs using Spot VMs with fallback to standard nodes
Configure low-latency inter-node networking for tightly-coupled HPC applications

Quality Notes

Skill clearly delineates scope with explicit 'When to Use' and 'Don't use for' guidance
Multiple concrete YAML examples for Job, JobSet, Kueue, and MPIJob resources with commented inline explanations
Best practices section covers critical production considerations (resource quotas, resilience, checkpointing)
Mentions related skill (gke-compute-classes) for spot VM fallback patterns, showing good skill composition
Appropriate tool allowlist (MCP Kubernetes tools) is declared in header
Well-structured with logical sections: batch processing, HPC, cost optimization, and best practices
Some placeholder tokens (<IMAGE>, <CLUSTER_NAME>, <REGION>) require user substitution — no guidance on failure modes if these are omitted
No explicit error handling guidance for failed manifest applications or missing Kueue/MPI operator CRDs

Model: claude-haiku-4-5-20251001Analyzed: Jun 24, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

gke-batch-hpc