Catalog
google/gke-reliability

google

gke-reliability

Improves GKE workload reliability, using PDBs, health probes, and topology spread constraints. Use when configuring GKE workload reliability, setting up PDBs, or configuring GKE health probes (liveness, readiness, startup). Don't use for disaster recovery setup or full cluster backups (use gke-backup-dr instead).

global
New~1.7k
v1.0Saved Jun 24, 2026

GKE Reliability

This reference covers high availability and reliability configuration for GKE clusters and workloads.

MCP Tools: get_cluster, get_k8s_resource, describe_k8s_resource, apply_k8s_manifest, list_k8s_events

Golden Path Reliability Defaults

Setting Golden Path Value Notes
Cluster type Regional (4 zones: Control plane replicated across
: : us-central1-a/b/c/f) : zones :
Upgrade strategy SURGE (maxSurge: 1) Rolling upgrades with extra
: : : capacity :
Auto-repair true Unhealthy nodes replaced
: : : automatically :
Auto-upgrade true Nodes follow control plane
: : : version :
Release channel REGULAR Balanced freshness and stability
Stateful HA Enabled Leader election for stateful
: : : workloads :

Workflows

1. Verify Cluster High Availability

# MCP (preferred)
get_cluster(name="projects/<PROJECT>/locations/<REGION>/clusters/<CLUSTER>",
  readMask="location,locations,nodePools.locations")

# gcloud fallback
gcloud container clusters describe <CLUSTER> --region <REGION> \
  --format="json(location, locations)" \
  --quiet
  • If location is a region (e.g., us-central1), the control plane is regional
  • If locations has multiple entries, nodes span multiple zones

2. Pod Disruption Budgets (PDBs)

PDBs ensure minimum pod availability during voluntary disruptions (node upgrades, autoscaler scale-down).

Check existing PDBs:

# MCP (preferred)
get_k8s_resource(parent="...", resourceType="poddisruptionbudget")

# kubectl fallback
kubectl get pdb --all-namespaces

Create PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
  namespace: default
spec:
  minAvailable: 2       # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

Every production Deployment with 2+ replicas should have a PDB.

3. Health Probes

Every production container should have liveness and readiness probes. Startup probes are recommended for slow-starting apps.

Check existing probes:

# MCP (preferred)
describe_k8s_resource(parent="...", resourceType="deployment", name="<APP>", namespace="<NS>")

# kubectl fallback
kubectl get deployment <APP> -n <NS> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"

Recommended probe configuration:

spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 10
      timeoutSeconds: 2
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /readyz
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 2
      failureThreshold: 3
    startupProbe:             # For slow-starting apps
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 2
      failureThreshold: 30    # 30 * 5s = 150s max startup time
  • Readiness: Determines when a pod can accept traffic
  • Liveness: Determines when to restart a container
  • Startup: Disables liveness/readiness until the app is ready (prevents premature restarts)

4. Graceful Shutdown

Ensure applications handle SIGTERM and drain in-flight requests:

spec:
  terminationGracePeriodSeconds: 30    # Default; increase for long-running requests
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]  # Allow LB to deregister

5. Topology Spread Constraints

Distribute pods across zones and nodes to survive failures:

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: my-app
  • Zone spread (DoNotSchedule): Hard requirement -- pods must be balanced across zones
  • Node spread (ScheduleAnyway): Best-effort -- prefer distribution but don't block scheduling

6. Replicas

Workload Type Minimum Replicas Reason
Stateless web/API 2 Survive single pod/node
: : : failure :
Critical services 3 Survive zone failure with zone
: : : spread :
Stateful (databases) 3 (with replication) Application-level quorum
Batch/jobs 1 Ephemeral by nature

Best Practices & Production Guidelines

  1. Regional clusters for production: Always use regional clusters to survive zone failures.
  2. PDBs for everything: Every production workload with 2+ replicas needs a PodDisruptionBudget (PDB) to protect against voluntary disruptions.
  3. Probes with Explicit Timeouts: Every production container must have both liveness and readiness probes defined. Always explicitly define initialDelaySeconds, periodSeconds, and timeoutSeconds for all probes. Never rely on the Kubernetes default timeout of 1 second if your application requires more, but always set a strict limit to prevent hanging connections.
  4. Zone spreading: Use topology spread constraints to distribute pods across failure domains (zones and nodes).
  5. Graceful shutdown: Handle SIGTERM and set appropriate terminationGracePeriodSeconds with a preStop sleep hook to allow load balancer deregistration.
  6. Maintenance windows: Schedule upgrades during low-traffic periods (see the gke-upgrades skill).
Files1
1 files · 11.1 KB

Select a file to preview

Overall Score

82/100

Grade

B

Good

Safety

85

Quality

80

Clarity

85

Completeness

78

Summary

This skill provides guidance for configuring high availability and reliability for GKE workloads using Pod Disruption Budgets, health probes, topology spread constraints, and regional cluster setup. It references MCP tools for cluster queries and manifest application, offering both preferred MCP workflows and gcloud/kubectl fallbacks.

Detected Capabilities

kubernetes-resource-querykubernetes-manifest-applicationcluster-configuration-inspectionworkload-health-probe-configuration

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

pod disruption budgetsgke health probesgraceful shutdowntopology spread constraintsworkload reliabilityreplica configuration

Risk Signals

INFO

References MCP tools (get_cluster, get_k8s_resource, describe_k8s_resource, apply_k8s_manifest, list_k8s_events) with clear boundaries

Frontmatter and Workflows sections
INFO

Provides gcloud and kubectl fallbacks as alternatives to MCP

All workflow sections
INFO

Scope is explicitly bounded to GKE reliability configuration; explicitly excludes disaster recovery and cluster backups

Description field

Referenced Domains

External domains referenced in skill content, detected by static analysis.

www.apache.org

Use Cases

  • Verify GKE cluster regional high availability setup
  • Create and manage Pod Disruption Budgets for production workloads
  • Configure liveness, readiness, and startup probes for containers
  • Implement topology spread constraints for zone-aware pod distribution
  • Design replica counts for stateless, stateful, and batch workloads
  • Set up graceful shutdown handling with SIGTERM and preStop hooks

Quality Notes

  • Strength: Well-organized with clear sections for each reliability concern (HA verification, PDBs, health probes, graceful shutdown, topology spread, replicas)
  • Strength: Includes comprehensive golden path defaults table with documented rationale for each setting
  • Strength: Provides both MCP (preferred) and fallback (gcloud/kubectl) command options for all workflows
  • Strength: Concrete YAML examples for every configuration type with detailed comments explaining key fields
  • Strength: Best practices section clearly states production guidelines with explicit emphasis on probe configuration requirements
  • Strength: References external skill (gke-upgrades) appropriately for related tasks, showing good scope boundaries
  • Weakness: Limited guidance on probe timeout tuning for different application types (only one example provided)
  • Weakness: Does not discuss monitoring/alerting for PDB violations or probe failures
  • Weakness: Graceful shutdown section is minimal; lacks guidance on validating SIGTERM handling or common implementation patterns
  • Weakness: No guidance on testing topology spread constraints or verifying pod distribution across zones
Model: claude-haiku-4-5-20251001Analyzed: Jun 24, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

Add google/gke-reliability to your library

Command Palette

Search for a command to run...

google/gke-reliability | SkillRepo