GKE Reliability

This reference covers high availability and reliability configuration for GKE clusters and workloads.

MCP Tools: get_cluster, get_k8s_resource, describe_k8s_resource, apply_k8s_manifest, list_k8s_events

Golden Path Reliability Defaults

Setting	Golden Path Value	Notes
Cluster type	Regional (4 zones:	Control plane replicated across
: : us-central1-a/b/c/f) : zones :
Upgrade strategy	SURGE (`maxSurge: 1`)	Rolling upgrades with extra
: : : capacity :
Auto-repair	`true`	Unhealthy nodes replaced
: : : automatically :
Auto-upgrade	`true`	Nodes follow control plane
: : : version :
Release channel	REGULAR	Balanced freshness and stability
Stateful HA	Enabled	Leader election for stateful
: : : workloads :

Workflows

1. Verify Cluster High Availability

# MCP (preferred)
get_cluster(name="projects/<PROJECT>/locations/<REGION>/clusters/<CLUSTER>",
  readMask="location,locations,nodePools.locations")

# gcloud fallback
gcloud container clusters describe <CLUSTER> --region <REGION> \
  --format="json(location, locations)" \
  --quiet

If location is a region (e.g., us-central1), the control plane is regional
If locations has multiple entries, nodes span multiple zones

2. Pod Disruption Budgets (PDBs)

PDBs ensure minimum pod availability during voluntary disruptions (node upgrades, autoscaler scale-down).

Check existing PDBs:

# MCP (preferred)
get_k8s_resource(parent="...", resourceType="poddisruptionbudget")

# kubectl fallback
kubectl get pdb --all-namespaces

Create PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
  namespace: default
spec:
  minAvailable: 2       # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

Every production Deployment with 2+ replicas should have a PDB.

3. Health Probes

Every production container should have liveness and readiness probes. Startup probes are recommended for slow-starting apps.

Check existing probes:

# MCP (preferred)
describe_k8s_resource(parent="...", resourceType="deployment", name="<APP>", namespace="<NS>")

# kubectl fallback
kubectl get deployment <APP> -n <NS> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"

Recommended probe configuration:

spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 10
      timeoutSeconds: 2
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /readyz
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 2
      failureThreshold: 3
    startupProbe:             # For slow-starting apps
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 2
      failureThreshold: 30    # 30 * 5s = 150s max startup time

Readiness: Determines when a pod can accept traffic
Liveness: Determines when to restart a container
Startup: Disables liveness/readiness until the app is ready (prevents premature restarts)

4. Graceful Shutdown

Ensure applications handle SIGTERM and drain in-flight requests:

spec:
  terminationGracePeriodSeconds: 30    # Default; increase for long-running requests
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]  # Allow LB to deregister

5. Topology Spread Constraints

Distribute pods across zones and nodes to survive failures:

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: my-app

Zone spread (DoNotSchedule): Hard requirement -- pods must be balanced across zones
Node spread (ScheduleAnyway): Best-effort -- prefer distribution but don't block scheduling

6. Replicas

Workload Type	Minimum Replicas	Reason
Stateless web/API	2	Survive single pod/node
: : : failure :
Critical services	3	Survive zone failure with zone
: : : spread :
Stateful (databases)	3 (with replication)	Application-level quorum
Batch/jobs	1	Ephemeral by nature

Best Practices & Production Guidelines

Regional clusters for production: Always use regional clusters to survive zone failures.
PDBs for everything: Every production workload with 2+ replicas needs a PodDisruptionBudget (PDB) to protect against voluntary disruptions.
Probes with Explicit Timeouts: Every production container must have both liveness and readiness probes defined. Always explicitly define initialDelaySeconds, periodSeconds, and timeoutSeconds for all probes. Never rely on the Kubernetes default timeout of 1 second if your application requires more, but always set a strict limit to prevent hanging connections.
Zone spreading: Use topology spread constraints to distribute pods across failure domains (zones and nodes).
Graceful shutdown: Handle SIGTERM and set appropriate terminationGracePeriodSeconds with a preStop sleep hook to allow load balancer deregistration.
Maintenance windows: Schedule upgrades during low-traffic periods (see the gke-upgrades skill).

Files1

1 files · 11.1 KB

Select a file to preview

Overall Score

82/100

Grade

B

Good

Safety

85

Quality

80

Clarity

85

Completeness

78

Summary

This skill provides guidance for configuring high availability and reliability for GKE workloads using Pod Disruption Budgets, health probes, topology spread constraints, and regional cluster setup. It references MCP tools for cluster queries and manifest application, offering both preferred MCP workflows and gcloud/kubectl fallbacks.

Detected Capabilities

kubernetes-resource-querykubernetes-manifest-applicationcluster-configuration-inspectionworkload-health-probe-configuration

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

pod disruption budgetsgke health probesgraceful shutdowntopology spread constraintsworkload reliabilityreplica configuration

Risk Signals

INFO

References MCP tools (get_cluster, get_k8s_resource, describe_k8s_resource, apply_k8s_manifest, list_k8s_events) with clear boundaries

Frontmatter and Workflows sections

INFO

Provides gcloud and kubectl fallbacks as alternatives to MCP

All workflow sections

INFO

Scope is explicitly bounded to GKE reliability configuration; explicitly excludes disaster recovery and cluster backups

Description field

Referenced Domains

External domains referenced in skill content, detected by static analysis.

www.apache.org

Use Cases

Verify GKE cluster regional high availability setup
Create and manage Pod Disruption Budgets for production workloads
Configure liveness, readiness, and startup probes for containers
Implement topology spread constraints for zone-aware pod distribution
Design replica counts for stateless, stateful, and batch workloads
Set up graceful shutdown handling with SIGTERM and preStop hooks

Quality Notes

Strength: Well-organized with clear sections for each reliability concern (HA verification, PDBs, health probes, graceful shutdown, topology spread, replicas)
Strength: Includes comprehensive golden path defaults table with documented rationale for each setting
Strength: Provides both MCP (preferred) and fallback (gcloud/kubectl) command options for all workflows
Strength: Concrete YAML examples for every configuration type with detailed comments explaining key fields
Strength: Best practices section clearly states production guidelines with explicit emphasis on probe configuration requirements
Strength: References external skill (gke-upgrades) appropriately for related tasks, showing good scope boundaries
Weakness: Limited guidance on probe timeout tuning for different application types (only one example provided)
Weakness: Does not discuss monitoring/alerting for PDB violations or probe failures
Weakness: Graceful shutdown section is minimal; lacks guidance on validating SIGTERM handling or common implementation patterns
Weakness: No guidance on testing topology spread constraints or verifying pod distribution across zones

Model: claude-haiku-4-5-20251001Analyzed: Jun 24, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

gke-reliability

GKE Reliability

Golden Path Reliability Defaults

Workflows

1. Verify Cluster High Availability

2. Pod Disruption Budgets (PDBs)

3. Health Probes

4. Graceful Shutdown

5. Topology Spread Constraints

6. Replicas

Best Practices & Production Guidelines

Summary

Detected Capabilities

Trigger Keywords

Risk Signals

Referenced Domains

Use Cases

Quality Notes

Reviews

Command Palette