GKE Reliability
This reference covers high availability and reliability configuration for GKE clusters and workloads.
MCP Tools:
get_cluster,get_k8s_resource,describe_k8s_resource,apply_k8s_manifest,list_k8s_events
Golden Path Reliability Defaults
| Setting | Golden Path Value | Notes |
|---|---|---|
| Cluster type | Regional (4 zones: | Control plane replicated across |
| : : us-central1-a/b/c/f) : zones : | ||
| Upgrade strategy | SURGE (maxSurge: 1) |
Rolling upgrades with extra |
| : : : capacity : | ||
| Auto-repair | true |
Unhealthy nodes replaced |
| : : : automatically : | ||
| Auto-upgrade | true |
Nodes follow control plane |
| : : : version : | ||
| Release channel | REGULAR | Balanced freshness and stability |
| Stateful HA | Enabled | Leader election for stateful |
| : : : workloads : |
Workflows
1. Verify Cluster High Availability
# MCP (preferred)
get_cluster(name="projects/<PROJECT>/locations/<REGION>/clusters/<CLUSTER>",
readMask="location,locations,nodePools.locations")
# gcloud fallback
gcloud container clusters describe <CLUSTER> --region <REGION> \
--format="json(location, locations)" \
--quiet
- If
locationis a region (e.g.,us-central1), the control plane is regional - If
locationshas multiple entries, nodes span multiple zones
2. Pod Disruption Budgets (PDBs)
PDBs ensure minimum pod availability during voluntary disruptions (node upgrades, autoscaler scale-down).
Check existing PDBs:
# MCP (preferred)
get_k8s_resource(parent="...", resourceType="poddisruptionbudget")
# kubectl fallback
kubectl get pdb --all-namespaces
Create PDB:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
namespace: default
spec:
minAvailable: 2 # Or use maxUnavailable: 1
selector:
matchLabels:
app: my-app
Every production Deployment with 2+ replicas should have a PDB.
3. Health Probes
Every production container should have liveness and readiness probes. Startup probes are recommended for slow-starting apps.
Check existing probes:
# MCP (preferred)
describe_k8s_resource(parent="...", resourceType="deployment", name="<APP>", namespace="<NS>")
# kubectl fallback
kubectl get deployment <APP> -n <NS> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"
Recommended probe configuration:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
startupProbe: # For slow-starting apps
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 30 # 30 * 5s = 150s max startup time
- Readiness: Determines when a pod can accept traffic
- Liveness: Determines when to restart a container
- Startup: Disables liveness/readiness until the app is ready (prevents premature restarts)
4. Graceful Shutdown
Ensure applications handle SIGTERM and drain in-flight requests:
spec:
terminationGracePeriodSeconds: 30 # Default; increase for long-running requests
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # Allow LB to deregister
5. Topology Spread Constraints
Distribute pods across zones and nodes to survive failures:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-app
- Zone spread (
DoNotSchedule): Hard requirement -- pods must be balanced across zones - Node spread (
ScheduleAnyway): Best-effort -- prefer distribution but don't block scheduling
6. Replicas
| Workload Type | Minimum Replicas | Reason |
|---|---|---|
| Stateless web/API | 2 | Survive single pod/node |
| : : : failure : | ||
| Critical services | 3 | Survive zone failure with zone |
| : : : spread : | ||
| Stateful (databases) | 3 (with replication) | Application-level quorum |
| Batch/jobs | 1 | Ephemeral by nature |
Best Practices & Production Guidelines
- Regional clusters for production: Always use regional clusters to survive zone failures.
- PDBs for everything: Every production workload with 2+ replicas needs a PodDisruptionBudget (PDB) to protect against voluntary disruptions.
- Probes with Explicit Timeouts: Every production container must have both
liveness and readiness probes defined. Always explicitly define
initialDelaySeconds,periodSeconds, andtimeoutSecondsfor all probes. Never rely on the Kubernetes default timeout of 1 second if your application requires more, but always set a strict limit to prevent hanging connections. - Zone spreading: Use topology spread constraints to distribute pods across failure domains (zones and nodes).
- Graceful shutdown: Handle
SIGTERMand set appropriateterminationGracePeriodSecondswith apreStopsleep hook to allow load balancer deregistration. - Maintenance windows: Schedule upgrades during low-traffic periods (see
the
gke-upgradesskill).