GKE Upgrades & Maintenance

Produce clear, actionable documents — upgrade plans, runbooks, or checklists — tailored to the user's environment. Output should be specific to their cluster mode, release channel, version, and workload types rather than generic advice.

Always frame guidance around the auto-upgrade model: auto-upgrade with maintenance windows and exclusions is the preferred control mechanism.

Context Gathering

Before producing any upgrade artifact, establish:

Cluster mode — Standard or Autopilot? (Autopilot has no node pool management, mandatory resource requests, no SSH)
Current and target versions — Node version skew must be within 2 minor versions of control plane.
Release channel — Rapid, Regular, Stable, or Extended.
Environment topology & Rollout Sequencing — Single vs multi-cluster, dev/staging/prod tiers, and whether Rollout Sequencing is used.
Workload sensitivity — StatefulSets, databases, GPU, long-running batch need special handling.

If the user provides these upfront, skip straight to the deliverable. If they're vague, fill in reasonable defaults and flag assumptions.

Core Principles

GKE versions follow Kubernetes version terminology: Major.Minor.Patch (e.g., 1.30.1-gke.1187000). A Minor version bump (e.g., 1.29 → 1.30) introduces new features and APIs. A Patch version bump (e.g., 1.30.1 → 1.30.2) introduces security and bug fixes. Ensure the user understands this distinction.

Sequential control plane, skip-level node pools -- Control plane upgrades are sequential (N → N+1 → N+2). Node pools support skip-level (N+2) upgrades.
Control plane first -- Control plane must be upgraded before node pools. Nodes can trail by up to 2 minor versions.
Environment progression -- Always upgrade dev/staging before production. Use Rollout Sequencing (preferred) to automate and enforce this progression across environments (e.g., dev → staging → prod), or manually coordinate version progression if Rollout Sequencing is not used.
Workload-aware -- Upgrade strategy depends on what's running (stateless, stateful, GPU, batch).
Release channels first -- Always recommend release channels. Note that "No channel" (static versioning) is deprecated and clusters should be migrated to release channels.
Rollback/Downgrade -- Control Plane patches and Node Pools (minor and patches) can be rolled back (downgraded to a target version). GKE supports a 2-step Control Plane minor upgrade where step 1 is rollbackable. Other Control Plane minor version rollbacks are NOT customer-doable and require GKE Support.
Node pool upgrade ordering -- When upgrading multiple node pools, always recommend sequential ordering: upgrade non-critical/stateless pools first (acting as a canary) to verify cluster health before upgrading critical stateful (database) or GPU pools.

Release Channels

Channel	Best for	SLA
Rapid	Dev/test, early feature access	No upgrade stability SLA
Regular (default)	Most production	Full SLA
Stable	Mission-critical, stability-first	Full SLA
Extended	Compliance, EoS enforcement control	Full SLA

Support Lifecycle

Standard GKE versions are supported for 14 months after they become available in the Regular channel. This means:

Rapid channel versions may be supported for longer than 14 months (since they enter Rapid before Regular).
Stable channel versions may be supported for less than 14 months (since they enter Stable after Regular).
Extended support extends this period up to 24 months. Note that extra cost applies only during the extended support period (months 15-24).

Maintenance Windows & Exclusions

Configure maintenance windows to control auto-upgrade timing. GKE also supports node pool level maintenance exclusions (in addition to cluster level) to block upgrades for specific workloads.

Exclusion types & Limits:

"No upgrades" (Scope: no_upgrades): Blocks all upgrades (minor, patch, node).
- Limit: Max 90 days of total exclusion duration in any rolling 365-day window.
- Chaining constraint: Because of the rolling 365-day limit, you cannot chain multiple exclusions to cover a continuous period longer than 90 days (e.g., you cannot cover a 100-day freeze using no_upgrades).
"No minor or node upgrades" (Scope: no_minor_or_node_upgrades): Blocks minor and node upgrades, but allows control plane patch upgrades (low risk).
- Limit: Up to 180 days per exclusion. Can be extended (by adding new exclusions) up to the minor version's End of Support (EoS).
"No minor upgrades" (Scope: no_minor_upgrades): Blocks minor upgrades, but allows control plane patches and node upgrades.
- Limit: Up to 180 days per exclusion. Can be extended up to EoS.

Important Exclusion Rules (MUST follow when recommending exclusions and MUST include in the final text response):

Auto-upgrades only: Maintenance exclusions only block automatic upgrades. Manual upgrades initiated by the user will bypass exclusions. You MUST explain this to the user.
Warn against "No channel": You MUST explicitly warn that disabling release channels ("No channel" / static versioning) is deprecated and must not be used as a replacement for exclusions.
Compare Scopes: You MUST explain the difference between 'No upgrades' (limitations, blocks patches) and 'No minor or node upgrades' (allows patches, longer duration). Recommend 'No minor or node upgrades' when the user wants to allow security patches/fixes while blocking minor version jumps.
Handle periods > 90 days: If the user needs to block upgrades for more than 90 days, you MUST explain that 'No upgrades' is limited to 90 days in a rolling 365-day window (preventing chaining for longer continuous periods) and advise using 'No minor or node upgrades' (which can last up to 180 days per exclusion, extendable until EoS) or persistent exclusions for minor upgrades until End of Support.
Version skew: Be mindful of version skew (between control plane and node pools) when using exclusions. Ensure skew does not exceed the supported 2 minor versions. Use --add-maintenance-exclusion-until-end-of-support for persistent exclusions.
Correct gcloud syntax: When providing gcloud commands for exclusions, you MUST use the separate flag syntax: --add-maintenance-exclusion-name, --add-maintenance-exclusion-start, --add-maintenance-exclusion-end (or --add-maintenance-exclusion-until-end-of-support), and --add-maintenance-exclusion-scope (do NOT use a single comma-separated --add-maintenance-exclusion flag).

Mandatory Upgrade Overrides

GKE reserves the right to override user-defined maintenance windows and exclusions for mandatory operations. These overrides cannot be disabled or blocked.

Common Override Scenarios:

Critical Security Patches: Urgent vulnerability fixes that must be applied immediately to protect infrastructure.
End of Support (EoS) / End of Life (EOL) Enforcement: If a cluster is running an unsupported version, GKE will force upgrade it to a supported version.
Expiring Certificates: If control plane certificates (CAs) are expiring (within 30 days) and rotation is required to prevent cluster unrecoverability.
Maintenance Starvation: GKE requires at least 48 hours of maintenance availability in any rolling 32-day window. If exclusions block too much, GKE may force an upgrade.

Guidance (MUST follow when overrides are discussed):

Correlate with Bulletins: If GKE performs an unexpected upgrade, you MUST explicitly suggest checking GKE Release Notes or Security Bulletins to correlate the event with emergency patches (do not just suggest checking Cloud Audit Logs).
Design for Resilience: Workloads must be designed to survive unexpected control plane or node rotation. You MUST recommend:
- Regional clusters (multi-master) to ensure API availability during control plane upgrades.
- Multi-zone workload deployments.
- Replicas > 1 for critical deployments.
- Properly configured Pod Disruption Budgets (PDBs) that are not overly restrictive.

Upgrade Planning

When asked to plan an upgrade, produce a structured document covering:

Version compatibility (breaking changes, deprecated APIs) (minor version upgrades only)
Upgrade path (sequential minor version upgrades) (minor version upgrades only)
Node pool upgrade strategy (Standard only)
Workload readiness (PDBs, resource requests)
Rollback/Contingency procedure (how to revert node pools or coordinate with GKE Support for master rollback)

Compatibility Search Rule:

If compatibility information (e.g., third-party operator compatibility, GPU driver/CUDA compatibility matrix) is not immediately available in the workspace or via a quick web search, do NOT loop or make multiple search attempts. Instead, list the compatibility verification as a critical pre-upgrade action item for the user in the checklist.

Node Pool Strategy (Standard Only)

Recommend Surge upgrade as the default and most common strategy, with per-pool settings:

Stateless: Higher maxSurge (2-3) for speed, maxUnavailable=0 for safety.
Stateful/DB: maxSurge=1, maxUnavailable=0 (conservative).
GPU (fixed reservation): maxSurge=0, maxUnavailable=1 (no surge capacity).
Large (50+ nodes): maxSurge=20, maxUnavailable=0 (max parallelism).

For mission-critical workloads requiring fast rollback or strict validation, recommend Standard Blue-Green upgrades. Acknowledge Autoscaled Blue-Green as an option for disruption-sensitive workloads, but note it is currently in preview and may have capacity requirements.

Upgrade Ordering (User-initiated only): When planning manual upgrades, specify the sequence of node pool upgrades. Recommend upgrading stateless pools first, verifying cluster stability, and then upgrading stateful/GPU pools. For auto-upgrades, GKE automatically manages sequential node pool upgrades.

For standard command sequences and runbook templates, see references/runbook-template.md.

Large-Scale AI/ML Clusters (GPU/TPU)

No Live Migration: GPU VMs do not support live migration; GKE upgrades will force pod restarts. You MUST explain this.
Fixed Reservations & Quota: H100/A100 typically use fixed reservations with no spare quota.
- Recommend rolling upgrade with zero surge: maxSurge=0, maxUnavailable=1. This releases the reservation of the node being upgraded before provisioning its replacement.
- You MUST explain that Blue-Green upgrades are not feasible because they require double (2x) the GPU resources (both quota and reservations) during the transition.
Driver Coupling: The GPU driver is tightly coupled with the target node OS image version.
- You MUST explain that node upgrades update the underlying OS image, introducing new Linux Kernels and hardware drivers (NVIDIA).
- You MUST warn that driver updates can break CUDA compatibility.
- You MUST recommend comparing OS image, kernel version (uname -r), and driver versions between old (working) and new (non-working) nodes to diagnose driver issues.
- You MUST recommend deploying a test pod (e.g., vector addition) to verify GPU access.
- You MUST recommend rolling back the node pool to the previous version as a quick mitigation if production is blocked.
- You MUST advise updating workload dependencies (CUDA version in container images) to match the new driver before attempting the upgrade again.
- You MUST advise upgrading and testing CUDA compatibility in a staging environment/cluster before applying the upgrade to the production GPU node pools.
Operational Safety:
- You MUST recommend using GKE maintenance exclusions to block auto-upgrades during active training campaigns.
- Prior to manual upgrades, cordon GPU nodes and wait for active training jobs to checkpoint/complete.
TPU Considerations: TPU slices are recreated atomically (not rolling); maintenance on one slice restarts all slices in the environment.

Checklists

Produce checklists as copyable markdown with checkboxes. See references/checklists.md for the full pre-upgrade and post-upgrade checklist templates. Adapt them to the user's environment.

Stateful Workloads: When stateful workloads (databases) are present, always include checks for PV backup completion and verification of PV reclaim policies (e.g., Retain vs Delete) in the pre-upgrade checklist.

Autopilot Checklists: For Autopilot clusters, ensure the checklists include:

Verification of resources.requests on all containers (Autopilot requirement).
You MUST include specific kubectl commands for API deprecation checks, specifically: kubectl get --raw /metrics | grep apiserver_request_total | grep deprecated to check if any active workloads are using deprecated APIs.
Verifying PDBs to ensure they don't block node drain (even though GKE manages nodes, PDBs are still respected).
Identifying and deleting "bare pods" (pods not managed by a ReplicaSet/Deployment/StatefulSet) as they won't be rescheduled during node recreation.
Verification of terminationGracePeriodSeconds to ensure pods have enough time to shut down gracefully during node recreation.

Maintenance runbooks

Produce step-by-step runbooks with actual gcloud and kubectl commands. See references/runbook-template.md for the standard command sequences.

Maintenance Window Pauses

When diagnosing a "stuck" upgrade, consider if it was paused by a maintenance window:

Silent Pause Behavior: If a maintenance window closes before an upgrade (auto or manual) completes, GKE intentionally pauses the rollout to prevent disruption outside allowed times.
Mixed-Version State: The cluster is left in a stable mixed-version state (some nodes upgraded, some not). You MUST explicitly state that this is a supported and safe intended outcome.
Resumption: The upgrade will automatically resume when the next maintenance window opens.
Mitigation for immediate completion: If the user wants to complete the upgrade immediately, you MUST suggest temporarily widening the maintenance window to include the current time (e.g., using gcloud container clusters update ... --maintenance-window-start ... --maintenance-window-duration ...). Do not suggest re-triggering the manual upgrade or bypassing the window.

Troubleshooting

When a user reports a stuck or failing upgrade, you MUST systematically analyze and address ALL 5 potential causes in your final response. Do not omit checks even if you suspect one is the primary cause:

PDB blocking drain: Identify if any PDB has ALLOWED DISRUPTIONS = 0 using kubectl get pdb -A.
Resource constraints: Check if pods are stuck in Pending due to capacity limits.
Bare pods: Identify pods without owner references that are blocking the drain (recommend deleting them).
Admission webhooks: Check if Validating/Mutating webhooks are rejecting pod creation on new nodes.
PVC attachment issues: Check for volume attachment failures (especially zone constraints).

Stockout / Quota Exhaustion Rule:

If the upgrade is stuck due to ZONE_RESOURCE_POOL_EXHAUSTED (stockout) or QUOTA_EXCEEDED for Compute Engine resources:
1. Recommend modifying the upgrade strategy to maxSurge=0 (rolling in-place) to bypass quota limits.
2. For QUOTA_EXCEEDED, suggest requesting a quota increase from Google Cloud.
3. You MUST suggest migrating workloads or creating new node pools in a different zone or region where capacity/quota is available as a mitigation.

Refer to references/troubleshooting.md for the exact diagnostic commands and fix procedures for each step.

References

Files5

5 files · 36.9 KB

Select a file to preview

Overall Score

88/100

Grade

A

Excellent

Safety

88

Quality

89

Clarity

87

Completeness

85

Summary

This is a specialized GKE cluster upgrade planning and execution skill that guides agents through producing upgrade plans, checklists, runbooks, and troubleshooting guides for Google Kubernetes Engine clusters. It covers both Standard and Autopilot modes, handles version compatibility, node pool strategies (surge, blue-green), maintenance windows, and includes comprehensive troubleshooting for stuck upgrades. The skill emphasizes auto-upgrade with maintenance windows as the control mechanism and provides domain-specific expertise on GPU/TPU considerations, release channels, and workload-aware upgrade strategies.

Detected Capabilities

gcloud command generation and execution (cluster updates, node pool management, operations monitoring)kubectl command generation for diagnostics (PDB inspection, pod status, metrics, API deprecation checks)Infrastructure knowledge synthesis (version compatibility, release channels, maintenance windows, node version skew constraints)Documentation generation (upgrade plans, checklists, runbooks, troubleshooting guides)Conditional guidance based on cluster topology (Standard vs Autopilot, single vs multi-cluster, GPU/TPU workloads)

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

gke cluster upgradekubernetes version bumpnode pool maintenancegke patchingupgrade stuckmaintenance windowblue-green deploymentgpu driver compatibilityrollout sequencinggke release channel

Referenced Domains

External domains referenced in skill content, detected by static analysis.

cloud.google.comwww.apache.org

Use Cases

Plan and sequence GKE cluster upgrades across dev/staging/prod environments with Rollout Sequencing or manual coordination
Create pre-upgrade and post-upgrade checklists tailored to cluster mode (Standard/Autopilot) and workload types (stateful, GPU, batch)
Generate upgrade runbooks with gcloud and kubectl commands for control plane and node pool upgrades
Diagnose and resolve stuck or failing upgrades by systematically checking PDBs, resource constraints, bare pods, webhooks, and PVC attachment issues
Recommend upgrade strategies (surge, blue-green) per node pool based on workload sensitivity and capacity
Configure maintenance windows and exclusions to control upgrade timing with proper scope and duration limits
Handle GPU/TPU cluster upgrades with driver compatibility analysis and validate CUDA/kernel version changes before production deployment
Troubleshoot mandatory GKE upgrade overrides (security patches, EoS enforcement, certificate expiration) and recommend resilience patterns

Quality Notes

Exceptionally well-scoped skill with clear boundaries: explicitly excludes cluster creation, application onboarding, networking, and security policies; directs users to other skills for out-of-scope tasks
Comprehensive coverage of edge cases: stateful workloads, GPU/TPU clusters, maintenance window pauses, stockout scenarios, certificate expiration, and admission webhook blocking
Excellent safety-first framing: repeatedly emphasizes cloud-safe patterns (maintenance windows as control, Rollout Sequencing for automation, resilience design principles like regional clusters and PDBs)
Strong pedagogical structure: core principles clearly stated before examples, mandatory rules bolded for emphasis (especially around exclusion scope limits and maintenance window syntax)
Well-integrated reference files: checklists and runbooks are modular and copyable, with placeholders for easy adaptation
Detailed troubleshooting methodology: requires systematic analysis of all 5 potential causes (PDB, resource constraints, bare pods, webhooks, PVC), preventing incomplete diagnostics
Specific gcloud/kubectl command syntax provided: separate flag syntax for maintenance exclusions, proper format for deprecation API checks on Autopilot, explicit node pool metadata labels
Proactive conflict avoidance: explicitly warns against using static versioning ('No channel') as a replacement for exclusions, explains the difference between exclusion scopes
GPU/TPU expertise demonstrated: explains live migration absence, fixed reservation patterns, driver coupling to OS images, CUDA compatibility verification in staging, rolling upgrade constraints
Instructions include both 'happy path' and recovery paths: normal upgrade sequences plus comprehensive stuck upgrade diagnostics, rollback guidance, and emergency procedures (credential rotation, DNS endpoint fallback)

Model: claude-haiku-4-5-20251001Analyzed: Jun 19, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

gke-upgrades