GKE Workload Scaling
This reference covers scaling workloads on GKE. The golden path enables VPA, OPTIMIZE_UTILIZATION autoscaling profile, and Node Auto Provisioning by default.
MCP Tools:
get_k8s_resource,describe_k8s_resource,apply_k8s_manifest,patch_k8s_resource,get_cluster,update_cluster,update_node_pool
Golden Path Scaling Defaults
| Setting | Golden Path Value | Notes |
|---|---|---|
autoscaling.autoscalingProfile |
OPTIMIZE_UTILIZATION |
Aggressive scale-down for cost savings |
verticalPodAutoscaling.enabled |
true |
VPA recommendations available |
autoscaling.enableNodeAutoprovisioning |
true |
NAP creates node pools on demand |
| GPU resource limits (T4, A100) | 1000000000 each |
NAP can provision GPU nodes |
Scaling Mechanisms
1. Manual Scaling
kubectl-only — no MCP equivalent for
kubectl scale. Use kubectl directly.
kubectl scale deployment <DEPLOYMENT> --replicas=<N> -n <NAMESPACE>
2. Horizontal Pod Autoscaling (HPA)
Scales the number of pods based on metrics.
Quick setup (kubectl-only — no MCP equivalent for kubectl autoscale):
kubectl autoscale deployment <DEPLOYMENT> --cpu-percent=50 --min=1 --max=10
Manifest approach (recommended — use MCP apply_k8s_manifest):
See assets/hpa-example.yaml for a template.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: <DEPLOYMENT>-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: <DEPLOYMENT>
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
3. Vertical Pod Autoscaling (VPA)
Adjusts CPU and memory requests to match actual usage. Enabled by default on golden path.
Update modes:
Off— recommendations only (safest, start here)Initial— sets resources only at pod creationAuto— restarts pods to apply new resource valuesInPlaceOrRecreate— updates resources without restart when possible (GKE 1.34+)
Create VPA in recommendation mode:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: <DEPLOYMENT>-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: <DEPLOYMENT>
updatePolicy:
updateMode: "Off"
Read recommendations (prefer MCP describe_k8s_resource):
# MCP (preferred)
describe_k8s_resource(parent="...", resourceType="verticalpodautoscaler", name="<DEPLOYMENT>-vpa", namespace="<NAMESPACE>")
# kubectl fallback
kubectl get vpa <DEPLOYMENT>-vpa -o jsonpath='{.status.recommendation}'
See assets/vpa-example.yaml for a full template.
4. Cluster Autoscaler / Node Auto Provisioning (NAP)
On Autopilot (golden path), node scaling is fully managed. NAP automatically creates and sizes node pools based on workload demands.
For Standard clusters:
# Enable cluster autoscaler on a node pool
gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
--enable-autoscaling --node-pool <POOL_NAME> \
--min-nodes <MIN> --max-nodes <MAX> \
--quiet
# Enable NAP
gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
--enable-autoprovisioning \
--min-cpu <MIN_CPU> --max-cpu <MAX_CPU> \
--min-memory <MIN_MEM> --max-memory <MAX_MEM> \
--quiet
Autoscaling profiles:
| Profile | Behavior | Golden Path? |
|---|---|---|
BALANCED |
Default GKE; conservative scale-down | No |
OPTIMIZE_UTILIZATION |
Aggressive scale-down; lower idle | Yes |
| : : resources : : |
Best Practices
- Define resource requests: HPA and VPA rely on accurate requests. Always set them.
- Avoid metric conflicts: Do not use HPA and VPA on the same metric. Typical pattern: HPA on CPU, VPA on memory.
- Pod Disruption Budgets: Define PDBs for all production workloads to ensure availability during scaling events.
- HPA stabilization: HPA has a default 5-minute stabilization window. Tune
behaviorfor faster response if needed. - VPA "Auto" caution: Auto mode restarts pods. Ensure your app handles SIGTERM gracefully. VPA requires at least 2 replicas for evictions by default.
- Use ComputeClasses: For workload-specific node targeting (Spot fallback, GPU, specific machine families), use ComputeClasses instead of node selectors.
Rightsizing Workflow
- Deploy VPA in
Offmode for 24+ hours - Read recommendations:
kubectl describe vpa <NAME> - Compare
targetvalues against currentrequests - Apply with 20% buffer:
new_request = target * 1.2 - Use patch format to update Deployment
| Condition | Recommendation | Risk |
|---|---|---|
| CPU request >5x P95 actual | Reduce to P95 * 1.2 |
Medium |
| Memory request >3x P95 actual | Reduce to P95 * 1.2 |
Medium |
| CPU request >2x P95 actual | Rightsizing with 20% buffer | Low |
| No resource limits set | Add limits to prevent noisy-neighbor | Low |