Catalog
google/gke-observability

google

gke-observability

Configures GKE observability, including Cloud Logging, Cloud Monitoring, and managed Prometheus. Use when configuring GKE monitoring, setting up GKE logging, or configuring Prometheus metrics collection. Don't use to configure local application logging frameworks or external APMs outside GKE.

global
New~2.3k
v1.0Saved Jun 24, 2026

GKE Observability

This reference covers monitoring, logging, and metrics configuration for GKE. The golden path enables comprehensive observability including control-plane metrics.

MCP Tools: get_cluster, list_k8s_events, get_k8s_logs, get_k8s_cluster_info, describe_k8s_resource. CLI-only: gcloud container clusters update --monitoring=..., gcloud logging read

Golden Path Observability Defaults

Setting Golden Path Value Notes
loggingConfig components SYSTEM_COMPONENTS, WORKLOADS Full workload logging
monitoringConfig components SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER Full suite including control-plane
managedPrometheusConfig.enabled true Google-managed Prometheus
advancedDatapathObservabilityConfig.enableMetrics true Dataplane V2 flow metrics
loggingService logging.googleapis.com/kubernetes Cloud Logging
monitoringService monitoring.googleapis.com/kubernetes Cloud Monitoring

Control-Plane Metrics (Golden Path Addition)

The golden path adds three control-plane monitoring components not present in default clusters:

Component What It Monitors
APISERVER API server request latency, error rates, admission
: : webhook performance :
SCHEDULER Scheduling latency, pending pods, scheduling failures
CONTROLLER_MANAGER Controller work queue depth, reconciliation latency

These are critical for diagnosing cluster-level issues (slow API responses, scheduling delays, stuck controllers).

Enabling Full Monitoring

# Enable golden path monitoring suite
gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
  --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM \
  --quiet

# Enable Managed Prometheus
gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
  --enable-managed-prometheus \
  --quiet

# Enable Dataplane V2 observability metrics
gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
  --enable-dataplane-v2-flow-observability \
  --quiet

Managed Prometheus

Golden path enables Google Managed Prometheus for metrics collection and querying.

Querying metrics:

  • Use Cloud Monitoring Metrics Explorer in the console
  • Use PromQL via the Prometheus UI or API
  • Grafana dashboards via Managed Grafana

Key GKE metrics:

Metric Source Use
container_cpu_usage_seconds_total cAdvisor Pod CPU usage
container_memory_working_set_bytes cAdvisor Pod memory
: : : usage :
kube_pod_status_phase kube-state-metrics Pod lifecycle
apiserver_request_duration_seconds API Server Control plane
: : : latency :
scheduler_scheduling_duration_seconds Scheduler Scheduling
: : : performance :
node_cpu_seconds_total Kubelet Node CPU
DCGM_FI_DEV_GPU_UTIL DCGM GPU
: : : utilization :

Live Resource Usage (kubectl-only)

No MCP or gcloud equivalent exists for live resource usage. Use kubectl top:

kubectl top pods --all-namespaces --sort-by=cpu
kubectl top nodes
kubectl top pods --containers -n <NAMESPACE>  # per-container breakdown

Cloud Logging (gcloud-only)

Querying cluster logs (no MCP equivalent — use gcloud logging read):

# System component logs
gcloud logging read \
  'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"' \
  --project <PROJECT_ID> --limit 50 \
  --quiet

# Workload logs for a specific namespace
gcloud logging read \
  'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"' \
  --project <PROJECT_ID> --limit 50 \
  --quiet

# Audit logs (who did what)
gcloud logging read \
  'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"' \
  --project <PROJECT_ID> --limit 50 \
  --quiet

Diagnostic Settings

For security monitoring and troubleshooting, enable control-plane audit logs:

# View current logging config
gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
  --format="yaml(loggingConfig)" \
  --quiet

Alerting

Set up alerts for critical conditions:

Condition Metric Threshold
High API server latency apiserver_request_duration_seconds P99 > 5s
Pod crash loops kube_pod_container_status_restarts_total > 5 in 10min
Node not ready kube_node_status_condition condition=Ready, status!=True
High GPU utilization DCGM_FI_DEV_GPU_UTIL > 95% sustained
PVC near capacity kubelet_volume_stats_used_bytes / capacity > 85%
Scheduling failures scheduler_schedule_attempts_total{result="error"} > 0

Proposing Dashboards & Alerts (Production Rules)

When designing or proposing alerting and dashboard strategies for GKE:

  1. Always explicitly name Google Cloud Monitoring as the platform to implement these alerts and dashboards.
  2. Always include API server latency (via apiserver_request_duration_seconds metric) on the dashboard as a critical indicator of control plane health, alongside node CPU/Memory and pod crash loops.

Cost Considerations

Monitoring and logging have associated costs:

  • Cloud Logging: Charged per GiB ingested beyond free tier (50 GiB/project/month)
  • Cloud Monitoring: Free for GKE system metrics; custom metrics charged per time series
  • Managed Prometheus: Charged per samples ingested

To reduce costs in non-production:

# Reduce to system-only monitoring
gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
  --monitoring=SYSTEM \
  --quiet

Not golden path defaults — recommended for production microservice architectures and performance-sensitive workloads.

  • Cloud Trace: Add OpenTelemetry SDK to your app with the opentelemetry-operations-go (or equivalent) exporter. Traces appear in Cloud Trace console. Identifies cross-service latency bottlenecks.
  • Cloud Profiler: Add the Cloud Profiler agent to your app. Profiles CPU and memory usage in production with low overhead. Identifies hotspots and compares across versions.

LQL Query Examples

Common Logging Query Language patterns for GKE troubleshooting:

# Error logs for a specific container
resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR

# OOMKilled events
resource.type="k8s_event" AND jsonPayload.reason="OOMKilling"

# Pod scheduling failures
resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling"

# Audit logs (who did what)
resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"
Files1
1 files · 11.1 KB

Select a file to preview

Overall Score

82/100

Grade

B

Good

Safety

88

Quality

80

Clarity

85

Completeness

78

Summary

A reference guide for configuring Google Kubernetes Engine (GKE) observability across Cloud Logging, Cloud Monitoring, and managed Prometheus. The skill provides golden-path defaults, command examples, and metric reference tables to help agents configure cluster-wide monitoring, enable control-plane metrics, and set up alerting thresholds.

Detected Capabilities

gcloud CLI executionKubernetes API inspectioncluster configurationmetric queryinglog readingalert threshold definitioncommand-line instruction

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

enable gke monitoringconfigure cloud loggingmanaged prometheus metricsgke observability setupcontrol-plane metricsdataplane v2 observabilitycluster alerting ruleskubernetes audit logs

Risk Signals

INFO

gcloud commands with cluster mutation flags (--monitoring, --enable-managed-prometheus, --enable-dataplane-v2-flow-observability)

Enabling Full Monitoring section
INFO

No direct secrets access or credential references; uses GCP IAM-bound service accounts

throughout
INFO

All gcloud/kubectl invocations are read-heavy (describe, read, top, list_k8s_events) or safe configuration updates

Cloud Logging, Live Resource Usage, Alerting sections

Referenced Domains

External domains referenced in skill content, detected by static analysis.

www.apache.org

Use Cases

  • Enable comprehensive GKE monitoring stack including control-plane metrics
  • Configure Cloud Logging and Cloud Monitoring for Kubernetes workloads
  • Set up managed Prometheus for metrics collection and querying
  • Enable Dataplane V2 flow observability for network metrics
  • Design alerting strategies for critical cluster conditions
  • Query cluster logs and events using Logging Query Language (LQL)
  • Troubleshoot cluster health using control-plane and workload metrics

Quality Notes

  • Clear scope boundaries: applies only to GKE observability, not local app logging frameworks
  • Well-organized reference structure with golden-path defaults, metric tables, and query examples
  • Comprehensive metric reference table aids metric selection for custom dashboards
  • Cost considerations section helps users make informed trade-offs between monitoring comprehensiveness and budget
  • LQL query examples are practical and address common troubleshooting scenarios
  • Controls-plane metric additions (APISERVER, SCHEDULER, CONTROLLER_MANAGER) are explicitly justified with use-case descriptions
  • Distributed tracing and continuous profiling sections appropriately marked as recommended additions, not golden-path defaults
  • Clear tool availability notes (MCP vs. CLI-only) guide agent tool selection
  • Alert thresholds lack severity levels or SLO context—guidance could specify when to escalate
  • No example of full dashboard JSON or monitoring rule configuration—reference only, no templates
Model: claude-haiku-4-5-20251001Analyzed: Jun 24, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

Add google/gke-observability to your library

Command Palette

Search for a command to run...