Kubernetes dramatically changed how applications are deployed, scaled, and recovered — but its built-in health mechanisms only cover part of the monitoring problem. Liveness and readiness probes tell the Kubernetes control plane whether to restart your container or route traffic to it. They do not tell you whether your public-facing APIs are responding correctly, whether your TLS certificates are three days from expiry, whether your nightly CronJob completed, or whether a misconfigured Ingress rule silently broke a route that no internal probe would detect. This guide covers all of it: internal probe design, external uptime monitoring for K8s-hosted services, Ingress endpoint validation, CronJob heartbeat monitoring, certificate expiry alerting, and incident communication for Kubernetes environments.
Any serious Kubernetes monitoring strategy operates at two fundamentally different layers. Conflating them is the most common mistake teams make when assessing whether their K8s monitoring is adequate.
Internal cluster monitoring covers metrics and health signals that originate from within the Kubernetes cluster. This includes:
/metrics endpoints for resource utilization, error rates, and custom application metricsExternal monitoring covers health signals observed from outside the cluster — from the perspective of actual end users or downstream consumers of your services. This layer is independent of the cluster's internal state and catches failure modes that internal probes cannot.
Internal probes are the foundation of Kubernetes self-healing. Misconfigured probes are one of the top causes of unnecessary pod restarts, traffic routing to unhealthy pods, and slow rollout recovery. Getting probe configuration right is worth the investment.
A liveness probe answers the question: "Is this container still running correctly, or should Kubernetes restart it?" The probe should check whether the application process is alive and can make basic progress — not whether it is fully healthy and ready for traffic. A liveness probe failure triggers a pod restart, so false positives have operational cost.
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5
Key principles for liveness probes:
initialDelaySeconds generously to avoid restarting containers that are still starting up. For slow-starting JVM or .NET applications, this may need to be 60–120 seconds. Use startup probes if initialization time is variable.A readiness probe answers: "Is this container ready to receive traffic right now?" This probe gates traffic routing via the Service endpoint controller. A readiness failure removes the pod from the Endpoints list — it receives no new requests — but does not restart the container. Readiness probes can and should check external dependencies, because their failure model is graceful degradation rather than restart.
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
timeoutSeconds: 3
A readiness endpoint can check database connectivity, cache reachability, or the availability of required upstream services — signaling that the pod should be excluded from load balancing if critical dependencies are unavailable.
For applications with slow or variable initialization times, startup probes prevent liveness from killing the container before it has finished starting. The startup probe runs until it succeeds (or hits failureThreshold * periodSeconds), after which liveness and readiness probes take over.
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
This configuration gives the container up to 300 seconds to start before Kubernetes considers it failed — without setting a high initialDelaySeconds on liveness that would also slow restart-and-recovery cycles.
The most important external monitoring you can add to a Kubernetes environment is HTTP/HTTPS checks against your Ingress-exposed endpoints from probes outside your cluster. These checks validate the complete request path — DNS resolution, Ingress controller routing (nginx, Traefik, Istio gateway), service discovery, pod reachability, and application response — from a perspective that no internal probe can replicate.
| Failure Scenario | Detected by Liveness / Readiness? | Detected by External Monitor? |
|---|---|---|
| Pod crash / CrashLoopBackOff | Yes — triggers restart | Yes — HTTP timeout or 5xx |
| Bad Ingress rule routing 404 to users | No | Yes — HTTP 404 detected |
| Nginx upstream timeout (502) | No | Yes — HTTP 502 detected |
| Expired TLS cert → browser error | No | Yes — SSL check fails |
| DNS misconfiguration pointing nowhere | No | Yes — DNS check fails |
| Network policy blocking external traffic | No | Yes — connection refused / timeout |
| All pods in CrashBackoff, 0 Ready pods | Yes (each pod) | Yes — 503 from Ingress |
| HPA scaled to 0 accidentally | N/A (no pods) | Yes — 503 or connection error |
| Correct HTTP but wrong response body | No | Yes — response body assertion fails |
| Slow degradation — P99 latency spike | No (unless timeoutSeconds exceeded) | Yes — response time threshold exceeded |
For each Ingress-exposed hostname in your cluster, create an external HTTP monitor that checks the public URL at a 1–5 minute interval from multiple geographic locations. Configure the monitor with:
https://api.yourapp.com/health)For external monitors to be maximally effective, your application should expose a dedicated health endpoint that does a lightweight check of application health — database connectivity, critical cache warmup status — and returns a structured response. A common pattern:
GET /health
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "ok",
"version": "2.4.1",
"checks": {
"database": "ok",
"cache": "ok"
}
}
Configure your external monitor to assert that the response body contains "status":"ok" — this validates end-to-end application health, not just that the Ingress controller answered the TCP connection.
Certificate expiry is one of the most embarrassing and avoidable production incidents. In Kubernetes environments, TLS certificates are commonly sourced from Let's Encrypt via cert-manager, from an organization's internal CA, or from a cloud provider's certificate management service (ACM, GCP Managed Certificates). Regardless of source, external certificate monitoring is the safety layer that catches renewal failures before they reach end users.
cert-manager automates Let's Encrypt certificate renewal and is generally reliable. But renewal can fail due to: DNS-01 or HTTP-01 challenge failures caused by misconfigured DNS, network policy blocking the ACME challenge path, rate limit violations from too many renewal requests, cert-manager pod being unavailable during the renewal window, or manual changes to Certificate resources that break the renewal configuration.
When cert-manager fails to renew, the failure is silent unless you are actively monitoring certificate expiry externally. The first signal many teams get is a user reporting a browser security warning.
For each Ingress hostname, configure an SSL certificate monitor that tracks the certificate currently served at that domain and alerts when expiry is within a configurable threshold (typically 14–30 days). The external monitor should check:
This external monitoring is independent of cert-manager's internal state — it observes what browsers and API clients actually see when connecting, making it the authoritative check for whether certificate management is working correctly.
Kubernetes CronJobs run scheduled workloads — data pipeline steps, report generation jobs, cache warming, backup processes, billing event processing. When a CronJob fails, completes with errors, runs longer than expected, or silently stops running due to a scheduling suspension, cluster-internal monitoring often misses it. CrashLoopBackOff detection works for pods that repeatedly restart; it does not work for jobs that complete without an error code but with incorrect output, or for jobs that never start.
Add a curl ping at the end of your CronJob script or container entrypoint — but only on successful completion:
#!/bin/bash
set -e
# Your job logic
python /app/process_billing_events.py
# Ping heartbeat monitor on success
curl -fsS --max-time 5 \
"https://upticknow.com/api/heartbeat/YOUR_MONITOR_ID" \
|| echo "Heartbeat ping failed (non-fatal)"
For Kubernetes CronJob specs, the pattern in the job container looks like:
apiVersion: batch/v1
kind: CronJob
metadata:
name: billing-processor
spec:
schedule: "0 3 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: processor
image: yourapp/billing-processor:v1.4
command:
- /bin/sh
- -c
- |
python /app/run.py && \
curl -fsS "https://upticknow.com/api/heartbeat/YOUR_MONITOR_ID"
restartPolicy: OnFailure
Configure the heartbeat monitor in UpTickNow with a grace period slightly longer than the expected job duration (e.g., if the job typically runs in 45 minutes and fires every 24 hours, set the grace period to 25 hours). The monitor will alert if no ping is received within that window — catching missed runs, runs that started but never completed, and jobs that were accidentally suspended.
command && curl), no ping fireskubectl patch or Helm release accidentally set spec.suspend: trueFor Kubernetes services exposed via Ingress, external DNS records must correctly point to the Ingress controller's load balancer IP or hostname. DNS configuration changes — during cluster migrations, load balancer replacements, or DNS provider changes — can break external access in ways that internal Kubernetes monitoring never observes.
Configure external DNS monitors for each publicly exposed Ingress hostname that verify:
DNS monitoring is especially important during and after cluster migrations, blue-green cluster switchovers, and any operation that changes the Ingress controller's external IP or hostname.
Running checks from a single probe location can produce false positives when temporary network conditions between that probe and your infrastructure cause failures that users do not actually experience. Multi-region confirmation requires checks to fail from two or more geographically distributed probe locations before triggering an alert.
For Kubernetes clusters serving global traffic, multi-region external monitoring serves a second purpose beyond false-positive reduction: it validates that your CDN, DNS-based geographic routing, and regional load balancing are correctly serving users in their respective regions. A single-region probe cannot detect routing failures that affect users in specific geographic areas.
A production-grade Kubernetes monitoring stack in 2026 combines internal and external layers with clear ownership and alert routing for each signal type.
Core infrastructure: cluster resource metrics, pod states, deployment rollout status, PVC utilization, HPA scaling events.
Dashboards for cluster health, per-service resource utilization, deployment history, and alert state overview.
Route CrashLoopBackOff and OOMKill alerts to the owning team based on namespace labels or pod annotations.
For microservice architectures: Jaeger or Tempo for cross-service trace correlation when diagnosing latency issues and dependency failures.
HTTP/HTTPS checks with response body validation, multi-region confirmation, response time thresholds. One monitor per exposed Ingress route.
Certificate expiry alerts at 30 and 14 days for all Ingress-served TLS endpoints. Hostname mismatch and chain validity checks.
Dead-man's switch for each critical scheduled job with an expected completion window slightly wider than the typical job duration.
DNS record validation for each Ingress hostname. Alert on NXDOMAIN or unexpected IP changes.
A public or private status page that reflects the availability state of your Kubernetes-hosted services, fed by the external monitoring results.
| Monitor Type | Recommended Frequency | Why |
|---|---|---|
| Ingress endpoint (critical paths) | Every 1 minute | Fast detection of pod failures, routing issues |
| Ingress endpoint (lower-traffic routes) | Every 5 minutes | Balance detection speed with cost |
| SSL certificate expiry check | Every 12–24 hours | Expiry changes slowly; high-frequency is wasteful |
| DNS validation | Every 15–30 minutes | DNS changes rarely; faster detection adds marginal value |
| CronJob heartbeat (daily jobs) | Grace period = job cadence + 1 hour buffer | Alert if the job does not complete within the window |
| CronJob heartbeat (hourly jobs) | Grace period = 90 minutes | Catches missed runs without being overly aggressive |
| TCP port checks (databases, internal services) | Every 2–5 minutes | Port availability as a lightweight proxy for service health |
An effective alert routing strategy maps each monitoring signal to the team responsible for the affected service — not to a central ops team that then has to triage and re-route. For Kubernetes environments, routing should follow namespace or service ownership boundaries.
For external uptime monitors: route alerts to the service team's primary on-call channel (Slack, PagerDuty, or Opsgenie escalation policy). Use team labels in UpTickNow to group monitors by service ownership and configure per-team routing.
For certificate expiry alerts: route to the platform or infrastructure team responsible for cert-manager configuration — typically a different audience than the application service teams.
For CronJob heartbeat failures: route to the team that owns the scheduled job, with the job name and expected cadence in the alert message for immediate context.
For DNS failures: route to the infrastructure team with high urgency — DNS failures typically affect all users across all services simultaneously.
Even in a Kubernetes environment where the cluster self-heals most failures automatically, incidents that affect user-facing availability require external communication. When a pod restart cycle takes 3–5 minutes, when a rolling deployment temporarily degrades capacity, or when a certificate expiry causes a hard outage, users and stakeholders need a reliable place to understand what is happening and what the resolution timeline is.
A status page fed by your external uptime monitoring data closes the loop: when external monitors detect degraded availability, incidents are opened on the status page automatically or manually by the on-call engineer. This is the channel that external customers, third-party integrations, and internal non-technical stakeholders check during incidents — and it needs to be independent of your Kubernetes cluster's health so it remains accessible even when the cluster itself is in a degraded state.
UpTickNow provides the external monitoring layer for Kubernetes environments: HTTP/HTTPS endpoint monitoring with response body validation, SSL certificate monitoring for Ingress-served TLS, DNS record validation, TCP port checks, CronJob heartbeat monitoring with configurable grace windows, and a public status page for incident communication — all in a single platform.
Configuring UpTickNow's HTTP monitor against a Kubernetes Ingress endpoint is identical to any other HTTPS endpoint: provide the URL, configure the response assertions and threshold, select check locations, and configure alert routing to the service team. The monitoring layer has no knowledge of or dependency on your Kubernetes cluster — which is exactly the point. External monitoring should be infrastructure-agnostic, observing your services from the outside the way your users do.
See the full platform at upticknow.com or get started on the pricing page.
Monitor your K8s Ingress endpoints, TLS certificates, CronJob heartbeats, and DNS records from outside the cluster — the way your users experience your services. Start free in minutes.
Try UpTickNow Free