Auto-Discovering Kubernetes Services for Monitoring — Without the YAML Circus

We get it. You chose Kubernetes. Maybe your company needed it, maybe you needed it on your resume, maybe you genuinely run workloads at a scale where it makes sense. But here is the thing: monitoring your services inside that cluster should not require another 200 lines of YAML, a custom CRD, Prom stack and a degree in label-selector archaeology.

And yet, somehow, that is exactly what most monitoring tools ask you to do.

The traditional approach (a.k.a. "I forgot to add a monitor for that new service")

Let's paint the picture. You deploy a new service. It gets an Ingress. Traffic flows. Everyone is happy. Three weeks later, it goes down at 2 AM and nobody notices because -- surprise -- nobody remembered to create a monitor for it.

The traditional workflow looks like this:

Deploy service to K8s
Remember that you need monitoring (maybe)
Log into your monitoring tool
Manually type in the URL, set the interval, pick a notification channel
Repeat for every service, every endpoint, every cluster

And when you delete a service? Those orphaned monitors sit there forever, pinging a hostname that no longer resolves, silently wasting everyone's time.

We thought this was dumb. So we fixed it.

How auto-discovery works

StatusDude ships a lightweight agent that runs as a pod inside your cluster. Set one environment variable -- K8S_DISCOVERY_ENABLED=true -- and it starts scanning the Kubernetes API every 5 minutes for resources that look like they should be monitored.

The flow is simple:

Agent queries the K8s API for Ingresses, Services, and HTTPRoutes
Builds a desired-state manifest -- a list of monitors that should exist based on what it found
Sends the manifest to the StatusDude APIs -- monitor these!

The agent needs exactly one token: its agent API key. No user credentials, no separate service account for the monitoring API, no OAuth dance. One token.

The K8s API calls use the standard kubernetes Python library, which is synchronous, so we wrap everything in asyncio.to_thread() to keep the agent's async event loop unblocked. Discovery runs in the background while the agent continues its normal job of pinging your other monitors.

What gets discovered

The agent scans three types of Kubernetes resources:

Ingresses -- the bread and butter. For each Ingress rule, the agent extracts the hostname and path, determines whether TLS is configured (HTTPS vs HTTP), and creates an HTTP monitor. An Ingress with host: api.example.com and TLS enabled becomes a monitor for https://api.example.com/.

Services (LoadBalancer / NodePort) -- for services exposed outside the cluster. LoadBalancer services get HTTP monitors on well-known ports (80, 443, 8080, etc.) and TCP monitors on everything else. NodePort services get TCP monitors on their node ports.

Gateway API HTTPRoutes -- the newer alternative to Ingresses (Long live the King!). The agent reads the hostnames field and creates HTTPS monitors for each one.

You can scope discovery with namespace filtering (K8S_NAMESPACE="production" or "all") and label selectors (K8S_LABEL_SELECTOR="statusdude.io/monitor=true") so you are not monitoring every single thing in the cluster -- just the stuff that matters.

Smart tagging from K8s labels

Nobody wants to manually tag 50 monitors. The agent extracts tags automatically from Kubernetes labels and metadata:

k8s-autodiscovery -- applied to everything the agent creates, so you can filter at a glance
Cluster ID -- the first 12 characters of your kube-system namespace UID (e.g., cluster:a1b2c3d4e5f6)
Namespace -- production, staging, whatever it finds
App labels -- app, app.kubernetes.io/name, app.kubernetes.io/component, k8s-app -- the values become tags

The extraction is smart about skipping noise. Labels like pod-template-hash, controller-revision-hash, or helm.sh/chart are ignored. Values that look like hashes (more than 80% hex characters) are skipped too. So app=frontend becomes the tag frontend, but pod-template-hash=7c9f8b6d4a does not become anything.

The YAML comparison

Let's see what it takes to monitor a new service with different tools.

Prometheus ServiceMonitor approach:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: checkout-service
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: checkout
  namespaceSelector:
    matchNames:
      - production
  endpoints:
    - port: http
      interval: 30s
      path: /metrics
      scrapeTimeout: 10s

That is 18 lines of YAML per service. And it assumes you have the Prometheus Operator CRD installed, a Prometheus instance configured to pick up ServiceMonitors with that label selector, and a Grafana dashboard to actually visualize the data. Oh, and this monitors metrics, not uptime -- you still need alerting rules on top.

StatusDude auto-discovery approach:

# That's it. There's no YAML. The agent found it automatically.

Zero lines. Deploy your service, give it an Ingress, and the next discovery cycle picks it up. If you want to be selective, add a label: statusdude.io/monitor: "true".

Orphan handling: deploy removes a service, monitor auto-pauses

This is the part that made us unreasonably happy to build. When a Kubernetes resource disappears -- someone deletes the Ingress, scales down a service, removes a namespace -- the next discovery cycle notices it is gone.

By default, the orphaned monitor gets paused, not deleted. Your historical uptime data and incident history are preserved. If the service comes back (a rollback, a redeployment), the monitor resumes from where it left off.

If you prefer a clean slate, set K8S_DELETE_ORPHANED_MONITORS=true and orphaned monitors get deleted entirely. Either way, you never end up with ghost monitors pinging dead endpoints.

The reconcile API handles this atomically -- the agent sends its full desired state, and anything managed by that agent that is not in the manifest gets cleaned up. No polling for deletions, no separate garbage collection job, no race conditions.

The bottom line

Kubernetes already knows what is running in your cluster. Your monitoring tool should be able to read that information instead of asking you to type it in again. One agent pod, one API token, zero YAML files to maintain. Services appear, monitors appear. Services disappear, monitors pause. Labels become tags. Namespaces become filters.

That is the entire pitch. We think monitoring should be boring -- in the good way, where it just works and you forget it is there until something actually breaks.