You are viewing documentation for a previous version of Platform9 products. For the latest Private Cloud Director documentation, click here.
Managed Kubernetes
5.14
×
Overview
Getting Started
Clusters
PMK CLI
APIs
In Cluster Monitoring
Platform9 Managed Add-ons
Platform Administration
Workloads & Apps
Support
Title
Message
Create new category
What is the title of your new category?
Edit page index title
What is the title of the page index?
Edit category
What is the new title of your category?
Edit link
What is the new title and URL of your link?
Node Connectivity Monitoring
Summarize Page
Copy Markdown
Open in ChatGPT
Open in Claude
Node Connectivity Monitoring
Rules for the Hostagent
The Prometheus agent is configured to report numerous node-based metrics. Below is the YAML-based rule set that Catapult uses, including alert names, expression, timeframe, annotations which contain the summary and description of the notification as well as the SaaS management plane and hostname.
YAML
x
bash-5.0# cat resmgr.ymlgroups: - name: Host Availability Alerts rules: # du_sidekick_last_checkin_seconds will be the time in seconds since a host checked in via sidekick # or -1 if the host is not reported by sidekick at all. The latter case may occur if # sidekick has restarted and a host hasn't connected since. - expr: >-2 ( sum by (host_id,host_name) ( time() - sidekick_host_last_heartbeat_time{job="sidekickserver"} / 1000 ) ) OR ignoring(host_name) ( sum by (host_id,host_name) ( (resmgr_host_up{job="resmgr"} == 0) - 1 ) ) record: du_sidekick_last_checkin_seconds - name: Hosts disconnected rules: # resmgr says it's been down for at least 10m, sidekick says still reporting # NOTE: the `for` delay _must_ exceeed the cutoff (currently 600 seconds) + the scrape # period (1m) or else both this and host-down will fire off simultaneously - alert: host-disconnected expr: >-2 sum by (du, host_id, host_name) (du_sidekick_last_checkin_seconds < 600) AND ON(host_id) resmgr_host_up{job="resmgr"} == 0 for: 15m annotations: summary: host-disconnected description: "{{ $labels.host_name }} disconnected from control plane {{ $labels.du }} for more than 10 minutes" du: "{{ $labels.du }}" host_name: "{{ $labels.host_name }}" # resmgr says it's been down for at least 10m, sidekick "agrees" - alert: host-down expr: >-2 sum by (du, host_id, host_name) (du_sidekick_last_checkin_seconds >= 600 OR du_sidekick_last_checkin_seconds == -1) AND ON(host_id) resmgr_host_up{job="resmgr"} == 0 for: 10m annotations: summary: host-down description: "{{ $labels.host_name }} down {{ $labels.du }} for more than 10 minutes" du: "{{ $labels.du }}" host_name: "{{ $labels.host_name }}"bash-5.0#Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
Last updated on Jan 23, 2026
Was this page helpful?
Next to read:
Addons MonitoringDiscard Changes
Do you want to discard your current changes and overwrite with the template?
Archive Synced Block
Message
Create new Template
What is this template's title?
Delete Template
Message