VMHA Stuck in "Waiting"

Problem

VMHA remains stuck in the "waiting" state during enablement.

Environment

Private Cloud Director Virtualization – v2025.4 and Higher
Self-Hosted Private Cloud Director Virtualization – v2025.4 and Higher

Cause

A decommissioned host was still listed in Nova's service records. Because of this, VMHA tried to use that host during setup, which caused an error and left the VMHA stuck in the "waiting" state.

Diagnostics

For SAAS customers contact Platform9 Support Team to validate if you are hitting the issue mentioned in this article.

Check VMHA logs:

command
    
xxxxxxxxxx
 
$ kubectl exec deploy/hamgr -n <REGION_NAMESPACE> -- cat /var/log/pf9/hamgr/hamgr.log |  grep -A1 'Enabling HA'
Copy

Look for log entries like:

hamgr.log
    
xxxxxxxxxx
 
Enabling HA on some of the hosts [...] including host '[HOST-ID]'WARNING Role status of host [HOST-ID] is not ok
Copy

List compute services and validate if any of the hypervisors are showing the "Status" as disabled and "State" down

Identify services that are down, disabled, or associated with non-existent or decommissioned hosts. In the sample output the HOST2.EXAMPLE.COM is the decommissioned node.

command
    
xxxxxxxxxx
 
$ openstack compute service list #sample output+--------------------+-------------+--------------------+-------+---------+-------+--------------+| ID                 | Binary      | Host               | Zone  | Status  | State | Updated At   |+--------------------+-------------+--------------------+-------+---------+-------+--------------+| [HOST1_SERVICE_ID] | nova-compute| [HOST1.EXAMPLE.COM]| [zone]| enabled | up    | [TIMESTAMP]  || [HOST2_SERVICE_ID] | nova-compute| [HOST2.EXAMPLE.COM]| [zone]| disabled| down  | [TIMESTAMP]  || [HOST3_SERVICE_ID] | nova-compute| [HOST3.EXAMPLE.COM]| [zone]| enabled | up    | [TIMESTAMP]  |+--------------------+-------------+--------------------+-------+---------+-------+--------------+
Copy

List hypervisors and validate host mapping. In the sample output, we see that the node [HOST2.EXAMPLE.COM] is in a downstate. we can check its associated service ID to validate the host mapping

command
    
xxxxxxxxxx
 
$ openstack hypervisor list #sample output:+----------------+---------------------+-----------------+-------------+-------+| ID             | Hypervisor Hostname | Hypervisor Type | Host IP     | State |+----------------+---------------------+-----------------+-------------+-------+| [HOST1_UUID]   | [HOST1.EXAMPLE.COM] | QEMU            | [IP-ADDR-1] | up    || [HOST2_UUID]   | [HOST2.EXAMPLE.COM] | QEMU            | [IP-ADDR-2] | down  |+----------------+---------------------+-----------------+-------------+-------+ $ openstack hypervisor show <HYPERVISOR_ID> #sample output $ openstack hypervisor show [HOST2_UUID]+---------------------+--------------------------------------+| Field               | Value                                |+---------------------+--------------------------------------+| aggregates          | []                                   || cpu_info            | None                                 || host_ip             | [IP-ADDR-2]                          || hypervisor_hostname | [HOST2.EXAMPLE.COM]                  || hypervisor_type     |  QEMU                                || hypervisor_version  | [HYPERVISOR_VERSION]                 || id                  | [HOST2_UUID]                         || service_host        | [SERVICE_HOST_UUID]                  || service_id          | [HOST2_SERVICE_ID]                   || state               | down                                 || status              | disabled                             |+---------------------+--------------------------------------+
Copy

Resolution

Identify the stale compute service entry from the output of the below command, in the sample output we see the node HOST2.EXAMPLE.COM is down.

command
    
xxxxxxxxxx
 
$ openstack compute service list #sample output+--------------------+-------------+--------------------+-------+---------+-------+--------------+| ID                 | Binary      | Host               | Zone  | Status  | State | Updated At   |+--------------------+-------------+--------------------+-------+---------+-------+--------------+| [HOST1_SERVICE_ID] | nova-compute| [HOST1.EXAMPLE.COM]| [zone]| enabled | up    | [TIMESTAMP]  || [HOST2_SERVICE_ID] | nova-compute| [HOST2.EXAMPLE.COM]| [zone]| disabled| down  | [TIMESTAMP]  || [HOST3_SERVICE_ID] | nova-compute| [HOST3.EXAMPLE.COM]| [zone]| enabled | up    | [TIMESTAMP]  |+--------------------+-------------+--------------------+-------+---------+-------+--------------+
Copy

Delete the stale service using below command, post deletion of the stale entry we will still have minimum two working hypervisors as per the requirement of enabling VMHA

command
    
xxxxxxxxxx
 
$ openstack compute service delete <HOST2_SERVICE_ID>
Copy

Wait for the VMHA to retry the operation automatically, or disable and re-enable VMHA to trigger a fresh attempt.

Validation:

Ensure VMHA state transitions from waiting to enabled.
Confirm no additional stale hosts remain.

Additional Information:

At minimum two working hypervisors are needed for enabling VMHA

Last updated on

Was this page helpful?