High CPU Usage Due to Orphaned VM Processes on Hypervisor Nodes

Problem

In some environments, compute nodes may experience elevated CPU utilization caused by orphaned virtual machine (QEMU) processes. This occurs when certain instances are deleted or migrated at the control-plane level (Compute DB), but their corresponding QEMU processes continue to run on the hypervisor.

Symptoms include:

QEMU processes consuming CPU although the VM no longer exists in the database.
Mismatch between the number of instances reported in Nova DB versus running on the hypervisor.
Periodic warnings in nova.compute.manager logs during instance power-state synchronization.

Environment

Private Cloud Director Virtualization - v2025.6 and Higher
Private Cloud Director Kubernetes – v2025.6 and Higher
Self-Hosted Private Cloud Director Virtualization - v2025.6 and Higher
Self-Hosted Private Cloud Director Kubernetes - v2025.6 and Higher
Compute Service

Cause

Compute Service periodically performs a synchronization cycle where it validates running instances on the hypervisor against entries in the Compute database.

In this case:

Several VMs were deleted or migrated, but their QEMU processes persisted on the source hypervisor.
These orphaned processes continued running because the default Nova behavior did not automatically clean them up.
As a result, CPU utilization on the affected hypervisor increased unnecessarily.

Additionally, frequent instance deletions/migrations caused temporary discrepancies in instance counts during sync cycles, contributing to repeated warnings in the logs.

Diagnostics

1. Power State Synchronization Warnings

Compute service logs indicate mismatches between DB-reported VM count and hypervisor-reported VM count:

Bash
    
 
$ less /var/log/pf9/ostackhost.logWARNING nova.compute.manager [...] While synchronizing instance power states, found 83 instances in the database and 86 instances on the hypervisor.
Copy

These messages demonstrate:

The presence of additional QEMU processes not tracked in DB.
Consistent mismatch over multiple cycles.

Resolution

To ensure Compute Service automatically cleans up orphaned QEMU processes, the following configuration was added to /opt/pf9/etc/nova/conf.d/nova_override.conf on all affected compute hosts:

Hypervisor Host
    
 
$ vi /opt/pf9/etc/nova/conf.d/nova_override.conf [DEFAULT]running_deleted_instance_action = reap
Copy

Restart the pf9-ostackhost service post adding the above change on the hypervisors. The restart of this service will also cleanup any VMs which are stuck in the deleting phase as per the Compute service database.

Hypervisor Host
    
 
$ sudo systemctl restart pf9-ostackhost
Copy

Effect of this configuration: When Nova detects an instance running on the hypervisor that does not exist in the Compute Service database:
1. Frees compute node CPU/memory resources.
2. Ensures hypervisor state aligns with Nova DB state.
3. Orphaned VMs are now automatically removed within the standard 30-minute sync cycle
This configuration is included as part of standard deployment from Platform9 cloud Director - v2025.10 and higher

Last updated on

Was this page helpful?