The vGPU Mapped Instances Fails to Boot With Error "Node device not found"

Problem

After rebooting the vGPU capable Host as part of an upgrade/normal reboot, the vGPU enabled VMs are failing to start.
This leads to pf9-ostackhost service not converging properly and any new VMs will not be getting spawned and older VMs are expected to be stuck in powering up state with below error:

ostackhost.log
    
xxxxxxxxxx
 
INFO nova.virt.libvirt.host [req-ID None None] Secure Boot support detected ERROR oslo_service.service [req-ID None None] Error starting thread.: libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID-of-mediated-devices]' TRACE oslo_service.service Traceback (most recent call last): TRACE oslo_service.service File "/opt/pf9/venv/lib/python3.9/site-packages/oslo_service/service.py", line 810, in run_service TRACE oslo_service.service service.start()[..] TRACE oslo_service.service raise libvirtError('virNodeDeviceLookupByName() failed') TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID-of-mediated-devices]' TRACE oslo_service.service
Copy

Environment

Private Cloud Director Virtualisation - v2025.6 and Higher
Self-Hosted Private Cloud Director Virtualisation - v2025.6 and Higher
Component: GPU [NVIDIA drivers- v570 and v580]

Solution

This is a known issue, to track the progress of the bug: PCD-2656, reach out to the Platform9 Support Team mentioning the bug ID.
For remediation follow the provide details in the workaround section.

Root Cause

The identified cause is that the mdev devices are not getting associated with the consumed vGPU because the interconnecting logic is yet to get worked out .
As a result the VMs that were trying to go active were failing due to no device found error.
This is a known issue, to track the progress of the bug: PCD-2656, reach out to the Platform9 Support Team mentioning the bug ID.

Workaround

Check the output of $ mdev list to see if it is empty to confirm this issue.
Also the output of $ lspci -nnn | grep -i nvidia , does not list SRIOV devices.
To resolve this issue, run the GPU Configuration Script in location/opt/pf9/gpu.While executing the script:
1. Move to the location having the script: $ cd /opt/pf9/gpu/ .
2. Run the script $ sudo ./pf9-gpu-configure.sh with option 3) vGPU SR-IOV configure.
3. Run the script pf9-gpu-configure.sh with 6) Validate vGPU to check if the GPU is configured.
4. Re-run the $ lspci -nnn | grep -i nvidia which should now list all the VFs for the given GPU.
5. Run the pf9-gpu-configure.sh with option 4) vGPU host configure
6. From the UI under the GPU host section, the host should now be visible.
7. From the UI select the host and the required GPU profile for the host and continue to save the form.
8. Monitor the UI to see the host completes the converge action.
Post the Step-3, from the command line list out the UUID that are associated to the failing VMs instance from the pf9-ostackhost logs.
Identify UUID Errors- Check the ostackhost logs to identify UUIDs causing attachment errors.

OstackHost log
    
xxxxxxxxxx
 
TRACE oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_[UUID_OF_MEDIATED_DEVICES]'
Copy

Map UUIDs to Bus IDs - Use the echo command to map UUIDs to the appropriate bus from vGPU host:

Command
    
xxxxxxxxxx
 
$ echo <UUIDs> > /sys/class/mdev_bus/<BUS_ID>/mdev_supported_types/nvidia-558/create Example:$ echo [UUID_OF_MEDIATED_DEVICES] > /sys/class/mdev_bus/0000:21:00.5/mdev_supported_types/nvidia-558/create
Copy

Restart NVIDIA vGPU Manager- Restart the NVIDIA vGPU manager service and verify its status:

Command
    
xxxxxxxxxx
 
$ systemctl restart nvidia-vgpu-mgr$ systemctl status nvidia-vgpu-mgr
Copy

Restart the ostackhost service and monitor the status of those stuck VMs from UI.

Command
    
xxxxxxxxxx
 
$ systemctl restart pf9-ostackhost
Copy

Validation

List the newly added mdev devices using the below command:

The mdevctl list gives non-empty output.

Command
    
xxxxxxxxxx
 
$ mdevctl list Example:$ mdevctl list[UUID_OF_MEDIATED_DEVICES_1] 0000:21:00.5 nvidia-558d[UUID_OF_MEDIATED_DEVICES_2] 0000:21:00.6 nvidia-558d[UUID_OF_MEDIATED_DEVICES_3] 0000:21:00.7 nvidia-558d
Copy

The vGPU enabled VMs will no longer be stuck in powering-on stuck state.

Last updated on

Was this page helpful?