In the Linux kernel, the following vulnerability has been resolved:
block: Use RCU in blkmq[un]quiescetagset() instead of set->taglist_lock
blkmq{add,del}queuetagset() functions add and remove queues from tagset, the functions make sure that tagset and queues are marked as shared when two or more queues are attached to the same tagset. Initially a tagset starts as unshared and when the number of added queues reaches two, blkmqaddqueuetagset() marks it as shared along with all the queues attached to it. When the number of attached queues drops to 1 blkmqdelqueuetag_set() need to mark both the tagset and the remaining queues as unshared.
Both functions need to freeze current queues in tagset before setting on unsetting BLKMQFTAGQUEUESHARED flag. While doing so, both functions hold set->taglistlock mutex, which makes sense as we do not want queues to be added or deleted in the process. This used to work fine until commit 98d81f0df70c ("nvme: use blkmq[un]quiescetagset") made the nvme driver quiesce tagset instead of quiscing individual queues. blkmqquiescetagset() does the job and quiesce the queues in set->taglist while holding set->taglistlock also.
This results in deadlock between two threads with these stacktraces:
_schedule+0x47c/0xbb0 ? timerqueueadd+0x66/0xb0 schedule+0x1c/0xa0 schedulepreemptdisabled+0xa/0x10 _mutexlock.constprop.0+0x271/0x600 blkmqquiescetagset+0x25/0xc0 nvmedevdisable+0x9c/0x250 nvmetimeout+0x1fc/0x520 blkmqhandleexpired+0x5c/0x90 btiter+0x7e/0x90 blkmqqueuetagbusyiter+0x27e/0x550 ? _blkmqcompleterequestremote+0x10/0x10 ? _blkmqcompleterequestremote+0x10/0x10 ? _callrcucommon.constprop.0+0x1c0/0x210 blkmqtimeoutwork+0x12d/0x170 processonework+0x12e/0x2d0 workerthread+0x288/0x3a0 ? rescuerthread+0x480/0x480 kthread+0xb8/0xe0 ? kthreadpark+0x80/0x80 retfromfork+0x2d/0x50 ? kthreadpark+0x80/0x80 retfromforkasm+0x11/0x20
_schedule+0x47c/0xbb0 ? xasfind+0x161/0x1a0 schedule+0x1c/0xa0 blkmqfreezequeuewait+0x3d/0x70 ? destroyscheddomainsrcu+0x30/0x30 blkmqupdatetagsetshared+0x44/0x80 blkmqexitqueue+0x141/0x150 delgendisk+0x25a/0x2d0 nvmensremove+0xc9/0x170 nvmeremovenamespaces+0xc7/0x100 nvmeremove+0x62/0x150 pcideviceremove+0x23/0x60 devicereleasedriverinternal+0x159/0x200 unbindstore+0x99/0xa0 kernfsfopwriteiter+0x112/0x1e0 vfswrite+0x2b1/0x3d0 ksyswrite+0x4e/0xb0 dosyscall64+0x5b/0x160 entrySYSCALL64afterhwframe+0x4b/0x53
The top stacktrace is showing nvmetimeout() called to handle nvme command timeout. timeout handler is trying to disable the controller and as a first step, it needs to blkmqquiescetagset() to tell blk-mq not to call queue callback handlers. The thread is stuck waiting for set->taglistlock as it tries to walk the queues in set->tag_list.
The lock is held by the second thread in the bottom stack which is waiting for one of queues to be frozen. The queue usage counter will drop to zero after nvme_timeout() finishes, and this will not happen because the thread will wait for this mutex forever.
Given that [un]quiescing queue is an operation that does not need to sleep, update blkmq[un]quiescetagset() to use RCU instead of taking set->taglistlock, update blkmq{add,del}queuetagset() to use RCU safe list operations. Also, delete INITLISTHEAD(&q->tagsetlist) in blkmqdelqueuetagset() because we can not re-initialize it while the list is being traversed under RCU. The deleted queue will not be added/deleted to/from a tagset and it will be freed in blkfree_queue() after the end of RCU grace period.
{
"osv_generated_from": "https://github.com/CVEProject/cvelistV5/tree/main/cves/2025/68xxx/CVE-2025-68756.json",
"cna_assigner": "Linux"
}