In the Linux kernel, the following vulnerability has been resolved:
blk-rq-qos: fix crash on rqqoswait vs. rqqoswake_function race
We're seeing crashes from rqqoswake_function that look like this:
BUG: unable to handle page fault for address: ffffafe180a40084 #PF: supervisor write access in kernel mode #PF: errorcode(0x0002) - not-present page PGD 100000067 P4D 100000067 PUD 10027c067 PMD 10115d067 PTE 0 Oops: Oops: 0002 [#1] PREEMPT SMP PTI CPU: 17 UID: 0 PID: 0 Comm: swapper/17 Not tainted 6.12.0-rc3-00013-geca631b8fe80 #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:rawspinlockirqsave+0x1d/0x40 Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 62 97 30 4c 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 2c 0b 00 RSP: 0018:ffffafe180580ca0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffafe180a3f7a8 RCX: 0000000000000011 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffafe180a40084 RBP: 0000000000000000 R08: 00000000001e7240 R09: 0000000000000011 R10: 0000000000000028 R11: 0000000000000888 R12: 0000000000000002 R13: ffffafe180a40084 R14: 0000000000000000 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff9aaf1f280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffafe180a40084 CR3: 000000010e428002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> trytowakeup+0x5a/0x6a0 rqqoswakefunction+0x71/0x80 _wakeupcommon+0x75/0xa0 _wakeup+0x36/0x60 scaleup.part.0+0x50/0x110 wbtimer_fn+0x227/0x450 ...
So rqqoswakefunction() calls wakeupprocess(data->task), which calls trytowakeup(), which faults in rawspinlockirqsave(&p->pilock).
p comes from data->task, and data comes from the waitqueue entry, which is stored on the waiter's stack in rqqoswait(). Analyzing the core dump with drgn, I found that the waiter had already woken up and moved on to a completely unrelated code path, clobbering what was previously data->task. Meanwhile, the waker was passing the clobbered garbage in data->task to wakeupprocess(), leading to the crash.
What's happening is that in between rqqoswakefunction() deleting the waitqueue entry and calling wakeupprocess(), rqqos_wait() is finding that it already got a token and returning. The race looks like this:
preparetowaitexclusive() data->gottoken = true; listdelinit(&curr->entry); if (data.gottoken) break; finishwait(&rqw->wait, &data.wq); ^- returns immediately because listemptycareful(&wqentry->entry) is true ... return, go do something else ... wakeup_process(data->task) (NO LONGER VALID!)-^
Normally, finish_wait() is supposed to synchronize against the waker. But, as noted above, it is returning immediately because the waitqueue entry has already been removed from the waitqueue.
The bug is that rqqoswakefunction() is accessing the waitqueue entry AFTER deleting it. Note that autoremovewake_function() wakes the waiter and THEN deletes the waitqueue entry, which is the proper order.
Fix it by swapping the order. We also need to use listdelinitcareful() to match the listemptycareful() in finishwait().