In the Linux kernel, the following vulnerability has been resolved:
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmulock in the TDP MMU, restart the iterator during tdpiter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when minlevel == rootlevel, restarting the iter and then invoking tdpiternext() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in trystepside() skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with _mustcheck to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvmtdpmmuzapall() and kvmtdpmmuputroot(), the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdpmmu.c:46 [kvm] RIP: 0010:kvmmmuuninittdpmmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvmarchdestroyvm+0x130/0x1b0 [kvm] kvmdestroyvm+0x162/0x2a0 [kvm] kvmvcpurelease+0x34/0x60 [kvm] _fput+0x82/0x240 taskworkrun+0x5c/0x90 doexit+0x364/0xa10 ? futexunqueue+0x38/0x60 dogroupexit+0x33/0xa0 getsignal+0x155/0x850 archdosignalorrestart+0xed/0x750 exittousermodeprepare+0xc5/0x120 syscallexittousermode+0x1d/0x40 dosyscall64+0x48/0xc0 entrySYSCALL64afterhwframe+0x44/0xae
If kvmtdpmmuzapall() skips a gfn/SPTE but that SPTE is then zapped by kvmtdpmmuputroot(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvmmain.c:171 RIP: 0010:kvmiszonedevicepfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvmsetpfndirty+0x120/0x1d0 [kvm] _handlechangedspte+0x92e/0xca0 [kvm] _handlechangedspte+0x63c/0xca0 [kvm] _handlechangedspte+0x63c/0xca0 [kvm] _handlechangedspte+0x63c/0xca0 [kvm] zapgfnrange+0x549/0x620 [kvm] kvmtdpmmuputroot+0x1b6/0x270 [kvm] mmufreerootpage+0x219/0x2c0 [kvm] kvmmmufreeroots+0x1b4/0x4e0 [kvm] kvmmmuunload+0x1c/0xa0 [kvm] kvmarchdestroyvm+0x1f2/0x5c0 [kvm] kvmputkvm+0x3b1/0x8b0 [kvm] kvmvcpurelease+0x4e/0x70 [kvm] _fput+0x1f7/0x8c0 taskworkrun+0xf8/0x1a0 doexit+0x97b/0x2230 dogroupexit+0xda/0x2a0 getsignal+0x3be/0x1e50 archdosignalorrestart+0x244/0x17f0 exittousermodeprepare+0xcb/0x120 syscallexittousermode+0x1d/0x40 dosyscall64+0x4d/0x90 entrySYSCALL64afterhwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdpmmuitercondresched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn.
Alternatively, tdpmmuitercondresched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdpmmuitercondresched() must be called before anything else in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do wh ---truncated---