In the Linux kernel, the following vulnerability has been resolved:
perf/x86/amd: Fix crash due to race between amdpmuenable_all, perf NMI and throttling
amdpmuenable_all() does:
if (!test_bit(idx, cpuc->active_mask))
continue;
amd_pmu_enable_event(cpuc->events[idx]);
A perf NMI of another event can come between these two steps. Perf NMI handler internally disables and enables all events, including the one which nmi-intercepted amdpmuenableall() was in process of enabling. If that unintentionally enabled event has very low sampling period and causes immediate successive NMI, causing the event to be throttled, cpuc->events[idx] and cpuc->activemask gets cleared by x86pmustop(). This will result in amdpmuenableevent() getting called with event=NULL when amdpmuenableall() resumes after handling the NMIs. This causes a kernel crash:
BUG: kernel NULL pointer dereference, address: 0000000000000198 #PF: supervisor read access in kernel mode #PF: errorcode(0x0000) - not-present page [...] Call Trace: <TASK> amdpmuenableall+0x68/0xb0 ctxresched+0xd9/0x150 eventfunction+0xb8/0x130 ? hrtimerstartrangens+0x141/0x4a0 ? perfdurationwarn+0x30/0x30 remotefunction+0x4d/0x60 _flushsmpcallfunctionqueue+0xc4/0x500 flushsmpcallfunctionqueue+0x11d/0x1b0 doidle+0x18f/0x2d0 cpustartupentry+0x19/0x20 startsecondary+0x121/0x160 secondarystartup64no_verify+0xe5/0xeb </TASK>
amdpmudisableall()/amdpmuenableall() calls inside perf NMI handler were recently added as part of BRS enablement but I'm not sure whether we really need them. We can just disable BRS in the beginning and enable it back while returning from NMI. This will solve the issue by not enabling those events whose active_masks are set but are not yet enabled in hw pmu.