In the Linux kernel, the following vulnerability has been resolved:
drm/sched: Check scheduler work queue before calling timeout handling
During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling timeout handling).
It uses ready condition whether to call drmschedfault which unwind the TDR leads to GPU reset. However it looks the ready condition is overloaded with other meanings, for example, for the following stack is related GPU reset :
0 gfxv90cpgfxstart 1 gfxv90cpgfxresume 2 gfxv90cpresume 3 gfxv90hwinit 4 gfxv90resume 5 amdgpudeviceipresume_phase2
does the following: /* start the ring */ gfxv90cpgfx_start(adev); ring->sched.ready = true;
The same approach is for other ASICs as well : gfxv80cpgfxresume gfxv100kiq_resume, etc...
As a result, our GPU reset test causes GPU fault which calls unconditionally gfxv90fault and then drmschedfault. However now it depends on whether the interrupt service routine drmschedfault is executed after gfxv90cpgfxstart is completed which sets the ready field of the scheduler to true even for uninitialized schedulers and causes oops vs no fault or when ISR drmschedfault is completed prior gfxv90cpgfx_start and NULL pointer dereference does not occur.
Use the field timeout_wq to prevent oops for uninitialized schedulers. The field could be initialized by the work queue of resetting the domain.
v1: Corrections to commit message (Luben)