CVE-2023-53351

It uses ready condition whether to call drmschedfault which unwind the TDR leads to GPU reset. However it looks the ready condition is overloaded with other meanings, for example, for the following stack is related GPU reset :

0 gfxv90cpgfxstart 1 gfxv90cpgfxresume 2 gfxv90cpresume 3 gfxv90hwinit 4 gfxv90resume 5 amdgpudeviceipresume_phase2

does the following: /* start the ring */ gfxv90cpgfx_start(adev); ring->sched.ready = true;

The same approach is for other ASICs as well : gfxv80cpgfxresume gfxv100kiq_resume, etc...

As a result, our GPU reset test causes GPU fault which calls unconditionally gfxv90fault and then drmschedfault. However now it depends on whether the interrupt service routine drmschedfault is executed after gfxv90cpgfxstart is completed which sets the ready field of the scheduler to true even for uninitialized schedulers and causes oops vs no fault or when ISR drmschedfault is completed prior gfxv90cpgfx_start and NULL pointer dereference does not occur.

Use the field timeout_wq to prevent oops for uninitialized schedulers. The field could be initialized by the work queue of resetting the domain.

v1: Corrections to commit message (Luben)

Database specific

{
    "osv_generated_from": "https://github.com/CVEProject/cvelistV5/tree/main/cves/2023/53xxx/CVE-2023-53351.json",
    "cna_assigner": "Linux"
}

References

Affected packages

Git / git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

Affected ranges

Type: GIT
Repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Events: Introduced

11b3b9f461c5c4f700f6c8da202fcc2fd6418e1f

Fixed

c43a96fc00b662cef1ef0eb22d40441ce2abae8f

Fixed

2da5bffe9eaa5819a868e8eaaa11b3fd0f16a691

Affected versions

v6.*

v6.3

v6.3-rc7

v6.3.1

v6.3.2

v6.3.3

Database specific

source

"https://storage.googleapis.com/cve-osv-conversion/osv-output/CVE-2023-53351.json"