In the Linux kernel, the following vulnerability has been resolved:
cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
A hung_task problem shown below was found:
INFO: task kworker/0:0:8 blocked for more than 327 seconds. "echo 0 > /proc/sys/kernel/hungtasktimeoutsecs" disables this message. Workqueue: events cgroupbpfrelease Call Trace: <TASK> _schedule+0x5a2/0x2050 ? findheldlock+0x33/0x100 ? wqworkersleeping+0x9e/0xe0 schedule+0x9f/0x180 schedulepreemptdisabled+0x25/0x50 _mutexlock+0x512/0x740 ? cgroupbpfrelease+0x1e/0x4d0 ? cgroupbpfrelease+0xcf/0x4d0 ? processscheduledworks+0x161/0x8a0 ? cgroupbpfrelease+0x1e/0x4d0 ? mutexlocknested+0x2b/0x40 ? _pfxdelaytsc+0x10/0x10 mutexlocknested+0x2b/0x40 cgroupbpfrelease+0xcf/0x4d0 ? processscheduledworks+0x161/0x8a0 ? traceeventraweventworkqueueexecutestart+0x64/0xd0 ? processscheduledworks+0x161/0x8a0 processscheduledworks+0x23a/0x8a0 workerthread+0x231/0x5b0 ? _pfxworkerthread+0x10/0x10 kthread+0x14d/0x1c0 ? _pfxkthread+0x10/0x10 retfromfork+0x59/0x70 ? _pfxkthread+0x10/0x10 retfromforkasm+0x1b/0x30 </TASK>
This issue can be reproduced by the following pressuse test: 1. A large number of cpuset cgroups are deleted. 2. Set cpu on and off repeatly. 3. Set watchdog_thresh repeatly. The scripts can be obtained at LINK mentioned above the signature.
The reason for this issue is cgroupmutex and cpuhotpluglock are acquired in different tasks, which may lead to deadlock. It can lead to a deadlock through the following steps: 1. A large number of cpusets are deleted asynchronously, which puts a large number of cgroupbpfrelease works into systemwq. The maxactive of systemwq is WQDFLACTIVE(256). Consequently, all active works are cgroupbpfrelease works, and many cgroupbpfrelease works will be put into inactive queue. As illustrated in the diagram, there are 256 (in the acvtive queue) + n (in the inactive queue) works. 2. Setting watchdogthresh will hold cpuhotpluglock.read and put smpcalloncpu work into systemwq. However step 1 has already filled systemwq, 'sscs.work' is put into inactive queue. 'sscs.work' has to wait until the works that were put into the inacvtive queue earlier have executed (n cgroupbpfrelease), so it will be blocked for a while. 3. Cpu offline requires cpuhotpluglock.write, which is blocked by step 2. 4. Cpusets that were deleted at step 1 put cgrouprelease works into cgroupdestroywq. They are competing to get cgroupmutex all the time. When cgroupmetux is acqured by work at csskilledworkfn, it will call cpusetcssoffline, which needs to acqure cpuhotpluglock.read. However, cpusetcssoffline will be blocked for step 3. 5. At this moment, there are 256 works in active queue that are cgroupbpfrelease, they are attempting to acquire cgroup_mutex, and as a result, all of them are blocked. Consequently, sscs.work can not be executed. Ultimately, this situation leads to four processes being blocked, forming a deadlock.
systemwq(step1) WatchDog(step2) cpu offline(step3) cgroupdestroywq(step4) ... 2000+ cgroups deleted asyn 256 actives + n inactives _lockupdetectorreconfigure P(cpuhotpluglock.read) put sscs.work into systemwq 256 + n + 1(sscs.work) sscs.work wait to be executed warting sscs.work finish percpudownwrite P(cpuhotpluglock.write) ...blocking... csskilledworkfn P(cgroupmutex) cpusetcssoffline P(cpuhotpluglock.read) ...blocking... 256 cgroupbpfrelease mutexlock(&cgroup_mutex); ..blocking...
To fix the problem, place cgroupbpfrelease works on a dedicated workqueue which can break the loop and solve the problem. System wqs are for misc things which shouldn't create a large number of concurrent work items. If something is going to generate > ---truncated---