In the Linux kernel, the following vulnerability has been resolved:
net/mlx5e: Prevent deadlock while disabling aRFS
When disabling aRFS under the priv->state_lock
, any scheduled
aRFS works are canceled using the cancel_work_sync
function,
which waits for the work to end if it has already started.
However, while waiting for the work handler, the handler will
try to acquire the state_lock
which is already acquired.
The worker acquires the lock to delete the rules if the state is down, which is not the worker's responsibility since disabling aRFS deletes the rules.
Add an aRFS state variable, which indicates whether the aRFS is enabled and prevent adding rules when the aRFS is disabled.
Kernel log:
====================================================== WARNING: possible circular locking dependency detected
ethtool/386089 is trying to acquire lock: ffff88810f21ce68 ((workcompletion)(&rule->arfswork)){+.+.}-{0:0}, at: _flushwork+0x74/0x4e0
but task is already holding lock: ffff8884a1808cc0 (&priv->statelock){+.+.}-{3:3}, at: mlx5eethtoolsetchannels+0x53/0x200 [mlx5_core]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&priv->statelock){+.+.}-{3:3}: _mutexlock+0x80/0xc90 arfshandlework+0x4b/0x3b0 [mlx5core] processonework+0x1dc/0x4a0 workerthread+0x1bf/0x3c0 kthread+0xd7/0x100 retfromfork+0x2d/0x50 retfromforkasm+0x11/0x20
-> #0 ((workcompletion)(&rule->arfswork)){+.+.}-{0:0}: _lockacquire+0x17b4/0x2c80 lockacquire+0xd0/0x2b0 _flushwork+0x7a/0x4e0 _cancelworktimer+0x131/0x1c0 arfsdelrules+0x143/0x1e0 [mlx5core] mlx5earfsdisable+0x1b/0x30 [mlx5core] mlx5eethtoolsetchannels+0xcb/0x200 [mlx5core] ethnlsetchannels+0x28f/0x3b0 ethnldefaultsetdoit+0xec/0x240 genlfamilyrcvmsgdoit+0xd0/0x120 genlrcvmsg+0x188/0x2c0 netlinkrcvskb+0x54/0x100 genlrcv+0x24/0x40 netlinkunicast+0x1a1/0x270 netlinksendmsg+0x214/0x460 _socksendmsg+0x38/0x60 _syssendto+0x113/0x170 _x64syssendto+0x20/0x30 dosyscall64+0x40/0xe0 entrySYSCALL64after_hwframe+0x46/0x4e
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&priv->statelock); lock((workcompletion)(&rule->arfswork)); lock(&priv->statelock); lock((workcompletion)(&rule->arfswork));
* DEADLOCK *
3 locks held by ethtool/386089: #0: ffffffff82ea7210 (cblock){++++}-{3:3}, at: genlrcv+0x15/0x40 #1: ffffffff82e94c88 (rtnlmutex){+.+.}-{3:3}, at: ethnldefaultsetdoit+0xd3/0x240 #2: ffff8884a1808cc0 (&priv->statelock){+.+.}-{3:3}, at: mlx5eethtoolsetchannels+0x53/0x200 [mlx5_core]
stack backtrace: CPU: 15 PID: 386089 Comm: ethtool Tainted: G I 6.7.0-rc4netnextmlx55483eb2 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dumpstacklvl+0x60/0xa0 checknoncircular+0x144/0x160 _lockacquire+0x17b4/0x2c80 lockacquire+0xd0/0x2b0 ? _flushwork+0x74/0x4e0 ? savetrace+0x3e/0x360 ? _flushwork+0x74/0x4e0 _flushwork+0x7a/0x4e0 ? _flushwork+0x74/0x4e0 ? _lockacquire+0xa78/0x2c80 ? lockacquire+0xd0/0x2b0 ? markheldlocks+0x49/0x70 _cancelworktimer+0x131/0x1c0 ? markheldlocks+0x49/0x70 arfsdelrules+0x143/0x1e0 [mlx5core] mlx5earfsdisable+0x1b/0x30 [mlx5core] mlx5eethtoolsetchannels+0xcb/0x200 [mlx5core] ethnlsetchannels+0x28f/0x3b0 ethnldefaultsetdoit+0xec/0x240 genlfamilyrcvmsgdoit+0xd0/0x120 genlrcvmsg+0x188/0x2c0 ? ethn ---truncated---