In the Linux kernel, the following vulnerability has been resolved:
net/sched: taprio: avoid disabling offload when it was never enabled
In an incredibly strange API design decision, qdisc->destroy() gets called even if qdisc->init() never succeeded, not exclusively since commit 87b60cfacf9f ("netsched: fix error recovery at qdisc creation"), but apparently also earlier (in the case of qdisccreate_dflt()).
The taprio qdisc does not fully acknowledge this when it attempts full offload, because it starts off with q->flags = TAPRIOFLAGSINVALID in taprioinit(), then it replaces q->flags with TCATAPRIOATTRFLAGS parsed from netlink (in tapriochange(), tail called from taprioinit()).
But in tapriodestroy(), we call tapriodisableoffload(), and this determines what to do based on FULLOFFLOADISENABLED(q->flags).
But looking at the implementation of FULLOFFLOADISENABLED() (a bitwise check of bit 1 in q->flags), it is invalid to call this macro on q->flags when it contains TAPRIOFLAGSINVALID, because that is set to U32MAX, and therefore FULLOFFLOADIS_ENABLED() will return true on an invalid set of flags.
As a result, it is possible to crash the kernel if user space forces an error between setting q->flags = TAPRIOFLAGSINVALID, and the calling of taprioenableoffload(). This is because drivers do not expect the offload to be disabled when it was never enabled.
The error that we force here is to attach taprio as a non-root qdisc, but instead as child of an mqprio root qdisc:
$ tc qdisc add dev swp0 root handle 1: \ mqprio numtc 8 map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0 $ tc qdisc replace dev swp0 parent 1:1 \ taprio numtc 8 map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 base-time 0 \ sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \ flags 0x0 clockid CLOCKTAI Unable to handle kernel paging request at virtual address fffffffffffffff8 [fffffffffffffff8] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] PREEMPT SMP Call trace: tapriodump+0x27c/0x310 vsc9959portsetuptc+0x1f4/0x460 felixportsetuptc+0x24/0x3c dsaslavesetuptc+0x54/0x27c tapriodisableoffload.isra.0+0x58/0xe0 tapriodestroy+0x80/0x104 qdisccreate+0x240/0x470 tcmodifyqdisc+0x1fc/0x6b0 rtnetlinkrcvmsg+0x12c/0x390 netlinkrcvskb+0x5c/0x130 rtnetlinkrcv+0x1c/0x2c
Fix this by keeping track of the operations we made, and undo the offload only if we actually did it.
I've added "bool offloaded" inside a 4 byte hole between "int clockid" and "atomic64t picosper_byte". Now the first cache line looks like below:
$ pahole -C tapriosched net/sched/schtaprio.o struct tapriosched { struct Qdisc * * qdiscs; /* 0 8 */ struct Qdisc * root; /* 8 8 */ u32 flags; /* 16 4 */ enum tkoffsets tk_offset; /* 20 4 / int clockid; / 24 4 / bool offloaded; / 28 1 */
/* XXX 3 bytes hole, try to pack */
atomic64_t picos_per_byte; /* 32 0 */
/* XXX 8 bytes hole, try to pack */
spinlock_t current_entry_lock; /* 40 0 */
/* XXX 8 bytes hole, try to pack */
struct sched_entry * current_entry; /* 48 8 */
struct sched_gate_list * oper_sched; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */