CVE-2021-47209

Kevin is reporting crashes which point to a use-after-free of a cfsrq in updateblockedaverages(). Initial debugging revealed that we've live cfsrq's (onlist=1) in an about to be kfree()'d task group in freefairschedgroup(). However, it was unclear how that can happen.

His kernel config happened to lead to a layout of struct schedentity that put the 'myq' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLABFREELISTHARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast.

Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfsrq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfsrq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tgunthrottleup() getting called via schedcfsperiod_timer() via the timer interrupt at an inconvenient time.

When unregisterfairschedgroup() unlinks all cfsrq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregisterfairschedgroup() has finished but prior to unlinking the task group, schedcfsperiodtimer() will execute and walk the list of task groups, trying to unthrottle cfsrq's, i.e. re-add them to the dying task group. These will later -- in freefairschedgroup() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing.

To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below:

CPU1:                                      CPU2:
  :                                        timer IRQ:
  :                                          do_sched_cfs_period_timer():
  :                                            :
  :                                            distribute_cfs_runtime():
  :                                              rcu_read_lock();
  :                                              :
  :                                              unthrottle_cfs_rq():
sched_offline_group():                             :
  :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
  list_del_rcu(&tg->list);                           :

(1) : listforeachentryrcu(child, &parent->children, siblings) : : (2) listdelrcu(&tg->siblings); : : tgunthrottleup(): unregisterfairschedgroup(): struct cfsrq *cfsrq = tg->cfsrq[cpuof(rq)]; : : listdelleafcfsrq(tg->cfsrq[cpu]); : : : : if (!cfsrqisdecayed(cfsrq) || cfsrq->nrrunning) (3) : listaddleafcfsrq(cfs_rq); : : : : : : : : :
---truncated---

References

CVE-2021-47209

Affected packages