In the Linux kernel, the following vulnerability has been resolved:
sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
There is a code path in dequeueentities() that can set the slice of a
schedentity to U64_MAX, which sometimes results in a crash.
The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:
- In the if (entityistask(se)) else block at the beginning of
dequeueentities(), slice is set to
cfsrqminslice(groupcfsrq(se)). If the entity was delayed, then
it has no queued tasks, so cfsrqminslice() returns U64MAX.
- The first foreachsched_entity() loop dequeues the entity.
- If the entity was its parent's only child, then the next iteration
tries to dequeue the parent.
- If the parent's dequeue needs to be delayed, then it breaks from the
first foreachschedentity() loop _without updating slice.
- The second foreachschedentity() loop sets the parent's ->slice to
the saved slice, which is still U64MAX.
This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:
- In updateentitylag(), se->slice is used to calculate limit, which
ends up as a huge negative number.
- limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
is negative, vlag > limit, so se->vlag is set to the same huge
negative number.
- In place_entity(), se->vlag is scaled, which overflows and results in
another huge (positive or negative) number.
- The adjusted lag is subtracted from se->vruntime, which increases or
decreases se->vruntime by a huge number.
- pickeevdf() calls entityeligible()/vruntimeeligible(), which
incorrectly returns false because the vruntime is so far from the
other vruntimes on the queue, causing the
(vruntime - cfsrq->min_vruntime) * load calulation to overflow.
- Nothing appears to be eligible, so pick_eevdf() returns NULL.
- picknextentity() tries to dereference the return value of
pick_eevdf() and crashes.
Dumping the cfsrq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).
Fix it in dequeueentities() by always setting slice from the first
non-empty cfsrq.