This update for slurm2411 fixes the following issues:
Update to version 24.11.5.
Security issues fixed:
Other changes and issues fixed:
Changes from version 24.11.5
scontrol
reboot on bad nodelists.slurmrestd
- Report an error when QOS resolution fails for
v0.0.40 endpoints.slurmrestd
- Report an error when QOS resolution fails for
v0.0.41 endpoints.slurmrestd
- Report an error when QOS resolution fails for
v0.0.42 endpoints.data_parser/v0.0.42
- Added +inline_enums
flag which
modifies the output when generating OpenAPI specification.
It causes enum arrays to not be defined in their own schema
with references ($ref
) to them. Instead they will be dumped
inline.tres-bind map/mask
on partial node
allocations.stepmgr
enabled steps being able to request features.slurmd
- Restrict listening for new incoming RPC requests
further into startup.slurmd
- Avoid auth/slurm
related hangs of CLI commands
during startup and shutdown.slurmctld
- Restrict processing new incoming RPC requests
further into startup. Stop processing requests sooner during
shutdown.slurmcltd
- Avoid auth/slurm related hangs of CLI commands
during startup and shutdown.slurmctld
- Avoid race condition during shutdown or
ereconfigure that could result in a crash due delayed
processing of a connection while plugins are unloaded.%
escape characters when printing
stdio fields for jobs.%A
array job id when expanding patterns.Bad Constraints
.switch/hpe_slingshot
- Prevent potential segfault on failed
curl request to the fabric manager.%A
will now be substituted by the correct value.%A
will now be substituted by the correct value.switch/hpe_slingshot
- Fix VNI range not updating on slurmctld
restart or reconfigre.-c
and -n
inferior to the jobs requested resources, when
using stepmgr and nodes are configured with
CPUs == Sockets*CoresPerSocket
.SwitchParameter
.memory.high
and memory.swap.max
in slurmd
startup or reconfigure as we are never really touching this
in slurmd
.CoreSpecLimits
have been removed from
slurm.conf
.switch/hpe-slingshot
- Make sure the slurmctld can free
step VNIs after the controller restarts or reconfigures while
the job is running.slurmctld
failure on 2nd takeover.Changes from version 24.11.4
slurmctld
,slurmrestd
- Avoid possible race condition that
could have caused process to crash when listener socket was
closed while accepting a new connection.slurmrestd
- Avoid race condition that could have resulted
in address logged for a UNIX socket to be incorrect.slurmrestd
- Fix parameters in OpenAPI specification for the
following endpoints to have job_id
field:
GET /slurm/v0.0.40/jobs/state/
GET /slurm/v0.0.41/jobs/state/
GET /slurm/v0.0.42/jobs/state/
GET /slurm/v0.0.43/jobs/state/
slurmd
- Fix tracking of thread counts that could cause
incoming connections to be ignored after burst of simultaneous
incoming connections that trigger delayed response logic.SRUN_TIMEOUT
forwarding to stepmgr
.mcs_label
requirements.acct_gather_energy/{gpu,ipmi}
- Fix potential energy
consumption adjustment calculation underflow.acct_gather_energy/ipmi
- Fix regression introduced in 24.05.5
(which introduced the new way of preserving energy measurements
through slurmd restarts) when EnergyIPMICalcAdjustment=yes
.slurmctld
deadlock in the assoc mgr.RestrictedCoresPerGPU
is enabled.slurmd
- Avoid crash due when slurmd has a communications
failure with slurmstepd
.slurmctld
from showing error message about PreemptMode=GANG
being a cluster-wide option for scontrol update part
calls
that don't attempt to modify partition PreemptMode.GANG
preemption on partition when updating
PreemptMode
with scontrol
.CoreSpec
and MemSpec
limits not being removed
from previously configured slurmd.slurmd
,
slurmstepd
, slurmctld
, slurmrestd
or sackd
have a fatal
event.--ntasks-per-node
and --mem
keep pending
forever when the requested mem divided by the number of CPUs
will surpass the configured MaxMemPerCPU
.slurmd
- Fix address logged upon new incoming RPC connection
from INVALID
to IP address.scontrol
, sinfo
, sview
, and the following slurmrestd
endpoints:
GET /slurm/{any_data_parser}/reservation/{reservation_name}
GET /slurm/{any_data_parser}/reservations
debuflags=conmgr
gated log when
deferring new incoming connections when number of active
connections exceed conmgr_max_connections
.FUTURE
node state on restart and reconfig instead of reverting to
FUTURE
state. This will be made the default in 25.05.slurmctld
to crash.--cpus-per-gpu
and --mem
keep pending forever
when the requested mem divided by the number of CPUs will surpass
the configured MaxMemPerCPU
.--mem
and several --*-per-*
options
do not violate the MaxMemPerCPU
in place.slurmctld
- Fix use-cases of jobs incorrectly pending held
when --prefer
features are not initially satisfied.slurmctld
- Fix jobs incorrectly held when --prefer
not
satisfied in some use-cases.RestrictedCoresPerGPU
and CoreSpecCount
don't overlap.Changes from version 24.11.3
slurmd -G
gave no output.slurmctld
after updating a
reservation with an empty nodelist. The crash could occur
after restarting slurmctld, or if downing/draining a node
in the reservation with the REPLACE
or REPLACE_DOWN
flag.watch
' from original daemon name.
This could potentially breaking some monitoring scripts.slurmctld
being killed by SIGALRM
due to race condition
at startup.Requested
data_parser plugin does not support OpenAPI plugin
' error being
returned for valid endpoints.task/cgroup
CPUset and jobacctgather/cgroup
.
The first was removing the pid from task_X
cgroup directory
causing memory limits to not being applied.SLURM_JOB_PARTITION
output environment variable to the partition in which the job is
running for salloc
and srun
in order to match the documentation
and the behavior of sbatch
.srun
- Fixed wrongly constructed SLURM_CPU_BIND
env variable
that could get propagated to downward srun calls in certain mpi
environments, causing launch failures.slurmrestd
- Avoid connection to slurmdbd for the following
endpoints:
GET /slurm/v0.0.41/jobs
GET /slurm/v0.0.41/job/{job_id}
slurmrestd
- Avoid connection to slurmdbd for the following
endpoints:
GET /slurm/v0.0.40/jobs
GET /slurm/v0.0.40/job/{job_id}
slurmrestd
- Fix possible memory leak when parsing arrays with
data_parser/v0.0.40
.slurmrestd
- Fix possible memory leak when parsing arrays with
data_parser/v0.0.41
.slurmrestd
- Fix possible memory leak when parsing arrays with
data_parser/v0.0.42
.Changes from version 24.11.2
--test-only
jobs that can
preempt.DAILY
, HOURLY
, WEEKLY
, WEEKDAY
, and WEEKEND
.slurmctld
will ensure that healthy nodes are not reported as
UnavailableNodes
in job reason codes.OVERLAP,FLEX
or OVERLAP,ANY_NODES
when it overlaps nodes
with a future maintenance reservation. When a job submission
had a time limit that overlapped with the future maintenance
reservation, it was rejected. Now the job is accepted but
stays pending with the reason 'ReqNodeNotAvail, Reserved for
maintenance
'.pam_slurm_adopt
- avoid errors when explicitly setting some
arguments to the default value.PreemptMode=SUSPEND
.slurmdbd
- When changing a user's name update lineage at the
same time.burst_buffer.lua
does not
inherit the SLURM_CONF
environment variable from slurmctld
and
fails to run if slurm.conf is in a non-standard location.select/linear
and the
PreemptParameters=reclaim_licenses
options are both set in
slurm.conf
. Regression in 24.11.1.switch/hpe_slingshot
- Fix compatibility with newer cxi
drivers, specifically when specifying disable_rdzv_get
.ABORT_ON_FATAL
environment variable to capture a backtrace
from any fatal()
message.sched/backfill
- Fix node state PLANNED
not being cleared from
fully allocated nodes during a backfill cycle.select/cons_tres
- Fix future planning of jobs with
bf_licenses
.on_data returned rc: Rate limit exceeded,
please retry momentarily
' error message from being printed in
slurmctld logs.QOS=(null)
when not explicitly
requesting a QOS.job_resrcs
.sacctmgr delete/modify/show
account operations
with where
clauses.SIGTSTP
, SIGTTIN
and SIGUSR1
signals and
ignored them, while before they were not ignoring them. This
also caused slurmctld to not being able to shutdown after a
SIGTSTP
because slurmscriptd caught the signal and stopped
while slurmctld ignored it. Unify and fix these situations and
get back to the previous behavior for these signals.SIGQUIT
is no longer ignored by slurmctld
,
slurmdbd
, and slurmd in 24.11. As of 24.11.0rc1, SIGQUIT
is
identical to SIGINT
and SIGTERM
for these daemons, but this
change was not documented.boot^
state on unexpected node reboot after return
to service.nextstate=resume
.nextstate=resume
rebooting nodes.{ "binaries": [ { "libnss_slurm2_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-config": "24.11.5-150300.7.8.1", "perl-slurm_24_11": "24.11.5-150300.7.8.1", "libpmi0_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-auth-none": "24.11.5-150300.7.8.1", "slurm_24_11-torque": "24.11.5-150300.7.8.1", "slurm_24_11-pam_slurm": "24.11.5-150300.7.8.1", "slurm_24_11-webdoc": "24.11.5-150300.7.8.1", "slurm_24_11-doc": "24.11.5-150300.7.8.1", "slurm_24_11-munge": "24.11.5-150300.7.8.1", "slurm_24_11-sql": "24.11.5-150300.7.8.1", "slurm_24_11-sview": "24.11.5-150300.7.8.1", "libslurm42": "24.11.5-150300.7.8.1", "slurm_24_11-plugins": "24.11.5-150300.7.8.1", "slurm_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-node": "24.11.5-150300.7.8.1", "slurm_24_11-slurmdbd": "24.11.5-150300.7.8.1", "slurm_24_11-config-man": "24.11.5-150300.7.8.1", "slurm_24_11-devel": "24.11.5-150300.7.8.1", "slurm_24_11-cray": "24.11.5-150300.7.8.1", "slurm_24_11-lua": "24.11.5-150300.7.8.1", "slurm_24_11-rest": "24.11.5-150300.7.8.1" } ] }
{ "binaries": [ { "libnss_slurm2_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-config": "24.11.5-150300.7.8.1", "perl-slurm_24_11": "24.11.5-150300.7.8.1", "libpmi0_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-auth-none": "24.11.5-150300.7.8.1", "slurm_24_11-torque": "24.11.5-150300.7.8.1", "slurm_24_11-pam_slurm": "24.11.5-150300.7.8.1", "slurm_24_11-webdoc": "24.11.5-150300.7.8.1", "slurm_24_11-doc": "24.11.5-150300.7.8.1", "slurm_24_11-munge": "24.11.5-150300.7.8.1", "slurm_24_11-sql": "24.11.5-150300.7.8.1", "slurm_24_11-sview": "24.11.5-150300.7.8.1", "libslurm42": "24.11.5-150300.7.8.1", "slurm_24_11-plugins": "24.11.5-150300.7.8.1", "slurm_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-node": "24.11.5-150300.7.8.1", "slurm_24_11-slurmdbd": "24.11.5-150300.7.8.1", "slurm_24_11-config-man": "24.11.5-150300.7.8.1", "slurm_24_11-devel": "24.11.5-150300.7.8.1", "slurm_24_11-cray": "24.11.5-150300.7.8.1", "slurm_24_11-lua": "24.11.5-150300.7.8.1", "slurm_24_11-rest": "24.11.5-150300.7.8.1" } ] }
{ "binaries": [ { "libnss_slurm2_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-config": "24.11.5-150300.7.8.1", "perl-slurm_24_11": "24.11.5-150300.7.8.1", "libpmi0_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-auth-none": "24.11.5-150300.7.8.1", "slurm_24_11-torque": "24.11.5-150300.7.8.1", "slurm_24_11-pam_slurm": "24.11.5-150300.7.8.1", "slurm_24_11-webdoc": "24.11.5-150300.7.8.1", "slurm_24_11-doc": "24.11.5-150300.7.8.1", "slurm_24_11-munge": "24.11.5-150300.7.8.1", "slurm_24_11-sql": "24.11.5-150300.7.8.1", "slurm_24_11-sview": "24.11.5-150300.7.8.1", "libslurm42": "24.11.5-150300.7.8.1", "slurm_24_11-plugins": "24.11.5-150300.7.8.1", "slurm_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-node": "24.11.5-150300.7.8.1", "slurm_24_11-slurmdbd": "24.11.5-150300.7.8.1", "slurm_24_11-config-man": "24.11.5-150300.7.8.1", "slurm_24_11-devel": "24.11.5-150300.7.8.1", "slurm_24_11-cray": "24.11.5-150300.7.8.1", "slurm_24_11-lua": "24.11.5-150300.7.8.1", "slurm_24_11-rest": "24.11.5-150300.7.8.1" } ] }
{ "binaries": [ { "libnss_slurm2_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-config": "24.11.5-150300.7.8.1", "perl-slurm_24_11": "24.11.5-150300.7.8.1", "libpmi0_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-auth-none": "24.11.5-150300.7.8.1", "slurm_24_11-torque": "24.11.5-150300.7.8.1", "slurm_24_11-pam_slurm": "24.11.5-150300.7.8.1", "slurm_24_11-webdoc": "24.11.5-150300.7.8.1", "slurm_24_11-doc": "24.11.5-150300.7.8.1", "slurm_24_11-munge": "24.11.5-150300.7.8.1", "slurm_24_11-sql": "24.11.5-150300.7.8.1", "slurm_24_11-sview": "24.11.5-150300.7.8.1", "libslurm42": "24.11.5-150300.7.8.1", "slurm_24_11-plugins": "24.11.5-150300.7.8.1", "slurm_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-node": "24.11.5-150300.7.8.1", "slurm_24_11-slurmdbd": "24.11.5-150300.7.8.1", "slurm_24_11-config-man": "24.11.5-150300.7.8.1", "slurm_24_11-devel": "24.11.5-150300.7.8.1", "slurm_24_11-cray": "24.11.5-150300.7.8.1", "slurm_24_11-lua": "24.11.5-150300.7.8.1", "slurm_24_11-rest": "24.11.5-150300.7.8.1" } ] }
{ "binaries": [ { "libnss_slurm2_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-config": "24.11.5-150300.7.8.1", "perl-slurm_24_11": "24.11.5-150300.7.8.1", "libpmi0_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-auth-none": "24.11.5-150300.7.8.1", "slurm_24_11-torque": "24.11.5-150300.7.8.1", "slurm_24_11-pam_slurm": "24.11.5-150300.7.8.1", "slurm_24_11-webdoc": "24.11.5-150300.7.8.1", "slurm_24_11-doc": "24.11.5-150300.7.8.1", "slurm_24_11-munge": "24.11.5-150300.7.8.1", "slurm_24_11-sql": "24.11.5-150300.7.8.1", "slurm_24_11-sview": "24.11.5-150300.7.8.1", "libslurm42": "24.11.5-150300.7.8.1", "slurm_24_11-plugins": "24.11.5-150300.7.8.1", "slurm_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-node": "24.11.5-150300.7.8.1", "slurm_24_11-slurmdbd": "24.11.5-150300.7.8.1", "slurm_24_11-config-man": "24.11.5-150300.7.8.1", "slurm_24_11-devel": "24.11.5-150300.7.8.1", "slurm_24_11-cray": "24.11.5-150300.7.8.1", "slurm_24_11-lua": "24.11.5-150300.7.8.1", "slurm_24_11-rest": "24.11.5-150300.7.8.1" } ] }
{ "binaries": [ { "libnss_slurm2_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-config": "24.11.5-150300.7.8.1", "perl-slurm_24_11": "24.11.5-150300.7.8.1", "libpmi0_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-auth-none": "24.11.5-150300.7.8.1", "slurm_24_11-torque": "24.11.5-150300.7.8.1", "slurm_24_11-pam_slurm": "24.11.5-150300.7.8.1", "slurm_24_11-webdoc": "24.11.5-150300.7.8.1", "slurm_24_11-doc": "24.11.5-150300.7.8.1", "slurm_24_11-munge": "24.11.5-150300.7.8.1", "slurm_24_11-sql": "24.11.5-150300.7.8.1", "slurm_24_11-sview": "24.11.5-150300.7.8.1", "libslurm42": "24.11.5-150300.7.8.1", "slurm_24_11-plugins": "24.11.5-150300.7.8.1", "slurm_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-node": "24.11.5-150300.7.8.1", "slurm_24_11-slurmdbd": "24.11.5-150300.7.8.1", "slurm_24_11-config-man": "24.11.5-150300.7.8.1", "slurm_24_11-devel": "24.11.5-150300.7.8.1", "slurm_24_11-cray": "24.11.5-150300.7.8.1", "slurm_24_11-lua": "24.11.5-150300.7.8.1", "slurm_24_11-rest": "24.11.5-150300.7.8.1" } ] }
{ "binaries": [ { "libnss_slurm2_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-config": "24.11.5-150300.7.8.1", "perl-slurm_24_11": "24.11.5-150300.7.8.1", "libpmi0_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-openlava": "24.11.5-150300.7.8.1", "slurm_24_11-testsuite": "24.11.5-150300.7.8.1", "slurm_24_11-auth-none": "24.11.5-150300.7.8.1", "slurm_24_11-torque": "24.11.5-150300.7.8.1", "slurm_24_11-hdf5": "24.11.5-150300.7.8.1", "slurm_24_11-pam_slurm": "24.11.5-150300.7.8.1", "slurm_24_11-sjstat": "24.11.5-150300.7.8.1", "slurm_24_11-webdoc": "24.11.5-150300.7.8.1", "slurm_24_11-doc": "24.11.5-150300.7.8.1", "slurm_24_11-munge": "24.11.5-150300.7.8.1", "slurm_24_11-sql": "24.11.5-150300.7.8.1", "slurm_24_11-sview": "24.11.5-150300.7.8.1", "libslurm42": "24.11.5-150300.7.8.1", "slurm_24_11-plugins": "24.11.5-150300.7.8.1", "slurm_24_11": "24.11.5-150300.7.8.1", "slurm_24_11-node": "24.11.5-150300.7.8.1", "slurm_24_11-slurmdbd": "24.11.5-150300.7.8.1", "slurm_24_11-seff": "24.11.5-150300.7.8.1", "slurm_24_11-config-man": "24.11.5-150300.7.8.1", "slurm_24_11-devel": "24.11.5-150300.7.8.1", "slurm_24_11-cray": "24.11.5-150300.7.8.1", "slurm_24_11-lua": "24.11.5-150300.7.8.1", "slurm_24_11-rest": "24.11.5-150300.7.8.1" } ] }