SUSE-RU-2023:4335-1

Source
https://www.suse.com/support/update/announcement/2023/suse-ru-20234335-1/
Import Source
https://ftp.suse.com/pub/projects/security/osv/SUSE-RU-2023:4335-1.json
JSON Data
https://api.osv.dev/v1/vulns/SUSE-RU-2023:4335-1
Related
Published
2023-11-02T01:00:43Z
Modified
2023-11-02T01:00:43Z
Summary
Recommended update for slurm_23_02
Details

This update for slurm2302 fixes the following issues:

  • Updated to version 23.02.5 with the following changes:

    • Bug Fixes:

      • Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's environment when --ntasks-per-node was requested. The method that is is being set, however, is different and should be more accurate in more situations.
      • Change pmi2 plugin to honor the SrunPortRange option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the MpiParams=ports= option, and previously were only limited by the systems ephemeral port range.
      • Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured.
      • Fix and prevent reoccurring reservations from overlapping.
      • job_container/tmpfs - Avoid attempts to share BasePath between nodes.
      • With CR_Cpu_Memory, fix node selection for jobs that request gres and --mem-per-cpu.
      • Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks.
      • Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over.
      • Fix slurmctld segfault when a node registers with a configured CpuSpecList while slurmctld configuration has the node without CpuSpecList.
      • Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not registering by ResumeTimeout.
      • slurmstepd - Avoid cleanup of config.json-less containers spooldir getting skipped.
      • Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode.
      • Properly handle a race condition between bind() and listen() calls in the network stack when running with SrunPortRange set.
      • Federation - Fix revoked jobs being returned regardless of the -a/--all option for privileged users.
      • Federation - Fix canceling pending federated jobs from non-origin clusters which could leave federated jobs orphaned from the origin cluster.
      • Fix sinfo segfault when printing multiple clusters with --noheader option.
      • Federation - fix clusters not syncing if clusters are added to a federation before they have registered with the dbd.
      • node_features/helpers - Fix node selection for jobs requesting changeable. features with the | operator, which could prevent jobs from running on some valid nodes.
      • node_features/helpers - Fix inconsistent handling of & and |, where an AND'd feature was sometimes AND'd to all sets of features instead of just the current set. E.g. foo|bar&baz was interpreted as {foo,baz} or {bar,baz} instead of how it is documented: {foo} or {bar,baz}.
      • Fix job accounting so that when a job is requeued its allocated node count is cleared. After the requeue, sacct will correctly show that the job has 0 AllocNodes while it is pending or if it is canceled before restarting.
      • sacct - AllocCPUS now correctly shows 0 if a job has not yet received an allocation or if the job was canceled before getting one.
      • Fix intel OneAPI autodetect: detect the /dev/dri/renderD[0-9]+ GPUs, and do not detect /dev/dri/card[0-9]+.
      • Fix node selection for jobs that request --gpus and a number of tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs.
      • Remove MYSQL_OPT_RECONNECT completely.
      • Fix cloud nodes in POWERING_UP state disappearing (getting set to FUTURE) when an scontrol reconfigure happens.
      • openapi/dbv0.0.39 - Avoid assert / segfault on missing coordinators list.
      • slurmrestd - Correct memory leak while parsing OpenAPI specification templates with server overrides.
      • Fix overwriting user node reason with system message.
      • Prevent deadlock when rpc_queue is enabled.
      • slurmrestd - Correct OpenAPI specification generation bug where fields with overlapping parent paths would not get generated.
      • Fix memory leak as a result of a partition info query.
      • Fix memory leak as a result of a job info query.
      • For step allocations, fix --gres=none sometimes not ignoring gres from the job.
      • Fix --exclusive jobs incorrectly gang-scheduling where they shouldn't.
      • Fix allocations with CR_SOCKET, gres not assigned to a specific socket, and block core distribion potentially allocating more sockets than required.
      • Revert a change in 23.02.3 where Slurm would kill a script's process group as soon as the script ended instead of waiting as long as any process in that process group held the stdout/stderr file descriptors open. That change broke some scripts that relied on the previous behavior. Setting time limits for scripts (such as PrologEpilogTimeout) is strongly encouraged to avoid Slurm waiting indefinitely for scripts to finish.
      • Fix slurmdbd -R not returning an error under certain conditions.
      • slurmdbd - Avoid potential NULL pointer dereference in the mysql plugin.
      • Fix regression in 23.02.3 which broken X11 forwarding for hosts when MUNGE sends a localhost address in the encode host field. This is caused when the node hostname is mapped to 127.0.0.1 (or similar) in /etc/hosts.
      • openapi/[db]v0.0.39 - fix memory leak on parsing error.
      • data_parser/v0.0.39 - fix updating qos for associations.
      • openapi/dbv0.0.39 - fix updating values for associations with null users.
      • Fix minor memory leak with --tres-per-task and licenses.
      • Fix cyclic socket cpu distribution for tasks in a step where --cpus-per-task < usable threads per core.
      • slurmrestd - For GET /slurm/v0.0.39/node[s], change format of node's energy field current_watts to a dictionary to account for unset value instead of dumping 4294967294.
      • slurmrestd - For GET /slurm/v0.0.39/qos, change format of QOS's field 'priority' to a dictionary to account for unset value instead of dumping 4294967294.
      • slurmrestd - For GET /slurm/v0.0.39/job[s], the 'return code' code field in v0.0.39_job_exit_code will be set to -127 instead of being left unset where job does not have a relevant return code.
    • Other Changes:

      • Remove --uid / --gid options from salloc and srun commands. These options did not work correctly since the CVE-2022-29500 fix in combination with some changes made in 23.02.0.
      • Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu or pn_min_cpus are being automatically adjusted.
      • Change the log message warning for rate limited users from verbose to info.
      • slurmstepd - Cleanup per task generated environment for containers in spooldir.
      • Format batch, extern, interactive, and pending step ids into strings that are human readable.
      • slurmrestd - Reduce memory usage when printing out job CPU frequency.
      • data_parser/v0.0.39 - Add required/memory_per_cpu and required/memory_per_node to sacct --json and sacct --yaml and GET /slurmdb/v0.0.39/jobs from slurmrestd.
      • gpu/oneapi - Store cores correctly so CPU affinity is tracked.
      • Allow slurmdbd -R to work if the root assoc id is not 1.
      • Limit periodic node registrations to 50 instead of the full TreeWidth. Since unresolvable cloud/dynamic nodes must disable fanout by setting TreeWidth to a large number, this would cause all nodes to register at once.
References

Affected packages

SUSE:Linux Enterprise High Performance Computing 15 SP1-LTSS / slurm_23_02

Package

Name
slurm_23_02
Purl
pkg:rpm/suse/slurm_23_02&distro=SUSE%20Linux%20Enterprise%20High%20Performance%20Computing%2015%20SP1-LTSS

Affected ranges

Type
ECOSYSTEM
Events
Introduced
0Unknown introduced version / All previous versions are affected
Fixed
23.02.5-150100.3.11.2

Ecosystem specific

{
    "binaries": [
        {
            "libslurm39": "23.02.5-150100.3.11.2",
            "slurm_23_02-torque": "23.02.5-150100.3.11.2",
            "slurm_23_02-plugin-ext-sensors-rrd": "23.02.5-150100.3.11.2",
            "perl-slurm_23_02": "23.02.5-150100.3.11.2",
            "slurm_23_02-auth-none": "23.02.5-150100.3.11.2",
            "libnss_slurm2_23_02": "23.02.5-150100.3.11.2",
            "slurm_23_02-lua": "23.02.5-150100.3.11.2",
            "slurm_23_02": "23.02.5-150100.3.11.2",
            "slurm_23_02-munge": "23.02.5-150100.3.11.2",
            "slurm_23_02-rest": "23.02.5-150100.3.11.2",
            "slurm_23_02-pam_slurm": "23.02.5-150100.3.11.2",
            "slurm_23_02-slurmdbd": "23.02.5-150100.3.11.2",
            "slurm_23_02-cray": "23.02.5-150100.3.11.2",
            "slurm_23_02-devel": "23.02.5-150100.3.11.2",
            "slurm_23_02-sview": "23.02.5-150100.3.11.2",
            "slurm_23_02-plugins": "23.02.5-150100.3.11.2",
            "slurm_23_02-config": "23.02.5-150100.3.11.2",
            "slurm_23_02-webdoc": "23.02.5-150100.3.11.2",
            "slurm_23_02-config-man": "23.02.5-150100.3.11.2",
            "slurm_23_02-doc": "23.02.5-150100.3.11.2",
            "libpmi0_23_02": "23.02.5-150100.3.11.2",
            "slurm_23_02-sql": "23.02.5-150100.3.11.2",
            "slurm_23_02-node": "23.02.5-150100.3.11.2"
        }
    ]
}