The SLURM scheduler has a number of mechanisms it can use to manage jobs. In our current configuration at CiS we encapsulate processes and tasks using cgroups. This protects the nodes and other jobs from run-away applications. Though jobs are encapsulated within cgroups the statistics are monitored via the jobacct_gather/linux plugin. While SLURM does offer the jobacct_gather/cgroup plugin which would be well suited to configurations using cgroups encapsulation, this method of accounting has never fully worked since we implemented SLURM. The sole option for gathering statistics has been the jobacct_gather/linux plugin.
The impact of tracking jobs within a memory cgroup is that the cache has an large influence on whether a job is successful or not. This is because the cache + rss + swap comprise the overall memory usage of a cgroup, and should this figure exceed the applied limit imposed by the SLURM batch job then the Out Of Memory killer will terminate the job. Unfortunately the jobacct_gather/linux plugin collects it’s data from the linux /proc virtual filesystem which isn’t aware of a processes cache, because of this there is a large hole in the memory statistics gathered by SLURM.
Although the jobacct_gather/cgroup plugin doesn’t work the available documentation indicated that it was possible for it collect rss and cache statistics. It therefore seemed sensible to spend some effort to understand why the jobacct_gather/cgroup plugin wasn’t working. Upon review there were a couple of reasons why the cgroup plugin doesn’t record this data; the first being it assumed a cgroup hierarchy which isn’t employed by SLURM. The cgroup plugin assumes a hierarchy of /sys/fs/cgroup/memory/uid_100/job_123/step_0/task_1. The organisation applied to cgroups by the process and task plugins is slightly different, i.e. /sys/fs/cgroup/memory/uid_100/job_123/step_0. This results in the plugin expecting the statistics to be located within a non-existent directory. The second reason why the plugin doesn’t collect data, is because it doesn’t look for cache statistics. While the documentation states a job’s cache usage is pulled from the memory.usage_in_bytes file within the cgroup, there is no reference to this file within the source code.
To resolve this we have submitted two patches, a bug fix and an enhancement. The patch addresses the issue of looking in the wrong location by removing superfluous code that interrogated the non-existent task directory. The patch also records the cgroup’s cache statistic and adds it to the current vsize metric, with the idea being that by including the cache statistic within the vsize it will provide greater understanding of the overall memory consumption of a step/job, that is it’s heap, stack, data, text, shared objects, cache.
The patch/contribution has been reviewed by SchedMD and is currently targeting for the 17.11 release. Details are available here: https://bugs.schedmd.com/show_bug.cgi?id=3531
Sam Gallop, Paul Fretter (CiS)
CiS is responsible for developing and managing the advanced research computing infrastructure, including High Performance Computing (HPC), storage, middleware and application platforms at the Norwich Bioscience Institutes, enabling World-class research in biological science.