CIMMYT HPC visit to CiS

August 2017 – CiS hosted a knowledge exchange visit from Mr Jaime Campos Serna.  Mr Serna is developing a new HPC cluster service for his Institute (CIMMYT) and visited Norwich on a fact-finding mission.

Jaime Campos Serna (CIMMYT) with some of the CiS dept:  Paul, Mike, Adam, Chris, Sam.

Jaime Campos Serna (CIMMYT) with some of the CiS dept: Paul Fretter, Michael Burrell, Adam Carrgilson, Chris Bridson, Sam Gallop.

The three ‘R’s of dependability

The three R’s of dependability by Paul Fretter

We need to feel that we can depend on computer services or systems, to a greater or lesser degree, according to their importance, exposure and development status.  Almost by definition, critical systems need to be highly dependable, whereas faults or outages can be tolerated in non-essential and development systems.  In this short article, I attempt to put into words what I think are the core components of the intuition or ‘gut-feeling’ that the seasoned sysadmin or service manager will possess. .  I propose the Dependability of a system can be described as a function of three properties: Reliability, Resilience and Robustness, specifically according to three simple questions. …

Download the complete article The three R’s of dependability as a PDF.

Earlham Institute and CiS win Bio-IT best practices award for IT Infrastructure/HPC

Bio-IT World – Best Practices Award

25th May 2017 – Boston, MA

Improving Global Food Security and Sustainability By Applying High-Performance Computing To Unlock The Complex Bread Wheat Genome

Using the SGI UV systems specified, procured and supported by CiS, and the integration of Edico Dragen, the Earlham Institute has won the Bio-IT Best Practices award for IT infrastructure/HPC for their novel approach to mapping the wheat genome.  The award was received by Tim Stitt (on behalf of the Earlham Institute) and Paul Fretter (on behalf of CiS) at the awards ceremony in Boston MA.

Pictured: Paul Fretter (CiS), Allison Proffitt (Bio-IT), Tim Stitt (Earlham Institute) and Chris Davidson (HPE/SGI)

BioIT world best practice award 2017

CiS first recommended and procured the SGI UV systems for Earlham (then TGAC) in 2010-11 and the team has continually developed the service for our customers (EI researchers) to use.   CiS staff who deserve a specific mention for their significant technical contributions are Chris Bridson and Sam Gallop.

CiS submits fix for SLURM memory allocation >4TB

April 2016

In early 2016, CiS specified, procured and installed two SGI UV UV300TM systems for one of its customers (the Earlham Institute, previously known as TGAC), and implemented the open source SLURM job scheduler, instead of the closed-source commercial software that is normally supported on UV systems.  Each of theses systems has 256 cores and 12TB RAM, and were purchased for running multi-TB large memory jobs, e.g. for DNA assembly, as well as more traditional HPC workloads.

During the integration testing we encountered a scenario where jobs requesting memory resources greater than 4TB would result in unpredictable memory allocations, and usually resulting in the job being automatically killed.  The implication of this discovery was that, at the time, no-one else in the SLURM user community was attempting to allocate more than 4TB of RAM on a single host to a SLURM processing job.

CiS traced the problem in the source code to an incorrect variable sizing for jobacct_gather_set_mem_limit (in slurm_jobacct_gather.c).  We then successfully patched, tested and applied to our own installation, before then submitting the update to SchedMD for inclusion in future community releases of SLURM.

The patch/contribution was accepted by SchedMD in April 2016 and was included in releases 14.11.12, 15.08.11 and 16.05.0-pre3, and is now subsumed into v17.  Details are available here:

Sam Gallop, Paul Fretter (CiS)

CiS is responsible for developing and managing the advanced research computing infrastructure, including High Performance Computing (HPC), storage, middleware and application platforms at the Norwich Bioscience Institutes, enabling World-class research in biological science.

CiS submit enhancements for SLURM memory accounting

The SLURM scheduler has a number of mechanisms it can use to manage jobs.  In our current configuration at CiS we encapsulate processes and tasks using cgroups.  This protects the nodes and other jobs from run-away applications.  Though jobs are encapsulated within cgroups the statistics are monitored via the jobacct_gather/linux plugin.  While SLURM does offer the jobacct_gather/cgroup plugin which would be well suited to configurations using cgroups encapsulation, this method of accounting has never fully worked since we implemented SLURM.  The sole option for gathering statistics has been the jobacct_gather/linux plugin.

The impact of tracking jobs within a memory cgroup is that the cache has an large influence on whether a job is successful or not.  This is because the cache + rss + swap comprise the overall memory usage of a cgroup, and should this figure exceed the applied limit imposed by the SLURM batch job then the Out Of Memory killer will terminate the job.  Unfortunately the jobacct_gather/linux plugin collects it’s data from the linux /proc virtual filesystem which isn’t aware of a processes cache, because of this there is a large hole in the memory statistics gathered by SLURM.

Although the jobacct_gather/cgroup plugin doesn’t work the available documentation indicated that it was possible for it collect rss and cache statistics.  It therefore seemed sensible to spend some effort to understand why the jobacct_gather/cgroup plugin wasn’t working.  Upon review there were a couple of reasons why the cgroup plugin doesn’t record this data; the first being it assumed a cgroup hierarchy which isn’t employed by SLURM.  The cgroup plugin assumes a hierarchy of /sys/fs/cgroup/memory/uid_100/job_123/step_0/task_1.  The organisation applied to cgroups by the process and task plugins is slightly different, i.e. /sys/fs/cgroup/memory/uid_100/job_123/step_0.  This results in the plugin expecting the statistics to be located within a non-existent directory.  The second reason why the plugin doesn’t collect data, is because it doesn’t look for cache statistics.  While the documentation states a job’s cache usage is pulled from the memory.usage_in_bytes file within the cgroup, there is no reference to this file within the source code.

To resolve this we have submitted two patches, a bug fix and an enhancement.  The patch addresses the issue of looking in the wrong location by removing superfluous code that interrogated the non-existent task directory.  The patch also records the cgroup’s cache statistic and adds it to the current vsize metric, with the idea being that by including the cache statistic within the vsize it will provide greater understanding of the overall memory consumption of a step/job, that is it’s heap, stack, data, text, shared objects, cache.

The patch/contribution has been reviewed by SchedMD and is currently targeting for the 17.11 release.  Details are available here:

Sam Gallop, Paul Fretter (CiS)

CiS is responsible for developing and managing the advanced research computing infrastructure, including High Performance Computing (HPC), storage, middleware and application platforms at the Norwich Bioscience Institutes, enabling World-class research in biological science.

New UV300 big-memory systems for TGAC

We (CiS) have recently completed the procurement of 2 new UV300 systems for The Genome Analysis Centre (TGAC).  Each system comprises 256 CPU cores (16x E7-8867V3 16core Haswell), 12TB RAM, 16x 2TB Intel NVMe FLASH and a fibrechannel-connected 100TB InfiniteStorage IS5100 disk array.

That’s a combined capacity of 512 cores, 24TB RAM, 64TB of NVMe FLASH and 200TB scratch disk.  Right now, this is the largest system of its kind in the World, and we think it is a winning configuration.

These are now the 4th and 5th UV systems that we’ve recommended and procured for TGAC, since the first one in late 2010, and the need/desire for this type of system continues unabated.  We’re excited to have these in our data centre, and we are looking forward to seeing the impact they can make on the various genomic workflows within the organisation.

Although our previous UVs were bought primarily for their large memory footprint, they also run mixed workloads that take advantage of the high core count, and are all configured at 8GB RAM per core.  The new UV300 is different; being memory-heavy and core-light, with only 256 cores to service 12TB of RAM, and that works out at a whopping 48GB per core!  This is expected to be a much more performant ratio for large memory jobs, and especially DNA assembly (De Brujin graph).

The 16x 2TB NVMe FLASH cards in each system are bonded together as a single RAID0 array, with an XVM/XFS layer that provides a single filesystem with parallel IO at 1.6Gbyte/sec+ per card, allowing equality of access from all parts of the system.  Right now, this is the largest implementation of it’s kind but as soon as others catch on to the idea I am sure it will be overtaken.  We’ve bought these cards for a number of reasons, but will publicise more about that in due course.

Please see the TGAC press release here:

Why choose UV?

Back in 2010, we knew that TGAC wanted to perform de-novo assembly of bread-wheat DNA, which is notoriously large and complex (5.5 times bigger than human DNA) but thus far it had not been achieved anywhere on standard hardware.  We knew it would need a lot of memory in a single space, but no-one knew exactly how much since, earlier attempts at other institutions had crashed on systems as large as 400GB.  We guesstimated that we might need a bare minimum of 2TB RAM, and could easily swallow up to 6TB as the workflow was developed, so we had to find a machine that was capable of this.  Bear in mind our customer (TGAC) would be using existing (DNA) assembly tools, and we would be stretching their use significantly beyond their originally intended application, we needed a platform that would appear like any other simple server, presenting a single system image and running standard Linux.  After a procurement competition, we reviewed the offerings and it was apparent that the only technology that was proven to work at this scale was the UV platform, and it had the backing of SGI’s long-standing reputation as a partner in HPC.  So, on TGAC’s behalf we bought our first UV100 with 768 cores and 6TB RAM back in 2010.  Since then we’ve also procured a second UV100, followed by a UV2000 with (2560c/20TB); and now we are adding the two UV300s (2x 256c/12TB) to the estate.

Our approach to implementing UV for TGAC has been vindicated repeatedly, with the sterling efforts of researchers, bioinformaticians and talented programmers in TGAC producing the first assembled drafts of the (bread) wheat genome sequence, as a key part of the International Wheat Genome Sequencing Consortium.  Without SGI UV large shared memory this would not have been possible, and we are proud to have been able to make a small contribution to their success.

Paul Fretter

Thinking of Exascale?

Optalysys / GENESYS prototype

Optalysys / GENESYS prototype

Trying to achieve Exascale performance using general-purpose scalar CPUs would be a very tall problem.  The World’s fastest machine is currently Tianhe-2, with a theoretical peak performance of 33 peta-flops (Pflop/s) and already consumes a whopping 24MW of power! An Exascale system will need to be at least 30 times faster, so what does that say about the power requirements?  (I’ll let you do the math).

What follows is a personal opinion …

Now, here’s is a controversial question, although I am asking this with tongue firmly in cheek :  Some of the world’s largest (and power hungry) supercomputers are employed to model the climate and predict the weather and so I wonder, therefore, what is the contribution of all that energy expenditure to global warming ?  

Even if we could build a conventional (scalar/vector) system that big, the connectivity might be too unwieldy to actually deliver the performance, and the power requirements would be very difficult to deliver.

We need to be more responsible in our energy use, not only for the sake of the climate, but also because we will simply not be able to power these darn things unless we take the difficult road of finding revolutionary ways of doing our large scale maths.  The piecemeal scaling of existing technology, I would controversially suggest, is like low hanging fruit and will only yield small (in the grand scheme of things) improvements.

One answer, I would humbly suggest, is to highlight the applications and algorithms that actually need Exascale, and then develop targeted hardware to act as co-processors to a conventional (scalar/vector) system that will control workloads and handle the IO etc, much the same way as we do with almost any co-processor now (e.g. GP-GPU).

We need disruptive and innovative technologies – and by that I mean we need to think differently from the norm of more scalar/vector cores, better connectivity, closer memory etc.  All those things are important, of course, but they cannot yet promise an order of magnitude (or higher) increase in performance without significantly hiking up the energy bill.

Some technologies to watch:

FPGA – custom-designed ‘soft’ CPUs.  Design your own processor logic, dedicated to your algorithm, without all the overheads and baggage of a general purpose CPU.   These have been around for a long time, but the unfamiliar programming model, as well as the need to really understand what you are doing with processor design, has put most people off getting too involved.  In the biotech world companies like TimeLogic have been doing this for years, but recent advances are getting very interesting indeed – check out how TGAC are getting on  with Edico’s Dragen FPGA system.

Quantum computing – yes this is really out-of the-box, which is more than we can say about Schrödinger’s poor cat !  Adiabatic quantum computing could yield optimisations or mathematical shortcuts that will dramatically shorten certain classes of very difficult computational tasks.  Check out this modest little 100 million fold improvement ( ! yep you read it correctly ! ) recently observed by Google using the latest D-Wave 2x machine:  OK this is exotic hardware needing specialist care, but its modest power consumption of only 25Kw means it could be deployed very widely.

Optical processing – no I don’t mean optical switching, although that may be of help.  I really do mean optical processing – maths being done by the manipulation of light.  In this case, it will address applications such as numerical simulations and also pattern matching.  Check out our work with Optalysys here and here on youtube – we might reach 17 exaflops in the 2020-2022 timeframe, but with a promise of under 5KW of power needed and filling only one equipment rack.  At this scale, a 1 exaflop system will probably fit on your desk and be run from a standard wall outlet !

The above are not the only disruptive technologies out there …

Best wishes and have a fabulous winter holiday/break.

Paul Fretter
Head of CiS