Integration of cloud technology with Los Alamos high-performance computing systems improves research

Los Alamos National Laboratory. Courtesy / LANL

LANL News:

Thanks to the continued collaboration between Los Alamos National Laboratory (LANL) and Hewlett Packard Enterprise (HPE), laboratory researchers are now able to use the power of cloud technologies to more efficiently conduct complex scientific research using high performance computing applications.

These technologies allow administrators to perform upgrades and maintenance of computer systems without interfering with critical work in progress.

“By leveraging Linux software containers and container orchestration both in user space and for system management, the lab’s latest high-performance institutional computing system, named Chicoma, now delivers to hundreds of Users have greater flexibility than that available on previous generation systems, ”LANL Head of the HPC division, Gary Grider, said.

Chicoma is one of the first systems deployed using the HPE Cray EX supercomputer, which also leverages HPE Cray System Management, a next-generation software stack with management capabilities and other related services.

HPE Cray System Management promises to minimize downtime and enable administrators to use continuous integration techniques to securely upgrade, patch, and fine-tune systems without disrupting user productivity. When paired with a cloud service model, it provides better manageability, reliability, availability, and resiliency.

“A resilient, well-versioned management plan has the potential to virtually eliminate system downtime for upgrades and administrative actions,” said Alden Stradling, senior administrator at Chicoma.

HPE Cray System Management also allows greater flexibility in upgrades and more aggressive fixes with no visible impact on the user.

“Administrators can now take advantage of modern cloud-ready toolsets that benefit from significant investment and developer attention. Meanwhile, users see this like any other cluster, except with better administrator response to feature requests and much less scheduled downtime, ”added Stradling.

Chicoma demonstrates the power of container technologies to support complex workflows and non-native software dependencies entirely in user space.

Using Charliecloud, a container runtime environment originally developed in Los Alamos, scientists were able to deploy a complex bioinformatics toolchain for the identification of pathogens in metagenomic samples.

“We used a workflow manager called Cromwell to coordinate container execution and a python script to automate sample processing,” said Mark Flynn, researcher at Los Alamos. “Our Cromwell workflow manager used a MySQL Charliecloud database to track each workflow. Without Charliecloud, it would not have been possible to deploy the Cromwell Workflow Manager without the assistance of an administrator.

Charliecloud is a fully unprivileged Linux container runtime environment. Users can install Charliecloud in their home directory, then create and run containers without the intervention of HPC staff.

“Charliecloud enables users to explore innovative solutions to difficult problems,” said Tim Randles, co-founder of Charliecloud, who works in the lab’s HPC design group. “Researchers now have full control over their runtime environment, allowing them to develop and deploy complex workflows using cutting-edge technologies that would have been difficult to support in a traditional HPC environment. . “

The MySQL workflow manager and database were run in a compute node and generated new Slurm jobs to run the various bioinformatics tools used to process each sample. The python automation script ran in another compute node to submit samples to the workflow manager. The python dependencies were installed using Miniconda.

“We were able to do whatever we needed using Charliecloud containers running in Slurm jobs,” Flynn said.

