Case: Linux HPC cluster automation

Our customer struggled in provisioning and maintaining their Linux HPC cluster manually. We stepped in and deployed Server Lifecycle Management with Foreman and Identity Management with FreeIPA. New nodes could now be provisioned quickly and easily at any time. Users could login to the cluster nodes using their Active Directory credentials. Changes could  be made to all the nodes safely and effectively.

Main technologies

Foreman
FreeIPA
Puppet

Benefits

Quick provisioning
Consistent configurations
Grealy reduced error rates

Numbers

21 managed nodes
4 machine roles
42 automation modules

1 Starting point

The client had a manually managed Linux HPC cluster. They used it to run calculations with software such as Ansys Fluent. Calculations server hardware failed fairly regularly due to the extreme CPU load it was subjected to. This, in turn, caused lots of debugging because it was never 100% sure if any particular problem was caused by a misconfiguration or hardware failure. They avoided rebuilding cluster nodes because it was a heavy, error-prone process based mainly on tacit knowledge. Software configurations across the cluster also had diverged a lot over time. The cluster nodes also did not use Active Directory authentication, so local Linux user accounts were used instead.

2 Project

We started resolving the client's challenges by turning cluster node basic configuration into Puppet code stored in Git. The code included integration with FreeIPA Linux domain that was integrated with Active Directory. These first steps removed most of the configuration divergences in their environment and allowed centralized authentication.

As the next step we automated provisioning of new cluster nodes with Red Hat Foreman's built-in pxeboot provisioning. This made it more appealing to just reprovision a cluster node that was exhibiting strange behavior than to try to debug it.

As the final step we created a fully virtualized test environment based on Virtualbox and Vagrant for developing cluster automation code and testing pxeboot-based provisioning.

During the project we helped the client's employees get accustomed to a new way of working with Git and Puppet.

3 End result

At the end of the project the client had a complete solution for managing their Linux HPC cluster. All changes to cluster configuration would go through Git version control and could be reviewed by a peer and then get deployed automatically. All changes to the cluster configuration were visible and traceable back to the source. No manual steps were required when rebuilding a cluster node or provisioning a new one. Users could login to cluster nodes using their familiar Active Directory credentials.
"Puppeteers helped us resolve our Red Hat Enterprise Linux issue. I'm looking forward to upgrading and improving our clients' production environments and our development setups with their help."
Aarre Pohjola
A&A Consulting Oy, Finland
menucross-circle