1 Starting point
The client had a manually managed Linux HPC cluster. They used it to run calculations with software such as Ansys Fluent. Calculations server hardware failed fairly regularly due to the extreme CPU load it was subjected to. This, in turn, caused lots of debugging because it was never 100% sure if any particular problem was caused by a misconfiguration or hardware failure. They avoided rebuilding cluster nodes because it was a heavy, error-prone process based mainly on tacit knowledge. Software configurations across the cluster also had diverged a lot over time. The cluster nodes also did not use Active Directory authentication, so local Linux user accounts were used instead.