Buildbot is a continuous integration framework which many open source projects seem to be using. Unlike continuous integration applications or automation servers like Jenkins Buildbot does not make many assumptions about your use-case. In fact, when you build your CI or CD pipeline you're actually writing Python code and hence have all the flexibility of Python directly at your disposal. In other words, you're not being blocked or obstructed by assumptions hidden in the tool itself, like often tends to happen with Jenkins.
With the introduction behind us, let's get to the real topic which is: how to limit concurrent builds in Buildbot. There are many ways to do this actually and some of it requires understanding the terms and concepts in Buildbot:
- Workers: these are the nodes that run Buildbot worker software. They could run on a baremetal machine, a VM or a container. They often correspond to an environment the software will be built and/or run. For example, you may have a separate worker for a bunch of Linux distributions, Windows and MacOS.
- Build steps: there are atomic operations like fetching the source code or running a build command
- Builders: builders combine build steps with Workers.
There are several reasons why you might want or need to limit the number of concurrent builds:
- Throttling: you might get throttled if a large number of workers try to run the same build step at the same time. For example downloading the same file at exactly the same time one hundred times could get you in trouble.
- Exclusive-access resources: the documentation gives an example of database server that only allows one connection at a time. It could also be a VPN server, for example, if you're testing the VPN client using the same set of credentials or certificates.
- Worker resource limitations: if your workers are highly stressed (e.g. CPU, memory, disk) you may need to limit the concurrent builds on them to prevent resource exhaustion.
- Worker host resource limitations: if several workers are hosted on the same node (e.g. baremetal machine or VM), limiting concurrent builds on the individual workers does not help. You need to limit the number of workers building at any given time on their host.
Buildbot provides locking to limit build concurrency. There are two scopes for locks:
- Master locks: these operate on the buildmaster level, but are not automatically "global".
- Worker locks: these are specific to each worker individually.
There are two access modes for locks. Both are available for Master and Worker locks:
- Exclusive mode: one slot is available. When a Builder acquires the lock other Builders are blocked until the lock is freed.
- Counting mode: the same as above, but the number of slots is configurable.
Locks can be defined for individual Build steps or for the entire Build.
Locks are always registered to a Builder. What this means is that you can share a lock between Builders. For example, you can create a lock that is shared between Builders that use Workers that happen to reside on the same node. Then, on Builders that use Workers that live on other nodes you can have different locking policies, or no locking.
There are a couple of other mechanisms for limiting concurrency besides locks:
- canStartBuild method: its name is passed to BuilderConfig that creates Builders. The purpose of the method is to do "something" to determine if it is ok to start a build or not. The method is called on all workers when a BuildRequest is received, e.g. when a Scheduler notices a change in a Git repository. While the method runs on the Buildmaster, the checks can be executed on a worker with RemoteCommand. This method probably has its uses, but throttling builds based on load average usage as in the official examples, or any other resource consumption-based check seemed very difficult to implement; this is because the buildmaster throws the BuildRequest to all Builders at the same time, and if the system is mostly idle at that point, resource consumption (load average, memory, etc) will be very low. So, all Builders are lured to thinking that there are enough resources available and happily start the build process. Adding delay to the process to allow resource consumption to grow works, kind of, but is relatively unreliable.
- max_builds option: this is a Worker-level option to easily limit the number of builds running on any given Worker.
Summarizing my "best practices" for the locks:
- Do not rely on the canStartbuild method that attempts to check resource consumption. The only exception to this rule is a Worker that may, for reasons unrelated to buildbot, be heavily taxed already.
- Use the max_build_options to limit concurrency on the Worker level. If some of your Builders are light on resource and some are not, then this approach may be too simplistic, but if you're just running roughly identical compilation jobs it should suffice.
- Use counting-mode master locks to prevent a node hosting several Workers from collapsing under load. I used this approach in my buildbot setup where a virtual machine was hosting a containerized buildmaster and a large number of containerized Linux workers; by registering a lock for the Builders associated with these Linux Workers I was able to prevent the container host for crashing when a BuildRequest arrived. The Builders that used non-containerized Workers did not have any locks, so they could build at their own pace, not being affected by limitations of the container host.
- Use exclusive master locks to control access to shared resources that only allow one connection at any given time.