Understanding Infrastructure as Code (IaC)

What is infrastructure as code?

Puppet code in terminal

Infrastructure as code (IaC) means code written to manage or create infrastructure components, instead of managing the infrastructure manually by running commands or by using a GUI. If you're not interested in the theory behind IaC you may be interested want to have a look at our introductory Terraform and Puppet articles instead.

Infrastructure as code brings all the tools of software development to system administration work, including version control, automated testing, change management and continuous delivery. With these tools come benefits like improved quality, speed and teamwork, as well as reduced risks and cost when making changes. These are illustrated in the diagram below:

Benefits of Infrastructure as Code

Manually managing infrastructure has several key issues:

  • Inconsistency: everyone likes to do things in their own way
  • Documentation, if it exists, tends to lack behind reality
  • Poor visibility into the current state of infrastructure
  • Lots of room for human error
  • Delay between introducing and spotting a problem
  • Lots of coordination required to understand who did what, when, how and why
  • Difficult to enforce policies

Infrastructure as code tools solve most of the above problems. With IaC you often, but not always, have to spend more time upfront, but much less time debugging and fixing issues or coordinating work.

Many infrastructure as code tools use a declarative programming language to describe the desired state of the system being managed.

What is declarative programming?

In declarative programming you define the desired state of the thing you are managing. This is unlike imperative programming where you define the steps your code will take. Many infrastructure as code tools are purely based on the declarative principle (Terraform, Puppet), some mix declarative and imperative (Ansible, Chef, Puppet Bolt) and some are purely imperative (Docker with Dockerfile).

Here's an example of creating a simple text file, first in the declarative Puppet language and then using imperative Linux shell commands:

file { '/tmp/foobar':
  ensure  => present,
  content => 'foobar',
  owner   => joe,
  group   => joe,
  mode    => '0755',
}

$ echo -n foobar > /tmp/foobar
$ chown joe:joe /tmp/foobar
$ chmod 755 /tmp/foobar

As you can see, the declarative approach defines the desired state of the file, whereas imperative list the steps to reach the desired state. Similar comparisons can easily be made for, say, Terraform (declarative) and the AWS CLI (imperative). Purely declarative approaches are naturally idempotent, see below for details.

Under the hood even declarative IaC tools run imperative commands but this is hidden from the user of those tools.

Scope of the desired state

Infrastructure as code tools do not manage the complete state of the system, regardless of whether that system is a baremetal server, virtual machine or a container. Instead, they manage a subset of it. Anything not covered by the administrator-defined desired state is left alone. For example, on a Linux desktop system the vast majority of resources are not explicitly managed; instead their initial state comes from the operating system installation and is further changed by package upgrades and system services (logrotate, certbot, etc) as well as manual changes. Here's an example:

Example of managed and unmanaged resources on a Linux desktop system

Containerization greatly reduces the likelihood of manual configurations, but does not have an impact on the operating system defaults. Every time you rebuild a container those defaults may change, e.g. due to security updates applied to the base image or by your container provisioning script (e.g. Dockerfile).

Similarly in the Cloud you always have a mix of unmanaged and managed resources. The former can be Cloud provider defaults or manual configurations. For example in AWS you could have something like this:

Example of managed and unmanaged resources in AWS

On top of this changes come from external systems. For example operating system updates change the state of the system outside of configuration management. Similarly Cloud providers occasionally change their defaults which may be reflected as unplanned changes in the managed resources.

What is idempotency?

A natural consequence of the declarative approach is being able to run the same code over and over again without any side-effects or changes to resources being managed. This feature is called idempotency. Here's an example of adding a line to a file, first with declarative Puppet code and then with imperative Linux shell commands:

file_line { 'myline':
  ensure => present,
  path   => '/tmp/myfile',
  line   => 'myline',
}

$ echo foobar >> /tmp/myfile

If you run the Puppet code again nothing will happen. When you run the shell command again, a new line is added every time. To solve this problem in shell you'd need to implement idempotency yourself with something like

#!/bin/sh

if ! grep -E '^foobar$' /tmp/myfile > /dev/null 2>&1; then 
  echo foobar >> /tmp/myfile
fi

The same principle applies to other declarative IaC tools like Terraform when you compare them to their imperative counterparts (e.g. the AWS CLI). The need to implement idempotency manually makes writing long, rerunnable and side-effect free scripts very laborious.

The only real alternative to implementing idempotency is to always build the thing you're managing from scratch, which is what Docker does when building ("baking") an image from a Dockerfile. Docker layers are just used to optimized the process, but the overall principle remains the same.

What is convergence?

Converge in the declarative configuration management context means doing the minimal steps required to bring the current state (reality) to match the desired state. In other words convergence fixes configuration drift, which is created by changes done outside of configuration management. Some of those changes are done manually, some by external tools (e.g. software updates) and some are side-effects of changes in default configurations (e.g. changes at the Cloud provider end). Here's a simplistic example of how convergence works in operating system level tools such as Puppet:

The terms convergence and configuration drift are important for mutable infrastructure but not for immutable infrastructure.

What is mutable and immutable infrastructure?

If a system's configuration is supposed to be changed over its lifecycle the system is considered mutable. If a system is rebuilt on every configuration change then it is considered immutable. Immutability is rarely absolute, but rebuilding on every configuration change reduces the likelihood of configuration drift and any configuration drift will be corrected much faster. The terms mutable and immutable are closely linked with to two main ways of building and maintaining infrastructure: frying and baking.

What are frying and baking?

Managing the configuration of system over its lifecycle aims to correct configuration drift. This process is sometimes called frying, which refers to a typical cooking process where you start frying something in a pan, then add more ingredients, then fry some more and finally, when the dish is ready you eat it. Similarly when managing a system you first make changes to reach the initial desired state, then periodically make changes over the lifetime of the system until its reached its end of life and you destroy it. Physical computers and virtual machines are often fried. In other words, you set up the initial state of the system and then change it as needed until the system is ready be decommissioned.

The term "baking" derives from the way you would bake a cake: you prepare the dough, put the dough into the oven, wait a while and finally take the finished cake out. When managing a system this means that you create ("bake") a new instance of the system from scratch every time there's a configuration change.

The diagram below illustrates the differences between baking and frying:

Differences between baking and frying in the configuration management context

As can be seen from the diagram, the main difference is that frying operates on the same system, whereas baking creates a new system from scratch. In frying the lifecycle of the system is typically long, whereas in baking it is typically short.

There are several example of systems used for baking:

Traditional configuration management tools like Puppet, Ansible, Chef and Salt are used to fry a system, in other words manage the system over its lifecycle with incremental changes.

In practice both bake and fry methods are often mixed together. For example:

  • You create a Windows system from a golden image with FOG and then install some user- or department-specific applications on top
  • You install a Red Hat operating system and configure a configuration management agent with kickstart. After installation you let a configuration management system take control over the system.
  • You launch a VM baked by somebody else (e.g. Microsoft, Red Hat) in a public Cloud, then use your own provisioning and/or configuration management scripts to configure it.

What are push and pull models?

Infrastructure as code tools apply their configurations by pushing them to the managed systems, or by letting the managed systems themselves pull and apply the configurations. The diagram below illustrates the concepts:

Push and pull models in configuration management

IaC tools that use the pull model often have an agent running that polls a configuration management server for the latest desired state. If the current state does not match the desired state then the agent takes corrective action. This means that in a pull model the agent is effectively a continuous delivery system. The pull model is best suited for with mutable (fried) infrastructure and is can be utlized when you have full access to the systems you're managing. Baremetal desktops and virtual machines are good examples of such systems.

Tools that utilize the push model work differently. They are launched from a controller, which could be your own laptop or a CI/CD system like Jenkins. If the tool in question is declarative the controller reaches out to the systems being managed, then figures out the differences between desired state and current state and runs the commands required to reach the desired state. If the tool is imperative controller just runs commands it is told to or deploys a new pre-baked image. In any case the target systems do not need a dedicated agent to be running. The push model is often only choice when you only have limited (e.g. API) access to the system you're managing: this is the case with public Clouds and SaaS for example.

Infrastructure as code tools can be grouped into the pull and push categories:

  • Puppet server + agents (pure pull model over TLS)
  • Puppet Bolt (pure push model over SSH or WinRM)
  • Masterless Puppet over Git (pull model over SSH or other Git-compatible protocol)
  • Ansible (pure push model over SSH or WinRM)
  • Controller-less Ansible over Git (pull model over SSH or other Git-compatible protocol)
  • Terraform (pure push model over various APIs)
  • Deploying Docker containers using a CI/CD system (pure push model)

As can be seen from these examples only Terraform is a purely push model tool. This design choice was forced over it, though, because most of the systems it manages are only available through APIs.

In practice pull and push can be combined. For example, some Puppet providers use API calls to push changes to a remote or a local system. Yet those API calls might be triggered by a Puppet Agent that pulls it configurations from a puppetserver.

Separation of data and code

When you install a new application to your computer you often have to configure it. In this scenario the configuration is your data and the application is the code. For example, the WiFi networks you have saved are specific to you, that is your data. The application that connects to those networks is the code. Few would propose mixing the two, especially in software that

  • Is distributed to others
  • Needs to adapt to different use-cases

The same principle applies to infrastructure as code. You want to keep data and code separate to support differences between

  • Development stages (production, staging, development)
  • Geographic locations
  • Deployment scenarios
  • Operating systems
  • Policies

The diagram below illustrates the "development stages" use-case for separating data and code in a typical web application deployment context:

Example of separation of infrastructure data and code for a web application

All infrastructure as code tools allow separating data from code. The quality of implementation may vary, but it is always possible. Most tools to use modules (Puppet, Terraform) or roles (Ansible) which take variables as input. The input variables are your data and drive the logic in the module (code) so it does what you want. In Dockerfiles you use variables or arguments to separate data from code at both build time and run time.

menucross-circle