Writing Ansible modules is easier than you may think. Many times it is easier than trying to hack your way through a problem with raw Ansible yaml code.
Writing Ansible modules is easier than you may think. Many times it is easier than trying to hack your way through a problem with raw Ansible yaml code. Photo credit: Harrison Haines (https://www.pexels.com/it-it/foto/internet-connessione-tablet-app-5247937/)

What are Ansible modules?

Ansible modules provide the infrastructure as code building blocks for your Ansible roles, plays and playbooks. Modules manage things such as packages, files and services. The scope of a module is typically quite narrow: it does one thing but attempts to do it well. Writing custom Ansible modules is not particularly difficult. The first step is to solve the problem with raw Python, then you can convert that Python code to an Ansible module

Some problems can't be solved elegantly with existing modules

The default modules get you quite far. However, occasionally you may end up with tasks that are quite difficult to do with Ansible yaml code. In these cases the Ansible code you write becomes very ugly or very difficult to understand, or both. Writing custom Ansible modules can greatly simplify things if this happens.

Example of modifying trivial JSON with raw Ansible

Here is a an example of how to modify a JSON file with Ansible. The file looks like this:

{
  "alt_domains": ["foo.example.org", "bar.example.org"]
}

What Ansible needs to do is add entries to and remove entries from the alt_domains list. The task sounds simple, but the solution in raw Ansible is very ugly:

- name: load current alt_domains file
  include_vars:
    file: "{{ alt_domains_file }}"
    name: alt_domains
- name: set default value for alt_domain_present
  ansible.builtin.set_fact:
    alt_domain_present: false
# The lookup returns data in this format: {'key': 'alt_domains', 'value': ['foobar.example.org', 'foobar.example.org']}
- name: check if current alt_domain already exists in alt_domains
  ansible.builtin.set_fact:
    alt_domain_present: true
  loop: "{{ query('ansible.builtin.dict', alt_domains) }}"
  when: alt_domain in item.value
- name: add alt_domain to alt_domains
  set_fact:
    alt_domains: "{{ alt_domains | default({}) | combine({\"alt_domains\": [\"{{ alt_domain | mandatory }}\"]}, list_merge=\"append\") }}"

Most would probably agree that the code above is already very nasty. That said, it does yet even handle removal of entries from the list or writing the results back to disk. If you had to modify non-trivial JSON files using code like above would make your head explode. There may be other ways to solve this particular problem in raw Ansible. If there are, I was unable to find any easily.

The solution: writing custom Ansible modules

With Ansible you occasionally end up in a hairy situation where you find yourself hacking your way through a problem. It is in those cases where writing a custom Ansible module probably makes most sense. To illustrate the point here's a rudimentary but fully functional implementation for managing alt_domains file such as above:

#!/usr/bin/python
import json

from ansible.module_utils.basic import AnsibleModule

def read_config(module):
  try:
    with open(module.params.get('path'), 'r') as alt_domains_file:
      have = json.load(alt_domains_file)
  except FileNotFoundError:
    have = { "alt_domains": [] }

  return have

def write_config(module, have):
  with open(module.params.get('path'), 'w') as alt_domains_file:
    json.dump(have, alt_domains_file, indent=4, sort_keys=True)
    alt_domains_file.write("\n")

def run_module():
  module_args = dict(
    domain=dict(type='str', required=True),
    path=dict(type='str', required=True),
    state=dict(type='str', require=True, choices=['present', 'absent'])
  )

  result = dict(
    changed=False
  )

  module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
  )

  if module.check_mode:
      module.exit_json(**result)

  have = read_config(module)
  want = module.params.get('domain')
  state = module.params.get('state')

  if state == 'present' and want in have['alt_domains']:
    result.update(changed=False)
  elif state == 'present' and not (want in have['alt_domains']):
    result.update(changed=True)
    have['alt_domains'].append(want)
    write_config(module, have)
  elif state == 'absent' and want in have['alt_domains']:
    result.update(changed=True)
    have['alt_domains'].remove(want)
    write_config(module, have)
  elif state == 'absent' and not (want in have['alt_domains']):
    result.update(changed=False)
  else:
    module.fail_json(msg="Unhandled exception: domain == %s, state == %s!" % (want, state))

  module.exit_json(**result)

def main():
  run_module()

if __name__ == '__main__':
  main()

This Python code could use some polishing (e.g. proper check_mode support). Yet it is still a lot more readable and understandable than the hackish raw Ansible yaml implementation would be. You also get variable validation for free without having to resort to strategically places ansible.builtin.assert calls.

Summary: do not be afraid of writing Ansible modules

Sometimes you may find yourself in a world of hurt while solving seemingly easy problem with raw Ansible yaml code. This is when you should stop and consider writing and Ansible module instead. Writing a custom Ansible module can make your code much more understandable, flexible and of better quality.

More about Ansible quality assurance from Puppeteers

Open source maturity model from Mindtrek 2022. Applicable to the European Commission's digital sovereignty journey as well. Photo: Samuli Seppänen, 2022
Open source maturity model from Mindtrek 2022. Applicable to the European Commission's digital sovereignty journey as well. Photo: Samuli Seppänen, 2022

What is software sovereignty

Software sovereignty is a subset of digital sovereignty. In essence, digital sovereignty means controlling your data, hardware and software. In Europe digital sovereignty has been driven by the EU. The reason is the reliance on services from big, global US-led vendors such as Amazon, Microsoft and Google. This poses a risk to the EU, just as does reliance on Chinese manufacturing is similarly a risk.

These worries are compounded by the threats to democracy posed by rise of authoritarianism (e.g Russia and China) and other threats to democracy, such as Trump's rise to power and the MAGA movement in the US, and the rise of far-right nationalistic parties in various European countries. Without digital sovereignty in general, or software sovereignty in particular, somebody could "pull the plug" and you would loose access to your own data, hardware and software. Moreover, if you do not control your data, hardware and software, you're capability to innovate is severely hindered.

Open source and software sovereignty

Open source software is a part of the "software" part of digital sovereignty. If you are not able to see and modify the source code for the applications you run, you need to rely on somebody else to do it. If your are using closed source (proprietary) software, the vendor might never implement the features you'd like it to.

Big organizations may be able to get the commercial vendors to customize their software for them. Small actors, such as individuals and small companies, are essentially at the mercy of the vendor. The vendor may or may not implement or may or may not drop features. The vendor may or may not decide to change the prices or the pricing model at will. Software as service (SaaS) is in this regard the worst, as the vendor manages everything, including the configuration of the application. You can typically customize closed-source self-hosted applications to a greated degree than software as a service.

This is where open source comes in. Being open, it allows anyone with the proper skillset to inspect and modify software to suit their needs. This characteristic of open source helps avoid vendor lock-in, even when using commercially supported open source software. Even in the SaaS context you're not out of luck, as you can typically migrate your data from a vendor-managed service to a self-hosted instance.

Perspectives from the Mindtrek 2022 event

In Mindtrek 2022 Miguel Diez Blanco and Gijs Hillenius from the Open Source Programme Office (OSPO) in European commission had a presentation about their open source journey. Timo Väliharju from COSS ("The Finnish Centre for Open Systems and Solutions") gave some perspectives on open source in Europe through his experience in APELL ("Association Professionnelle Européenne du Logiciel Libre"). What follows is a essentially a summary of their analysys of the state of open source and digital sovereignty within the EU and the European member states.

European commission's open source journey

European Commission started their open source journey around year 2000 by using Linux, Apache and PHP for setting up a wiki. Later they set up a lot more wikis. Open source was at that time also used on the infrastructure layer. Later it gradually crept up to the desktop. In 2007 they started to produce open source software themselves (see code.europa.eu). By 2014 the commission had started contributing to other, external open source projects. So, over the years they climbed up the open source maturity level ladder. The Commission's usage of open source software continues to increase. OSPO's goal is to lead by example: by working towards open source and software sovereignty others tends to encourage others to pick it up, too. Something that works for the EU, is likely to work for a national government also.

Culture of sharing and open source

There are about three thousand developers (employees and contractors) in the European Commission. As seems to often be the case, many of these internal teams previously worked in isolation. The isolation is by accident, not by design, but is still harmful for introduction of open source and hence for achieving software sovereignty.

OSPO tackled the problem by encouraging use of an "inner source" by default. The term meant using code developed in-house code when possible. This did, however, require a culture of sharing first. While some software projects were good to share as-is, some had issues that the authors had to resolve first. Some projects were not useful outside of the team that had developed them, so the authors decided to keep them private. The cultural change took a couple of years. OSPO encouraged the change by providing really nice tools for those teams that decide to join. That is, they preferred a carrot to a stick.

Outreach to communities

Along with their internal open source journey, OSPO have also reached out to open source communities.They fund public bug bounty programs and organize hackatons for important open source projects. The hackathons help gauge the maturity of those open source projects. They also help OSPO find ways to help them become more mature.

OSPO also holds physical and virtual meetings between presentatives of European countries once a year. The goal of these meetings is to increase open source usage and software sovereignty with data-based decisions.

Improving security of open source software

OSPO has gone beyond bug bountries in their attempts to improve the security of open source software. FOSSEPS stands for "Free and Open Source Software Solutions for European Public Services". One of its key objectives has been to improve the security of open source software use the the Commission. OSPO achieved the goal by building an inventory of software used by the EU. It used the inventory to figure out what software required an audit. Once they had finished auditing, they fixed thesecurity issues they had identified.

The journey to software sovereignty continues

The European Commission's open source work is still ongoing. In the member state the status of digital sovereignty vary a lot. Some countries like France and Germany put a lot of emphasis on open source in their policies, but funding may at times be a bit thin. Other countries, for example Finland and Denmark consider open source as "nice to have" instead of "must have". On the commercial front the challenge is that European open source companies tend to be small. This is the reason why one of APELL's goals is help them work together more efficiently.

Open source at Puppeteers

We, the Puppeteers, are an open source company. We do Cloud automation with infrastructure as code using open source tools such as Puppet, Terraform, Ansible, Packer and Podman. The majority of the code we write is available in GitHub and in various upstream open source projects. We provider our clients with high quality peer reviewed code and help them avoid any form of vendor lock-in.

If you need help with your Cloud automation project do not hesitate to contact us!

When you version lock your project's Ansible Collections you make all the parts with together perfect, not just so-and-so
When you version lock your project's Ansible Collections you all the parts fit together perfectly, not just so-and-so (https://www.pexels.com/photo/abstract-art-circle-clockwork-414579/)

What are Ansible Collections?

Ansible is an infrastructure as code tool used for configuration management, network device management, orchestration and other tasks. Ansible Collections are a way to distribute Ansible content such as roles, playbooks and modules. They can be downloaded from Ansible Galaxy, Git repositories or local directories. Basically collections are a more modern packaging format compared to standalone roles which they are in the process of replacing. Version-locking Ansible Collections allows you to run the exact same Ansible code every time.

Version locking Ansible Collections

When you use collections it makes a lot of sense to version-lock them, even though that is not really emphasized in the official documentation. If you don't version-lock dependencies such as collections or roles, you have no guarantees that your code will behave the same each and every time. When your Ansible is bootstrapped by someone he or she could get completely different dependency versions than somebody else. This, in turn, will eventually cause problems when your own playbooks or roles are incompatible with the collections they depend on.

Version locking Ansible collections on the project level ensures that everybody uses the exact same Ansible code, including your local code and its dependencies . Version locking also allows controlling the changes going into project dependencies.

Configuring project-specific collections in Ansible

The first step is to create an Ansible configuration file, ansible.cfg, to the root of your project:

[defaults]
collections_paths = ./collections

This tells Ansible to look for collections in the current directory under collections subdirectory. The next step is to create a collections/requirements.yml file which tells which dependencies your project needs. While requirements.yml can be placed anywhere, the above location is the only one that is compatible with Ansible Tower. Here's a minimal example for version locking Ansible Collections:

---
collections:
- name: https://github.com/Puppet-Finland/ansible-collections-puppeteers-keycloak.git
  type: git
  version: 1.0.1

For details on the requirements.yml syntax please refer to the official documentation.

Note that if any of the collections you list has collection dependencies of their own, those will also be installed automatically. This means there's a fair chance of a "dependency hell" as your requirements.yml file grows and the requirements of different collections become impossible to satisfy at the same time.

To install the collections run this command:

ansible-galaxy collection install -r collections/requirements.yml

You should also add ansible_collections to .gitignore, like this:

ansible_collections

That way you don't accidentally version your external dependencies.

More about Ansible quality assurance from Puppeteers

Red Hat Open Tour 2022 entry ID card
Red Hat Open Tour 2022 entry ID card

Automation use-cases in the Cloud

Johan Wennerberg, a Solution Architect for Red Hat Nordics in Stockholm gave presentation in Red Hat Open Tour 2022 Tallinn. In his presentation titled "Gain robust repeatability as selfservice, by automating your automation" he listed several automation use-cases in the Cloud. Each of these automation use-cases is made possible by automation stacks. The automation stacks are technology stacks that produce business value through automation. In practice you need to build your automation stacks on top of infrastructure as code.

Below I will review Johan's automation use-cases in the Cloud one by one. The descriptions are my own as Johan did not go deep into details in his presentation.

Fully-automated provision

Fully-automated provisioning typically means building a system from scratch into a fully usable state. The system could be a bare-metal server or desktop, a virtual machine, container image or a dynamically created test environment in a Cloud. Provisioning does not include managing the system over its lifecycle.

Config management

With config management or configuration management your goal is to keep the desired state of a system and to control changes to it. The system could be a container, virtual machine, bare metal system or network device. Common configuration management tools include Puppet, Terraform and Ansible.

Containers are typically short-lived and immutable: this means that configuration management happens before provisioning. Longer-living systems, such as virtual machines or network devices, are typically different: you handle configuration management after provisioning.

Orchestration

Orchestration refers to automation of workflows and processes which may include multiple systems. For example, you can orchestrate a complex software release process. Other example would be a complex security patching procedure that you need to do in a certain order. Well-known tools for orchestration include Ansible and Puppet Bolt.

Security compliance

Security compliance requires configuration management in practice. If you do not manage the configuration of a system you cannot easily understand its state. This makes it very difficult to enforce its desired state, which includes its security compliance.

Continuous delivery

Continuous delivery automates release of software by removing manual software release steps. You can build continuous delivery with tools such as Buildbot, Jenkins or GitHub Actions. Generally these tools are useful for software companies that release software packages for other to use.

App deployment

I am not sure what Johan intended "app deployment to mean in this context. Probably he meant Continous deployment which involves deploying software automatically. You can encounter continuous deployment in the SaaS context where there is only one installation of the software.

You can use continuous delivery tools for continuous deployment as well. That said, there are dedicated continuous delivery tools for special use-cases. For example ArgoCD is used for continuous delivery in Kubernetes.

Ansible variable validation helps you avoid accidentally hammering square pegs going into round holes. Photo by Julij Vanello Premru (https://medium.com/@julijVP/square-pegs-in-round-holes-6b3d1a3aa58a)

Overview of Ansible quality assurance

Ansible is an IT automation engine which you can use for configuration management, orchestration and device management, among other things. While you can get started fast with Ansible, ensuring high-quality, bug-free code can be challenging. Moreover, there's not that much official, high-quality or coherent documentation available on Ansible quality assurance best practices. While low hanging fruit like Ansible variable validation are available, they are not emphasized in official documentation.

This is the first part of our "Ansible quality assurance" series of blog posts.

Why you should do Ansible variable validation?

Here in part 1 we cover validation of variables. In particular, we focus on variable validation in Ansible roles, although the same approach works anywhere. Variable validation helps avoid playbook failures and hard to debug runtime errors and side-effects caused by

  1. Undefined variables
  2. Invalid variable values

Ansible variable validation with ansible.builtin.assert

Ansible does not a have built-in data types. As a result, you need to construct assertions manually. This is contrast to typed languages like Puppet where data types are first class citizens. The main tool you can use for Ansible variable validation is the ansible.builtin.assert module which builds on top of Jinja2 tests and filters.

Fail as early as possible

You can minimize misconfigurations by failing as early as possible. Typically misconfigurations are caused by missing or wrongly time variable validation and fall into two categories

For this reason you should not only fail on invalid variables, but fail as early as possible. Correct location for Ansible variable validation depends on your use-case:

This is why you should not in most cases rely on variable validation done at task execution time.

Additionally you can avoid late failures by preferring imports over includes. When you Import a role or task it is added to your code statically. In other words, importing is about the same as if you had copied-and-pasted the imported role or task into your own code. A positive side-effect of an import is that you can validate variables before the Ansible run starts. In contrast includes are evaluated at runtime when the playbook is already running, so missing/invalid variables may go unnoticed until they cause a failure.

If you want to learn more about includes and imports please refer to official Ansible documentation.

Minimal Ansible variable validation: is the variable defined?

The minimal check you should do for every variable is to check for variable presence. We focus on roles, of which here is an example:

- ansible.builtin.include_role:
    name: myrole
  vars:
    myrole_myvar: foobar

The myrole/tasks/main.yml has the assert(s) at top. You should at minimum check that you've set all the variables a role needs:

---
- name: validate parameters
  ansible.builtin.assert:
    that:
      - myrole_myvar is defined

Now if you forget to pass a value to the role Ansible will error out immediately with a reasonable error message. This is definitely progress, but it is far from optimal. As you can see, you could still pass, for example, "foobar" instead of an IP address.

Validating that a variable is of certain type

Validating that a variable is of certain type is one step up from the "is variable defined" check:

- name: validate parameters
  ansible.builtin.assert:
    that:
      - my_string is string
      - my_integer is number
      - my_float is float
      - my_boolean is boolean

Validating that a variable belongs to a predefined set

For string variables that take a limited set of values using a regular expression matches are very useful:

- name: validate parameters
  ansible.builtin.assert:
    that:
      - myrole_state is match ('^(present|absent)$')

Regular expressions for complex string validation

Regular expressions also help you validate DNS names, for example like this:

- name: validate parameters
  ansible.builtin.assert:
    that:
      - myrole_dnsname is match ('^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)[A-Za-z]{2,6}')

As a regular expression grows more complex, the likelihood of it being buggy grows.

Validating numeric values

Checking numeric values like port numbers is easy:

- name: validate parameters
  ansible.builtin.assert:
    that:
      - myrole_port >= 1 and myrole_port <= 65535

Doing asserts on multiple variable values in one place

Validating multiple asserts in one place is trivial as well:

---
- name: validate parameters
  ansible.builtin.assert:
    that:
      - myrole_state is match ('^(present|absent)$')
      - myrole_dnsname is match ('^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)[A-Za-z]{2,6}')
      - myrole_port >= 1 and myrole_port <= 65535

Validating multiple variables in a loop

Sometimes you have a large number of variables that require the same set of complex validation rules. In that case using a loop saves code and reduces code repetition. Here is an example where we validate a list of domain names with a regular expression pattern:

- name: Create list of DNS domain parameters for validation
  ansible.builtin.set_fact:
    dns_domains:
      - foo.example.org
      - bar.example.org
      - baz.example.org
- name: Validate DNS domains
  ansible.builtin.assert:
    that: dns_domain is match ('^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)[A-Za-z]{2,6}')
  loop: "{{ dns_domains }}"
  loop_control:
    loop_var: dns_domain

More on Ansible quality assurance from Puppeteers

External resources

We participated in Red Hat Open Tour 2022 Tallinn a while back. Johan Wennerberg, who is a Solution Architect for Red Hat Nordics in Stockholm, gave a presentation titled "Gain robust repeatability as self.service, by automating your automation". Among other things he discussed the importance and use-cases of Cloud infrastructure standardization and automation. Here I willsummarize his views on standardization and automation and explain how they tie in with our focus on quality in the Cloud infrastructure and elsewhere.

The presentation

Johan started his presentation from the key topic of standardization, which is the basis for gaining the advantages from automation. So, standardize the components you automate, as well as the automation processes. In industrial automation everything is standardized and you won't be seeing manual ad hoc steps there; the same principle should apply for IT automation as well.

Johan listed the benefits of standardization and automation:

  1. Efficient operation / maintenance
  2. Reproducibility
  3. Documentation
  4. Policy / Governance
  5. Audit / Reporting

Standardization includes:

  1. Workload
  2. Middleware / Application platform
  3. Backup and observability
  4. Operating system

All of the above produce predictability and are very spot on.

How does Cloud infrastructure standardization and automation tie in with quality?

Now I want to look at this from our viewpoint, which is quality in the Cloud. You can use version control systems along with infrastructure as code tools to improve the quality of your Cloud infrastructure. The tool can also help you automate your Cloud to a high degree. When using IaC you are actually implicitly standardizing your infrastructure because automation in most cases creates standardization. If you deviate from the standards you actually do more work. This is because you'd need to parameterize your code. Or, in the worst case, you would have to hard fork your codebase tos support multiple scenarios. Both cases end up in maintenance troubles down the line.

That said, if you focus on Cloud infrastructure standardization and automation you improve the quality of your IaC code and Cloud:

Standardization should be organization-wide to avoid each team ending up with different solutions to the same or similar challenges.

Importance of discipline in standardization

You need a lot of discipline to not cut corners when standardizing and automating your Cloud infrastructure. This is particularly true when operating in multiple environments. For example, you need to ensure that the stakeholders in those environments understand why standardizing makes sense. Moreover you need convince them that your way makes most sense to them. You should emphasize that other options would be more costly in time and money. More importantly alternatives would produce lower quality results.

You also need discipline to avoid copy-and-paste reuse of your IaC code. While copy-and-paste is often the quickest solution, it causes issues down the line: with copy-and-paste you're violation the DRY principle. In practice you end up implementing fixes and feature in multiple places. Instead, you should modularize code while maintaining high degree of standardization.

Microsoft Azure provides a metrics and monitoring framework called Azure Monitor. With it you can monitor your Cloud infrastructure and services running there. You can view graphs of the metrics, alert on threshold and all that usual stuff, just like in AWS Cloudwatch.

Some Cloud resources like Azure Functions expose "a limited number of useful metrics" in Azure Monitor. This is a polite way of saying that the metrics are "completely useless" for most intents and purposes. For example, an Azure Function that gets triggered a few times a week on an irregular schedule by real people you don't really have any useful metrics in Azure Monitor which you could alert on if something is off.

Fortunately you can use Application Insights to get more useful metrics. Application Insights extends Azure Monitor, but is not particularly tightly coupled with it. This is clear from the fact that both in Azure Portal and with Azure command-line they're completely separate things. Fortunately for us both Azure Monitor and Application Insights are supported by azure-metrics-exporter, which allows getting both metrics into Prometheus.

We are command-line guys, so here how you can get a list of metrics available in Azure Monitor and Application Insights. First log in to Azure - you can skip the tenant part if you only have access one:

$ az login --tenant example.onmicrosoft.com

First we get Azure Monitor metrics. Note that the resource-uri is the URI of the Azure Function itself:

$ az monitor metrics list-definitions --resource <resource-uri>

Then Application Insights metrics. Note that the --app parameter expect an URI of the Application Insights "component" of a Function, not a Function's resource URI:

az monitor app-insights metrics get-metadata --app <app-insights-component-uri>

Prometheus integration will follow up in future blog post, stay tuned!

Red Hat Open Tour 2022 entry ID card
Red Hat Open Tour 2022 entry ID card

We participated in Red Hat Open Tour 2022 Tallinn a few weeks ago. Jaan Tanel Veikesaar from Elering, a gas/energy company in Estonia, gave a really nice presentation about their Ansible automation project. Ansible is a very common infrastructure as code and automation tool. Below I'll go over Jaan's presentation, adding some comments and key takeaways.

The starting point for the Ansible automation project

The starting point at Elering was fairly typical: to create a new VM, one had to go to VSphere, launch a VM and continue from there manually. In other words, automation did not really exist. This started to change when business started to demand more from the IT without adding any more human resources. This essentially made automation a necessity instead of being a luxury. So, they did some research and testing, after which Elering ended up with Ansible. They made this choice in part because whenever an Ansible problem was encountered, they could find an answer easily from the Internet.

The lifecycle of the Ansible automation project

In the early stages of Ansible automation project Elering focused on provisioning. This probably meant one-time creation of virtual machines with known-good configurations. Later they expanded automation to adding DNS entries for VMs, adding disks and other less critical tasks. The infrastructure currently managed by Ansible at Elering is fairly big, about 600 servers. With such a big infrastructure scaling is an issue with Ansible, as we discussed with Elering. At the moment they use Ansible automation primarily for provisioning. When they made big changes they targeted a subset of servers with VMWare tags.

Currently Elering's team enforces that the SSH keys of the VMs are correct at all time. In other words the scope of enforcing configuration management is not yet very wide. For quality assurance Elering uses of Visual Studio for Ansible syntax checking and linting among other things.

Manual configuration results in lots of wasted time

As Elering moved forward in their automation project, they started realizing how much wasted effort manually configuring things can create. Jaan gave one example of a manually created cluster that broke and took a week to debug and fix. The reason for the breakage was simple: a configuration mistake in one of the cluster members. With automation the team can track changes, including tracking who broke what and when: "See, you made this change which broke thing". I can personally add that with automation the time between making a mistake and realizing it is typically short: you realize you problem almost immediately after having made it. In my experience this is not the case with if you manage infrastructure manually.

People were the key to success

The Ansible automation project's success did not owe itself to just automation. The people were the key. Somebody had to take the lead, of course, but the real success grew from the Ansible/system administration team helping other people solve their problems with Ansible automation. That motivated more people to commit themselves to the project. This people factor is very important because a project is likely to fail if it does not scratch and itch for each participant.

Learning curve and how to overcome it

Another people factor they encountered was the "learning curve". People accustomed to Windows were not particularly keen in learning Unix-y command-line required to use plain Ansible. They solved this problem with Ansible Tower, which allowed the Windows people to just use a GUI to do what they need. Especially at the early stages of a project when the value of automation is not yet clear, demanding them to climb a steep learning curve is probably too much.

Key takeaways

The key takeways are:

When you create a distribution, AWS creates several DNS A records with the same name (e.g. d25gma2ea3ckma.cloudfront.net) which point to IPs the distribution is using. Then, typically, you would define CNAME(s) pointing to that cloudfront.net address in your own DNS. Each Cloudfront distribution has a list of aliases, similar to Subject Alternative Names ("SAN") in SSL certificates. The aliases should match the CNAME(s) you've set in DNS.

Now, this is all good until you need to migrate those Cloudfront distributions to, say, a new AWS account. If you attempt to create a new distribution (here with Terraform) with the same alias you will get this error:

aws_cloudfront_distribution.example: Creating...
╷
│ Error: error creating CloudFront Distribution: CNAMEAlreadyExists: One
| or more aliases specified for the distribution includes an incorrectly
| configured DNS record that points to another CloudFront distribution.
| You must update the DNS record to correct the problem. For more
| information, see
| https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/CNAMEs.html#alternate-domain-names-restrictions
│       status code: 409, request id: 4eeaa2ee-b317-4c64-a221-e09244192add
│ 
│   with aws_cloudfront_distribution.example,
│   on cloudfront.tf line 7, in resource "aws_cloudfront_distribution"
|   "example":
│    7: resource "aws_cloudfront_distribution" "example" {

The error message is very clear. Naturally you assume it tells you the whole truth, so you inform everyone that the CloudFront distribution will be down briefly while DNS propagates to AWS DNS. You wait. And you wait some more. After many hours you loose hope, because everybody's DNS seems to have propagated, but AWS just keeps telling Terraform the same thing. And finally you give up and restore the old CNAME and go back to the drawing board.

The problem here turns out to be simple: the error message does not tell the whole truth. In fact, based on experimentation, it is not enough to remove the CNAME. It is also not enough to disable the old Cloudfront distribution - you really need to delete it first. It may be enough to change the aliases in the old distribution so that they no longer overlap with the new distribution you're creating, but that was not tested - SSL might make that process more complicated than expected.

This article shows you how to enable Azure Backup on Linux VMs. It is recommended to read the Understanding Azure Backup for Linux VMs article first before trying to enable backups with Terraform. Terraform AzureRM provider has three relevant resources:

I'll go through the basic process first, which may work for some people. I'll then take a deep dive into the timeout issues you may or may not encounter in your environment.

The first step is to create a recovery services vault and a backup policy. Here's a simple example:

resource "azurerm_resource_group" "myrg" {
  name     = "myrg"
  location = "northeurope"
}

resource "azurerm_recovery_services_vault" "default" {
  name                = "default"
  location            = azurerm_resource_group.myrg.location
  resource_group_name = azurerm_resource_group.myrg.name
  sku                 = "Standard"
  storage_mode_type   = "LocallyRedundant"
}

resource "azurerm_backup_policy_vm" "default" {
  name                           = "default"
  resource_group_name            = azurerm_resource_group.myrg.name
  recovery_vault_name            = azurerm_recovery_services_vault.default.name
  policy_type                    = "V2"

  # This parameter may not work, depending on things. In theory
  # policy_type V2 should allow setting this more freely than V1, but it
  # seems that in certain cases V1 policy is silently enabled even if you
  # set it to V2 in Terraform.
  #instant_restore_retention_days = 7

  timezone = "UTC"

  backup {
    frequency = "Daily"
    time      = "23:00"
  }

  retention_daily {
    count = 7
  }

  retention_weekly {
    count    = 4
    weekdays = ["Saturday"]
  }

  retention_monthly {
    count    = 6
    weekdays = ["Saturday"]
    weeks    = ["Last"]
  }
}

The next step is to enable backups for a VM, here assumed to be azurerm_linux_virtual_machine.testvm:

resource "azurerm_backup_protected_vm" "testvm" {
  resource_group_name = azurerm_resource_group.myrg.name
  recovery_vault_name = azurerm_recovery_services_vault.default.name
  source_vm_id        = azurerm_linux_virtual_machine.testvm.id
  backup_policy_id    = azurerm_backup_policy_vm.default.id
}

Now, in theory this should be enough plumbing and it may work for you. However, in my case deploying the azurerm_backup_protected_vm just seems to hang indefinitely:

azurerm_backup_protected_vm.testvm: Still creating... [1h19m50s elapsed]                                                                                     
╷                                                                                                                                                             
│ Error: waiting for the Azure Backup Protected VM "VM;iaasvmcontainerv2;myrg;testvm" to be true (Resource Group "myrg") to provision: context dead
line exceeded    

While enabling backups manually and taking the first backup in the Azure Portal is slow (~15 minutes), it is nowhere near this slow. This problem is not related to ordering, either: even if the VM is created before trying to enable backups on it the same thing happens.

One thing I tried was enabling the VMSnapshotLinux extension manually, so that it would be installed before trying to protect the VM. The first step was to figure out the extension publisher, name and version with Azure CLI:

$ az vm extension image list --location northeurope --output table

Name             Publisher                          Version
----             ---------                          -------
AcronisBackup    Acronis.Backup                     1.0.33
--- snip ---
VMSnapshotLinux  Microsoft.Azure.RecoveryServices   1.0.9188.0
--- snip ---

When used with azurerm_virtual_machine_extension Terraform resource these fields translate like this:

So, naively one would assume this Terraform code would install the VMSnapshotLinux extension to a VM, assuming that waagent.service is running:

resource "azurerm_virtual_machine_extension" "testvm_vmsnapshotlinux" {
  name                 = "testvm-vmsnapshotlinux"
  virtual_machine_id   = azurerm_linux_virtual_machine.testvm.id
  publisher            = "Microsoft.Azure.RecoveryServices"
  type                 = "VMSnapshotLinux"
  type_handler_version = "1.0.9188.0"
}

But of course this code did not work due to an esoteric and undocumented feature/bug:

the value of parameter typeHandlerVersion is invalid

Basically the version number in type_handler_version must be shortened from 1.0.9188.0 to 1.0:

type_handler_version = "1.0"

This allowed me to the next error:

Publisher 'Microsoft.Azure.RecoveryServices' and type VMSnapshotLinux is reserved for internal use.

This "Internal use" issue can be easily reproduced with Azure CLI:

$ az vm extension set --resource-group myrg --vm-name testvm --name VMSnapshotLinux --publisher Microsoft.Azure.RecoveryServices
(OperationNotAllowed) Publisher 'Microsoft.Azure.RecoveryServices' and type VMSnapshotLinux is reserved for internal use.
Code: OperationNotAllowed
Message: Publisher 'Microsoft.Azure.RecoveryServices' and type VMSnapshotLinux is reserved for internal use.

How nice, so this particular extension can't be explicitly installed. The next debugging step was to check the deployment JSON file in Azure Portal for a manual VM protection action. The deployments are stored under the resource group. And here is what Azure Portal deploys:

{
    "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "armProviderNamespace": {
            "type": "String"
        },
        "vaultName": {
            "type": "String"
        },
        "vaultRG": {
            "type": "String"
        },
        "vaultSubID": {
            "type": "String"
        },
        "policyName": {
            "type": "String"
        },
        "fabricName": {
            "type": "String"
        },
        "protectionContainers": {
            "type": "Array"
        },
        "protectedItems": {
            "type": "Array"
        },
        "sourceResourceIds": {
            "type": "Array"
        },
        "extendedProperties": {
            "type": "Array"
        }
    },
    "resources": [
        {
            "type": "Microsoft.RecoveryServices/vaults/backupFabrics/protectionContainers/protectedItems",
            "apiVersion": "2016-06-01",
            "name": "[concat(parameters('vaultName'), '/', parameters('fabricName'), '/',parameters('protectionContainers')[copyIndex()], '/', parameters('protectedItems')[copyIndex()])]",
            "properties": {
                "protectedItemType": "Microsoft.ClassicCompute/virtualMachines",
                "policyId": "[resourceId(concat(parameters('armProviderNamespace'), '/vaults/backupPolicies'), parameters('vaultName'), parameters('policyName'))]",
                "sourceResourceId": "[parameters('sourceResourceIds')[copyIndex()]]",
                "extendedProperties": "[parameters('extendedProperties')[copyIndex()]]"
            },
            "copy": {
                "name": "protectedItemsCopy",
                "count": "[length(parameters('protectedItems'))]"
            }
        }
    ]
}

It is noteworthy that there is only one resource in this deployment file. Also, the parameters in this file match those used by the azurerm_backup_protected_vm Terraform resource, as do the parameter values. So, Azure Portal does not seem to be doing any "under the hood" magic to enable the backups for a VM. All points to a bug in Terraform AzureRM provider itself.

Looking at this bug report it seems that Terraform's AzureRM provider waits for the first backup to succeed and then considers the resource to be successfully created. Given that enabling the backup and triggering the first backup takes ~15 minutes in Azure Portal it looks as if Terraform does not trigger the first backup at all. Combined with a reasonable "once per day" backup schedule this means Terraform will fail every time unless timeout values are huge and the first backup actually kicks in during the Terraform run.

Summa summarum: if you encounter these timeout issues it may be easiest to just create the backups manually in Azure Portal or with az cli:

az backup protection enable-for-vm --resource-group myrg --vault-name default --vm testvm --policy-name default

Then you can import the already present resource into Terraform and be done with it, while waiting for this bug to be fixed.

Azure Backup is an Azure service that allows, among other things, backing up Windows and Linux VMs in Azure. The backups are essentially virtual machine snapshots, but backing up and/or restoring individual files is also possible. This article tries to explain how Azure Backup and Linux VMs interact and what is required for them to work. If you're interested in using Terraform to configure Azure Backup, please have a look at the Enabling Azure Backup on Linux VMs with Terraform article.

Azure <-> VM integrations such as Azure Backup depend on an agent (operating system service) running on the VM. In case of Linux systemd distros (like Ubuntu 20.04, below) this agent is called walinuxagent.service:

● walinuxagent.service - Azure Linux Agent
     Loaded: loaded (/lib/systemd/system/walinuxagent.service; enabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/walinuxagent.service.d
             └─10-Slice.conf, 11-CPUAccounting.conf, 12-CPUQuota.conf
     Active: active (running) since Mon 2022-08-29 11:43:50 UTC; 37min ago
   Main PID: 507 (python3)
      Tasks: 7 (limit: 2285)
     Memory: 42.2M
        CPU: 3.466s
     CGroup: /azure.slice/walinuxagent.service
             ├─507 /usr/bin/python3 -u /usr/sbin/waagent -daemon
             └─557 python3 -u bin/WALinuxAgent-2.8.0.11-py2.7.egg -run-exthandlers

The walinuxagent is available as a package in the upstream repository, for example see the Ubuntu 20.04 variant here. The commonly-used base images in Azure Marketplace should have this package installed out of the box. The distro packages themselves are built from the Microsoft Azure Linux Agent source code.

Walinuxagent works together with the Azure Wireserver (168.63.129.16) to enable various Azure integrations such as DHCP, DNS and health status. Another use for the walinuxagent is installation of virtual machine extensions. Here the VMSnapshotLinux extension got installed as it was part of the "goal state" (=desired state) for walinuxagent when Azure Backup was enabled for the VM in Azure Portal:

Aug 29 11:53:44 testvm python3[557]: 2022-08-29T11:53:44.001323Z INFO ExtHandler Fetched a new incarnation for the WireServer goal state [incarnation 2]
--- snip ---
Aug 29 11:53:51 testvm python3[557]: 2022-08-29T11:53:51.307613Z INFO ExtHandler ExtHandler All extensions in the goal state have reached a terminal state: [('Microsoft.Azure.RecoveryServices.VMSnapshotLinux', 'success')]

If the virtual machine is down when backups are enabled then walinuxagent can't do its job. However, in all likelihood (untested) the Azure Wireserver will upload the new goal state to it when it starts up the next time and the VMSnapshotLinux extension gets installed correct.

The aws_instance resource in Terraform can automatically create the default network interface for you. There are cases, however, when you notice that the default network interface is not enough anymore, and modifying it via the limited aws_instance parameters is not sufficient. In these cases you can convert the interface into an aws_network_interface resource, but the process is far from straightforward.

The first step is to create a Terraform resource for the network interface:

resource "aws_network_interface" "myinstance" {
  subnet_id          = aws_subnet.primary.id
  description        = "Network interface for myinstance"
  # Add any parameters you need, such as this:
  ipv6_prefix_count  = 2
}

Now you need to import the network interface into the Terraform state file. There's no need to "terraform apply", as the network interface is already present in AWS. For example:

terraform import aws_network_interface.myinstance eni-0123456789abcedf0

The next step is to add a network_interface block into aws_instance's Terraform code:

network_interface {
  delete_on_termination = false
  network_interface_id  = aws_network_interface.myinstance.id
  device_index          = 0
}  

At this point Terraform will be confused because the state file does not have such a network_interface block. So, get the current state file:

terraform state pull > temp.tfstate

Now add the network_interface block into the state file:

--- snip ---
            "monitoring": false,
            "network_interface": [
              {
                "delete_on_termination": false,
                "device_index": 0,
                "network_interface_id": "eni-0123456789abcedf0"
              }
            ],
            "outpost_arn": "",
            "password_data": "",
            "placement_group": "",
            "placement_partition_number": null,
            "primary_network_interface_id": "eni-0123456789abcedf0",
--- snip ---

Then modify the value of "serial" at the top of the state file, incrementing it by one, e.g. from 50 to 51:

  "serial": 51,

Now push the massaged state file to the remote state:

terraform state push temp.tfstate

Now when you run "terraform plan" you should not see any changes, or at most, changes that are non-destructive and/or to be expected. The delete_on_termination parameter's value might not match: if so, just change the real value in AWS network interface settings to match what Terraform expects.

When deploying with Terraform to Azure you may sometimes encounter errors such as this:

╷                                                                                                                                                             
│ Error: creating Automation Account (Subscription: "83fa140a-2bea-175a-395c-914cae2902ce"                                                                    
│ Resource Group Name: "production"                                             
│ Automation Account Name: "backup-automation"): automationaccount.AutomationAccountClient#CreateOrUpdate: Failure responding to request: StatusCode=409 -- Or
iginal Error: autorest/azure: Service returned an error. Status=409 Code="MissingSubscriptionRegistration" Message="The subscription is not registered to use 
namespace 'Microsoft.Automation'. See https://aka.ms/rps-not-found for how to register subscriptions." Details=[{"code":"MissingSubscriptionRegistration","mes
sage":"The subscription is not registered to use namespace 'Microsoft.Automation'. See https://aka.ms/rps-not-found for how to register subscriptions.","targe
t":"Microsoft.Automation"}]

The problem is that in Azure you may need to register the provider for the service you intend to manage with Terraform. If you add resources from Azure Portal this registration part is handled automation. In the above case the Azure Automation provider was missing.

To figure out which provider you need to register either go to Microsoft documentation here, or use az cli to find the correct one:

az provider list --output table

Then register the provider (here for Azure Automation):

az provider register --namespace Microsoft.Automation

It takes a while before the provider is registered, so you can monitor registration state with a command such as this:

az provider show -n Microsoft.Automation
Keys are also used Keycloak realms
Keys are used for various purposes in Keycloak realms (https://www.pexels.com/it-it/foto/close-up-chiave-tubolare-hoppe-grigio-115642)

What are Keycloak realm keys?

Keycloak's authentication protocols make use of private and public keys for signing and encrypting, as described in the official documentation. These keys are realm-specific, and by default managed internally in Keycloak. So, when you create a realm using the Keycloak Admin API, kcadm.sh or manually using the Web UI, new keypair(s) get generated automatically. These are called "Managed keys". However, you can also use custom realm keys in Keycloak, which is what this article is about. Using custom realm keys allow you, for example, to use a single keypair in multiple realms.

Internally Keycloak handles realm keys as "components" and stores them the COMPONENT_CONFIGS table in the database. That's why you won't see any "keypair" or "key" methods in the REST API, nor can you find the keys as properties of the realm. While kcadm.sh does have a "get keys" method, keys can only be added, deleted or modifies as components.

The currently used key - managed or custom - is selected based on the priority of all active keys. So, in this context active does not automatically mean that a realm keypair is being used, just that it is has not been explicitly made passive.

Creating custom realm keys

While managed keys are convenient, you can use custom realm keys in Keycloak. The official documentation on how to do this via kcadm and/or the API is missing, so this blog post aims to fill that void. Here we use kcadm.sh, but the process is very similar for raw API calls.

The first step is to create a keypair (borrowed from here):

$ openssl genpkey -algorithm RSA -out private_key.pem -pkeyopt rsa_keygen_bits:2048

The next step is to create a JSON payload, which should look like this:

{
  "name" : "rsa-shared",
  "providerId" : "rsa",
  "providerType" : "org.keycloak.keys.KeyProvider",
  "parentId" : "Test",
  "config" : {
    "privateKey": ["-----BEGIN PRIVATE KEY-----\n<private-key-body>\n-----END PRIVATE KEY-----"],
    "certificate" : [],
    "active" : [ "true" ],
    "priority" : [ "123" ],
    "enabled" : [ "true" ],
    "algorithm" : [ "RS256" ]
  }
}

A couple of notes regarding the payload:

To apply the payload use

$ /opt/keycloak/bin/kcadm.sh create components -r test -f payload-add-key.json --no-config --server http://localhost:8080/auth --realm master --user admin --password changeme

Getting custom realm key's id

Keycloak uses unique, typically machine-generated identifiers for various resources: this also applies to custom realm keys in Keycloak. According to kcadm.sh documentation you need the key's "Provider ID" to modify or delete it. This is true, but also very confusing: when looking at the keys in JSON you see a property called "providerId", whose value can be "rsa", "rsa-generated" or such. Using that as part of the URL in Admin REST API calls will just fail. Instead, you should use the key's id as part of the URL and ignore its providerId property completely.

To get the ID for a key list all realm keys and check the value of "id" for your custom realm key:

/opt/keycloak/bin/kcadm.sh get keys -r test --no-config --server http://localhost:8080/auth --realm master --user admin --password changeme

Armed with this information you can now update and remove your custom keys.

Modifying custom realm keys

You can modify custom realm keys in Keycloak with kcadm.sh. The first and easier option is to pass the changes as a value to the -s parameter:

/opt/keycloak/bin/kcadm.sh update components/<provider-id> -r test -s 'config.active=["false"]' --no-config --server http://localhost:8080/auth --realm Test --user admin --password changeme

Alternatively you can make the modification(s) using a JSON payload. In fact, when you're working directly with the Admin REST API that's your only option. Note that in that case the payload you use has to include the full representation of the key object. For example:

{ "id":"<provider-id>",
  "name":"rsa-shared",
  "providerId":"rsa",
  "providerType":"org.keycloak.keys.KeyProvider",
  "parentId":"Test",
  "config": {
    "publicKey" : ["-----BEGIN PUBLIC KEY-----\n<public-key-body>\n-----END PUBLIC KEY-----"],
    "privateKey": ["-----BEGIN PRIVATE KEY-----\n<private-key-body>\n-----END PRIVATE KEY-----"],
    "active":["true"],
    "priority":["101"],
    "enabled":["true"],
    "algorithm":["RS256"]
  }
}

Once you've crafter the payload you can update the key object:

/opt/keycloak/bin/kcadm.sh update components/<provider-id> -r test -f update.json --no-config --server http://localhost:8080/auth --realm Test --user admin --password changeme

Removing custom realm keys

To remove a custom realm key from Keycloak delete the component like this:

/opt/keycloak/bin/kcadm.sh delete components/<provider-id> -r test --no-config --server http://localhost:8080/auth --realm master --user admin --password changeme

Ansible module for managing Keycloak realm keys

If you use Ansible for managing Keycloak you can use Puppeteers' keycloak_realm_key module in the puppeteers.keycloak collection. Here's sample usage:

- name: Manage Keycloak realm key
    keycloak_realm_key:
      name: custom
      state: present
      parent_id: master
      provider_id: "rsa"
      auth_keycloak_url: "http://localhost:8080/auth"
      auth_username: keycloak
      auth_password: keycloak
      auth_realm: master
      config:
        private_key: "{{ private_key }}"
        enabled: true
        active: true
        priority: 120
        algorithm: "RS256"

Note that the private key should be string with literal linefeed characters ('\n') instead of actual linefeeds.

I was working with Keycloak realm private/public key automation and it was not immediately obvious where Keycloak stores the keys. Figuring it out was actually easy, and this method applies to any web application that uses MySQL/MariaDB, not just Keycloak.

Anyhow, on Ubuntu, you'd navigate to /var/lib/mysql/<name-of-database>. For example:

cd /var/lib/mysql/keycloak

Make sure that no changes have been made recently (within one minute):

$ find -mmin -1

Then go to the web application and change whatever you're trying to locate from the database. Here I changed the Keycloak RSA key priority. Then run the find command again:

$ find -mmin -1
./COMPONENT_CONFIG.ibd

And there it is, in the COMPONENT_CONFIG table in the keycloak database.

Using find in this way is a very useful technique and I've been using it for years. But somehow I had never done it in this particular use-case, which is why I chose to document it now.

In AWS EBS ("Elastic Block Storage") is the underlying technology that (virtual) hard disks of your instances (virtual machines) use. You can take snapshots of those virtual hard disks and use those snapshots to, for example:

Here we'll focus on the last use-case: being able to create copies of virtual machines on another AWS account. The reason why I even bothered writing this blog post is that most of the articles on the Internet do not cover this use-case: they assume you're working within one AWS account and/or one region. The use-case covered here requires a few extra steps:

  1. In the origin AWS account take a snapshot of the virtual machine's EBS volume
  2. In the origin AWS account (If needed) copy the EBS snapshot to the region where it will be deployed on on the other AWS account
  3. In the origin AWS account configure snapshot permissions to grant access to the target AWS account
  4. In the target AWS account create a snapshot from the snapshot that was shared with you (creating an AMI directly from a snapshot shared with you does not work)
  5. In the target AWS account create an AMI from the snapshot that was created from the EBS volume
  6. In the target AWS account launch a new instance (virtual machine) from the AMI you just created

Why the process needs this extra step (snapshot -> snapshot) I do not know. Possibly it has something to do with how the snapshot permissions/sharing works.

Microsoft Azure has a nice service for scheduling tasks called Azure Automation. While Azure Automation is able to other things as well, such as being able to act as a Powershell DSC pull server, we'll focus on the runbooks and scheduling. Runbooks are scripts that do things, e.g. run maintenance and reporting tasks. Runbooks often, but not necessarily manipulate objects in Azure. Runbooks are serverless like Azure Functions and the pricing is therefore similar - you pay for the amount of computing resources you use and don't have to pay for a server that's mostly idle.

On the surface Azure Automation, runbooks and schedules seem deceptively simple, and getting something running in the Azure Portal is actually relatively easy. However, several Azure technologies are involved in making the pieces of the puzzle come together:

When starting with Azure Automation I recommend doing it manually first. Once you're able to make it work manually, you can way more easily codify your work. That said, here comes the sample Terraform code to set up Azure Automation to start and stop VMs on schedule - something that seems to be a very common use-case. First we need some plumbing:

data "azurerm_subscription" "primary" {
}

resource "azurerm_resource_group" "development" {
  name = "${var.resource_prefix}-rg"
  location = "northeurope"
}

Then we create the user-managed identity which the Azure Automation account will use and assign it a custom role that has the permissions to start and stop VMs:

resource "azurerm_role_definition" "stop_start_vm" {
  name        = "StopStartVM"
  scope       = data.azurerm_subscription.primary.id
  description = "Allow stopping and starting VMs in the primary subscription"

  permissions {
    actions     = ["Microsoft.Network/*/read",
                   "Microsoft.Compute/*/read",
                   "Microsoft.Compute/virtualMachines/start/action",
                   "Microsoft.Compute/virtualMachines/restart/action",
                   "Microsoft.Compute/virtualMachines/deallocate/action"]
    not_actions = []
  }
}

resource "azurerm_user_assigned_identity" "development_automation" {
  resource_group_name = azurerm_resource_group.development.name
  location            = azurerm_resource_group.development.location

  name = "development-automation"
}

resource "azurerm_role_assignment" "development_automation" {
  scope              = data.azurerm_subscription.primary.id
  role_definition_id = azurerm_role_definition.stop_start_vm.role_definition_resource_id
  principal_id       = azurerm_user_assigned_identity.development_automation.principal_id
}

With the user-managed identity in place we can create the Azure Automation account:

resource "azurerm_automation_account" "development" {
  name                = "development"
  location            = azurerm_resource_group.development.location
  resource_group_name = azurerm_resource_group.development.name
  sku_name            = "Basic"

  identity {
    type = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.development_automation.id]
  }
}

As you can see, the Azure Automation account is linked with the user-managed identity in the identity block.

Now the Simple-Azure-VM-Start-Stop.ps1 runbook can be added:

data "local_file" "simple_azure_vm_start_stop" {
  filename = "${path.module}/scripts/SimpleAzureVMStartStop.ps1"
}

resource "azurerm_automation_runbook" "simple_azure_vm_start_stop" {
  name                    = "Simple-Azure-VM-Start-Stop"
  location                = azurerm_resource_group.development.location
  resource_group_name     = azurerm_resource_group.development.name
  automation_account_name = azurerm_automation_account.development.name
  log_verbose             = "true"
  log_progress            = "true"
  description             = "Start or stop virtual machines"
  runbook_type            = "PowerShell"
  content                 = data.local_file.simple_azure_vm_start_stop.content
}

The runbook is from here, but small modifications were made to make it work with user-managed identities. In particular, the Azure connection part was changed from this simplistic version to:

try {
    $null = Connect-AzAccount -Identity
}
catch {
    --- snip ---
}

to a more complex version:

params(
    --- snip ---
    [Parameter(Mandatory = $true)]
    $AccountId,
    --- snip ---
)

--- snip ---

try {
    # Ensures you do not inherit an AzContext in your runbook
    Disable-AzContextAutosave -Scope Process

    # Connect to Azure with user-assigned managed identity
    $AzureContext = (Connect-AzAccount -Identity -AccountId $AccountId).context

    # set and store context
    $AzureContext = Set-AzContext -SubscriptionName $AzureContext.Subscription -DefaultProfile $AzureContext
}
catch {
    --- snip ---
}

The -AccountId parameter somewhat confusingly expects to get the Client ID of the managed identity.

Now, with the runbook in place we can create the schedules:

resource "azurerm_automation_schedule" "nightly_vm_backup_start" {
  name                    = "nightly-vm-backup-start"
  resource_group_name     = azurerm_resource_group.development.name
  automation_account_name = azurerm_automation_account.development.name
  frequency               = "Day"
  interval                = 1
  timezone                = "Etc/UTC"
  start_time              = "2022-08-11T01:00:00+00:00"
  description             = "Start VMs every night for backups"
}

resource "azurerm_automation_schedule" "nightly_vm_backup_stop" {
  name                    = "nightly-vm-backup-stop"
  resource_group_name     = azurerm_resource_group.development.name
  automation_account_name = azurerm_automation_account.development.name
  frequency               = "Day"
  interval                = 1
  timezone                = "Etc/UTC"
  start_time              = "2022-08-11T01:30:00+00:00"
  description             = "Stop VMs every night after backups"

The final step is to link the schedules with the runbook with job schedules. Note that the parameters to to the runbook are passed here. Also note that the keys (parameter names) have to be lowercase even if in the Powershell code they're uppercase (e.g. AccountId -> accountid):

resource "azurerm_automation_job_schedule" "nightly_vm_backup_start" {
  resource_group_name     = azurerm_resource_group.development.name
  automation_account_name = azurerm_automation_account.development.name
  schedule_name           = azurerm_automation_schedule.nightly_vm_backup_start.name
  runbook_name            = azurerm_automation_runbook.simple_azure_vm_start_stop.name

  parameters = {
    resourcegroupname = azurerm_resource_group.development.name
    accountid         = azurerm_user_assigned_identity.development_automation.client_id
    vmname            = "testvm"
    action            = "start"
  }
}

resource "azurerm_automation_job_schedule" "nightly_vm_backup_stop" {
  resource_group_name     = azurerm_resource_group.development.name
  automation_account_name = azurerm_automation_account.development.name
  schedule_name           = azurerm_automation_schedule.nightly_vm_backup_stop.name
  runbook_name            = azurerm_automation_runbook.simple_azure_vm_start_stop.name

  parameters = {
    resourcegroupname = azurerm_resource_group.development.name
    accountid         = azurerm_user_assigned_identity.development_automation.client_id
    vmname            = "testvm"
    action            = "stop"
  }
}

With this code you should be able to schedule the startup and shutdown of a VM called "testvm" succesfully. If that is not the case, go to Azure portal -> Automation Accounts -> Development -> Runbooks -> Simple-Azure-VM-Start-Stop, edit the runbook and use the "test pane" to debug what is going on. You can get script input, output, errors and all that good stuff from there, and you can trigger the script with various parameters for testing purposes.

This code is also available as a generalized module in GitHub.

Puppet Development Kit is probably the best thing since sliced bread if you work a lot with Puppet. It makes adding basic validation and unit tests trivial with help from rspec-puppet. It also makes it very easy to build module packages for the Puppet Forge.

That said, there is a minor annoyance with it: whenever you run "pdk update" to update PDK to the latest version, all your local changes to files such as .gitignore get destroyed, because PDK creates those files from PDK templates and any local changes are mercilessly wiped. Fortunately there is a way to sync your local changes to the files generated from the templates with .sync.yml. While .sync.yml is is pretty well documented, documentation is lots and lots of words with very few examples. So, here's a fairly trivial .sync.yml example that ensures local changes to .pdkignore and .gitignore persist across PDK updates:

---
.gitignore:
  required:
    - '*.log'
.pdkignore:
  required:
    - '*.log'

In PDK template speak .gitignore and .pdkignore are namespaces. For a list of namespaces see the official documentation.

Typically Linux nodes are joined to FreeIPA using admin credentials. While this works, it exposes fully privileged credentials unnecessarily, for example when used within a configuration management system (see for example puppet-ipa).

Fortunately joining nodes to FreeIPA is possible with more limited privileges. The first step is to create a new FreeIPA role, e.g. "Enrollment administrator" with three privileges:

Then you create a new user, e.g. "enrollment", and join it to the "Enrollment administrator" role. After that you should be able to join nodes using that "enrollment" user.

While this is not perfect security-vise, it is still better than having to expose the admin credentials just to join nodes to FreeIPA.

Cloud-Init is "a standard for customizing" cloud instances, typically on their first boot. It is allows mixing state-based configuration management with imperative provisioning commands (details in our IaC article). By using cloud-init most of the annoyances of SSH-based provisioning can be avoided:

That said, neither cloud-init itself, nor its use within Terraform is particularly well documented. Therefore it can be an effort to create cloud-init-based provisioning that works and adapts easily to different use-cases. This article attempts to fill that gap to some extent.

In this particular case I had to convert an existing, imperative SSH-based Puppet agent provisioning process to cloud-init, so there's very little state-based configuration management in all of this. What I ended up with a three-phase approach:

  1. Put the provisioning scripts on the host
  2. Run all the provisioning scripts that are required for any particular use-case
  3. Remove the provisioning scripts from the host

The first step includes creating a cloud-init yaml config, write-scripts.cfg, that has all the provisioning scripts embedded into it:

#cloud-config
write_files:
  - path: /var/cache/set-hostname.sh
    owner: root:root
    permissions: '0755'
    content: |
      #!/bin/sh
      #
      # Script body start
      --- snip ---
      # Script body end
  - path: /var/cache/add-puppetmaster-to-etc-hosts.sh
    owner: root:root
    permissions: '0755'
    content: |
      #!/bin/sh
      #
      # Script body start
      --- snip ---
      # Script body end
  - path: /var/cache/add-deployment-fact.sh
    owner: root:root
    permissions: '0755'
    content: |
      #!/bin/sh
      #
      # Script body start
      --- snip ---
      # Script body end
  - path: /var/cache/install-puppet.sh
    owner: root:root
    permissions: '0755'
    content: |
      #!/bin/sh
      #
      # Script body start
      --- snip ---
      # Script body end

The key with these scripts is that they are not Terraform templates. Instead, they're static files that take parameters to adapt their behavior, including doing nothing if the user so desires. The main reason for making this file static instead of a template is that it prevents Terraform variable interpolation from getting confused about POSIX shell variables written in the ${} syntax.

The cloud-init part is just thin wrapping to allow "uploading" the scripts to the host. In Terraform we load the above file using a local_file datasource:

# cloud-init config that installs the provisioning scripts
data "local_file" "write_scripts" {
  filename = "${path.module}/write-scripts.cfg"
}

This alone does not do anything, just makes the file contents available for use in Terraform.

The next step is to create the cloud-init config, run-scripts.cfg.tftpl, that actually runs the scripts and does cleanup after the scripts have run. As the name implies, it is a Terraform template:

#cloud-config
runcmd:
  - [ "/var/cache/set-hostname.sh", "${hostname}" ]
%{ if install_puppet_agent ~}
  - [ "/var/cache/add-puppetmaster-to-etc-hosts.sh", "${puppetmaster_ip}" ]
  - [ "/var/cache/add-deployment-fact.sh", "${deployment}" ]
  - [ "/var/cache/install-puppet.sh", "-n", "${hostname}", "-e", "${puppet_env}", "-p", "${puppet_version}", "-s"]
%{endif ~}
  - [ "rm", "-f", "/var/cache/set-hostname.sh", "/var/cache/add-puppetmaster-to-etc-hosts.sh", "/var/cache/add-deployment-fact.sh", "/var/cache/install-puppet.sh" ]

Note the ~ after the statements: it ensures that a linefeed is not added to the resulting cloud-init configuration file.

By making this file a template we can drive the provisioning logic using "advanced" constructs like real for-loops and if statements which Terraform (or rather, HCL2) itself lacks. Templating also allows making all provisioning steps conditional - something that's very difficult to accomplish with SSH-based provisioning (see my earlier blog post).

The matching Terraform datasource looks like this:

data "template_file" "run_scripts" {
  template = file("${path.module}/run-scripts.cfg.tftpl")
  vars     = {
               hostname             = var.hostname,
               deployment           = var.deployment,
               install_puppet_agent = var.install_puppet_agent,
               puppet_env           = local.puppet_env,
               puppet_version       = var.puppet_version,
               puppetmaster_ip      = var.puppetmaster_ip,
             }
}

As can be seen the template does not magically know the values that are already available in Terraform code - instead, they need to be passed to the template explicitly as a map.

The next step is to bind the two cloud-init configs into a single, multi-part cloud-init configuration using the cloudinit_config datasource:

data "cloudinit_config" "provision" {
  gzip          = true
  base64_encode = true

  part {
    content_type = "text/cloud-config"
    content      = data.local_file.write_scripts.content
  }

  part {
    content_type = "text/cloud-config"
    content      = data.template_file.run_scripts.rendered
  }
}

The above file shows one the strengths of cloud-init: you can do provisioning using a combination of shell commands, scripts and cloud-init configurations by setting the content_type appropriately for each part. See cloud-init documentation for more details.

Finally we can pass the rendered cloud-init configuration to the VM resource that will consume it:

resource "aws_instance" "ec2_instance" {
  --- snip ---
  user_data = data.cloudinit_config.provision.rendered
}

You may also want to ensure that changes to provisioning scripts do not trigger instance rebuilt:

  lifecycle {
    ignore_changes = [
      user_data,
    ]
  }

When developing cloud-init templates it can be useful to validate their contents:

$ cloud-init devel schema --config-file <config-file>

This will catch all the easy errors quickly. According to some sources this command is (or was) nothing but a glorified yaml linter, but still, it is easily available on Linux so worth using.

If provisioning scripts are not working as expected, cloud-init logs may reveal why:

Some notes:

External links:

menucross-circle