Puppeteers Blog

Fixing ansible-playbook hangs caused by SSH timeouts

November 29, 2024 
Image of a pickup-truck stuck in thick snow.

Introduction

Ansible is an infrastructure as code tool that uses SSH as its transport mechanism. When the targeted nodes are close, latency-vise to the Ansible controller things usually work fine with the default settings. However, when the targets are far away or latency is big, you may notice that ansible-playbook hangs on some long-running task and never recovers. When you rerun the playbook it may proceed further and then hang at some other task. Or the same task. This is incredibly frustrating, but fortunately also fairly easy to fix.

Improving Ansible reliability with custom SSH options

A common solution to ansible-playbook hangs is to pass SSH options to ansible-playbook. All descriptions below are rom ssh_config man page. Three SSH settings can prevent the SSH server from killing the connection when an Ansible task runs for a long time and "nothing happens" from SSH perspective:

  • ServerAliveInterval: "Sets a timeout interval in seconds after which if no data has been received from the server, ssh will send a message through the encrypted channel to request a response from the server. The default is 0, indicating that these messages will not be sent to the server... "
  • ServerAliveCountMax: "Sets the number of server alive messages (see below) which may be sent without ssh receiving any messages back from the server... The default value is 3."
  • TCPKeepAlive: "Specifies whether the system should send TCP keepalive messages to the other side... The default is yes (to send TCP keepalive messages)..."

The first two are protocol-level keepalive settings and the last one is a network connection level setting. Setting these to reasonable values (see below) should solve most SSH-related hangs in Ansible.

Topping off with some performance improvements

You can improve reliability somewhat by using SSH connection multiplexing:

  • ControlMaster: "Enables the sharing of multiple sessions over a single network connection". Defaults to "no".
  • ControlPersist: "When used in conjunction with ControlMaster, specifies that the master connection should remain open in the background (waiting for future client connections) after the initial client connection has been closed". Defaults to "no".

While these two are not critical reliability-vise, they do speed ansible-playbook runs. For more performance-enhancing ideas see this article.

What settings are reasonable, then?

That's a good question. Nobody knows the answer, me included. However, based on practical tests with fairly high-latency links (Finland -> North Virginia) the following settings in ansible.cfg allow long ansible-playbook runs to finish reliably:

[ssh_connection]
ssh_args = -o ServerAliveInterval=30 -o ServerAliveCountMax=3 -o TCPKeepAlive=yes -o ControlMaster=auto -o ControlPersist=60s

You can also pass the same settings on the command-line, should you need to. This can be useful when Ansible is integrated into other tools such as Packer or custom scripts:

ansible-playbook --ssh-extra-args '-o ControlMaster=auto -o ControlPersist=60s -o ServerAliveInterval=30 -o ServerAliveCountMax=3 -o TCPKeepAlive=yes' playbook.yml

You can also pass the settings in the ANSIBLE_SSH_EXTRA_ARGS environment variable.

The ansible.builtin.ssh_connection documentation describes all the above options in detail.

Addendum: passing SSH options to Packer's Ansible provisioner

Packer has a pretty decent Ansible provisioner. As with plain ansible-playbook it works fine until latency grows too much and Ansible tasks get stuck randomly. The official Packer plugin documentation tells us this:

ansible_ssh_extra_args ([]string) - Specifies --ssh-extra-args on command line defaults to -o IdentitiesOnly=yes

https://developer.hashicorp.com/packer/integrations/hashicorp/ansible/latest/components/provisioner/ansible

This is all well and good, but does not really tell anything.

Tales of things that did not work

Intuitively one might start feeding the options like this to a HCL2-format Packerfile:

ansible_ssh_extra_args = ["-o", "ControlMaster=auto", "-o", "ControlPersist=60s", "-o", "IdentitiesOnly=yes", "-o", "ServerAliveInterval=30", "-o", "TCPKeepAlive=yes"]

This non-obviously fails. A second attempt might be this:

ansible_ssh_extra_args = ["-o ControlMaster=auto", "-o ControlPersist=60s", "-o IdentitiesOnly=yes", "-o ServerAliveInterval=30", "-o TCPKeepAlive=yes"]

No go, either. Does it maybe accept a string despite claiming to want a string array?

ansible_ssh_extra_args = ["-o ControlMaster=auto -o ControlPersist=60s -o IdentitiesOnly=yes -o ServerAliveInterval=30 -o TCPKeepAlive=yes"]

--- snip ---

 ec2.amazon-ebs.build: fatal: [default]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: command-line line 0: keyword controlmaster extra arguments at end of line", "unreachable": true}

What about single-quoting? That works with plain ansible-playbook:

ansible_ssh_extra_args = ['-o ControlMaster=auto -o ControlPersist=60s -o IdentitiesOnly=yes -o ServerAliveInterval=30 -o TCPKeepAlive=yes']

--- snip ---

Error: Invalid expression

  on packerfile.pkr.hcl line 162, in build:
 162:     ansible_ssh_extra_args = ['-o ControlMaster=auto -o ControlPersist=60s -o IdentitiesOnly=yes -o ServerAliveInterval=30 -o TCPKeepAlive=yes']

Expected the start of an expression, but found an invalid expression token.

Ok, fine, I was clearly completely wrong.

Passing ssh_extra_args to ansible-playbook from Packer correctly

The correct answer is to put the single quotes inside double quotes:

ansible_ssh_extra_args = ["'-o ControlMaster=auto -o ControlPersist=60s -o IdentitiesOnly=yes -o ServerAliveInterval=30 -o TCPKeepAlive=yes'"]

--- snip ---

==> ec2.amazon-ebs.build: Executing Ansible: ansible-playbook -e packer_build_name="build" -e packer_builder_type=amazon-ebs --ssh-extra-args ''-o ControlMaster=auto -o ControlPersist=60s 
-o IdentitiesOnly=yes -o ServerAliveInterval=30 -o TCPKeepAlive=yes'' --extra-vars enable_marketplace_cleanup=true -e ansible_ssh_private_key_file=/tmp/ansible-key3890725170 -i /tmp/packer-provisioner-ansible8741
39248 playbook.yml

Note how Packer actually lies and claims that it wraps --ssh-extra-args into double single quotes. That is a syntax error for plain ansible-playbook and just gives you the help text.

You might be able to pass Ansible SSH settings as environment variables (see here) or as extra_args instead (see here). I have not personally tested either of these approaches, but they look reasonable.

Samuli Seppänen
Samuli Seppänen
Author archive
menucross-circle