AWS EC2 instances are subject to two types of status checks (AWS docs):
- System status check (issues with the underlying hardware/networking: "the AWS side")
- Instance status check (issues with the OS, e.g. OOM, file system corruption, broken networking, etc: "our side")
The official AWS EC2 instance recovery documentation claims that automatically recovering from an EC2 instance failure is only possible for system status checks, but the AWS documentation is outdated: Cloudwatch metrics are available for both types of check failures and people have set up recovery for both types of failures using AWS CLI.
It seemed fairly straightforward to configure auto-recovery until I hit this error:
Error: Creating metric alarm failed: ValidationError: The EC2 'Recover' Action is not valid for the associated instance. Please remove or change to a different EC2 action.
status code: 400, request id: f1ef242a-be24-45b7-a971-5430291a3081
It turns out that setting up an EC2 "Recover" action for an EC2 instance that has ephemeral (instance store) volumes attached to it does not work, hence the error. I can only fathom why this makes any difference.
The first step to get the alarms working is to check which of your EC2 instances actually have ephemeral volumes attached to them. One option is to run some commands on the EC2 instance. There are two ways, lsblk and nvme list. The latter requires installation of nvme-cli or equivalent package, but clearly shows which volumes are instance store volumes and which are EBS volumes.
It also seems possible to do the same with AWS CLI, but note that you need to target the AMI used by the EC2 instance, not the EC2 instance itself:
$ aws ec2 describe-images --image-ids=ami-097ebb39620d8d54b
{
"Images": [
{
"Architecture": "x86_64",
--- snip ---
"BlockDeviceMappings": [ [14/1929]
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-09874e2e955d8f241",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"Description": "Canonical, Ubuntu, 18.04 LTS, amd64 bionic image build on 2019-05-14",
--- snip ---
Checking the AMI only tells whether the AMI sets up ephemeral devices by default when you create the instance, not the actual status of the ephemeral devices on your EC2 instance. For that you need lsblk or nvme list (in case of Linux).
Yet another way is to find the ephemeral volues is to try adding the alarms and see on which EC2 instances it fails.
Once you know which EC2 instance have ephemeral volumes attached to them you can detach those volumes. Naturally you need to do it the hard way using AWS CLI, because AWS Console does not show you the attachments at all. In theory the information you need for the job is available in AWS documentation, but crafting a suitable command-line and a JSON payload can still be an effort. A payload that works for some Ubuntu variants is this:
[
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0",
"NoDevice": ""
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1",
"NoDevice": ""
}
]
To apply it put it into remove-ephemeral.json and run this:
$ aws ec2 modify-instance-attribute --instance-id <instance-id> --attribute blockDeviceMapping --block-device-mappings file:////path-to/remove-ephemeral.json
Unlike terraform-providers-aws documentation seems to imply, you can't use "NoDevice": true or "NoDevice": "true" in the JSON payload or AWS CLI will barf.
After you've modified instance attributes you should be good, except if an EC2 instance was in a "stopped" state. If it was, you need to start it up: otherwise you changes won't kick in and Terraform will continue complaining. After that trickery Terraform should be happy with the situation. In other words it should not try to destroy and rebuild the instance because ephemeral volumes have been detached, nor error out.
Finally you can write Terraform code to create the Cloudwatch alarms and recovery/reboot actions for both types of status checks. For example:
resource "aws_cloudwatch_metric_alarm" "system" {
count = var.restart_on_system_failure == true ? 1 : 0
alarm_name = "${var.hostname}_system_check_fail"
alarm_description = "System check has failed"
alarm_actions = ["arn:aws:automate:${var.region}:ec2:recover"]
metric_name = "StatusCheckFailed_System"
namespace = "AWS/EC2"
dimensions = { InstanceId: aws_instance.ec2_instance[0].id }
statistic = "Maximum"
period = "300"
evaluation_periods = "2"
datapoints_to_alarm = "2"
threshold = "1"
comparison_operator = "GreaterThanOrEqualToThreshold"
tags = { "Name": "${var.hostname}_system_check_fail" }
}
resource "aws_cloudwatch_metric_alarm" "instance" {
count = var.restart_on_instance_failure == true ? 1 : 0
alarm_name = "${var.hostname}_instance_check_fail"
alarm_description = "Instance check has failed"
alarm_actions = ["arn:aws:automate:${var.region}:ec2:reboot"]
metric_name = "StatusCheckFailed_Instance"
namespace = "AWS/EC2"
dimensions = { InstanceId: aws_instance.ec2_instance[0].id }
statistic = "Maximum"
period = "300"
evaluation_periods = "3"
datapoints_to_alarm = "3"
threshold = "1"
comparison_operator = "GreaterThanOrEqualToThreshold"
tags = { "Name": "${var.hostname}_system_check_fail" }
}
This code is taken from terraform-aws_instance_wrapper which we use a lot to wrap useful functionality into Terraform EC2 instance creation.