Enabling Azure Backup on Linux VMs with Terraform

September 5, 2022 – Samuli Seppänen

This article shows you how to enable Azure Backup on Linux VMs. It is recommended to read the Understanding Azure Backup for Linux VMs article first before trying to enable backups with Terraform. Terraform AzureRM provider has three relevant resources:

I'll go through the basic process first, which may work for some people. I'll then take a deep dive into the timeout issues you may or may not encounter in your environment.

The first step is to create a recovery services vault and a backup policy. Here's a simple example:

resource "azurerm_resource_group" "myrg" {
  name     = "myrg"
  location = "northeurope"
}

resource "azurerm_recovery_services_vault" "default" {
  name                = "default"
  location            = azurerm_resource_group.myrg.location
  resource_group_name = azurerm_resource_group.myrg.name
  sku                 = "Standard"
  storage_mode_type   = "LocallyRedundant"
}

resource "azurerm_backup_policy_vm" "default" {
  name                           = "default"
  resource_group_name            = azurerm_resource_group.myrg.name
  recovery_vault_name            = azurerm_recovery_services_vault.default.name
  policy_type                    = "V2"

  # This parameter may not work, depending on things. In theory
  # policy_type V2 should allow setting this more freely than V1, but it
  # seems that in certain cases V1 policy is silently enabled even if you
  # set it to V2 in Terraform.
  #instant_restore_retention_days = 7

  timezone = "UTC"

  backup {
    frequency = "Daily"
    time      = "23:00"
  }

  retention_daily {
    count = 7
  }

  retention_weekly {
    count    = 4
    weekdays = ["Saturday"]
  }

  retention_monthly {
    count    = 6
    weekdays = ["Saturday"]
    weeks    = ["Last"]
  }
}

The next step is to enable backups for a VM, here assumed to be azurerm_linux_virtual_machine.testvm:

resource "azurerm_backup_protected_vm" "testvm" {
  resource_group_name = azurerm_resource_group.myrg.name
  recovery_vault_name = azurerm_recovery_services_vault.default.name
  source_vm_id        = azurerm_linux_virtual_machine.testvm.id
  backup_policy_id    = azurerm_backup_policy_vm.default.id
}

Now, in theory this should be enough plumbing and it may work for you. However, in my case deploying the azurerm_backup_protected_vm just seems to hang indefinitely:

azurerm_backup_protected_vm.testvm: Still creating... [1h19m50s elapsed]                                                                                     
╷                                                                                                                                                             
│ Error: waiting for the Azure Backup Protected VM "VM;iaasvmcontainerv2;myrg;testvm" to be true (Resource Group "myrg") to provision: context dead
line exceeded    

While enabling backups manually and taking the first backup in the Azure Portal is slow (~15 minutes), it is nowhere near this slow. This problem is not related to ordering, either: even if the VM is created before trying to enable backups on it the same thing happens.

One thing I tried was enabling the VMSnapshotLinux extension manually, so that it would be installed before trying to protect the VM. The first step was to figure out the extension publisher, name and version with Azure CLI:

$ az vm extension image list --location northeurope --output table

Name             Publisher                          Version
----             ---------                          -------
AcronisBackup    Acronis.Backup                     1.0.33
--- snip ---
VMSnapshotLinux  Microsoft.Azure.RecoveryServices   1.0.9188.0
--- snip ---

When used with azurerm_virtual_machine_extension Terraform resource these fields translate like this:

  • Name -> type
  • Publisher -> publisher
  • Version -> type_handler_version

So, naively one would assume this Terraform code would install the VMSnapshotLinux extension to a VM, assuming that waagent.service is running:

resource "azurerm_virtual_machine_extension" "testvm_vmsnapshotlinux" {
  name                 = "testvm-vmsnapshotlinux"
  virtual_machine_id   = azurerm_linux_virtual_machine.testvm.id
  publisher            = "Microsoft.Azure.RecoveryServices"
  type                 = "VMSnapshotLinux"
  type_handler_version = "1.0.9188.0"
}

But of course this code did not work due to an esoteric and undocumented feature/bug:

the value of parameter typeHandlerVersion is invalid

Basically the version number in type_handler_version must be shortened from 1.0.9188.0 to 1.0:

type_handler_version = "1.0"

This allowed me to the next error:

Publisher 'Microsoft.Azure.RecoveryServices' and type VMSnapshotLinux is reserved for internal use.

This "Internal use" issue can be easily reproduced with Azure CLI:

$ az vm extension set --resource-group myrg --vm-name testvm --name VMSnapshotLinux --publisher Microsoft.Azure.RecoveryServices
(OperationNotAllowed) Publisher 'Microsoft.Azure.RecoveryServices' and type VMSnapshotLinux is reserved for internal use.
Code: OperationNotAllowed
Message: Publisher 'Microsoft.Azure.RecoveryServices' and type VMSnapshotLinux is reserved for internal use.

How nice, so this particular extension can't be explicitly installed. The next debugging step was to check the deployment JSON file in Azure Portal for a manual VM protection action. The deployments are stored under the resource group. And here is what Azure Portal deploys:

{
    "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "armProviderNamespace": {
            "type": "String"
        },
        "vaultName": {
            "type": "String"
        },
        "vaultRG": {
            "type": "String"
        },
        "vaultSubID": {
            "type": "String"
        },
        "policyName": {
            "type": "String"
        },
        "fabricName": {
            "type": "String"
        },
        "protectionContainers": {
            "type": "Array"
        },
        "protectedItems": {
            "type": "Array"
        },
        "sourceResourceIds": {
            "type": "Array"
        },
        "extendedProperties": {
            "type": "Array"
        }
    },
    "resources": [
        {
            "type": "Microsoft.RecoveryServices/vaults/backupFabrics/protectionContainers/protectedItems",
            "apiVersion": "2016-06-01",
            "name": "[concat(parameters('vaultName'), '/', parameters('fabricName'), '/',parameters('protectionContainers')[copyIndex()], '/', parameters('protectedItems')[copyIndex()])]",
            "properties": {
                "protectedItemType": "Microsoft.ClassicCompute/virtualMachines",
                "policyId": "[resourceId(concat(parameters('armProviderNamespace'), '/vaults/backupPolicies'), parameters('vaultName'), parameters('policyName'))]",
                "sourceResourceId": "[parameters('sourceResourceIds')[copyIndex()]]",
                "extendedProperties": "[parameters('extendedProperties')[copyIndex()]]"
            },
            "copy": {
                "name": "protectedItemsCopy",
                "count": "[length(parameters('protectedItems'))]"
            }
        }
    ]
}

It is noteworthy that there is only one resource in this deployment file. Also, the parameters in this file match those used by the azurerm_backup_protected_vm Terraform resource, as do the parameter values. So, Azure Portal does not seem to be doing any "under the hood" magic to enable the backups for a VM. All points to a bug in Terraform AzureRM provider itself.

Looking at this bug report it seems that Terraform's AzureRM provider waits for the first backup to succeed and then considers the resource to be successfully created. Given that enabling the backup and triggering the first backup takes ~15 minutes in Azure Portal it looks as if Terraform does not trigger the first backup at all. Combined with a reasonable "once per day" backup schedule this means Terraform will fail every time unless timeout values are huge and the first backup actually kicks in during the Terraform run.

Summa summarum: if you encounter these timeout issues it may be easiest to just create the backups manually in Azure Portal or with az cli:

az backup protection enable-for-vm --resource-group myrg --vault-name default --vm testvm --policy-name default

Then you can import the already present resource into Terraform and be done with it, while waiting for this bug to be fixed.

Want to talk to an expert?

If you want to reach us, just send us a message or book a free call!
menucross-circle