Using Velero for AKS Cross Region Disaster Recovery // Andy Roberts

It’s been a while since I wrote a blog post, but I wanted to jump back in again with big one, Disaster Recovery!

Traditionally with DR in a Virtual Machine infrastructure, you would replicate the VMs/Disks etc to your DR location, and when you initial a failover you simply start them up. This works well in a VM environment because you would normally spend a fair amount of time configuring them, either manually (I hope not) or with Automation, but either way its time consuming but well understood.

But what about Containers and Kubernetes?

The most common answer when you talk about Kubernetes and backups/DR is, don’t bother, just redeploy. That does make sense in some instances such as if your application is totally stateless and its easy to deploy, but what happens if you have many different products, deployed by different tools and possibly by different teams? It was this exact challenge I faced recently, and then to add to the mix, I needed to have Disaster Recovery in a totally different Azure Region (and country in some cases). How do you do that?

For me, the tool I found which fit this best was Velero

How does it work

Velero is an OSS tool from VMWare and is designed for backup and disaster recovery of any Kubernetes, be it bare metal or Cloud Managed like AWS, Azure and GCP. It stores its data on object storage like Azure Blob, AWS S3 etc or even virtual disk images, and authentication is available in many forms. For instance with Azure Blob Storage you can authenticate with Access Keys, Managed Identity and even Workload Identity. It also support backup and restore to different clusters than the source, and crucially for me, the restore isn’t region dependant, so you can use it for multi region DR or migrations.

You install Velero on each cluster via Helm chart, give it connection details to your storage for backups, and then setup backups with a CLI tool you run locally.

Azure Backup actually has started to use Velero as its Kubernetes backup tool, however as of November 2023 this only support restores to clusters in the same region as the original backup.

Lets try this out.

You are going to need a few items.

2x AKS Clusters, each if different regions if you want to test multi region restore
1x Azure Storage Account for Velero
Velero Client on you laptop to interact with Velero

If you have some AKS Clusters you can use them, so skip to next step, but if not lets create some.


# Set your Subscription Context
mySubscription="Your-Azure-Subscription"
mySubID=$(az account list --query="[?name=='$mySubscription'].id | [0]" -o tsv)
az account set -s $mySubID

# Create a shared resource group
myResourceGroup="rsg-velero"
myPrimaryLocation="australiaeast"
myDRLocation="westus2"

az group create --name $myResourceGroup --location $myPrimaryLocation

# Create my two test clusters
myPrimaryCluster="aks-1"
myDRCluster="aks-2"

# Primary Cluster
az aks create --resource-group $myResourceGroup --location $myPrimaryLocation \
    --name $myPrimaryCluster \
    --node-count 1 \
    --node-vm-size Standard_B2ms

# DR Cluster
az aks create --resource-group $myResourceGroup --location $myDRLocation \
    --name $myDRCluster \
    --node-count 1 \
    --node-vm-size Standard_B2ms

# Check both clusters are created
az aks list --resource-group $myResourceGroup --output table

You will then end up with a couple of clusters.

aks_clusters

So, now you have a couple of AKS clusters in a resource group ready to go, but Velero needs a location to store config/backups etc. You can use lots of options for this, but the one I found to be the easiest to setup with AKS with a regular Azure Storage Account, so lets create on.

# Setup Storage Account Resource Group
veleroStorageAccount="stvelerotest123"
veleroStorageContainer="velero"

az group create -n $myResourceGroup --location $myPrimaryLocation

# Setup Storage Account
az storage account create \
--name $veleroStorageAccount \
--resource-group $myResourceGroup \
--sku Standard_GRS \
--encryption-services blob \
--https-only true \
--kind BlobStorage \
--access-tier Hot

# Setup Container
az storage container create -n $veleroStorageContainer --public-access off --account-name $veleroStorageAccount

Once your Storage Account is setup, Velero needs a way to connect to it. You can do this with several options such as

Storage Access Key
Managed Identity
Workload Identity

This obviously depends on your workload requirement but for cross region cluster DR I found that Storage Access Key works the quickest, and its the fastest to test. Now as its a secret you wouldn’t commit this to any form of source control, but Velero has the method during install to take these secret values from a file. What I do in production is deploy Velero with an Azure DevOps Pipeline and during deployment I have it pull this secret the same as below and update Velero each time, so if I ever need to roll the keys I just redeploy Velero and it takes the latest secret. So lets generate the secret and file.

# Get the Storage Account Key for the Velero Blob.
VELERO_STORAGE_ACCESS_KEY=$(az storage account keys list --account-name $veleroStorageAccount --resource-group $myResourceGroup --query "[0].value" -o tsv)

# Add the Storage Account Key to the Credentials File

cat << EOF  > ./credentials-velero
AZURE_STORAGE_ACCOUNT_ACCESS_KEY=${VELERO_STORAGE_ACCESS_KEY}
AZURE_CLOUD_NAME=AzurePublicCloud
EOF

You should end up with a file like this. blob_access

Velero Client Install

If that has all worked, you now need to deploy Velero to your clusters, but first lets (download the Velero Client)[https://velero.io/docs/main/basic-install/#install-the-cli]. as you need this to interact with Velero and you will see how it looks without the cluster setup. From that link take whatever is appropriate for your OS, but for me lets do it on Ubuntu.

wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.1/velero-v1.12.1-linux-amd64.tar.gz
tar -xvf velero-v1.12.1-linux-amd64.tar.gz
mv velero-v1.12.1-linux-amd64/velero /usr/local/bin

With the Velero client running locally you can run a version check which will show you the version of your client and your server.

# First connect to our cluster
az aks get-credentials -n $myPrimaryCluster -g $myResourceGroup

# Then check Velero
velero version

You’ll get this below, which means Velero isn’t running on your cluster yet.

velero_client

So lets fix that.

Velero Server Install

Installing Velero can be done with the Velero CLI which I’ll show below, or with Helm Charts which is what I use in production, but for now we’ll install with the CLI as its easier to understand.

You need to perform this config on both clusters!

./velero install \
    --provider azure \
    --plugins velero/velero-plugin-for-microsoft-azure:v1.8.1 \
    --bucket $veleroStorageContainer \
    --secret-file ./credentials-velero \
    --backup-location-config resourceGroup=$myResourceGroup,storageAccount=$veleroStorageAccount,storageAccountKeyEnvVar=AZURE_STORAGE_ACCOUNT_ACCESS_KEY,subscriptionId=$mySubID \
    --use-volume-snapshots=false

Now when you run a version check you should see Client and Server are correct.

velero_version

Next you need to check if your backup storage, which is the Storage Account created earlier is working.

velero backup-location get

velero_storage

If you Phase shows as Available then you have setup the base requirements for Velero.

Create a Backup and Backup Schedule

To create a backup with Velero, you configure this with the CLI tool and you can crete a schedule or one-off backups. I’ll setup both below, but first lets verify we have something to backup and only in one cluster, so I have setup a couple of demo pods and a service etc in my AKS-1 cluster, but you can see below that nothing exists on my AKS-2 cluster.

velero_base_setup

First then lets create a scheduled backup on AKS-1.

velero schedule create "test-schedule" --schedule="0 0-23 * * *" --include-namespaces test --exclude-resources persistentvolumes --ttl 24h0m0s
velero get schedule

This has given us a schedule called “test-schedule which will run every hour and backup the test namespace, and you can see that in the results.

velero_schedule

If we want to execute a backup straight away from that schedule, we run the following.

velero backup create --from-schedule test-schedule
# It will output a command to run similar to this
velero backup describe test-schedule-20231109024158

You’ll see it runs very fast but like above you’ll end up with something like this.

velero_backup

OK, there is a lot to unpack here.

First is the command to create the backup
Then we see how to describe the backup and get its details
We can see it Completed
Look at the elapsed time, 2 seconds……
24 items were captured as part of this

That’s super impressive!!! I’ve seen Velero on heavily populated clusters with hundreds of objects take no more than 10 seconds to backup as well!

Off course if you wanted to do an adhoc backup you can, the command for that would be

velero backup create my-adhoc-backup # This will backup the entire cluster, all resources, all namespaces

Time to Restore

So we have a backup, and lets imagine I have just lost the entire Azure Region and my cluster is dead, I can’t access it. I need to restore to DR, how do I do that?

This is the beauty I have found with Velero as it is super simple. First lets change to our DR cluster AKS-2 and try and see what backups we have, but realizing all the backups we did were on AKS-1 and there was nothing running in AKS-2

velero_backup_dr

OK thats good, I am on my DR cluster with nothing running but I can see my Prod backup from my AKS-1 cluster I took earlier. Thats working because both clusters are accessing the same Azure Storage Account for backups/restores, so as long as that is GeoRedundant then I am good to go.

To restore my backup to my new cluster lets try this…

velero create restore --from-backup test-schedule-20231109024158
velero restore describe test-schedule-20231109024158-20231109025349

velero_restore_dr

That’s fast!!! The restore here took 4 seconds to complete. That can’t be right can it? Lets check the namespace.

velero_restore_complete

Everything is there, my pods, service, all the secrets and configmaps etc and everything is running, in a new cluster in a new Azure Region!! It worked!

Summary

That was just a quick overview of Velero and how you can utilize it for doing multi region DR at a high level with Azure Kubernetes Service. It was a basic demonstration but the principal is the same if you are doing just a single cluster backup and restore, to cluster migrations, or as demonstrated here, regional disaster recovery of clusters.

Cleanup

To remove all the test resources we created, just delete the original resource group.

az group delete --name $myResourceGroup --yes