It’s been a while since I wrote a blog post, but I wanted to jump back in again with big one, Disaster Recovery!
Traditionally with DR in a Virtual Machine infrastructure, you would replicate the VMs/Disks etc to your DR location, and when you initial a failover you simply start them up. This works well in a VM environment because you would normally spend a fair amount of time configuring them, either manually (I hope not) or with Automation, but either way its time consuming but well understood.
But what about Containers and Kubernetes?
The most common answer when you talk about Kubernetes and backups/DR is, don’t bother, just redeploy. That does make sense in some instances such as if your application is totally stateless and its easy to deploy, but what happens if you have many different products, deployed by different tools and possibly by different teams? It was this exact challenge I faced recently, and then to add to the mix, I needed to have Disaster Recovery in a totally different Azure Region (and country in some cases). How do you do that?
For me, the tool I found which fit this best was Velero
How does it work
Velero is an OSS tool from VMWare and is designed for backup and disaster recovery of any Kubernetes, be it bare metal or Cloud Managed like AWS, Azure and GCP. It stores its data on object storage like Azure Blob, AWS S3 etc or even virtual disk images, and authentication is available in many forms. For instance with Azure Blob Storage you can authenticate with Access Keys, Managed Identity and even Workload Identity. It also support backup and restore to different clusters than the source, and crucially for me, the restore isn’t region dependant, so you can use it for multi region DR or migrations.
You install Velero on each cluster via Helm chart, give it connection details to your storage for backups, and then setup backups with a CLI tool you run locally.
Azure Backup actually has started to use Velero as its Kubernetes backup tool, however as of November 2023 this only support restores to clusters in the same region as the original backup.
Lets try this out.
You are going to need a few items.
- 2x AKS Clusters, each if different regions if you want to test multi region restore
- 1x Azure Storage Account for Velero
- Velero Client on you laptop to interact with Velero
If you have some AKS Clusters you can use them, so skip to next step, but if not lets create some.
# Set your Subscription Context
mySubID=$(az account list --query="[?name=='$mySubscription'].id | " -o tsv)
az account set -s $mySubID
# Create a shared resource group
az group create --name $myResourceGroup --location $myPrimaryLocation
# Create my two test clusters
# Primary Cluster
az aks create --resource-group $myResourceGroup --location $myPrimaryLocation \
--name $myPrimaryCluster \
--node-count 1 \
# DR Cluster
az aks create --resource-group $myResourceGroup --location $myDRLocation \
--name $myDRCluster \
--node-count 1 \
# Check both clusters are created
az aks list --resource-group $myResourceGroup --output table
You will then end up with a couple of clusters.
So, now you have a couple of AKS clusters in a resource group ready to go, but Velero needs a location to store config/backups etc. You can use lots of options for this, but the one I found to be the easiest to setup with AKS with a regular Azure Storage Account, so lets create on.
# Setup Storage Account Resource Group
az group create -n $myResourceGroup --location $myPrimaryLocation
# Setup Storage Account
az storage account create \
--name $veleroStorageAccount \
--resource-group $myResourceGroup \
--sku Standard_GRS \
--encryption-services blob \
--https-only true \
--kind BlobStorage \
# Setup Container
az storage container create -n $veleroStorageContainer --public-access off --account-name $veleroStorageAccount
Once your Storage Account is setup, Velero needs a way to connect to it. You can do this with several options such as
- Storage Access Key
- Managed Identity
- Workload Identity
This obviously depends on your workload requirement but for cross region cluster DR I found that Storage Access Key works the quickest, and its the fastest to test. Now as its a secret you wouldn’t commit this to any form of source control, but Velero has the method during install to take these secret values from a file. What I do in production is deploy Velero with an Azure DevOps Pipeline and during deployment I have it pull this secret the same as below and update Velero each time, so if I ever need to roll the keys I just redeploy Velero and it takes the latest secret. So lets generate the secret and file.
# Get the Storage Account Key for the Velero Blob.
VELERO_STORAGE_ACCESS_KEY=$(az storage account keys list --account-name $veleroStorageAccount --resource-group $myResourceGroup --query ".value" -o tsv)
# Add the Storage Account Key to the Credentials File
cat << EOF > ./credentials-velero
You should end up with a file like this.
Velero Client Install
If that has all worked, you now need to deploy Velero to your clusters, but first lets (download the Velero Client)[https://velero.io/docs/main/basic-install/#install-the-cli]. as you need this to interact with Velero and you will see how it looks without the cluster setup. From that link take whatever is appropriate for your OS, but for me lets do it on Ubuntu.
tar -xvf velero-v1.12.1-linux-amd64.tar.gz
mv velero-v1.12.1-linux-amd64/velero /usr/local/bin
With the Velero client running locally you can run a version check which will show you the version of your client and your server.
# First connect to our cluster
az aks get-credentials -n $myPrimaryCluster -g $myResourceGroup
# Then check Velero
You’ll get this below, which means Velero isn’t running on your cluster yet.
So lets fix that.
Velero Server Install
Installing Velero can be done with the Velero CLI which I’ll show below, or with Helm Charts which is what I use in production, but for now we’ll install with the CLI as its easier to understand.
You need to perform this config on both clusters!
./velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.8.1 \
--bucket $veleroStorageContainer \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=$myResourceGroup,storageAccount=$veleroStorageAccount,storageAccountKeyEnvVar=AZURE_STORAGE_ACCOUNT_ACCESS_KEY,subscriptionId=$mySubID \
Now when you run a version check you should see Client and Server are correct.
Next you need to check if your backup storage, which is the Storage Account created earlier is working.
velero backup-location get
If you Phase shows as Available then you have setup the base requirements for Velero.
Create a Backup and Backup Schedule
To create a backup with Velero, you configure this with the CLI tool and you can crete a schedule or one-off backups. I’ll setup both below, but first lets verify we have something to backup and only in one cluster, so I have setup a couple of demo pods and a service etc in my AKS-1 cluster, but you can see below that nothing exists on my AKS-2 cluster.
First then lets create a scheduled backup on AKS-1.
velero schedule create "test-schedule" --schedule="0 0-23 * * *" --include-namespaces test --exclude-resources persistentvolumes --ttl 24h0m0s
velero get schedule
This has given us a schedule called “test-schedule which will run every hour and backup the test namespace, and you can see that in the results.
If we want to execute a backup straight away from that schedule, we run the following.
velero backup create --from-schedule test-schedule
# It will output a command to run similar to this
velero backup describe test-schedule-20231109024158
You’ll see it runs very fast but like above you’ll end up with something like this.
OK, there is a lot to unpack here.
- First is the command to create the backup
- Then we see how to describe the backup and get its details
- We can see it Completed
- Look at the elapsed time, 2 seconds……
- 24 items were captured as part of this
That’s super impressive!!! I’ve seen Velero on heavily populated clusters with hundreds of objects take no more than 10 seconds to backup as well!
Off course if you wanted to do an adhoc backup you can, the command for that would be
velero backup create my-adhoc-backup # This will backup the entire cluster, all resources, all namespaces
Time to Restore
So we have a backup, and lets imagine I have just lost the entire Azure Region and my cluster is dead, I can’t access it. I need to restore to DR, how do I do that?
This is the beauty I have found with Velero as it is super simple. First lets change to our DR cluster AKS-2 and try and see what backups we have, but realizing all the backups we did were on AKS-1 and there was nothing running in AKS-2
OK thats good, I am on my DR cluster with nothing running but I can see my Prod backup from my AKS-1 cluster I took earlier. Thats working because both clusters are accessing the same Azure Storage Account for backups/restores, so as long as that is GeoRedundant then I am good to go.
To restore my backup to my new cluster lets try this…
velero create restore --from-backup test-schedule-20231109024158
velero restore describe test-schedule-20231109024158-20231109025349
That’s fast!!! The restore here took 4 seconds to complete. That can’t be right can it? Lets check the namespace.
Everything is there, my pods, service, all the secrets and configmaps etc and everything is running, in a new cluster in a new Azure Region!! It worked!
That was just a quick overview of Velero and how you can utilize it for doing multi region DR at a high level with Azure Kubernetes Service. It was a basic demonstration but the principal is the same if you are doing just a single cluster backup and restore, to cluster migrations, or as demonstrated here, regional disaster recovery of clusters.
To remove all the test resources we created, just delete the original resource group.
az group delete --name $myResourceGroup --yes