Longhorn Backup and Restore
Table of Contents
Article breaks down using Longhorn (version 1.7) for backups, self-hosting, and recovery. It’s for seasoned DevOps and SysAdmin folks who like doing things themselves. You’ll find a step-by-step guide on setting up and restoring backups with Longhorn, Velero, and GitOps. The goal is to show you a tried-and-true setup for getting your backups and restores working smoothly.
##
Brief Introduction
Longhorn is a cloud-native distributed block storage solution tailored for Kubernetes. It is designed with a focus on simplicity, reliability, and minimizing resource consumption. Longhorn supports essential features such as snapshotting for consistent backups, block and filesystem modes, encryption, and automated backups.
Although many cloud providers offer integrated backup and restore solutions, self-hosted environments require us to manage these tasks independently. Often, the simplest solution appears to be a full-disk backup of a VM (e.g., Proxmox). However, this method incurs the overwhelming cost of backing up unnecessary data, such as images and OS tools. Furthermore, selectively restoring a few resources can be difficult, and this approach may even impact cluster consistency in some cases.
A popular strategy is to manage all manifests and configurations using GitOps, with persistent data stored separately—usually on NAS or S3-like storage solutions. During a restore, the system is minimally bootstrapped, manifests are re-applied while keeping pods in a pending status, and then data is restored. Unfortunately, this doesn’t work well with Longhorn’s dynamic PVC provisioning as PVCs have non-deterministic dynamic names that cannot be linked to new payloads.
The solution is to use Velero for manifest backups, Longhorn’s native backup for data, and GitOps for bootstrapping.
##
More About Longhorn Volumes
To fully grasp Longhorn’s functionality, it’s crucial to understand the relationship between Longhorn volumes and Kubernetes Persistent Volumes. Think of Longhorn volumes as disks. Kubernetes Persistent Volumes (PVs) represent these disks to Kubernetes, while Persistent Volume Claims (PVCs) act as links between the volume and the workload (similar to mounting a drive).
There are two types of integrations with Longhorn volumes:
- Static volumes: these are manually created in the UI and then linked in manifests.
- Dynamic volumes: these are created dynamically by the Longhorn controller.
Static volumes have deterministic naming, making it easy to re-link them between pods and backups. Dynamic volumes, on the other hand, are designed with obscure volume names. Without storing Longhorn’s internal information, mounting an old dynamic volume to a new PVC can be extremely complicated, making restoration quite challenging.
If you’re using only static volumes, you’re in good shape (see my report here), and this article might only be partially relevant to you. However, if you have dynamic PVCs and want to preserve them during backups, I hope you find the information in this article useful.
The insights shared here are derived from practical experience.
Example of static volume.
volumeHandle
is the name of volume in longhorn
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: "data"
spec:
storageClassName: longhorn-static
persistentVolumeReclaimPolicy: Retain
csi:
driver: driver.longhorn.io
fsType: "ext4"
volumeHandle: "my-longhorn-volume"
volumeAttributes:
dataLocality: "best-effort"
numberOfReplicas: "2"
staleReplicaTimeout: "30"
accessModes:
- ReadWriteOnce
capacity:
storage: "5Gi"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: "data"
spec:
storageClassName: longhorn-static
volumeName: "data"
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "5Gi"
Example of dynamic volume
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "data"
spec:
storageClassName: longhorn
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "5Gi"
##
HA/DR
High-Availability (HA) and Disaster Recovery (DR) are distinct concepts in IT infrastructure management.
HA functions within the same cluster, aiming to provide uninterrupted operation in the face of failures within the active cluster. For instance, it might involve rebuilding replicas in the background if a volume gets corrupted or a node is evicted. HA assumes a functioning cluster with capabilities for automatic self-healing.
DR comes into play when everything is gone—imagine a complete failure or unrecoverable cluster situation. It is your last line of defense, typically involving some downtime and possible data loss (up to the point of the last backup). DR usually requires manual intervention and doesn’t necessarily need automation, although modern solutions are increasingly focusing on automating these processes.
Longhorn offers support for both:
- HA: Through replicas (when count is more than one) and automatic balancing/rebuilding.
- DR: By providing backups and snapshots stored externally, as well as DR images.
This article primarily focuses on DR.
##
GitOps
It’s crucial to store data in a reproducible way. Nowadays, GitOps and Infrastructure as Code (IaaC) approaches are commonly used to maintain manifests that enable the creation of a reproducible cluster at any time, serving as a single source of truth.
It is also important to include all state configurations, such as Longhorn default settings, backup storage, and credentials, within this framework. Without it, you may have hard time to restore system from a backup.
For managing credentials, your approach may vary. However, I’ve found Sealed Secrets to be convenient. Just remember to periodically update the restore key due to key rotation. For managing manifests, I use Kustomize.
Overall, the cluster configuration using GitOps might look something like this:
cluster/
├── main.key
├── sealed-secrets
├── longhorn-system
├── velero
└── ...other apps...
main.key
is Sealed Secret backup key and stored outside of Git
##
Environment
In the text below, we are working within a multi-node, HA control plane environment:
- Kubernetes 1.30
Key components installed:
- Longhorn 1.7.2
- Velero 1.14.1
#
Backup Configuration
This configuration is primarily based on a GitHub comment, with a few additional enhancements.
##
Longhorn
Longhorn configured to have full backup every day to S3 endpoint (minio outside cluster).
values.yaml for Helm chart
# ...
defaultSettings:
backupTarget: s3://longhorn@minio/
backupTargetCredentialSecret: cold-storage
replicaAutoBalance: best-effort
defaultReplicaCount: 2
replicaZoneSoftAntiAffinity: false
replicaSoftAntiAffinity: false
dataLocality: best-effort
priorityClass: longhorn-critical
persistence:
defaultClass: false
defaultClassReplicaCount: 2
# ...
Default storage class
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
fromBackup: ""
fsType: "ext4"
dataLocality: "best-effort"
migratable: "true"
Daily backup configured via Longhorn UI
##
Velero
Velero configuration encompasses the following:
- An AWS plugin for uploading backups
- Minimal invasive settings: no per-node daemons, no snapshots, and no data movement
- Exclusions: Longhorn namespace (managed through GitOps), secrets (restored by GitOps using Sealed Secrets), and system namespace
values.yaml for Helm chart
configuration:
defaultSnapshotMoveData: false
backupStorageLocation:
- name: cold-storage
provider: aws
default: true
bucket: velero
config:
region: minio
checksumAlgorithm: ""
s3ForcePathStyle: true
s3Url: # **** masked, contains URL to S3
credential:
name: cold-storage
key: CONFIG
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.10.1
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /target
name: plugins
deployNodeAgent: false
snapshotsEnabled: false
credentials:
useSecret: false
schedules:
cloud:
disabled: false
schedule: "0 4 * * *"
useOwnerReferencesInBackup: false
paused: false
template:
ttl: "240h"
storageLocation: cold-storage
defaultVolumesToFsBackup: false
snapshotVolumes: false
includedNamespaces:
- '*'
excludedNamespaces:
- 'kube-system'
- 'longhorn-system'
excludedClusterScopedResources: &res
- persistentvolumes
- persistentvolumeclaims
- volumesnapshots
- volumenapshotcontents
- volumes.longhorn.io
- backups.longhorn.io
- volumeattachments.longhorn.io
- backupvolumes.longhorn.io
- settings.longhorn.io
- snapshots.longhorn.io
- nodes.longhorn.io
- replicas.longhorn.io
- engines.longhorn.io
- backingimagedatasources.longhorn.io
- backingimagemanagers.longhorn.io
- volumesnapshots.snapshot.storage.k8s.io
- systembackups.longhorn.io
- backingimages.longhorn.io
- sharemanagers.longhorn.io
- instancemanagers.longhorn.io
- secrets
excludedNamespaceScopedResources: *res
#
Restore Procedure
Here’s a high-level overview of the restore process, assuming you start with a freshly created empty cluster:
- Apply a minimal set of manifests from GitOps, including Velero and, if applicable, the secrets manager (such as Sealed Secrets in this example).
- Restore the latest backup using Velero (for manifest restoration).
- Restore data from Longhorn volumes.
Notes:
- The examples provided use Kustomize, but other tools should work in a similar manner.
- The order of operations is crucial for a successful restoration.
- Do not restore Longhorn manifests or Velero itself from the backup—use those from your GitOps repository to ensure consistency.
##
Restore Secrets
Secrets are essential for properly initializing backup targets for both Longhorn and Velero.
The example below applies only if you are using Sealed Secrets. Utilize your own process if you are not using Sealed Secrets.
By default, Sealed Secrets are installed in the
kube-system
namespace.
To deploy the controller, apply the following manifests:
kubectl apply -k sealed-secrets
To restore keys, execute these commands:
kubectl -n kube-system get secret | grep sealed | cut -f 1 -d ' ' | xargs -n 1 kubectl delete -n kube-system secret
kubectl apply -f main.key
kubectl -n kube-system rollout restart deploy/sealed-secrets-controller
Wait until the controller has successfully restarted.
##
Install Longhorn
Create the Longhorn system using the manifests:
kubectl apply -k longhorn
On occasion, you may need to apply this command multiple times, with intervals in between, due to the CRD definitions potentially causing failures.
Ensure that the Longhorn system is fully recovered by checking the status of the pods.
##
Install Velero
Install Velero with the following command:
kubectl apply -k velero
As with Longhorn, you might need to run this command multiple times, with intervals in between, due to possible CRD definition failures.
Note: If you have scheduled automatic backups in your manifests, consider temporarily disabling them. This precaution prevents any issues with existing backups during the recovery process, as the duration of recovery can be unpredictable.
##
Restore Manifests
Restore manifests from the backup. These manifests include Longhorn-specific metadata, ensuring that Longhorn does not create new volumes unnecessarily (ie: dynamic PVC will still point to the same Longhorn volume).
First, check the latest backup available:
velero backup get
If you used a scheduled backup, you can restore using:
velero restore create --wait --from-schedule velero-cloud
Alternatively, if you need to use a specific backup:
velero restore create --wait --from-backup foo-bar-example
##
Restore Data
This part involves manual steps through the Longhorn UI.
Access the Longhorn UI with the following command:
kubectl -n longhorn-system port-forward svc/longhorn-frontend 8080:80
Open your browser and navigate to http://localhost:8080.
In the UI, perform the following steps:
- Click on “Backups”.
- Select all volumes (consider setting pagination to “show all”).
- Click “Restore from last backup”.
- Choose “Use old name”.
- Click “Apply”.
Next, go to the “Volumes” section and wait for all volumes to reach the “Detached” state. This process can take some time depending on the volume sizes.
Once ready:
- Go back to the “Volumes” section.
- Select all volumes (be mindful of pagination!).
- Click “Create PV/PVC” and check the option to reuse the old name.
- Click “Apply”.
The restoration process may take several minutes.
Finally, verify that your Pods are up and running.
Warning
The following information is not stored in standard backups (or at least not accessible via the UI):
- Filesystem type (e.g., EXT4, XFS)
- Encryption flag
- Number of replicas
Incorrect configurations during restoration can result in pods getting stuck in the “ContainerCreating” status.
To mitigate these issues, it’s best to adopt a uniform configuration for all volumes. If you use encryption, apply it consistently across all volumes. Similarly, use a single filesystem type consistently (e.g., not a mix of XFS and EXT4). The number of replicas will follow the default configuration settings.
It’s important to note that PVCs can only be restored if they were bound to a workload at the time of backup. This is usually not an issue, as dynamic PVCs imply that a workload is present. However, if you want to maintain a PV that can be attached to different workloads in the future, consider using static volumes, as described earlier in the article.
#
Final Notes
Having used Longhorn for many years across various setups—from geographically distributed clusters to same-rack data centers — I can attest to its flexibility and performance in delivering distributed storage solutions.
Longhorn is an impressive tool that provides flexible and relatively performant distributed storage to its users. Its snapshot feature enables consistent block-level backups without downtime, vital for workloads like databases. The variety of embedded backup types and policies allows even less experienced users to manage persistent storage reliably. However, challenges can arise, particularly during data restoration.
Key takeaways:
- Ensure that your network is reliable. High latency or significant packet loss can lead to node disconnections and rebalancing, although dedicated or high-end switches aren’t necessary since Longhorn has some tolerance for this.
- Block mode performs well for medium I/O loads, but for high-load databases, native replication with bare disks is preferable. For projects like CloudNative PostgreSQL, setting replicas to 1 and locality to
strict-local
offers most of Longhorn’s benefits (such as backup and snapshot) while minimizing overhead. - Whenever possible, use static volumes for workloads with low dynamics, such as storage or databases. In this scenario, even if manifests weren’t backed up, you can more easily link data from backups and manifests from GitOps.
- Prefer the
retain
policy, and avoid removing PVs unless backed up. Having a few dangling volumes in Longhorn is preferable to losing data. - Don’t panic if new volumes are stuck — it might be due to the lack of available space. By default, Longhorn reserves the declared amount, but this can be adjusted.
- Use “locality: besteffort” if your cluster isn’t highly dynamic.
Regularly perform actual restore procedures. Without this practice, or if shortcuts are taken (like retrieving information from a running cluster rather than relying solely on backups), your backups might become useless at some point. For instance, only through regular procedure checks I encountered the issue of Sealed Secrets key rotation, highlighting the need to update the main key in backups regularly.