Skip to main content

Longhorn Backup and Restore

Table of Contents

Article breaks down using Longhorn (version 1.7) for backups, self-hosting, and recovery. It’s for seasoned DevOps and SysAdmin folks who like doing things themselves. You’ll find a step-by-step guide on setting up and restoring backups with Longhorn, Velero, and GitOps. The goal is to show you a tried-and-true setup for getting your backups and restores working smoothly.

## Brief Introduction

Longhorn is a cloud-native distributed block storage solution tailored for Kubernetes. It is designed with a focus on simplicity, reliability, and minimizing resource consumption. Longhorn supports essential features such as snapshotting for consistent backups, block and filesystem modes, encryption, and automated backups.

Although many cloud providers offer integrated backup and restore solutions, self-hosted environments require us to manage these tasks independently. Often, the simplest solution appears to be a full-disk backup of a VM (e.g., Proxmox). However, this method incurs the overwhelming cost of backing up unnecessary data, such as images and OS tools. Furthermore, selectively restoring a few resources can be difficult, and this approach may even impact cluster consistency in some cases.

A popular strategy is to manage all manifests and configurations using GitOps, with persistent data stored separately—usually on NAS or S3-like storage solutions. During a restore, the system is minimally bootstrapped, manifests are re-applied while keeping pods in a pending status, and then data is restored. Unfortunately, this doesn’t work well with Longhorn’s dynamic PVC provisioning as PVCs have non-deterministic dynamic names that cannot be linked to new payloads.

The solution is to use Velero for manifest backups, Longhorn’s native backup for data, and GitOps for bootstrapping.

## More About Longhorn Volumes

To fully grasp Longhorn’s functionality, it’s crucial to understand the relationship between Longhorn volumes and Kubernetes Persistent Volumes. Think of Longhorn volumes as disks. Kubernetes Persistent Volumes (PVs) represent these disks to Kubernetes, while Persistent Volume Claims (PVCs) act as links between the volume and the workload (similar to mounting a drive).

There are two types of integrations with Longhorn volumes:

  • Static volumes: these are manually created in the UI and then linked in manifests.
  • Dynamic volumes: these are created dynamically by the Longhorn controller.

Static volumes have deterministic naming, making it easy to re-link them between pods and backups. Dynamic volumes, on the other hand, are designed with obscure volume names. Without storing Longhorn’s internal information, mounting an old dynamic volume to a new PVC can be extremely complicated, making restoration quite challenging.

If you’re using only static volumes, you’re in good shape (see my report here), and this article might only be partially relevant to you. However, if you have dynamic PVCs and want to preserve them during backups, I hope you find the information in this article useful.

The insights shared here are derived from practical experience.

Example of static volume.

volumeHandle is the name of volume in longhorn

---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: "data"
spec:
  storageClassName: longhorn-static
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: driver.longhorn.io
    fsType: "ext4"
    volumeHandle: "my-longhorn-volume"
    volumeAttributes:
      dataLocality: "best-effort"
      numberOfReplicas: "2"
      staleReplicaTimeout: "30"
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: "5Gi"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: "data"
spec:
  storageClassName: longhorn-static
  volumeName: "data"
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "5Gi"

Example of dynamic volume

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "data"
spec:
  storageClassName: longhorn
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "5Gi"

## HA/DR

High-Availability (HA) and Disaster Recovery (DR) are distinct concepts in IT infrastructure management.

HA functions within the same cluster, aiming to provide uninterrupted operation in the face of failures within the active cluster. For instance, it might involve rebuilding replicas in the background if a volume gets corrupted or a node is evicted. HA assumes a functioning cluster with capabilities for automatic self-healing.

DR comes into play when everything is gone—imagine a complete failure or unrecoverable cluster situation. It is your last line of defense, typically involving some downtime and possible data loss (up to the point of the last backup). DR usually requires manual intervention and doesn’t necessarily need automation, although modern solutions are increasingly focusing on automating these processes.

Longhorn offers support for both:

  • HA: Through replicas (when count is more than one) and automatic balancing/rebuilding.
  • DR: By providing backups and snapshots stored externally, as well as DR images.

This article primarily focuses on DR.

## GitOps

It’s crucial to store data in a reproducible way. Nowadays, GitOps and Infrastructure as Code (IaaC) approaches are commonly used to maintain manifests that enable the creation of a reproducible cluster at any time, serving as a single source of truth.

It is also important to include all state configurations, such as Longhorn default settings, backup storage, and credentials, within this framework. Without it, you may have hard time to restore system from a backup.

For managing credentials, your approach may vary. However, I’ve found Sealed Secrets to be convenient. Just remember to periodically update the restore key due to key rotation. For managing manifests, I use Kustomize.

Overall, the cluster configuration using GitOps might look something like this:

cluster/
├── main.key
├── sealed-secrets
├── longhorn-system
├── velero
└── ...other apps...
  • main.key is Sealed Secret backup key and stored outside of Git

## Environment

In the text below, we are working within a multi-node, HA control plane environment:

  • Kubernetes 1.30

Key components installed:

  • Longhorn 1.7.2
  • Velero 1.14.1

# Backup Configuration

This configuration is primarily based on a GitHub comment, with a few additional enhancements.

## Longhorn

Longhorn configured to have full backup every day to S3 endpoint (minio outside cluster).

values.yaml for Helm chart

# ...
defaultSettings:
  backupTarget: s3://longhorn@minio/
  backupTargetCredentialSecret: cold-storage

  replicaAutoBalance: best-effort
  defaultReplicaCount: 2
  replicaZoneSoftAntiAffinity: false
  replicaSoftAntiAffinity: false
  dataLocality: best-effort
  priorityClass: longhorn-critical

persistence:
  defaultClass: false
  defaultClassReplicaCount: 2

# ...

Default storage class


---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"
  fromBackup: ""
  fsType: "ext4"
  dataLocality: "best-effort"
  migratable: "true"

Daily backup configured via Longhorn UI

## Velero

Velero configuration encompasses the following:

  • An AWS plugin for uploading backups
  • Minimal invasive settings: no per-node daemons, no snapshots, and no data movement
  • Exclusions: Longhorn namespace (managed through GitOps), secrets (restored by GitOps using Sealed Secrets), and system namespace

values.yaml for Helm chart

configuration:
  defaultSnapshotMoveData: false

  backupStorageLocation:
  - name: cold-storage
    provider: aws
    default: true
    bucket: velero
    config:
      region: minio
      checksumAlgorithm: ""
      s3ForcePathStyle: true
      s3Url: # **** masked, contains URL to S3
    credential:
      name: cold-storage
      key: CONFIG


initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.10.1
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins


deployNodeAgent: false
snapshotsEnabled: false
credentials:
  useSecret: false


schedules:
  cloud:
    disabled: false
    schedule: "0 4 * * *"
    useOwnerReferencesInBackup: false
    paused: false
    template:
      ttl: "240h"
      storageLocation: cold-storage

      defaultVolumesToFsBackup: false
      snapshotVolumes: false
      includedNamespaces:
      - '*'
      excludedNamespaces:
      - 'kube-system'
      - 'longhorn-system'
      excludedClusterScopedResources: &res
      - persistentvolumes
      - persistentvolumeclaims
      - volumesnapshots
      - volumenapshotcontents
      - volumes.longhorn.io
      - backups.longhorn.io
      - volumeattachments.longhorn.io
      - backupvolumes.longhorn.io
      - settings.longhorn.io
      - snapshots.longhorn.io
      - nodes.longhorn.io
      - replicas.longhorn.io
      - engines.longhorn.io
      - backingimagedatasources.longhorn.io
      - backingimagemanagers.longhorn.io
      - volumesnapshots.snapshot.storage.k8s.io
      - systembackups.longhorn.io
      - backingimages.longhorn.io
      - sharemanagers.longhorn.io
      - instancemanagers.longhorn.io
      - secrets
      excludedNamespaceScopedResources: *res

# Restore Procedure

Here’s a high-level overview of the restore process, assuming you start with a freshly created empty cluster:

  1. Apply a minimal set of manifests from GitOps, including Velero and, if applicable, the secrets manager (such as Sealed Secrets in this example).
  2. Restore the latest backup using Velero (for manifest restoration).
  3. Restore data from Longhorn volumes.

Notes:

  • The examples provided use Kustomize, but other tools should work in a similar manner.
  • The order of operations is crucial for a successful restoration.
  • Do not restore Longhorn manifests or Velero itself from the backup—use those from your GitOps repository to ensure consistency.

## Restore Secrets

Secrets are essential for properly initializing backup targets for both Longhorn and Velero.

The example below applies only if you are using Sealed Secrets. Utilize your own process if you are not using Sealed Secrets.

By default, Sealed Secrets are installed in the kube-system namespace.

To deploy the controller, apply the following manifests:

kubectl apply -k sealed-secrets

To restore keys, execute these commands:

kubectl -n kube-system get secret | grep sealed | cut -f 1 -d ' ' | xargs -n 1 kubectl delete -n kube-system secret
kubectl apply -f main.key
kubectl -n kube-system rollout restart deploy/sealed-secrets-controller

Wait until the controller has successfully restarted.

## Install Longhorn

Create the Longhorn system using the manifests:

kubectl apply -k longhorn

On occasion, you may need to apply this command multiple times, with intervals in between, due to the CRD definitions potentially causing failures.

Ensure that the Longhorn system is fully recovered by checking the status of the pods.

## Install Velero

Install Velero with the following command:

kubectl apply -k velero

As with Longhorn, you might need to run this command multiple times, with intervals in between, due to possible CRD definition failures.

Note: If you have scheduled automatic backups in your manifests, consider temporarily disabling them. This precaution prevents any issues with existing backups during the recovery process, as the duration of recovery can be unpredictable.

## Restore Manifests

Restore manifests from the backup. These manifests include Longhorn-specific metadata, ensuring that Longhorn does not create new volumes unnecessarily (ie: dynamic PVC will still point to the same Longhorn volume).

First, check the latest backup available:

velero backup get

If you used a scheduled backup, you can restore using:

velero restore create --wait --from-schedule velero-cloud

Alternatively, if you need to use a specific backup:

velero restore create --wait --from-backup foo-bar-example

## Restore Data

This part involves manual steps through the Longhorn UI.

  • Access the Longhorn UI with the following command:

    kubectl -n longhorn-system port-forward svc/longhorn-frontend 8080:80
    
  • Open your browser and navigate to http://localhost:8080.

  • In the UI, perform the following steps:

    • Click on “Backups”.
    • Select all volumes (consider setting pagination to “show all”).
    • Click “Restore from last backup”.
    • Choose “Use old name”.
    • Click “Apply”.

Next, go to the “Volumes” section and wait for all volumes to reach the “Detached” state. This process can take some time depending on the volume sizes.

Once ready:

  • Go back to the “Volumes” section.
  • Select all volumes (be mindful of pagination!).
  • Click “Create PV/PVC” and check the option to reuse the old name.
  • Click “Apply”.

The restoration process may take several minutes.

Finally, verify that your Pods are up and running.

Warning

The following information is not stored in standard backups (or at least not accessible via the UI):

  • Filesystem type (e.g., EXT4, XFS)
  • Encryption flag
  • Number of replicas

Incorrect configurations during restoration can result in pods getting stuck in the “ContainerCreating” status.

To mitigate these issues, it’s best to adopt a uniform configuration for all volumes. If you use encryption, apply it consistently across all volumes. Similarly, use a single filesystem type consistently (e.g., not a mix of XFS and EXT4). The number of replicas will follow the default configuration settings.

It’s important to note that PVCs can only be restored if they were bound to a workload at the time of backup. This is usually not an issue, as dynamic PVCs imply that a workload is present. However, if you want to maintain a PV that can be attached to different workloads in the future, consider using static volumes, as described earlier in the article.

# Final Notes

Having used Longhorn for many years across various setups—from geographically distributed clusters to same-rack data centers — I can attest to its flexibility and performance in delivering distributed storage solutions.

Longhorn is an impressive tool that provides flexible and relatively performant distributed storage to its users. Its snapshot feature enables consistent block-level backups without downtime, vital for workloads like databases. The variety of embedded backup types and policies allows even less experienced users to manage persistent storage reliably. However, challenges can arise, particularly during data restoration.

Key takeaways:

  • Ensure that your network is reliable. High latency or significant packet loss can lead to node disconnections and rebalancing, although dedicated or high-end switches aren’t necessary since Longhorn has some tolerance for this.
  • Block mode performs well for medium I/O loads, but for high-load databases, native replication with bare disks is preferable. For projects like CloudNative PostgreSQL, setting replicas to 1 and locality to strict-local offers most of Longhorn’s benefits (such as backup and snapshot) while minimizing overhead.
  • Whenever possible, use static volumes for workloads with low dynamics, such as storage or databases. In this scenario, even if manifests weren’t backed up, you can more easily link data from backups and manifests from GitOps.
  • Prefer the retain policy, and avoid removing PVs unless backed up. Having a few dangling volumes in Longhorn is preferable to losing data.
  • Don’t panic if new volumes are stuck — it might be due to the lack of available space. By default, Longhorn reserves the declared amount, but this can be adjusted.
  • Use “locality: besteffort” if your cluster isn’t highly dynamic.

Regularly perform actual restore procedures. Without this practice, or if shortcuts are taken (like retrieving information from a running cluster rather than relying solely on backups), your backups might become useless at some point. For instance, only through regular procedure checks I encountered the issue of Sealed Secrets key rotation, highlighting the need to update the main key in backups regularly.