Sicherung aufgrund des Fehlers „TooManySnapshots“ fehlgeschlagen

Beschreibung

Während der Sicherung werden Longhorn-Volumes gesichert, indem der Volume-Snapshot erstellt und an einen Remote-Speicherort gesendet wird. Wenn Probleme bei der Snapshot-Erstellung des Longhorn-Volumes auftreten und mehr als 248 Snapshots für das Volume erstellt werden sollen, ist die Sicherung nicht erfolgreich.

Sie können auf folgende Arten überprüfen, ob die Sicherung aufgrund des Fehlers TooManySnapshots fehlgeschlagen ist:

Durch Überprüfen der Velero-Protokolle:
```
kubectl logs  -n velero -l app.kubernetes.io/name=velero -c velero |grep "Waiting for volumesnapshotcontents" kubectl logs  -n velero -l app.kubernetes.io/name=velero -c velero |grep "Waiting for volumesnapshotcontents"
```
Beispielausgabe:
```
time="2023-12-15T08:15:59Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef to have snapshot handle. Retrying in 5s" backup=velero/daily-2 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:182" pluginName=velero-plugin-for-csi
time="2023-12-15T08:15:59Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef to have snapshot handle. Retrying in 5s" backup=velero/daily-2 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:182" pluginName=velero-plugin-for-csitime="2023-12-15T08:15:59Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef to have snapshot handle. Retrying in 5s" backup=velero/daily-2 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:182" pluginName=velero-plugin-for-csi
time="2023-12-15T08:15:59Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef to have snapshot handle. Retrying in 5s" backup=velero/daily-2 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:182" pluginName=velero-plugin-for-csi
```

Durch Überprüfen der Longhorn-Pod-Protokolle:

kubectl logs  -n longhorn-system -l app=csi-snapshotter --tail=-1 |grep "too many snapshots created"kubectl logs  -n longhorn-system -l app=csi-snapshotter --tail=-1 |grep "too many snapshots created"

Beispielausgabe:

I1215 08:39:41.707351       1 snapshot_controller.go:291] createSnapshotWrapper: CreateSnapshot for content snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef returned error: rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=failed to create snapshot: proxyServer=10.42.7.56:8501 destination=10.42.7.56:10004: failed to snapshot volume: rpc error: code = Unknown desc = failed to create snapshot snapshot-f073b88e-bd01-40a8-a645-ec929e276cef for volume 10.42.7.56:10004: rpc error: code = Unknown desc = too many snapshots created] from [http://longhorn-backend:9500/v1/volumes/pvc-7d89efa4-3d60-4837-a632-f190cd3cd9ed?action=snapshotCreate]I1215 08:39:41.707351       1 snapshot_controller.go:291] createSnapshotWrapper: CreateSnapshot for content snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef returned error: rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=failed to create snapshot: proxyServer=10.42.7.56:8501 destination=10.42.7.56:10004: failed to snapshot volume: rpc error: code = Unknown desc = failed to create snapshot snapshot-f073b88e-bd01-40a8-a645-ec929e276cef for volume 10.42.7.56:10004: rpc error: code = Unknown desc = too many snapshots created] from [http://longhorn-backend:9500/v1/volumes/pvc-7d89efa4-3d60-4837-a632-f190cd3cd9ed?action=snapshotCreate]

Lösung

Wenn der Fehler TooManySnapshots auftritt, müssen Sie die Snapshots für alle Volumes bereinigen, indem Sie das folgende Skript ausführen. Weitere Informationen zum Automatisieren dieses Vorgangs finden Sie unter Automatisches Bereinigen von Longhorn-Snapshots.

#!/bin/bash
set -e

# longhorn backend URL
url=
# By default, snapshot older than 10 days will be deleted
days=10

function display_usage() {
	echo "usage: $(basename "$0") [-h] -u longhorn-url -d days"
	echo "  -u	Longhorn URL"
	echo "  -d 	Number of days(should be >0). By default, script will delete snapshot older than 10 days."
	echo "  -h	Print help"
}

while getopts 'hd:u:' flag "$@"; do
	case "${flag}" in
		u)
			url=${OPTARG}
			;;
		d)
			days=${OPTARG}
			[ "$days" ] && [ -z "${days//[0-9]}" ] || { echo "Invalid number of days=$days"; exit 1; }
			;;
		h)
			display_usage
			exit 0
			;;
		:)
			echo "Invalid option: ${OPTARG} requires an argument."
			exit 1
			;;
		*)
			echo "Unexpected option ${flag}"
			exit 1
			;;
	esac
done

[[ -z "$url" ]] && echo "Missing longhorn URL" && exit 1

# check if URL is valid
curl -s --connect-timeout 30 ${url}/v1 >> /dev/null || { echo "Unable to connect to longhorn backend"; exit 1; }

echo "Deleting snapshots older than $days days"

# Fetch list of longhorn volumes
vols=$( (curl -s -X GET ${url}/v1/volumes |jq -r '.data[].name') )

#delete given snapshot for given volume
function delete_snapshot() {
	local vol=$1
	local snap=$2

	[[ -z "$vol" || -z "$snap" ]] && echo "Error: delete_snapshot: Empty argument" && return 1
	curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotDelete -d '{"name": "'$snap'"}'
	echo "Snapshot=$snap deleted for volume=$vol"
}

#perform cleanup for given volume
function cleanup_volume() {
	local vol=$1
	local deleted_snap=0

	[[ -z "$vol" ]] && echo "Error: cleanup_volume: Empty argument" && return 1

	# fetch list of snapshot
	snaps=$( (curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotList | jq  -r '.data[] | select(.usercreated==true) | .name' ) )
	for i in ${snaps[@]}; do
		if [[ $i == "volume-head" ]]; then
			continue
		fi

		# calculate date difference for snapshot
		snapTime=$(curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotGet -d '{"name":"'$i'"}' |jq -r '.created')
		currentTime=$(date "+%s")
		timeDiff=$(($currentTime - ($(date -d $snapTime "+%s")) / 86400))
		if [[ $timeDiff -lt $days ]]; then
			echo "Ignoring snapshot $i, since it is older than $timeDiff days"
			continue
		fi

		#trigger deletion for snapshot
		delete_snapshot $vol $i
		deleted_snap=$((deleted_snap+1))
	done

	if [[ "$deleted_snap" -gt 0 ]]; then
		#trigger purge for volume
		curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotPurge >> /dev/null
	fi

}

for i in ${vols[@]}; do
	cleanup_volume $i
done#!/bin/bash
set -e

# longhorn backend URL
url=
# By default, snapshot older than 10 days will be deleted
days=10

function display_usage() {
	echo "usage: $(basename "$0") [-h] -u longhorn-url -d days"
	echo "  -u	Longhorn URL"
	echo "  -d 	Number of days(should be >0). By default, script will delete snapshot older than 10 days."
	echo "  -h	Print help"
}

while getopts 'hd:u:' flag "$@"; do
	case "${flag}" in
		u)
			url=${OPTARG}
			;;
		d)
			days=${OPTARG}
			[ "$days" ] && [ -z "${days//[0-9]}" ] || { echo "Invalid number of days=$days"; exit 1; }
			;;
		h)
			display_usage
			exit 0
			;;
		:)
			echo "Invalid option: ${OPTARG} requires an argument."
			exit 1
			;;
		*)
			echo "Unexpected option ${flag}"
			exit 1
			;;
	esac
done

[[ -z "$url" ]] && echo "Missing longhorn URL" && exit 1

# check if URL is valid
curl -s --connect-timeout 30 ${url}/v1 >> /dev/null || { echo "Unable to connect to longhorn backend"; exit 1; }

echo "Deleting snapshots older than $days days"

# Fetch list of longhorn volumes
vols=$( (curl -s -X GET ${url}/v1/volumes |jq -r '.data[].name') )

#delete given snapshot for given volume
function delete_snapshot() {
	local vol=$1
	local snap=$2

	[[ -z "$vol" || -z "$snap" ]] && echo "Error: delete_snapshot: Empty argument" && return 1
	curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotDelete -d '{"name": "'$snap'"}'
	echo "Snapshot=$snap deleted for volume=$vol"
}

#perform cleanup for given volume
function cleanup_volume() {
	local vol=$1
	local deleted_snap=0

	[[ -z "$vol" ]] && echo "Error: cleanup_volume: Empty argument" && return 1

	# fetch list of snapshot
	snaps=$( (curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotList | jq  -r '.data[] | select(.usercreated==true) | .name' ) )
	for i in ${snaps[@]}; do
		if [[ $i == "volume-head" ]]; then
			continue
		fi

		# calculate date difference for snapshot
		snapTime=$(curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotGet -d '{"name":"'$i'"}' |jq -r '.created')
		currentTime=$(date "+%s")
		timeDiff=$(($currentTime - ($(date -d $snapTime "+%s")) / 86400))
		if [[ $timeDiff -lt $days ]]; then
			echo "Ignoring snapshot $i, since it is older than $timeDiff days"
			continue
		fi

		#trigger deletion for snapshot
		delete_snapshot $vol $i
		deleted_snap=$((deleted_snap+1))
	done

	if [[ "$deleted_snap" -gt 0 ]]; then
		#trigger purge for volume
		curl -s -X POST ${url}/v1/volumes/${vol}?action=snapshotPurge >> /dev/null
	fi

}

for i in ${vols[@]}; do
	cleanup_volume $i
done

Beim Ausführen des oben genannten Skripts müssen Sie die folgenden Argumente übergeben:

-u – Die Longhorn-Backend-URL. Um die Longhorn-Backend-URL abzurufen, führen Sie den folgenden Befehl aus:

kubectl get svc -n longhorn-system longhorn-backend -o json | jq -r '.spec | (.clusterIP|tostring) + ":" + (.ports[0].port|tostring)' to fetch URLkubectl get svc -n longhorn-system longhorn-backend -o json | jq -r '.spec | (.clusterIP|tostring) + ":" + (.ports[0].port|tostring)' to fetch URL

-d – Die Anzahl der Tage, bevor die Snapshots gelöscht werden.

Hinweis:

Um eine Liste der Volumes abzurufen, die vom Fehler TooManySnapshots betroffen sind, führen Sie den folgenden Befehl aus:

kubectl get volumes -n longhorn-system -o json | jq -r '.items[] | select(([ .status.conditions[] | select(.type == "toomanysnapshots" and .status == "True") ] | length ) == 1 ) | .metadata.name'kubectl get volumes -n longhorn-system -o json | jq -r '.items[] | select(([ .status.conditions[] | select(.type == "toomanysnapshots" and .status == "True") ] | length ) == 1 ) | .metadata.name'