-
Notifications
You must be signed in to change notification settings - Fork 209
Open
Description
When I had PrometheusTSDBCompactionsFailing alerts I had corrupted WAL files (with error messages in the logs looking like this: WAL truncation in Compact: create checkpoint: read segments: corruption in segment /prometheus/wal/00018151 at 72: unexpected full record).
With the following procedure I was able to fix the issue:
- Exec into the pod (or find the mount path of the PersistentVolumeClaim on the host) and delete the corrupted file (in the example above:
rm /prometheus/wal/00018151). - Delete all the WAL files in
/prometheus/walthat are older than the file deleted in the previous step (for examplerm /prometheus/wal/00018150). - Create empty files in the place of all the deleted files from the previous steps (for example
touch /prometheus/wal/00018150 /prometheus/wal/00018151). - Make sure the file ownership and permissions are the same as with the other WAL files (eg.
chown 1000:2000 /prometheus/wal/00018150 /prometheus/wal/00018151andchmod g+w /prometheus/wal/00018150 /prometheus/wal/00018151). - Restart the pod.
- Depending on how long ago the last successful compaction was, the next compaction might use a lot of memory and take a while. Look out if the pod gets out-of-memory-killed and (temporarily) increase the memory requests and limits of the prometheus container. Disable the startupProbe and the livenessProbe if the container terminates with exit code zero and you see the message "See you next time!" in the logs and a failed startup probe in the pod events (kubectl describe).
I do not know if this is good practice, though.
Should I open a pull request to extend the PrometheusTSDBCompactionsFailing runbook?
joaogbcravolouisgrasset
Metadata
Metadata
Assignees
Labels
No labels