PrometheusTSDBCompactionsFailing instructions for corrupted WAL files

When I had PrometheusTSDBCompactionsFailing alerts I had corrupted WAL files (with error messages in the logs looking like this: `WAL truncation in Compact: create checkpoint: read segments: corruption in segment /prometheus/wal/00018151 at 72: unexpected full record`).

With the following procedure I was able to fix the issue:

1. Exec into the pod (or find the mount path of the PersistentVolumeClaim on the host) and delete the corrupted file (in the example above: `rm /prometheus/wal/00018151`).
2. Delete all the WAL files in `/prometheus/wal` that are older than the file deleted in the previous step (for example `rm /prometheus/wal/00018150`).
3. Create empty files in the place of all the deleted files from the previous steps (for example `touch /prometheus/wal/00018150 /prometheus/wal/00018151`).
4. Make sure the file ownership and permissions are the same as with the other WAL files (eg. `chown 1000:2000 /prometheus/wal/00018150 /prometheus/wal/00018151` and `chmod g+w /prometheus/wal/00018150 /prometheus/wal/00018151`).
5. Restart the pod.
6. Depending on how long ago the last successful compaction was, the next compaction might use a lot of memory and take a while. Look out if the pod gets out-of-memory-killed and (temporarily) increase the memory requests and limits of the prometheus container. Disable the startupProbe and the livenessProbe if the container terminates with exit code zero and you see the message "See you next time!" in the logs and a failed startup probe in the pod events (kubectl describe).

I do not know if this is good practice, though.

Should I open a pull request to extend the PrometheusTSDBCompactionsFailing runbook?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PrometheusTSDBCompactionsFailing instructions for corrupted WAL files #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PrometheusTSDBCompactionsFailing instructions for corrupted WAL files #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions