diff --git a/administration/buffering-and-storage.md b/administration/buffering-and-storage.md index 40b60a2f9..c5e3d8dfe 100644 --- a/administration/buffering-and-storage.md +++ b/administration/buffering-and-storage.md @@ -151,6 +151,64 @@ The Service section refers to the section defined in the main [configuration fil | `storage.keep.rejected` | When enabled, the dead-letter queue feature stores failed chunks that can't be delivered. Accepted values: `Off`, `On`. | `Off`| | `storage.rejected.path` | When specified, the dead-letter queue is stored in a subdirectory (stream) under `storage.path`. The default value `rejected` is used at runtime if not set. | _none_ | +### Dead letter queue (DLQ) + +The Dead Letter Queue (DLQ) feature preserves chunks that fail to be delivered to output destinations. Instead of losing this data, Fluent Bit copies the rejected chunks to a dedicated storage location for later analysis and troubleshooting. + +#### When dead letter queue is triggered + +Chunks are copied to the DLQ in the following failure scenarios: + +- **Permanent errors**: When an output plugin returns an unrecoverable error (`FLB_ERROR`). +- **Retry limit reached**: When a chunk exhausts all configured retry attempts. +- **Retries disabled**: When `retry_limit` is set to `no_retries` and a flush fails. +- **Scheduler failures**: When the retry scheduler can't schedule a retry (for example, due to resource constraints). + +#### Requirements + +The DLQ feature requires: + +- `storage.path` must be configured (filesystem storage must be enabled). +- `storage.keep.rejected` must be set to `On`. + +#### Dead letter queue file location and format + +Rejected chunks are stored in a subdirectory under `storage.path`. For example, with the following configuration: + +```yaml +service: + storage.path: /var/log/flb-storage/ + storage.keep.rejected: on + storage.rejected.path: rejected +``` + +Rejected chunks are stored at `/var/log/flb-storage/rejected/`. + +Each DLQ file is named using this format: + +```text +___.flb +``` + +For example: `kube_var_log_containers_test_400_http_0x7f8b4c.flb` + +The file contains the original chunk data in the internal format of Fluent Bit, preserving all records and metadata. + +#### Troubleshooting with dead letter queue + +The DLQ feature enables the following capabilities: + +- **Data preservation**: Invalid or rejected chunks are preserved instead of being permanently lost. +- **Root cause analysis**: Investigate why specific data failed to be delivered without impacting live processing. +- **Data recovery**: Replay or transform rejected chunks after fixing the underlying issue. +- **Debugging**: Analyze the exact content of problematic records. + +To examine DLQ chunks, you can use the storage metrics endpoint (when `storage.metrics` is enabled) or directly inspect the files in the rejected directory. + +{% hint style="info" %} +DLQ files remain on disk until manually removed. Monitor disk usage in the rejected directory and implement a cleanup policy for older files. +{% endhint %} + A Service section will look like this: {% tabs %} @@ -165,6 +223,8 @@ service: storage.checksum: off storage.backlog.mem_limit: 5M storage.backlog.flush_on_shutdown: off + storage.keep.rejected: on + storage.rejected.path: rejected ``` {% endtab %} @@ -179,12 +239,14 @@ service: storage.checksum off storage.backlog.mem_limit 5M storage.backlog.flush_on_shutdown off + storage.keep.rejected on + storage.rejected.path rejected ``` {% endtab %} {% endtabs %} -This configuration sets an optional buffering mechanism where the route to the data is `/var/log/flb-storage/`. It uses `normal` synchronization mode, without running a checksum and up to a maximum of 5 MB of memory when processing backlog data. +This configuration sets an optional buffering mechanism where the route to the data is `/var/log/flb-storage/`. It uses `normal` synchronization mode, without running a checksum and up to a maximum of 5 MB of memory when processing backlog data. Additionally, the dead letter queue is enabled, and rejected chunks are stored in `/var/log/flb-storage/rejected/`. ### Input section configuration diff --git a/administration/configuring-fluent-bit/yaml/service-section.md b/administration/configuring-fluent-bit/yaml/service-section.md index 5ee069358..83949f53c 100644 --- a/administration/configuring-fluent-bit/yaml/service-section.md +++ b/administration/configuring-fluent-bit/yaml/service-section.md @@ -26,8 +26,27 @@ The `service` section defines global properties of the service. The available co | `sp.convert_from_str_to_num` | If enabled, the stream processor converts strings that represent numbers to a numeric type. | `true` | | `windows.maxstdio` | If specified, adjusts the limit of `stdio`. Only provided for Windows. Values from `512` to `2048` are allowed. | `512` | +### Storage configuration + +The following storage-related keys can be set in the `service` section: + +| Key | Description | Default Value | +| --- | ----------- | ------------- | +| `storage.path` | Set a location in the file system to store streams and chunks of data. Required for filesystem buffering. | _none_ | +| `storage.sync` | Configure the synchronization mode used to store data in the file system. Accepted values: `normal` or `full`. | `normal` | +| `storage.checksum` | Enable data integrity check when writing and reading data from the filesystem. Accepted values: `off` or `on`. | `off` | +| `storage.max_chunks_up` | Set the maximum number of chunks that can be `up` in memory when using filesystem storage. | `128` | +| `storage.backlog.mem_limit` | Set the memory limit for backlog data chunks. | `5M` | +| `storage.backlog.flush_on_shutdown` | Attempt to flush all backlog chunks during shutdown. Accepted values: `off` or `on`. | `off` | +| `storage.metrics` | Enable storage layer metrics on the HTTP endpoint. Accepted values: `off` or `on`. | `off` | +| `storage.delete_irrecoverable_chunks` | Delete irrecoverable chunks during runtime and at startup. Accepted values: `off` or `on`. | `off` | +| `storage.keep.rejected` | Enable the Dead Letter Queue (DLQ) to preserve chunks that fail to be delivered. Accepted values: `off` or `on`. | `off` | +| `storage.rejected.path` | Subdirectory name under `storage.path` for storing rejected chunks. | `rejected` | + For scheduler and retry details, see [scheduling and retries](../../scheduling-and-retries.md#Scheduling-and-Retries). +For storage and buffering details, see [buffering and storage](../../buffering-and-storage.md). + ## Configuration example The following configuration example that defines a `service` section with [hot reloading](../../hot-reload.md) enabled and a pipeline with a `random` input and `stdout` output: diff --git a/administration/scheduling-and-retries.md b/administration/scheduling-and-retries.md index 649882fc3..ea9672baf 100644 --- a/administration/scheduling-and-retries.md +++ b/administration/scheduling-and-retries.md @@ -95,6 +95,10 @@ The scheduler provides a configuration option called `Retry_Limit`, which can be | `Retry_Limit` | `no_limits` or `False` | When set there no limit for the number of retries that the scheduler can do. | | `Retry_Limit` | `no_retries` | When set, retries are disabled and scheduler doesn't try to send data to the destination if it failed the first time. | +{% hint style="info" %} +When a chunk exhausts all retry attempts or retries are disabled, the data is discarded by default. To preserve rejected data for later analysis, enable the [Dead Letter Queue (DLQ)](buffering-and-storage.md#dead-letter-queue-dlq) feature by setting `storage.keep.rejected` to `on` in the Service section. +{% endhint %} + ### Retry example The following example configures two outputs, where the HTTP plugin has an unlimited number of retries, and the Elasticsearch plugin have a limit of `5` retries: diff --git a/administration/troubleshooting.md b/administration/troubleshooting.md index c09c6f340..25aa57ccb 100644 --- a/administration/troubleshooting.md +++ b/administration/troubleshooting.md @@ -2,9 +2,64 @@ +- [Dead letter queue: preserve failed chunks](#dead-letter-queue) - [Tap: generate events or records](#tap) - [Dump internals signal](#dump-internals-and-signal) +## Dead letter queue + +The Dead Letter Queue (DLQ) feature preserves chunks that fail to be delivered to output destinations. This enables troubleshooting delivery failures without losing data. + +### Enable dead letter queue + +To enable the DLQ, add the following to your Service section: + +{% tabs %} +{% tab title="fluent-bit.yaml" %} + +```yaml +service: + storage.path: /var/log/flb-storage/ + storage.keep.rejected: on + storage.rejected.path: rejected +``` + +{% endtab %} +{% tab title="fluent-bit.conf" %} + +```text +[SERVICE] + storage.path /var/log/flb-storage/ + storage.keep.rejected on + storage.rejected.path rejected +``` + +{% endtab %} +{% endtabs %} + +### What gets stored + +Chunks are copied to the DLQ when: + +- An output plugin returns an unrecoverable error. +- A chunk exhausts all configured retry attempts. +- Retries are disabled (`retry_limit: no_retries`) and the flush fails. +- The scheduler fails to schedule a retry. + +### Examine dead letter queue files + +DLQ files are stored in the configured path (for example, `/var/log/flb-storage/rejected/`) with names that include the tag, status code, and output plugin name. This helps identify which records failed and why. + +For example, a file named `kube_var_log_containers_test_400_http_0x7f8b4c.flb` indicates a chunk with tag `kube.var.log.containers.test` that failed with status code `400` when sending to the `http` output. + +### Dead letter queue management + +{% hint style="warning" %} +DLQ files remain on disk until manually removed. Monitor disk usage and implement a cleanup policy. +{% endhint %} + +For more details on DLQ configuration, see [Buffering and Storage](buffering-and-storage.md#dead-letter-queue-dlq). + ## Tap Tap can be used to generate events or records detailing what messages pass through Fluent Bit, at what time and what filters affect them.