Skip to content
64 changes: 63 additions & 1 deletion administration/buffering-and-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,64 @@
| `storage.keep.rejected` | When enabled, the dead-letter queue feature stores failed chunks that can't be delivered. Accepted values: `Off`, `On`. | `Off`|
| `storage.rejected.path` | When specified, the dead-letter queue is stored in a subdirectory (stream) under `storage.path`. The default value `rejected` is used at runtime if not set. | _none_ |

### Dead letter queue (DLQ)

Check warning on line 154 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Acronyms] Spell out 'DLQ', if it's unfamiliar to the audience. Raw Output: {"message": "[FluentBit.Acronyms] Spell out 'DLQ', if it's unfamiliar to the audience.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 154, "column": 24}}}, "severity": "INFO"}

Check warning on line 154 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Headings] 'Dead letter queue (DLQ)' should use sentence-style capitalization. Raw Output: {"message": "[FluentBit.Headings] 'Dead letter queue (DLQ)' should use sentence-style capitalization.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 154, "column": 5}}}, "severity": "INFO"}

The Dead Letter Queue (DLQ) feature preserves chunks that fail to be delivered to output destinations. Instead of losing this data, Fluent Bit copies the rejected chunks to a dedicated storage location for later analysis and troubleshooting.

#### When dead letter queue is triggered

Chunks are copied to the DLQ in the following failure scenarios:

- **Permanent errors**: When an output plugin returns an unrecoverable error (`FLB_ERROR`).
- **Retry limit reached**: When a chunk exhausts all configured retry attempts.
- **Retries disabled**: When `retry_limit` is set to `no_retries` and a flush fails.
- **Scheduler failures**: When the retry scheduler can't schedule a retry (for example, due to resource constraints).

#### Requirements

The DLQ feature requires:

- `storage.path` must be configured (filesystem storage must be enabled).
- `storage.keep.rejected` must be set to `On`.

#### Dead letter queue file location and format

Rejected chunks are stored in a subdirectory under `storage.path`. For example, with the following configuration:

```yaml
service:
storage.path: /var/log/flb-storage/
storage.keep.rejected: on
storage.rejected.path: rejected
```

Rejected chunks are stored at `/var/log/flb-storage/rejected/`.

Each DLQ file is named using this format:

```text
<sanitized_tag>_<status_code>_<output_name>_<unique_id>.flb
```

For example: `kube_var_log_containers_test_400_http_0x7f8b4c.flb`

The file contains the original chunk data in the internal format of Fluent Bit, preserving all records and metadata.

#### Troubleshooting with dead letter queue

The DLQ feature enables the following capabilities:

- **Data preservation**: Invalid or rejected chunks are preserved instead of being permanently lost.
- **Root cause analysis**: Investigate why specific data failed to be delivered without impacting live processing.
- **Data recovery**: Replay or transform rejected chunks after fixing the underlying issue.
- **Debugging**: Analyze the exact content of problematic records.

To examine DLQ chunks, you can use the storage metrics endpoint (when `storage.metrics` is enabled) or directly inspect the files in the rejected directory.

{% hint style="info" %}
DLQ files remain on disk until manually removed. Monitor disk usage in the rejected directory and implement a cleanup policy for older files.
{% endhint %}

A Service section will look like this:

{% tabs %}
Expand All @@ -165,6 +223,8 @@
storage.checksum: off
storage.backlog.mem_limit: 5M
storage.backlog.flush_on_shutdown: off
storage.keep.rejected: on
storage.rejected.path: rejected
```

{% endtab %}
Expand All @@ -179,12 +239,14 @@
storage.checksum off
storage.backlog.mem_limit 5M
storage.backlog.flush_on_shutdown off
storage.keep.rejected on
storage.rejected.path rejected
```

{% endtab %}
{% endtabs %}

This configuration sets an optional buffering mechanism where the route to the data is `/var/log/flb-storage/`. It uses `normal` synchronization mode, without running a checksum and up to a maximum of 5&nbsp;MB of memory when processing backlog data.
This configuration sets an optional buffering mechanism where the route to the data is `/var/log/flb-storage/`. It uses `normal` synchronization mode, without running a checksum and up to a maximum of 5 MB of memory when processing backlog data. Additionally, the dead letter queue is enabled, and rejected chunks are stored in `/var/log/flb-storage/rejected/`.

### Input section configuration

Expand Down
19 changes: 19 additions & 0 deletions administration/configuring-fluent-bit/yaml/service-section.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,27 @@ The `service` section defines global properties of the service. The available co
| `sp.convert_from_str_to_num` | If enabled, the stream processor converts strings that represent numbers to a numeric type. | `true` |
| `windows.maxstdio` | If specified, adjusts the limit of `stdio`. Only provided for Windows. Values from `512` to `2048` are allowed. | `512` |

### Storage configuration

The following storage-related keys can be set in the `service` section:

| Key | Description | Default Value |
| --- | ----------- | ------------- |
| `storage.path` | Set a location in the file system to store streams and chunks of data. Required for filesystem buffering. | _none_ |
| `storage.sync` | Configure the synchronization mode used to store data in the file system. Accepted values: `normal` or `full`. | `normal` |
| `storage.checksum` | Enable data integrity check when writing and reading data from the filesystem. Accepted values: `off` or `on`. | `off` |
| `storage.max_chunks_up` | Set the maximum number of chunks that can be `up` in memory when using filesystem storage. | `128` |
| `storage.backlog.mem_limit` | Set the memory limit for backlog data chunks. | `5M` |
| `storage.backlog.flush_on_shutdown` | Attempt to flush all backlog chunks during shutdown. Accepted values: `off` or `on`. | `off` |
| `storage.metrics` | Enable storage layer metrics on the HTTP endpoint. Accepted values: `off` or `on`. | `off` |
| `storage.delete_irrecoverable_chunks` | Delete irrecoverable chunks during runtime and at startup. Accepted values: `off` or `on`. | `off` |
| `storage.keep.rejected` | Enable the Dead Letter Queue (DLQ) to preserve chunks that fail to be delivered. Accepted values: `off` or `on`. | `off` |
| `storage.rejected.path` | Subdirectory name under `storage.path` for storing rejected chunks. | `rejected` |

For scheduler and retry details, see [scheduling and retries](../../scheduling-and-retries.md#Scheduling-and-Retries).

For storage and buffering details, see [buffering and storage](../../buffering-and-storage.md).

## Configuration example

The following configuration example that defines a `service` section with [hot reloading](../../hot-reload.md) enabled and a pipeline with a `random` input and `stdout` output:
Expand Down
4 changes: 4 additions & 0 deletions administration/scheduling-and-retries.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,10 @@ The scheduler provides a configuration option called `Retry_Limit`, which can be
| `Retry_Limit` | `no_limits` or `False` | When set there no limit for the number of retries that the scheduler can do. |
| `Retry_Limit` | `no_retries` | When set, retries are disabled and scheduler doesn't try to send data to the destination if it failed the first time. |

{% hint style="info" %}
When a chunk exhausts all retry attempts or retries are disabled, the data is discarded by default. To preserve rejected data for later analysis, enable the [Dead Letter Queue (DLQ)](buffering-and-storage.md#dead-letter-queue-dlq) feature by setting `storage.keep.rejected` to `on` in the Service section.
{% endhint %}

### Retry example

The following example configures two outputs, where the HTTP plugin has an unlimited number of retries, and the Elasticsearch plugin have a limit of `5` retries:
Expand Down
55 changes: 55 additions & 0 deletions administration/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,64 @@

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=759ddb3d-b363-4ee6-91fa-21025259767a" />

- [Dead letter queue: preserve failed chunks](#dead-letter-queue)
- [Tap: generate events or records](#tap)
- [Dump internals signal](#dump-internals-and-signal)

## Dead letter queue

The Dead Letter Queue (DLQ) feature preserves chunks that fail to be delivered to output destinations. This enables troubleshooting delivery failures without losing data.

### Enable dead letter queue

To enable the DLQ, add the following to your Service section:

{% tabs %}
{% tab title="fluent-bit.yaml" %}

```yaml
service:
storage.path: /var/log/flb-storage/
storage.keep.rejected: on
storage.rejected.path: rejected
```

{% endtab %}
{% tab title="fluent-bit.conf" %}

```text
[SERVICE]
storage.path /var/log/flb-storage/
storage.keep.rejected on
storage.rejected.path rejected
```

{% endtab %}
{% endtabs %}

### What gets stored

Chunks are copied to the DLQ when:

- An output plugin returns an unrecoverable error.
- A chunk exhausts all configured retry attempts.
- Retries are disabled (`retry_limit: no_retries`) and the flush fails.
- The scheduler fails to schedule a retry.

### Examine dead letter queue files

DLQ files are stored in the configured path (for example, `/var/log/flb-storage/rejected/`) with names that include the tag, status code, and output plugin name. This helps identify which records failed and why.

For example, a file named `kube_var_log_containers_test_400_http_0x7f8b4c.flb` indicates a chunk with tag `kube.var.log.containers.test` that failed with status code `400` when sending to the `http` output.

### Dead letter queue management

{% hint style="warning" %}
DLQ files remain on disk until manually removed. Monitor disk usage and implement a cleanup policy.
{% endhint %}

For more details on DLQ configuration, see [Buffering and Storage](buffering-and-storage.md#dead-letter-queue-dlq).

## Tap

Tap can be used to generate events or records detailing what messages pass through Fluent Bit, at what time and what filters affect them.
Expand Down