Gap auto fill task #239573

nkhristinin · 2025-10-17T14:25:46Z

Gap auto fill scheduler task

Overview

This PR introduces the Gap auto fill scheduler, a task responsible for automatically scheduling backfills for rules that have unprocessed gaps.

The scheduler runs at a configured interval, checks for available system backfill capacity, identifies eligible rules with gaps, and schedules backfill jobs in batches while respecting configured limits and capacity constraints.

It also writes detailed execution information to the event log for visibility and troubleshooting.

New API: Create Gap auto fill scheduler

Endpoint:
POST /internal/alerting/rules/gaps/gap_auto_fill_scheduler

Purpose:
Persist a scheduler configuration in SO and register a task

Example request (v1):

{
  "id": "optional-id",
  "name": "Gap fill scheduler",
  "enabled": true,
  "max_backfills": 500,
  "amount_of_retries": 3,
  "gap_fill_range": "now-7d",
  "schedule": { "interval": "1d" },
  "scope": ["security"], // use to differentiate for solutions
  "rule_types": [
    { "type": "siem.rule", "consumer": "securitySolution" }
  ]
}

Example response:

{
  "id": "abc123",
  "name": "Gap fill scheduler",
  "enabled": true,
  "schedule": { "interval": "1d" },
  "gap_fill_range": "now-7d",
  "max_backfills": 500,
  "amount_of_retries": 3,
  "created_by": "elastic",
  "updated_by": "elastic",
  "created_at": "2025-10-29T12:34:56.789Z",
  "updated_at": "2025-10-29T12:34:56.789Z",
  "scheduled_task_id": "gap-auto-fill-scheduler-task:abc123"
}

Updated API: Get rules with gaps

Added a sort field to support fetching rules with the oldest or newest gaps.

Event Log

Added new fields: kibana.gap_auto_fill.execution.*
Used to track each scheduler run and its results.

Tracked fields include:

status
start, end, duration_ms
rule_ids[]
task_params.name
task_params.amount_of_retries
task_params.gap_fill_range
task_params.interval
task_params.max_backfills
results[] with rule_id, processed_gaps, status, and error

Saved Object

New type: gap_auto_fill_scheduler

Attributes:

name
enabled
schedule.interval
gapFillRange
maxBackfills
amountOfRetries
createdBy, updatedBy
createdAt, updatedAt
scheduledTaskId

Task

New task type: gap-auto-fill-scheduler-task

Timeout after 40s (default)

Task algorithm (High Level)

Initialize
- Create required clients and load scheduler configuration.
- Prepare an event logger for the run.
Capacity check
- Determine remaining system backfill capacity.
- If none, log “skipped” and exit.
Fetch rules ids with gaps
- Query for rule IDs that currently have gaps (most recent first).
- If none, log “skipped” and exit.
Process rules in batches
- Iterate rule IDs in chunks.
- Keep only enabled rules
- For each batch:
  - Fetch current gaps for these rules and ignore overlaps with active or scheduled backfills.
  - Schedule backfills for discovered gaps.
  - Aggregate per-rule results and statuses.
- Re-check capacity after each batch and handle cancelation
- If exhausted, log summary and stop early.
Finalize
- If no gaps were scheduled, log “skipped”.
- Otherwise, log summarised result and overall status, then exit.
Error and cancellation handling
- On error, log error summary and exit.
- On cancellation, log partial results and exit cleanly.

Event log statuses

SUCCESS

Cancelled by timeout or cancellation
Stopped early due to capacity exhausted (no remaining capacity during loop)
Stopped early after post-batch capacity check
Completed with at least one successful per-rule result

SKIPPED

No system backfill capacity at start
No rules with gaps
No enabled rules could be scheduled (after processing)

ERROR

Unhandled error during execution

Cleanup step for stacked gaps

This PR also introduces a cleanup mechanism for stacked and in-progress gaps.

During each scheduler execution, the cleanup step identifies gaps that are currently marked as in progress and verifies whether a corresponding backfill still exists.
If no active backfill is found for a gap, the scheduler resets its in-progress interval and moves the gap back to the unfilled state.

After processing, the updated_at field of each checked gap is updated.
Gaps that were recently updated by this process will not be re-evaluated for the next 12 hours to reduce redundant checks and load.

How to Test

Enable in kibana.dev.yml

xpack.alerting.gapAutoFillScheduler.enabled: true

Ensure you have rules with gaps

There are two ways to create gaps:

Manual method:
Create and enable a security rule with a 1-minute interval and 0-second lookback.
After the first run, disable the rule, wait 5 minutes, and then enable it again you should execution error about gaps, and see the gap in the gaps table in the execution tab.
Using the this tool:
Run the following command to generate multiple rules and gaps (100 rules, 10 gaps each, 30m interval rule, and remove all rules before):
```
npm run start -- rules --rules 100 -c -g 10 -i "30m"
```

Create and enable the scheduler

Run the following request (adjust as needed for your environment):

 fetch("http://localhost:5601/internal/alerting/rules/gaps/gap_auto_fill_scheduler", {
     "headers": {
       // add auth and content headers
     },
     "body": JSON.stringify({
       "id": "gap-scheduler",
       "name": "gap-scheduler",
       "enabled": true,
       "max_backfills": 1000,
       "amount_of_retries": 3,
       "gap_fill_range": "now-90d",
       "schedule": { "interval": "1m" },
       "scope": ["security"],
       "rule_types": [
         { "type": "siem.queryRule", "consumer": "siem" },
         { "type": "siem.savedQueryRule", "consumer": "siem" },
         { "type": "siem.eqlRule", "consumer": "siem" },
         { "type": "siem.esqlRule", "consumer": "siem" },
         { "type": "siem.thresholdRule", "consumer": "siem" },
         { "type": "siem.newTermsRule", "consumer": "siem" },
         { "type": "siem.mlRule", "consumer": "siem" },
         { "type": "siem.indicatorRule", "consumer": "siem" }
       ]
     }),
     "method": "POST",
     "mode": "cors",
     "credentials": "include"
  });

Verify that it works

In discover, search the .kibana-event-log* data view using the query: event.action:"gap-auto-fill-schedule" (check message and status field)

In the Rules Monitoring table, check that: Some rules has gaps and some rules are in progress (being backfilled). After some time, all gaps should be filled, and the number of gaps should be 0.

Performance

The Gap auto fill scheduler attempts to schedule as many backfills as possible during each run.
It continues processing until it reaches one of the following limits:

Task timeout: 40 seconds (default)
System backfill capacity: 1000 concurrent backfills (default)

Once either limit is reached, the task stops early and logs the partial results to the event log.

Test Results

1000 rules × 1 gap each | ~20s | All gaps scheduled successfully |

500 rules × 1000 gaps each | ~40s | Run canceled by timeout after partial scheduling |

Backfill client changes

In the Backfill client, we introduced a change that uses pMap to parallelise bulk creation. This allows the operations to run non-sequentially, which significantly improves performance.

For 100 rules and 100 gaps when we run manual run:

when we run in main - it took ~11s

when we run in this branch it takes ~1.6s

…te --fix'

…no-cache --fix'

… backfill-iniator

… src/core/server/integration_tests/ci_checks'

… backfill-iniator

nkhristinin · 2025-10-17T14:25:54Z

/ci

…no-cache --fix'

…to gap-auto-fill-task

nkhristinin · 2025-10-27T10:27:57Z

/ci

…te --fix'

…to gap-auto-fill-task

nkhristinin · 2025-10-27T13:47:43Z

/ci

…no-cache --fix'

…to gap-auto-fill-task

…tion_tests/ci_checks

nkhristinin · 2025-11-12T15:59:49Z

@elasticmachine merge upstream

elasticmachine · 2025-11-12T15:59:52Z

ignoring request to update branch, pull request is closed

nkhristinin · 2025-11-12T16:00:08Z

@elasticmachine merge upstream

nkhristinin · 2025-11-14T16:03:49Z

@elasticmachine merge upstream

ymao1 · 2025-11-14T15:12:41Z

...erver/routes/gaps/apis/gap_auto_fill_schedule/create/transforms/transform_request/v1.test.ts

+        gap_fill_range: '24h',
+        num_retries: 3,
+        max_backfills: 100,
+        scope: 'internal',


nit: should this be ['internal']?

ymao1 · 2025-11-14T15:15:15Z

...rm/plugins/shared/alerting/server/saved_objects/schemas/raw_gap_auto_fill_scheduler/index.ts

+        consumer: schema.string(),
+      })
+    ),
+    createdBy: schema.maybe(schema.string()),


We populate these fields based on the username we pull from the request object. I believe if security is turned off, we would get an undefined user. So if you're trying to overwrite an existing user in the saved object, undefined would not overwrite.

ymao1 · 2025-11-14T15:25:05Z

...rm/plugins/shared/alerting/server/saved_objects/schemas/raw_gap_auto_fill_scheduler/index.ts

+      interval: schema.string(),
+    }),
+    gapFillRange: schema.string(),
+    maxBackfills: schema.number(),


We do have the validation at the API level but this should also catch if we do updates to the saved object that don't come from an API request (maybe not happening in this PR), the saved objects client should validate those.

ymao1 · 2025-11-14T15:34:18Z

.../server/application/gap_auto_fill_scheduler/methods/create/create_gap_auto_fill_scheduler.ts

+    const taskManager = context.taskManager;
+
+    // Throw error if a gap auto fill scheduler already exists for the same (rule type, consumer) pair
+    const pairs = Array.from(new Set(params.ruleTypes.map((rt) => `${rt.type}:${rt.consumer}`)));


there's no unit test for this error path...I haven't looked at the functional tests yet but if there's no test for that there, we should add a test somewhere.

Yes, I do have API test for that

ymao1 · 2025-11-14T15:35:21Z

.../server/application/gap_auto_fill_scheduler/methods/create/create_gap_auto_fill_scheduler.ts

+
+    // Throw error if a gap auto fill scheduler already exists for the same (rule type, consumer) pair
+    const pairs = Array.from(new Set(params.ruleTypes.map((rt) => `${rt.type}:${rt.consumer}`)));
+    if (pairs.length > 0) {


should there be an error if pairs.length === 0? do we always need to specify at least one pair?

We should have at least 1 pair, is correct, but I thought it will cover by param types validation

ruleTypes: schema.arrayOf( schema.object({ type: schema.string(), consumer: schema.string(), }) ),

ymao1 · 2025-11-14T15:41:06Z

.../server/application/gap_auto_fill_scheduler/methods/create/create_gap_auto_fill_scheduler.ts

+        }
+      );
+    } catch (e) {
+      await soClient.delete(GAP_AUTO_FILL_SCHEDULER_SAVED_OBJECT_TYPE, so.id);


would a log message be useful here?

nkhristinin · 2025-11-17T20:27:25Z

@elasticmachine merge upstream

ymao1 · 2025-11-18T18:34:10Z

...ed/alerting/server/application/rule/methods/bulk_fill_gaps_by_rule_ids/process_gaps_batch.ts

+    }

-  const gapsInBackfillScheduling = gapsClampedIntervals.map(({ gap }) => gap);
+    if (ruleGapsClampedIntervals.length === 0) {


nit: can move this before

if ( maxGapsCountToProcess && totalProcessedGapsCount + ruleGapsClampedIntervals.length > maxGapsCountToProcess ) {

ymao1 · 2025-11-18T19:09:40Z

x-pack/platform/plugins/shared/alerting/server/backfill_client/backfill_client.ts

+    const chunkConcurrency = 10;
+    await pMap(
+      chunks,
+      async ({ startIndex, items }, idx) => {


i'm not too familiar with pMap. What happens if one of the calls to unsecuredSavedObjectsClient.bulkCreate fails? Does it act like a Promise.all where then all the chunks would fail? Maybe we should add a try/catch around it so we can populate the orderedResults with something in case this chunk fails? Otherwise the iteration below over the orderedResults might see a null accessor error because no value was populated.

Good catch, I updated logic, that we are catch those failures and return them in the response of this method

ymao1 · 2025-11-18T19:10:28Z

x-pack/platform/plugins/shared/alerting/server/backfill_client/backfill_client.ts

+        });
+
+        // Logging per-chunk and per-SO average timings
+        this.logger.info(


do you think this info log is necessary? maybe a debug log? if the auto gap fill scheduler is running frequently, it could lead to a decent amount of logging. we could keep the final info log after the pMap and leave these as debug

ymao1 · 2025-11-18T19:14:14Z

x-pack/platform/plugins/shared/alerting/server/backfill_client/backfill_client.ts

-                gaps,
-              })
-            )
+                gaps: ruleGaps,


was this a bug before that we weren't filter the gaps by the rule ID?

I don't think we used this method when we pass different rules in gaps, it was working fine when you pass 1 rule

ymao1 · 2025-11-18T19:48:14Z

...k/platform/plugins/shared/alerting/server/lib/rule_gaps/task/gap_auto_fill_scheduler_task.ts

+
+                const gapsPerPage = DEFAULT_GAPS_PER_PAGE;
+
+                while (true) {


Can we add a circuit breaker for the max number of iterations we'll allow in this while(true) loop? I see there are some break points in here but I'm always leery of a while(true) loop

ymao1 · 2025-11-18T19:59:37Z

...atform/plugins/shared/alerting/common/routes/gaps/apis/gap_auto_fill_scheduler/schemas/v1.ts

+    name: schema.string({ defaultValue: '' }),
+    enabled: schema.boolean({ defaultValue: true }),
+    max_backfills: schema.number({ defaultValue: 1000, min: 1, max: 5000 }),
+    num_retries: schema.number({ defaultValue: 3, min: 1 }),


I may have missed it but where is this used? num_retries

We will use it later, when we introduce new error state for gaps, when backfill is failed to fill gaps.

I decide not to add this functionality here, as it already big PR

…tion_tests/ci_checks

jeramysoucy

Kibana security changes LGTM

…to gap-auto-fill-task

elasticmachine · 2025-11-19T14:34:41Z

💚 Build Succeeded

Buildkite Build
Commit: e817da2

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`alerting`	348	349	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`alerting`	88.1KB	88.1KB	-8.0B

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`alerting`	61	64	+3

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`alerting`	25.4KB	25.4KB	-4.0B

History

💔 Build #362637 failed cb33530
💔 Build #362629 failed 63c7a09
💚 Build #362087 succeeded c592629
💚 Build #361485 succeeded 91f0f50
💚 Build #360133 succeeded fd116c2

cc @nkhristinin

nkhristinin and others added 14 commits October 6, 2025 15:24

Add initator field to backfill

31c0e58

[CI] Auto-commit changed files from 'node scripts/check_mappings_upda…

b154793

…te --fix'

[CI] Auto-commit changed files from 'node scripts/eslint_all_files --…

26daa71

…no-cache --fix'

Fixes types

4c14e11

Merge branch 'backfill-iniator' of github.com:nkhristinin/kibana into…

8e2c648

… backfill-iniator

[CI] Auto-commit changed files from 'node scripts/jest_integration -u…

65ed6f5

… src/core/server/integration_tests/ci_checks'

update types

213c9fb

Merge branch 'backfill-iniator' of github.com:nkhristinin/kibana into…

977f06e

… backfill-iniator

Fix more types

c3e43e7

fix more types

4e19aa2

Merge branch 'main' into backfill-iniator

981580e

Fix tests

fb677d6

Merge branch 'backfill-iniator' of github.com:nkhristinin/kibana into…

a2853d8

… backfill-iniator

Add task, api, event log mappings

a74fce3

kibanamachine and others added 8 commits October 17, 2025 14:51

[CI] Auto-commit changed files from 'node scripts/eslint_all_files --…

b65b5aa

…no-cache --fix'

Fix how task get space from request

25c7ba1

Merge branch 'gap-auto-fill-task' of github.com:nkhristinin/kibana in…

bd19d50

…to gap-auto-fill-task

Return default value for space

14839b1

use backfill initator constant

de66f16

Some fixes

163655b

fix some unit tests

d84bd51

Merge branch 'main' into gap-auto-fill-task

e6e0d53

kibanamachine and others added 4 commits October 27, 2025 10:41

[CI] Auto-commit changed files from 'node scripts/check_mappings_upda…

5f2be71

…te --fix'

fixes

62332f2

Merge branch 'main' into gap-auto-fill-task

0f9f792

Merge branch 'gap-auto-fill-task' of github.com:nkhristinin/kibana in…

4fc5880

…to gap-auto-fill-task

[CI] Auto-commit changed files from 'node scripts/eslint_all_files --…

6e1efdf

…no-cache --fix'

nkhristinin and others added 4 commits November 11, 2025 09:25

Ensure we are not leaking request

8157587

Remoe schedule task id

2bd5d0d

Merge branch 'gap-auto-fill-task' of github.com:nkhristinin/kibana in…

da6ab37

…to gap-auto-fill-task

Changes from node scripts/jest_integration -u src/core/server/integra…

fd116c2

…tion_tests/ci_checks

nkhristinin requested a review from ymao1 November 11, 2025 13:26

Add rule types as param to get rules with gaps method

efaab1e

nkhristinin closed this Nov 12, 2025

nkhristinin reopened this Nov 12, 2025

Merge branch 'main' into gap-auto-fill-task

ebf94e1

Merge branch 'main' into gap-auto-fill-task

7c4d001

ymao1 reviewed Nov 14, 2025

View reviewed changes

Changes from node scripts/check_saved_objects

91f0f50

elastic-vault-github-plugin-prod bot requested a review from a team as a code owner November 14, 2025 16:37

jeramysoucy self-requested a review November 17, 2025 16:20

Merge branch 'main' into gap-auto-fill-task

c592629

vitaliidm self-requested a review November 18, 2025 16:33

ymao1 reviewed Nov 18, 2025

View reviewed changes

nkhristinin and others added 2 commits November 19, 2025 12:46

pr fixes

63c7a09

Changes from node scripts/jest_integration -u src/core/server/integra…

cb33530

…tion_tests/ci_checks

jeramysoucy approved these changes Nov 19, 2025

View reviewed changes

nkhristinin requested a review from ymao1 November 19, 2025 12:47

nkhristinin added 2 commits November 19, 2025 13:52

fix types

a2732bc

Merge branch 'gap-auto-fill-task' of github.com:nkhristinin/kibana in…

e817da2

…to gap-auto-fill-task

Gap auto fill task #239573

Are you sure you want to change the base?

Gap auto fill task #239573

Uh oh!

Conversation

nkhristinin commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Gap auto fill scheduler task

Overview

New API: Create Gap auto fill scheduler

Updated API: Get rules with gaps

Event Log

Saved Object

Task

Task algorithm (High Level)

Event log statuses

Cleanup step for stacked gaps

How to Test

Performance

Test Results

Backfill client changes

Uh oh!

nkhristinin commented Oct 17, 2025

Uh oh!

nkhristinin commented Oct 27, 2025

Uh oh!

nkhristinin commented Oct 27, 2025

Uh oh!

nkhristinin commented Nov 12, 2025

Uh oh!

elasticmachine commented Nov 12, 2025

Uh oh!

nkhristinin commented Nov 12, 2025

Uh oh!

nkhristinin commented Nov 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nkhristinin commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeramysoucy left a comment

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Nov 19, 2025

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

nkhristinin commented Oct 17, 2025 •

edited

Loading