-
Notifications
You must be signed in to change notification settings - Fork 8.5k
Gap auto fill task #239573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Gap auto fill task #239573
Conversation
… backfill-iniator
… src/core/server/integration_tests/ci_checks'
… backfill-iniator
… backfill-iniator
|
/ci |
…to gap-auto-fill-task
|
/ci |
|
/ci |
|
@elasticmachine merge upstream |
|
ignoring request to update branch, pull request is closed |
|
@elasticmachine merge upstream |
|
@elasticmachine merge upstream |
| gap_fill_range: '24h', | ||
| num_retries: 3, | ||
| max_backfills: 100, | ||
| scope: 'internal', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be ['internal']?
| consumer: schema.string(), | ||
| }) | ||
| ), | ||
| createdBy: schema.maybe(schema.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We populate these fields based on the username we pull from the request object. I believe if security is turned off, we would get an undefined user. So if you're trying to overwrite an existing user in the saved object, undefined would not overwrite.
| interval: schema.string(), | ||
| }), | ||
| gapFillRange: schema.string(), | ||
| maxBackfills: schema.number(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have the validation at the API level but this should also catch if we do updates to the saved object that don't come from an API request (maybe not happening in this PR), the saved objects client should validate those.
| const taskManager = context.taskManager; | ||
|
|
||
| // Throw error if a gap auto fill scheduler already exists for the same (rule type, consumer) pair | ||
| const pairs = Array.from(new Set(params.ruleTypes.map((rt) => `${rt.type}:${rt.consumer}`))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no unit test for this error path...I haven't looked at the functional tests yet but if there's no test for that there, we should add a test somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I do have API test for that
|
|
||
| // Throw error if a gap auto fill scheduler already exists for the same (rule type, consumer) pair | ||
| const pairs = Array.from(new Set(params.ruleTypes.map((rt) => `${rt.type}:${rt.consumer}`))); | ||
| if (pairs.length > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be an error if pairs.length === 0? do we always need to specify at least one pair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have at least 1 pair, is correct, but I thought it will cover by param types validation
ruleTypes: schema.arrayOf(
schema.object({
type: schema.string(),
consumer: schema.string(),
})
),
| } | ||
| ); | ||
| } catch (e) { | ||
| await soClient.delete(GAP_AUTO_FILL_SCHEDULER_SAVED_OBJECT_TYPE, so.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a log message be useful here?
|
@elasticmachine merge upstream |
| } | ||
|
|
||
| const gapsInBackfillScheduling = gapsClampedIntervals.map(({ gap }) => gap); | ||
| if (ruleGapsClampedIntervals.length === 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can move this before
if (
maxGapsCountToProcess &&
totalProcessedGapsCount + ruleGapsClampedIntervals.length > maxGapsCountToProcess
) {
| const chunkConcurrency = 10; | ||
| await pMap( | ||
| chunks, | ||
| async ({ startIndex, items }, idx) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not too familiar with pMap. What happens if one of the calls to unsecuredSavedObjectsClient.bulkCreate fails? Does it act like a Promise.all where then all the chunks would fail? Maybe we should add a try/catch around it so we can populate the orderedResults with something in case this chunk fails? Otherwise the iteration below over the orderedResults might see a null accessor error because no value was populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I updated logic, that we are catch those failures and return them in the response of this method
| }); | ||
|
|
||
| // Logging per-chunk and per-SO average timings | ||
| this.logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think this info log is necessary? maybe a debug log? if the auto gap fill scheduler is running frequently, it could lead to a decent amount of logging. we could keep the final info log after the pMap and leave these as debug
| gaps, | ||
| }) | ||
| ) | ||
| gaps: ruleGaps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this a bug before that we weren't filter the gaps by the rule ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we used this method when we pass different rules in gaps, it was working fine when you pass 1 rule
|
|
||
| const gapsPerPage = DEFAULT_GAPS_PER_PAGE; | ||
|
|
||
| while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a circuit breaker for the max number of iterations we'll allow in this while(true) loop? I see there are some break points in here but I'm always leery of a while(true) loop
| name: schema.string({ defaultValue: '' }), | ||
| enabled: schema.boolean({ defaultValue: true }), | ||
| max_backfills: schema.number({ defaultValue: 1000, min: 1, max: 5000 }), | ||
| num_retries: schema.number({ defaultValue: 3, min: 1 }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it but where is this used? num_retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will use it later, when we introduce new error state for gaps, when backfill is failed to fill gaps.
I decide not to add this functionality here, as it already big PR
…tion_tests/ci_checks
jeramysoucy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kibana security changes LGTM
…to gap-auto-fill-task
💚 Build Succeeded
Metrics [docs]Module Count
Async chunks
Public APIs missing exports
Page load bundle
History
cc @nkhristinin |
Gap auto fill scheduler task
Overview
This PR introduces the Gap auto fill scheduler, a task responsible for automatically scheduling backfills for rules that have unprocessed gaps.
The scheduler runs at a configured interval, checks for available system backfill capacity, identifies eligible rules with gaps, and schedules backfill jobs in batches while respecting configured limits and capacity constraints.
It also writes detailed execution information to the event log for visibility and troubleshooting.
New API: Create Gap auto fill scheduler
Endpoint:
POST /internal/alerting/rules/gaps/gap_auto_fill_schedulerPurpose:
Persist a scheduler configuration in SO and register a task
Example request (v1):
{ "id": "optional-id", "name": "Gap fill scheduler", "enabled": true, "max_backfills": 500, "amount_of_retries": 3, "gap_fill_range": "now-7d", "schedule": { "interval": "1d" }, "scope": ["security"], // use to differentiate for solutions "rule_types": [ { "type": "siem.rule", "consumer": "securitySolution" } ] }Example response:
{ "id": "abc123", "name": "Gap fill scheduler", "enabled": true, "schedule": { "interval": "1d" }, "gap_fill_range": "now-7d", "max_backfills": 500, "amount_of_retries": 3, "created_by": "elastic", "updated_by": "elastic", "created_at": "2025-10-29T12:34:56.789Z", "updated_at": "2025-10-29T12:34:56.789Z", "scheduled_task_id": "gap-auto-fill-scheduler-task:abc123" }Updated API: Get rules with gaps
sortfield to support fetching rules with the oldest or newest gaps.Event Log
Added new fields:
kibana.gap_auto_fill.execution.*Used to track each scheduler run and its results.
Tracked fields include:
statusstart,end,duration_msrule_ids[]task_params.nametask_params.amount_of_retriestask_params.gap_fill_rangetask_params.intervaltask_params.max_backfillsresults[]withrule_id,processed_gaps,status, anderrorSaved Object
New type:
gap_auto_fill_schedulerAttributes:
nameenabledschedule.intervalgapFillRangemaxBackfillsamountOfRetriescreatedBy,updatedBycreatedAt,updatedAtscheduledTaskIdTask
New task type:
gap-auto-fill-scheduler-taskTimeout after 40s (default)
Task algorithm (High Level)
Event log statuses
SUCCESS
SKIPPED
ERROR
Cleanup step for stacked gaps
This PR also introduces a cleanup mechanism for stacked and in-progress gaps.
During each scheduler execution, the cleanup step identifies gaps that are currently marked as in progress and verifies whether a corresponding backfill still exists.
If no active backfill is found for a gap, the scheduler resets its in-progress interval and moves the gap back to the unfilled state.
After processing, the updated_at field of each checked gap is updated.
Gaps that were recently updated by this process will not be re-evaluated for the next 12 hours to reduce redundant checks and load.
How to Test
Enable in kibana.dev.yml
xpack.alerting.gapAutoFillScheduler.enabled: trueThere are two ways to create gaps:
Manual method:
Create and enable a security rule with a 1-minute interval and 0-second lookback.
After the first run, disable the rule, wait 5 minutes, and then enable it again you should execution error about gaps, and see the gap in the gaps table in the execution tab.
Using the this tool:
Run the following command to generate multiple rules and gaps (100 rules, 10 gaps each, 30m interval rule, and remove all rules before):
Run the following request (adjust as needed for your environment):
In discover, search the

.kibana-event-log*data view using the query:event.action:"gap-auto-fill-schedule"(check message and status field)In the Rules Monitoring table, check that: Some rules has gaps and some rules are in progress (being backfilled). After some time, all gaps should be filled, and the number of gaps should be 0.
Performance
The Gap auto fill scheduler attempts to schedule as many backfills as possible during each run.
It continues processing until it reaches one of the following limits:
Once either limit is reached, the task stops early and logs the partial results to the event log.
Test Results
Backfill client changes
In the Backfill client, we introduced a change that uses pMap to parallelise bulk creation. This allows the operations to run non-sequentially, which significantly improves performance.
For 100 rules and 100 gaps when we run manual run: