Skip to content

Conversation

@nkhristinin
Copy link
Contributor

@nkhristinin nkhristinin commented Oct 17, 2025

Gap auto fill scheduler task

Overview

This PR introduces the Gap auto fill scheduler, a task responsible for automatically scheduling backfills for rules that have unprocessed gaps.

The scheduler runs at a configured interval, checks for available system backfill capacity, identifies eligible rules with gaps, and schedules backfill jobs in batches while respecting configured limits and capacity constraints.

It also writes detailed execution information to the event log for visibility and troubleshooting.


New API: Create Gap auto fill scheduler

Endpoint:
POST /internal/alerting/rules/gaps/gap_auto_fill_scheduler

Purpose:
Persist a scheduler configuration in SO and register a task

Example request (v1):

{
  "id": "optional-id",
  "name": "Gap fill scheduler",
  "enabled": true,
  "max_backfills": 500,
  "amount_of_retries": 3,
  "gap_fill_range": "now-7d",
  "schedule": { "interval": "1d" },
  "scope": ["security"], // use to differentiate for solutions
  "rule_types": [
    { "type": "siem.rule", "consumer": "securitySolution" }
  ]
}

Example response:

{
  "id": "abc123",
  "name": "Gap fill scheduler",
  "enabled": true,
  "schedule": { "interval": "1d" },
  "gap_fill_range": "now-7d",
  "max_backfills": 500,
  "amount_of_retries": 3,
  "created_by": "elastic",
  "updated_by": "elastic",
  "created_at": "2025-10-29T12:34:56.789Z",
  "updated_at": "2025-10-29T12:34:56.789Z",
  "scheduled_task_id": "gap-auto-fill-scheduler-task:abc123"
}

Updated API: Get rules with gaps

  • Added a sort field to support fetching rules with the oldest or newest gaps.

Event Log

Added new fields: kibana.gap_auto_fill.execution.*
Used to track each scheduler run and its results.

Tracked fields include:

  • status
  • start, end, duration_ms
  • rule_ids[]
  • task_params.name
  • task_params.amount_of_retries
  • task_params.gap_fill_range
  • task_params.interval
  • task_params.max_backfills
  • results[] with rule_id, processed_gaps, status, and error

Saved Object

New type: gap_auto_fill_scheduler

Attributes:

  • name
  • enabled
  • schedule.interval
  • gapFillRange
  • maxBackfills
  • amountOfRetries
  • createdBy, updatedBy
  • createdAt, updatedAt
  • scheduledTaskId

Task

New task type: gap-auto-fill-scheduler-task

Timeout after 40s (default)

Task algorithm (High Level)

  1. Initialize
    • Create required clients and load scheduler configuration.
    • Prepare an event logger for the run.
  2. Capacity check
    • Determine remaining system backfill capacity.
    • If none, log “skipped” and exit.
  3. Fetch rules ids with gaps
    • Query for rule IDs that currently have gaps (most recent first).
    • If none, log “skipped” and exit.
  4. Process rules in batches
    • Iterate rule IDs in chunks.
    • Keep only enabled rules
    • For each batch:
      • Fetch current gaps for these rules and ignore overlaps with active or scheduled backfills.
      • Schedule backfills for discovered gaps.
      • Aggregate per-rule results and statuses.
    • Re-check capacity after each batch and handle cancelation
    • If exhausted, log summary and stop early.
  5. Finalize
    • If no gaps were scheduled, log “skipped”.
    • Otherwise, log summarised result and overall status, then exit.
  6. Error and cancellation handling
    • On error, log error summary and exit.
    • On cancellation, log partial results and exit cleanly.

Event log statuses

SUCCESS

  • Cancelled by timeout or cancellation
  • Stopped early due to capacity exhausted (no remaining capacity during loop)
  • Stopped early after post-batch capacity check
  • Completed with at least one successful per-rule result

SKIPPED

  • No system backfill capacity at start
  • No rules with gaps
  • No enabled rules could be scheduled (after processing)

ERROR

  • Unhandled error during execution

Cleanup step for stacked gaps

This PR also introduces a cleanup mechanism for stacked and in-progress gaps.

During each scheduler execution, the cleanup step identifies gaps that are currently marked as in progress and verifies whether a corresponding backfill still exists.
If no active backfill is found for a gap, the scheduler resets its in-progress interval and moves the gap back to the unfilled state.

After processing, the updated_at field of each checked gap is updated.
Gaps that were recently updated by this process will not be re-evaluated for the next 12 hours to reduce redundant checks and load.


How to Test

Enable in kibana.dev.yml

xpack.alerting.gapAutoFillScheduler.enabled: true

  1. Ensure you have rules with gaps

There are two ways to create gaps:

  • Manual method:
    Create and enable a security rule with a 1-minute interval and 0-second lookback.
    After the first run, disable the rule, wait 5 minutes, and then enable it again you should execution error about gaps, and see the gap in the gaps table in the execution tab.

  • Using the this tool:
    Run the following command to generate multiple rules and gaps (100 rules, 10 gaps each, 30m interval rule, and remove all rules before):

    npm run start -- rules --rules 100 -c -g 10 -i "30m"
    
  1. Create and enable the scheduler

Run the following request (adjust as needed for your environment):

 fetch("http://localhost:5601/internal/alerting/rules/gaps/gap_auto_fill_scheduler", {
     "headers": {
       // add auth and content headers
     },
     "body": JSON.stringify({
       "id": "gap-scheduler",
       "name": "gap-scheduler",
       "enabled": true,
       "max_backfills": 1000,
       "amount_of_retries": 3,
       "gap_fill_range": "now-90d",
       "schedule": { "interval": "1m" },
       "scope": ["security"],
       "rule_types": [
         { "type": "siem.queryRule", "consumer": "siem" },
         { "type": "siem.savedQueryRule", "consumer": "siem" },
         { "type": "siem.eqlRule", "consumer": "siem" },
         { "type": "siem.esqlRule", "consumer": "siem" },
         { "type": "siem.thresholdRule", "consumer": "siem" },
         { "type": "siem.newTermsRule", "consumer": "siem" },
         { "type": "siem.mlRule", "consumer": "siem" },
         { "type": "siem.indicatorRule", "consumer": "siem" }
       ]
     }),
     "method": "POST",
     "mode": "cors",
     "credentials": "include"
  });
  1. Verify that it works

In discover, search the .kibana-event-log* data view using the query: event.action:"gap-auto-fill-schedule" (check message and status field)
Screenshot 2025-10-29 at 13 37 51

In the Rules Monitoring table, check that: Some rules has gaps and some rules are in progress (being backfilled). After some time, all gaps should be filled, and the number of gaps should be 0.

Screenshot 2025-10-29 at 13 37 00

Performance

The Gap auto fill scheduler attempts to schedule as many backfills as possible during each run.
It continues processing until it reaches one of the following limits:

  • Task timeout: 40 seconds (default)
  • System backfill capacity: 1000 concurrent backfills (default)

Once either limit is reached, the task stops early and logs the partial results to the event log.

Test Results

  1. 1000 rules × 1 gap each | ~20s | All gaps scheduled successfully |
Screenshot 2025-10-30 at 11 28 07
  1. 500 rules × 1000 gaps each | ~40s | Run canceled by timeout after partial scheduling |
Screenshot 2025-10-30 at 11 43 32

Backfill client changes

In the Backfill client, we introduced a change that uses pMap to parallelise bulk creation. This allows the operations to run non-sequentially, which significantly improves performance.

For 100 rules and 100 gaps when we run manual run:

  • when we run in main - it took ~11s
Screenshot 2025-11-17 at 15 58 01
  • when we run in this branch it takes ~1.6s
Screenshot 2025-11-17 at 15 52 56

@nkhristinin
Copy link
Contributor Author

/ci

@nkhristinin
Copy link
Contributor Author

/ci

@nkhristinin
Copy link
Contributor Author

/ci

@nkhristinin nkhristinin requested a review from ymao1 November 11, 2025 13:26
@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@elasticmachine
Copy link
Contributor

ignoring request to update branch, pull request is closed

@nkhristinin nkhristinin reopened this Nov 12, 2025
@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

gap_fill_range: '24h',
num_retries: 3,
max_backfills: 100,
scope: 'internal',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be ['internal']?

consumer: schema.string(),
})
),
createdBy: schema.maybe(schema.string()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We populate these fields based on the username we pull from the request object. I believe if security is turned off, we would get an undefined user. So if you're trying to overwrite an existing user in the saved object, undefined would not overwrite.

interval: schema.string(),
}),
gapFillRange: schema.string(),
maxBackfills: schema.number(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have the validation at the API level but this should also catch if we do updates to the saved object that don't come from an API request (maybe not happening in this PR), the saved objects client should validate those.

const taskManager = context.taskManager;

// Throw error if a gap auto fill scheduler already exists for the same (rule type, consumer) pair
const pairs = Array.from(new Set(params.ruleTypes.map((rt) => `${rt.type}:${rt.consumer}`)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no unit test for this error path...I haven't looked at the functional tests yet but if there's no test for that there, we should add a test somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do have API test for that


// Throw error if a gap auto fill scheduler already exists for the same (rule type, consumer) pair
const pairs = Array.from(new Set(params.ruleTypes.map((rt) => `${rt.type}:${rt.consumer}`)));
if (pairs.length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be an error if pairs.length === 0? do we always need to specify at least one pair?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have at least 1 pair, is correct, but I thought it will cover by param types validation

  ruleTypes: schema.arrayOf(
    schema.object({
      type: schema.string(),
      consumer: schema.string(),
    })
  ),

}
);
} catch (e) {
await soClient.delete(GAP_AUTO_FILL_SCHEDULER_SAVED_OBJECT_TYPE, so.id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a log message be useful here?

@elastic-vault-github-plugin-prod elastic-vault-github-plugin-prod bot requested a review from a team as a code owner November 14, 2025 16:37
@jeramysoucy jeramysoucy self-requested a review November 17, 2025 16:20
@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@vitaliidm vitaliidm self-requested a review November 18, 2025 16:33
}

const gapsInBackfillScheduling = gapsClampedIntervals.map(({ gap }) => gap);
if (ruleGapsClampedIntervals.length === 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can move this before

if (
      maxGapsCountToProcess &&
      totalProcessedGapsCount + ruleGapsClampedIntervals.length > maxGapsCountToProcess
    ) {

const chunkConcurrency = 10;
await pMap(
chunks,
async ({ startIndex, items }, idx) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not too familiar with pMap. What happens if one of the calls to unsecuredSavedObjectsClient.bulkCreate fails? Does it act like a Promise.all where then all the chunks would fail? Maybe we should add a try/catch around it so we can populate the orderedResults with something in case this chunk fails? Otherwise the iteration below over the orderedResults might see a null accessor error because no value was populated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I updated logic, that we are catch those failures and return them in the response of this method

});

// Logging per-chunk and per-SO average timings
this.logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think this info log is necessary? maybe a debug log? if the auto gap fill scheduler is running frequently, it could lead to a decent amount of logging. we could keep the final info log after the pMap and leave these as debug

gaps,
})
)
gaps: ruleGaps,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this a bug before that we weren't filter the gaps by the rule ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we used this method when we pass different rules in gaps, it was working fine when you pass 1 rule


const gapsPerPage = DEFAULT_GAPS_PER_PAGE;

while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a circuit breaker for the max number of iterations we'll allow in this while(true) loop? I see there are some break points in here but I'm always leery of a while(true) loop

name: schema.string({ defaultValue: '' }),
enabled: schema.boolean({ defaultValue: true }),
max_backfills: schema.number({ defaultValue: 1000, min: 1, max: 5000 }),
num_retries: schema.number({ defaultValue: 3, min: 1 }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it but where is this used? num_retries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will use it later, when we introduce new error state for gaps, when backfill is failed to fill gaps.

I decide not to add this functionality here, as it already big PR

Copy link
Contributor

@jeramysoucy jeramysoucy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kibana security changes LGTM

@nkhristinin nkhristinin requested a review from ymao1 November 19, 2025 12:47
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
alerting 348 349 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
alerting 88.1KB 88.1KB -8.0B

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
alerting 61 64 +3

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
alerting 25.4KB 25.4KB -4.0B

History

cc @nkhristinin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants