Skip to content

Conversation

@baileycash-elastic
Copy link
Contributor

@baileycash-elastic baileycash-elastic commented Nov 4, 2025

Closes #240356
Replaces this PR

Summary

This pull request introduces a new API route and supporting backend logic for bulk updating tags on alerts in the rule registry. The main changes include the addition of a patchTags method to the AlertsClient, a new route for bulk patching alert tags, and reusable scripts for tag updates. These improvements make it possible to add/remove or replace tags on multiple alerts efficiently, either by IDs or by query.

Alerting authorization

The AlertingAuthorization class changed to support bulk authorized multiple rule type IDs and consumers. The ensureAuthorized was renamed to _ensureAuthorized and accepts Array<{ ruleTypeId: string; consumers: string[] }>;. The logic is the same as before, but it constructs all security actions based on the input. This is needed to avoid having to do one authorization call per (ruleTypeId, consumer) pair. Lastly, a bulkEnsureAuthorized is exposed which is a wrapper of the private _ensureAuthorized.

Alerts client

The code in the AlertsClient is pretty outdated. For this reason, I decided not to use the existing functionality and code from scratch. The bulkUpdateTags is introduced, and it bulk updates the tags of multiple alerts either by using alertIds or by using a KQL query. In the first scenario, an aggregation is made to get the rule type ID and the consumer of each alert. Then we bulk authorize. If the user has access, we move forward and update the tags of the alerts. If not, we throw an error. For the query scenario, we apply the authorization filter along with the query to filter out the alerts that the user does not have access to. Lastly, we audit log only once for the whole bulk operation and not for each alert found.

API and Backend Enhancements

  • Added a new API route POST /internal/rac/alerts/tags for bulk updating alert tags, supporting add, remove, and replace operations, with validation and error handling. [1] [2]
  • Implemented the bulkUpdateTags method in AlertsClient, enabling tag updates on alerts by IDs or query, using Elasticsearch scripts for efficient bulk operations.

Reusable Update Scripts

  • Added reusable Painless scripts for adding, removing, and replacing alert tags, exported from alert_client_bulk_update_scripts.ts and integrated into the client logic. [1] [2]

@baileycash-elastic
Copy link
Contributor Author

/ci

@github-actions github-actions bot added the author:obs-ux-management PRs authored by the obs ux management team label Nov 4, 2025
@baileycash-elastic baileycash-elastic force-pushed the alerting-240356-patch branch 2 times, most recently from 1f4e1b4 to a6c1886 Compare November 4, 2025 23:01
@baileycash-elastic baileycash-elastic added Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting labels Nov 4, 2025
@baileycash-elastic baileycash-elastic marked this pull request as ready for review November 4, 2025 23:03
@baileycash-elastic baileycash-elastic requested review from a team as code owners November 4, 2025 23:03
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

validate: {
body: buildRouteValidation(
t.intersection([
t.type({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to provide the index. Shouldn't the index be derived by the alert? It feels like our internal system should be able to know and keep track of this.

Also, what if we are trying to bulk update alerts that are spread across multiple indices, for example, add a tag to both a synthetics and apm rule that are for the same service? cc: @cnasikas there may be some context I'm missing here.

Copy link
Member

@cnasikas cnasikas Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was developed before ResponseOps, but I assume it is needed because it would be very inefficient to get the indices of the alerts using IDs. You may be able to do a bulk get using an alias, but for the query param, this is not feasible. You would need to do an aggregation (search) over all alerting indices, handle aggregation limits, etc. Now, behind the scenes, the bulk update of the alert client is using bulk update by query, which supports updating multiple documents using an alias. So you can set the index route param to be .alerts-observability-* or even alerts-observability-logs*,alerts-observability-apm*.

Copy link
Contributor

@dominiqueclarke dominiqueclarke Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm attempting this without success so far

PATCH kbn:/internal/rac/alerts/tags
{
    "index": ".alerts-*",
    "addTags": ["alert_tag"],
    "alertIds": ["30b3b23c-27f5-4064-9947-1ebe376d4561"]
}

Result is

{
  "statusCode": 404,
  "error": "Not Found",
  "message": "alerts with ids 30b3b23c-27f5-4064-9947-1ebe376d4561 and index .alerts-* not found"
}

Attempting to use the UUID

Screenshot 2025-11-07 at 10 47 30 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cnasikas testing I'm not able to use an index pattern it seems in the way you've suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will have to provide the full index due to constraints with esClient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from the UI, we will have to segment bulk requests so that we have one bulk request per index, it would seem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, mget does not support index patterns. To avoid having the UI make multiple requests, instead of an mget, we can do an aggregation to get the rule type IDs and consumers. Then we can construct the pairs and pass them to the ensureAllAuthorized. The index would still be needed to target the o11y + stack alerts (and not modify the security alerts), but it can be optional and fallback to .alerts-*.

GET .alerts-*/_search
{
  "query": {
    "ids": {
      "values": [
        "36d6d650-1204-44f5-b105-3af0c5d5acde"
      ]
    }
  },
  "aggs": {
    "ruleTypeId": {
      "terms": {
        "field": "kibana.alert.rule.rule_type_id",
        "size": 100
      },
      "aggs": {
        "consumer": {
          "terms": {
            "field": "kibana.alert.rule.consumer",
            "size": 100
          }
        }
      }
    }
  },
  "size": 0
}

@baileycash-elastic
Copy link
Contributor Author

@elasticmachine merge upstream

});
});

describe('failure scenarios', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please also mock the Not Found failure scenario?

noKibanaPrivileges,
];

addTests({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like how these are structured 🔥

.set('kbn-xsrf', 'true')
.send({
alertIds: [alertId],
addTags: ['new-tag'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed all these only have the addTags param. Could you please add one scenario for tag removal? (No need to duplicate all tests, just one should bulk remove alert tags.)

ctx._source['${ALERT_WORKFLOW_TAGS}'] = new ArrayList();
}
for (item in params.addTags) {
if (!ctx._source['${ALERT_WORKFLOW_TAGS}'].contains(item)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not enough, if I use spaces, I am able to create duplicates:

Screenshot 2025-11-07 at 13 34 22

Should we always trim the tags before applying?

@dominiqueclarke dominiqueclarke self-requested a review November 7, 2025 14:29
@dominiqueclarke
Copy link
Contributor

The Sec alert was updated but not the O11y one, which is expected given the Security index used in the request. The response though showed a success result for the O11y alert to

@cnasikas I was able to recreate this too. The difference between the success count and the length of the results array is confusing user experience wise, and the "success" status is misleading.

Copy link
Contributor

@dominiqueclarke dominiqueclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My user has access to the not_default space. I'm using the super user. There are no alerts in that space, definitely not the alerts with the uuids I provided. However, 403 seems like the wrong response to return here.

Screenshot 2025-11-18 at 11 18 58 AM

@dominiqueclarke dominiqueclarke self-requested a review November 18, 2025 18:14
Copy link
Contributor

@dominiqueclarke dominiqueclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use the query option, the results array is empty. Maybe it doesn't make sense to include an array item for every individual instance that matches the query, but perhaps returning back a single item with the query with status: success?

Copy link
Contributor

@dominiqueclarke dominiqueclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking on the bug reported by Umberto and I, where the success status can be incorrect.

@cnasikas
Copy link
Member

cnasikas commented Nov 19, 2025

EDIT When I tried to update the tags of all o11y rules I created, only 3 of 4 were actually updated:

@umbopepato Which user did you use and what is the consumer of each ES rule?

The Sec alert was updated but not the O11y one, which is expected given the Security index used in the request. The response though showed a success result for the O11y alert to

@cnasikas I was able to recreate this too. The difference between the success count and the length of the results array is confusing user experience wise, and the "success" status is misleading.

When you use the query option, the results array is empty. Maybe it doesn't make sense to include an array item for every individual instance that matches the query, but perhaps returning back a single item with the query with status: success?

@umbopepato @dominiqueclarke Given all the misleading findings, I would suggest removing the results array from the response entirely and returning a failure array with any failures if they exist. It is not possible to know which alerts were updated and which were not with an updateByQuery, and doing another call to find out is not performant, and it will increase the complexity of the code. Wdyt?

My user has access to the not_default space. I'm using the super user. There are no alerts in that space, definitely not the alerts with the uuids I provided. However, 403 seems like the wrong response to return here.

I believe that the response is correct here. You are not authorized to perform any action on that space, so a 403 indicates that.

@umbopepato
Copy link
Member

@umbopepato Which user did you use and what is the consumer of each ES rule?

I used the default elastic superuser

@dominiqueclarke
Copy link
Contributor

I believe that the response is correct here. You are not authorized to perform any action on that space, so a 403 indicates that.

I'm not sure I fully understand the rationale, but this particular issue is not as critical to me as the other. My user dos technical have access to perform all actions in that space, they are a super user, it's just that there are no alerts in that space. It's more of a 404.

@dominiqueclarke
Copy link
Contributor

@umbopepato @dominiqueclarke Given all the misleading findings, I would suggest removing the results array from the response entirely and returning a failure array with any failures if they exist. It is not possible to know which alerts were updated and which were not with an updateByQuery, and doing another call to find out is not performant, and it will increase the complexity of the code. Wdyt?

Fine by me. That'll help us iterate over fails to populate the error in the UI.

@cnasikas
Copy link
Member

I'm not sure I fully understand the rationale, but this particular issue is not as critical to me as the other. My user dos technical have access to perform all actions in that space, they are a super user, it's just that there are no alerts in that space. It's more of a 404.

Oh, I see what you mean. I thought you did NOT have access to that space. Let me investigate, and I will come back to you.

@cnasikas
Copy link
Member

@dominiqueclarke I changed the response to return only the failures if they exist. I also return a 404 for the scenario with alerts from a different space. @umbopepato I could not reproduce your bug. Was it a refresh issue with the alerts table?

@elastic-vault-github-plugin-prod elastic-vault-github-plugin-prod bot requested a review from a team as a code owner November 19, 2025 16:02
@cnasikas cnasikas self-assigned this Nov 20, 2025
@cnasikas
Copy link
Member

@elasticmachine merge upstream

@dominiqueclarke
Copy link
Contributor

@dominiqueclarke I changed the response to return only the failures if they exist. I also return a 404 for the scenario with alerts from a different space.

I'm really sorry to report that I'm having difficulties with returning failures if they are any. Here I have two ids. I'm querying against the slo alert indices. One id is from slo alert and the other is from synthetics alert. As you can see it's reporting total 1 updated 1, without reporting any failures.

Screenshot 2025-11-20 at 9 36 47 PM

@cnasikas
Copy link
Member

cnasikas commented Nov 21, 2025

I'm really sorry to report that I'm having difficulties with returning failures if they are any. Here I have two ids. I'm querying against the slo alert indices. One id is from slo alert and the other is from synthetics alert. As you can see it's reporting total 1 updated 1, without reporting any failures.

This is the expected response. We are using updateByQuery behind the scenes, which will update any alert that matches the query: { ids: { values: ['alert-id-1', ....] } }. ES will not report the docs that did not match the filter as failures. It will only report failures for docs that matched the filter, but for some reason, ES could not update them, like conflict errors, etc. We cannot throw an error if a doc is missing from the index by using mget (ES bulk get) before doing the query, because it does not support index aliases. Lastly, the reason we are using updateByQuery for IDs is to be able to support index aliases. If the API is public in the future, we can document the behavior.

Copy link
Contributor

@dominiqueclarke dominiqueclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed and will move forward with merging this PR.

@cnasikas cnasikas removed the request for review from a team November 22, 2025 12:03
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
alerting 886 888 +2
ruleRegistry 210 212 +2
total +4

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
alerting 62 63 +1
ruleRegistry 8 10 +2
total +3
Unknown metric groups

API count

id before after diff
alerting 923 925 +2
ruleRegistry 248 250 +2
total +4

ESLint disabled line counts

id before after diff
@kbn/test-suites-xpack-platform 160 161 +1

Total ESLint disabled count

id before after diff
@kbn/test-suites-xpack-platform 170 171 +1

History

cc @cnasikas

@cnasikas cnasikas merged commit 67de5ef into elastic:main Nov 22, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:obs-ux-management PRs authored by the obs ux management team backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Alert Tagging] Enhance bulk update api to support bulk tagging

7 participants