-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Deduplicate findings in batches #13491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Deduplicate findings in batches #13491
Conversation
|
Conflicts have been resolved. A maintainer will review the pull request shortly. |
|
This pull request identifies three related issues where lack of validation and direct use of admin-configurable parameters can lead to denial-of-service or resource exhaustion: an unvalidated DD_IMPORT_REIMPORT_DEDUPE_BATCH_SIZE in the importer that can create one Celery task per finding, an IMPORT_REIMPORT_DEDUPE_BATCH_SIZE used by the dedupe command that can cause excessive DB/worker load when set too low, and a clear_celery_queue management command that accepts arbitrary queue names and could be abused to purge critical queues. All three are administrative features but pose operational risks if misconfigured or executed by an attacker with management access.
Denial of Service via Misconfiguration in
|
| Vulnerability | Denial of Service via Misconfiguration |
|---|---|
| Description | The DD_IMPORT_REIMPORT_DEDUPE_BATCH_SIZE setting, which controls the batch size for processing findings, lacks input validation. If an administrator configures this setting to a very low value (e.g., 1), importing a large report will cause the system to dispatch a separate Celery task for almost every single finding. This rapid creation and enqueuing of numerous small tasks can overwhelm the Celery message broker and workers, leading to resource exhaustion and a denial of service for all background processing. While the setting requires administrative access to modify, the absence of validation makes it susceptible to accidental misconfiguration with significant operational impact. |
django-DefectDojo/dojo/importers/default_importer.py
Lines 241 to 244 in 6954cba
| if len(batch_finding_ids) >= batch_max_size or is_final_finding: | |
| finding_ids_batch = list(batch_finding_ids) | |
| batch_finding_ids.clear() | |
| if we_want_async(async_user=self.user): |
Potential for Unauthorized Queue Manipulation in dojo/management/commands/clear_celery_queue.py
| Vulnerability | Potential for Unauthorized Queue Manipulation |
|---|---|
| Description | The clear_celery_queue management command directly uses the --queue argument to purge Celery queues. While this is an administrative command, if an attacker gains the ability to execute Django management commands, they could specify an arbitrary queue name, including critical application queues like 'celery', 'dedupe', or 'default', leading to a denial of service by interrupting essential background tasks and potentially causing data loss or inconsistencies. |
django-DefectDojo/dojo/management/commands/clear_celery_queue.py
Lines 99 to 102 in 6954cba
| purged_count = channel.queue_purge(queue=queue) | |
| total_purged += purged_count | |
| self.stdout.write( | |
| self.style.SUCCESS(f" ✓ Purged {purged_count} messages from queue: {queue}"), |
Potential Resource Exhaustion via Batch Size Configuration in dojo/management/commands/dedupe.py
| Vulnerability | Potential Resource Exhaustion via Batch Size Configuration |
|---|---|
| Description | The dedupe management command, when run in batch mode, uses the IMPORT_REIMPORT_DEDUPE_BATCH_SIZE setting to determine the size of processing batches. If this setting is configured to a very low value (e.g., 1), each finding will be processed in its own batch. This leads to a significant increase in database queries and/or Celery task submissions, potentially causing performance degradation and resource exhaustion on the database and/or Celery broker/workers. While the command is typically run by administrators, a misconfiguration or malicious setting of this value could severely impact system stability. |
django-DefectDojo/dojo/management/commands/dedupe.py
Lines 127 to 130 in 6954cba
| batch_max_size = getattr(settings, "IMPORT_REIMPORT_DEDUPE_BATCH_SIZE", 1000) | |
| total_findings = findings_queryset.count() | |
| logger.info(f"Processing {total_findings} findings in batches of max {batch_max_size} per test ({mode_str})") | |
All finding details can be found in the DryRun Security Dashboard.
mtesauro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved
Traditionally Defect Dojo has been deduplicating (new) findings one-by-one. This works well for small imports and has the benefit of an easy to understand codebase and test suite.
For larger imports however the performance is bad and resource usage is (very) high. A 1000+ finding import can cause a celery worker to spend minutes on deduplication.
This PR changes the deduplication process for import and reimport to be done in batches. This biggest benefit is that there now will be 1 database query per batch (1000 findings), instead of 1 query per finding (1000 queries).
During the development of the PR I realized:
Although batching dedupe sounds like a simple PR, the caveat is that with the one-by-one deduplication the result of the deduplication of the first finding in the report can have an affect on the deduplication result of the next findings (if there are duplicates inside the same report). This should be a corner case and usually means the deduplication configuration need some fine tuning. Nevertheless we wanted to make not to cause unexpected/different behavior here. The new tests should cover this.
The PR splits the deduplication process in three parts:
One of the reasons for doing this is that we want to use the exact same matching logic for the
reimportprocess. Currently that has an almost identical matching algorithm, but with minor unintentional differences. Once this PR here has proven itself, we will adjust the reimport process. Next to the "reimport matching" the reimport process also deduplicates new findings. This part is already using the batchwise deduplication in this PR.A quick test with the
jfrog_xray_unified/very_many_vulns.jsonsamples scan (10k findings) shwo the obvious huge improvement in deduplication time. Please note that we're not only doing this for performance, but also to reduce the resources (cloud cost) needed to run Defect Dojo.initial import (no duplicates):
second import into the same product (all duplicates):
initial import (no duplicates):
Imagine what this can do for
reimportperformance if we switch that to batch mode.