Deduplicate findings in batches #13491

valentijnscholten · 2025-10-21T19:28:52Z

Traditionally Defect Dojo has been deduplicating (new) findings one-by-one. This works well for small imports and has the benefit of an easy to understand codebase and test suite.

For larger imports however the performance is bad and resource usage is (very) high. A 1000+ finding import can cause a celery worker to spend minutes on deduplication.

This PR changes the deduplication process for import and reimport to be done in batches. This biggest benefit is that there now will be 1 database query per batch (1000 findings), instead of 1 query per finding (1000 queries).

During the development of the PR I realized:

more test cases were needed: 13499, 13464, 13463, 13372.
some deduplication bugs needed fixing: 13513

Although batching dedupe sounds like a simple PR, the caveat is that with the one-by-one deduplication the result of the deduplication of the first finding in the report can have an affect on the deduplication result of the next findings (if there are duplicates inside the same report). This should be a corner case and usually means the deduplication configuration need some fine tuning. Nevertheless we wanted to make not to cause unexpected/different behavior here. The new tests should cover this.

The PR splits the deduplication process in three parts:

Finding possible candidates
Match the (new) finding against the candidates
Act upon it if a match is found

One of the reasons for doing this is that we want to use the exact same matching logic for the reimport process. Currently that has an almost identical matching algorithm, but with minor unintentional differences. Once this PR here has proven itself, we will adjust the reimport process. Next to the "reimport matching" the reimport process also deduplicates new findings. This part is already using the batchwise deduplication in this PR.

A quick test with the jfrog_xray_unified/very_many_vulns.json samples scan (10k findings) shwo the obvious huge improvement in deduplication time. Please note that we're not only doing this for performance, but also to reduce the resources (cloud cost) needed to run Defect Dojo.

initial import (no duplicates):

branch	import time	dedupe time	total time
dev	~200s	~400s	~600s
dedupe-batching	~190s	~12s	~200s

second import into the same product (all duplicates):
initial import (no duplicates):

branch	import time	dedupe time	total time
dev	~200s	~400s	~600s
dedupe-batching	~190s	~180s	~370s

Imagine what this can do for reimport performance if we switch that to batch mode.

github-actions · 2025-11-01T08:57:35Z

Conflicts have been resolved. A maintainer will review the pull request shortly.

dryrunsecurity · 2025-11-07T21:03:36Z

This pull request identifies three related issues where lack of validation and direct use of admin-configurable parameters can lead to denial-of-service or resource exhaustion: an unvalidated DD_IMPORT_REIMPORT_DEDUPE_BATCH_SIZE in the importer that can create one Celery task per finding, an IMPORT_REIMPORT_DEDUPE_BATCH_SIZE used by the dedupe command that can cause excessive DB/worker load when set too low, and a clear_celery_queue management command that accepts arbitrary queue names and could be abused to purge critical queues. All three are administrative features but pose operational risks if misconfigured or executed by an attacker with management access.

Denial of Service via Misconfiguration in dojo/importers/default_importer.py

Vulnerability	Denial of Service via Misconfiguration
Description	The `DD_IMPORT_REIMPORT_DEDUPE_BATCH_SIZE` setting, which controls the batch size for processing findings, lacks input validation. If an administrator configures this setting to a very low value (e.g., 1), importing a large report will cause the system to dispatch a separate Celery task for almost every single finding. This rapid creation and enqueuing of numerous small tasks can overwhelm the Celery message broker and workers, leading to resource exhaustion and a denial of service for all background processing. While the setting requires administrative access to modify, the absence of validation makes it susceptible to accidental misconfiguration with significant operational impact.

django-DefectDojo/dojo/importers/default_importer.py

Lines 241 to 244 in 6954cba

    
           if len(batch_finding_ids) >= batch_max_size or is_final_finding: 
        
               finding_ids_batch = list(batch_finding_ids) 
        
               batch_finding_ids.clear() 
        
               if we_want_async(async_user=self.user):

Potential for Unauthorized Queue Manipulation in dojo/management/commands/clear_celery_queue.py

Vulnerability	Potential for Unauthorized Queue Manipulation
Description	The `clear_celery_queue` management command directly uses the `--queue` argument to purge Celery queues. While this is an administrative command, if an attacker gains the ability to execute Django management commands, they could specify an arbitrary queue name, including critical application queues like 'celery', 'dedupe', or 'default', leading to a denial of service by interrupting essential background tasks and potentially causing data loss or inconsistencies.

django-DefectDojo/dojo/management/commands/clear_celery_queue.py

Lines 99 to 102 in 6954cba

    
           purged_count = channel.queue_purge(queue=queue) 
        
           total_purged += purged_count 
        
           self.stdout.write( 
        
               self.style.SUCCESS(f"  ✓ Purged {purged_count} messages from queue: {queue}"),

Potential Resource Exhaustion via Batch Size Configuration in dojo/management/commands/dedupe.py

Vulnerability	Potential Resource Exhaustion via Batch Size Configuration
Description	The `dedupe` management command, when run in batch mode, uses the `IMPORT_REIMPORT_DEDUPE_BATCH_SIZE` setting to determine the size of processing batches. If this setting is configured to a very low value (e.g., 1), each finding will be processed in its own batch. This leads to a significant increase in database queries and/or Celery task submissions, potentially causing performance degradation and resource exhaustion on the database and/or Celery broker/workers. While the command is typically run by administrators, a misconfiguration or malicious setting of this value could severely impact system stability.

django-DefectDojo/dojo/management/commands/dedupe.py

Lines 127 to 130 in 6954cba

    
           batch_max_size = getattr(settings, "IMPORT_REIMPORT_DEDUPE_BATCH_SIZE", 1000) 
        
           total_findings = findings_queryset.count() 
        
           logger.info(f"Processing {total_findings} findings in batches of max {batch_max_size} per test ({mode_str})")

All finding details can be found in the DryRun Security Dashboard.

mtesauro

Approved

valentijnscholten added 15 commits October 19, 2025 16:25

initial batching code

74c6563

fix dedupe_inside_engagement

01842eb

all tests working incl sarif with internal dupes

ccc5ad1

cleanup

01c4911

deduplication: add more importer unit tests

53b2258

deduplication: add more importer unit tests

4f6992d

deduplication: log hash_code_fields_always

15a06e6

view_finding: show unique_id_from_tool with hash_code

8bb5292

view_finding: show unique_id_from_tool with hash_code

b2ea7eb

uncomment tests

99bafd3

add more assessments

4d470f0

fix duplicate finding links

5d2768f

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

8b272d9

split per algo, move into new file

cdabfea

align logging

7f2f661

github-actions bot added the unittests label Oct 21, 2025

valentijnscholten added 14 commits October 21, 2025 21:29

better method name and param order

301c3c3

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

18db8c9

ruff apps.py

e73ac73

update task/query counts

0945279

update comments, parameters names

d9dad18

finetune uidorhash logic

a1da692

fix tests to import from deduplication.py

2c6f941

ruff unit tests

0efac0c

simplify base queryset building

76b78d6

deduplication logic: add cross scanner unique_id tests

58d6934

hook old per finding dedupe to batch dedupe code

74a8b2d

fix and make uid_or_hash_code matching identical to old dedupe

95974ca

UNIQUE_ID_OR_HASH_CODE: dont stop after one candidate

9a876e3

UNIQUE_ID_OR_HASH_CODE: dont stop after one candidate in Batch mode

92a92ca

valentijnscholten and others added 4 commits November 1, 2025 10:01

complete merge

182d5c3

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

934cdba

add more logging is_older, dedupe_eng_mismatch

be00200

support FINDING_DEDUPE_METHOD

04f24ad

valentijnscholten marked this pull request as ready for review November 7, 2025 21:03

valentijnscholten requested review from Maffooch and mtesauro as code owners November 7, 2025 21:03

Valentijn Scholten added 4 commits November 8, 2025 08:05

add support for FINDING_DEDUPE_BATCH_METHOD

3943767

simplify

f01ab16

update log line

42a5f48

make batch size a setting

1a721ea

github-actions bot added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Nov 8, 2025

add false positive history to new batch post process task

77e8ca1

valentijnscholten added the Breaking Changes label Nov 8, 2025

valentijnscholten and others added 8 commits November 9, 2025 09:26

commands: add command to clear celery queue

232fe7d

update dedupe command to use batch mode

b91330a

default to batch_mode for dedupe command

b336b75

do not deduplicate duplicates

93382b0

improve logging

edd8c04

prefetch better in dedupe command

ab18a94

dedupe command: max batch size 1000

239d7c7

remove leftover method

6954cba

Maffooch approved these changes Nov 12, 2025

View reviewed changes

Maffooch requested review from blakeaowens and dogboat November 12, 2025 11:50

valentijnscholten added the affects_pro PRs that affect Pro and need a coordinated release/merge moment. label Nov 12, 2025

mtesauro approved these changes Nov 13, 2025

View reviewed changes

valentijnscholten removed the Breaking Changes label Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplicate findings in batches #13491

Deduplicate findings in batches #13491

valentijnscholten commented Oct 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 1, 2025

Uh oh!

dryrunsecurity bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

mtesauro left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Deduplicate findings in batches #13491

Are you sure you want to change the base?

Deduplicate findings in batches #13491

Conversation

valentijnscholten commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 1, 2025

Uh oh!

dryrunsecurity bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtesauro left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

valentijnscholten commented Oct 21, 2025 •

edited

Loading

dryrunsecurity bot commented Nov 7, 2025 •

edited

Loading