Skip to content

benchmarking v0.6 #254

@bluegenes

Description

@bluegenes

Built gtdb-rs226 from scratch (732,475 genomes) to get a large microbial database benchmark

at 30 simultaneous downloads, 30 cores, with a batch size of 100k:

  • 5.35G RAM
  • 11 hours 14 minutes
  • % cpu time: 1069%

Failed Accessions:

  • 441 failed due to missing download links (could not get fetch link from NCBI),
  • 0 download failures
  • 0 checksum (crc32) failures.

I spot checked 10 random failed accessions and all were suppressed.

v0.6.2

Command:

sourmash scripts gbsketch gtdb-rs226.gbsketch.csv -o gtdb-rs226.sig.zip -n 30 -r 15 -g -c 30 \
                          --batch-size 100_000 -p dna,k=21,k=31,k=51,scaled=1000 \
                          --write-urlsketch-csv --verbose 2> gtdb-rs226.build-genomic.log

I provided my api key via the $NCBI_API_KEY variable.

Tool output:

== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_directsketch 0.6.2
params: ['dna,k=21,k=31,k=51,scaled=1000']
Batch size is set, enabling --no-overwrite-fasta by default.
Downloading and sketching all accessions in 'gtdb-rs226.gbsketch.csv using 30 simultaneous downloads, 15 retries, and 30 threads.
No valid existing signature batches found; building all signatures.
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.
Successfully downloaded and parsed dehydrated zipfile. Now processing accessions.
Skipped 441 download(s) due to missing download URLs.
Wrote urlsketch csv with download links to 'gtdb-rs226.gbsketch.csv.urlsketch.csv'
finished batch 1: wrote to 'gtdb-rs226.1.sig.zip'
finished batch 2: wrote to 'gtdb-rs226.2.sig.zip'
finished batch 3: wrote to 'gtdb-rs226.3.sig.zip'
finished batch 4: wrote to 'gtdb-rs226.4.sig.zip'
finished batch 5: wrote to 'gtdb-rs226.5.sig.zip'
finished batch 6: wrote to 'gtdb-rs226.6.sig.zip'
finished batch 7: wrote to 'gtdb-rs226.7.sig.zip'
finished batch 8: wrote to 'gtdb-rs226.8.sig.zip'
Wrote list of all batches to 'gtdb-rs226.sig.zip.batches.txt'
...gbsketch is done!
Sigs in 'gtdb-rs226.1.sig.zip', etc

Output from /usr/bin/time -v:

User time (seconds): 415849.58
        System time (seconds): 16567.65
        Percent of CPU this job got: 1069%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 11:13:56
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 5353936
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 18
        Minor (reclaiming a frame) page faults: 61622149
        Voluntary context switches: 652535474
        Involuntary context switches: 58011180
        Swaps: 0
        File system inputs: 84496
        File system outputs: 129107288
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions