-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Built gtdb-rs226 from scratch (732,475 genomes) to get a large microbial database benchmark
at 30 simultaneous downloads, 30 cores, with a batch size of 100k:
- 5.35G RAM
- 11 hours 14 minutes
- % cpu time: 1069%
Failed Accessions:
- 441 failed due to missing download links (could not get fetch link from NCBI),
- 0 download failures
- 0 checksum (crc32) failures.
I spot checked 10 random failed accessions and all were suppressed.
v0.6.2
Command:
sourmash scripts gbsketch gtdb-rs226.gbsketch.csv -o gtdb-rs226.sig.zip -n 30 -r 15 -g -c 30 \
--batch-size 100_000 -p dna,k=21,k=31,k=51,scaled=1000 \
--write-urlsketch-csv --verbose 2> gtdb-rs226.build-genomic.log
I provided my api key via the $NCBI_API_KEY variable.
Tool output:
== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
=> sourmash_plugin_directsketch 0.6.2
params: ['dna,k=21,k=31,k=51,scaled=1000']
Batch size is set, enabling --no-overwrite-fasta by default.
Downloading and sketching all accessions in 'gtdb-rs226.gbsketch.csv using 30 simultaneous downloads, 15 retries, and 30 threads.
No valid existing signature batches found; building all signatures.
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.
Successfully downloaded and parsed dehydrated zipfile. Now processing accessions.
Skipped 441 download(s) due to missing download URLs.
Wrote urlsketch csv with download links to 'gtdb-rs226.gbsketch.csv.urlsketch.csv'
finished batch 1: wrote to 'gtdb-rs226.1.sig.zip'
finished batch 2: wrote to 'gtdb-rs226.2.sig.zip'
finished batch 3: wrote to 'gtdb-rs226.3.sig.zip'
finished batch 4: wrote to 'gtdb-rs226.4.sig.zip'
finished batch 5: wrote to 'gtdb-rs226.5.sig.zip'
finished batch 6: wrote to 'gtdb-rs226.6.sig.zip'
finished batch 7: wrote to 'gtdb-rs226.7.sig.zip'
finished batch 8: wrote to 'gtdb-rs226.8.sig.zip'
Wrote list of all batches to 'gtdb-rs226.sig.zip.batches.txt'
...gbsketch is done!
Sigs in 'gtdb-rs226.1.sig.zip', etc
Output from /usr/bin/time -v:
User time (seconds): 415849.58
System time (seconds): 16567.65
Percent of CPU this job got: 1069%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:13:56
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 5353936
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 18
Minor (reclaiming a frame) page faults: 61622149
Voluntary context switches: 652535474
Involuntary context switches: 58011180
Swaps: 0
File system inputs: 84496
File system outputs: 129107288
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Metadata
Metadata
Assignees
Labels
No labels