-
Notifications
You must be signed in to change notification settings - Fork 1
Description
If running gbsketch, you need to specify accessions with their version numbers, e.g. GCF_023076805.1 not GCF_023076805. We should be clear about this requirement in the docs.
We could potentially remove this requirement, but we need the version number to get results from the current NCBI API POST we're using to get the dehydrated file. Since suppressed NCBI accessions no longer have fetch links, this leads to download failures that could be circumvented by using the updated version number, if one exists.
There are certainly ways to get the updated version number -- e.g. checking the GenBank assembly_summary and historical assembly_summary files.
The workflow would then be:
- parse all accession numbers
- parse the assembly summary and/or historical files, updating accession number as needed
- Use updated accession numbers to get dehydrated files, continue as normal
If we added version checking, we would need to add a --no-update-assembly-version flag to prevent updating if it is not desired. As far as I can tell, updating versions automatically would fix the majority of gbsketch download issues I'm seeing with v0.6+
There are some accessions that are only suppressed (no updated version). If we check the historical assembly summary file, we could explicitly note this as the reason.