-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Bug report by @irosenboom
It was reported for 2x150bp Novaseq reads that bwa mem performed alignments with extensive softclipping.
For example, some alignments were of 18bp with 82 bp simply softclipped off.
This led to serious problems with the species detected. We never noticed this before with single ended 75bp reads (confirmed here by test with the same datasets).
For these datasets, minimap2short led to much better alignments (provided the supplementary and secondary alignments were removed).
Solutions
- use minimap2short for aligns (not well tested to date, esp with paired end reads!)
- raise the soft clipping penalty from 5 to x for
bwa mem@irosenboom can you please try this and see if your results improve ? I don't have comparable datasets - filter BAM file for long alignments, at least x = 70 bp ?
- Test.
BWA mem options to test:
-M-P-Lclipping penalty # was 5, try 20 ? This is the most likely option to solve the issue
From https://www.mankier.com/1/bwa
ALGORITHM OPTIONS:
-t INT
Number of threads [1]
-k INT
Minimum seed length. Matches shorter than INT will be missed. The alignment speed is usually insensitive to this value unless it significantly deviates from 20. [19]
-w INT
Band width. Essentially, gaps longer than INT will not be found. Note that the maximum gap length is also affected by the scoring matrix and the hit length, not solely determined by this option. [100]
-d INT
Off-diagonal X-dropoff (Z-dropoff). Stop extension when the difference between the best and the current extension score is above |i-j|*A+INT, where i and j are the current positions of the query and reference, respectively, and A is the matching score. Z-dropoff is similar to BLAST's X-dropoff except that it doesn't penalize gaps in one of the sequences in the alignment. Z-dropoff not only avoids unnecessary extension, but also reduces poor alignments inside a long good alignment. [100]
-r FLOAT
Trigger re-seeding for a MEM longer than minSeedLen*FLOAT. This is a key heuristic parameter for tuning the performance. Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy. [1.5]
-c INT
Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [500]
-D FLOAT
Drop chains shorter than FLOAT fraction of the longest overlapping chain [0.5]
-m INT
Perform at most INT rounds of mate-SW [50]
-W INT
Drop a chain if the number of bases in seeds is smaller than INT. This option is primarily used for longer contigs/reads. When positive, it also affects seed filtering. [0]
-P
In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.
SCORING OPTIONS:
-A INT
Matching score. [1]
-B INT
Mismatch penalty. The sequence error rate is approximately: {.75 * exp[-log(4) * B/A]}. [4]
-O INT[,INT]
Gap open penalty. If two numbers are specified, the first is the penalty of openning a deletion and the second for openning an insertion. [6]
-E INT[,INT]
Gap extension penalty. If two numbers are specified, the first is the penalty of extending a deletion and second for extending an insertion. A gap of length k costs O + k*E (i.e. [-O](https://www.mankier.com/1/bwa#-O) is for opening a zero-length gap). [1]
-L INT[,INT]
Clipping penalty. When performing SW extension, BWA-MEM keeps track of the best score reaching the end of query. If this score is larger than the best SW score minus the clipping penalty, clipping will not be applied. Note that in this case, the SAM AS tag reports the best SW score; clipping penalty is not deduced. If two numbers are provided, the first is for 5'-end clipping and second for 3'-end clipping. [5]
-U INT
Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as scoreRead1+scoreRead2-INT and scores a paired as scoreRead1+scoreRead2-insertPenalty. It compares these two scores to determine whether we should force pairing. A larger value leads to more aggressive read pair. [17]
-x STR
Read type. Changes multiple parameters unless overriden [null]
pacbio:
-k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)
ont2d:
-k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)
intractg:
-B9 -O16 -L5 (intra-species contigs to ref)
INPUT/OUTPUT OPTIONS:
-p
Smart pairing. If two adjacent reads have the same name, they are considered to form a read pair. This way, paired-end and single-end reads can be mixed in a single FASTA/Q stream.
-R STR
Complete read group header line. '\t' can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null]
-H ARG
If ARG starts with @, it is interpreted as a string and gets inserted into the output SAM header; otherwise, ARG is interpreted as a file with all lines starting with @ in the file inserted into the SAM header. [null]
-o FILE
Write the output SAM file to FILE. For compatibility with other BWA commands, this option may also be given as -f FILE. [standard ouptut]
-q
Don't reduce the mapping quality of split alignment of lower alignment score.
-5
For split alignment, mark the segment with the smallest coordinate as the primary. It automatically applies option [-q](https://www.mankier.com/1/bwa#-q) as well. This option may help some Hi-C pipelines. By default, BWA-MEM marks highest scoring segment as primary.
-K INT
Process INT input bases in each batch regardless of the number of threads in use [10000000*nThreads]. By default, the batch size is proportional to the number of threads in use. Because the inferred insert size distribution slightly depends on the batch size, using different number of threads may produce different output. Specifying this option helps reproducibility.
-T INT
Don't output alignment with score lower than INT. This option affects output and occasionally SAM flag 2. [30]
-j
Treat ALT contigs as part of the primary assembly (i.e. ignore the db.prefix.alt file).
-h INT[,INT2]
If a query has not more than INT hits with score higher than 80% of the best hit, output them all in the XA tag. If INT2 is specified, BWA-MEM outputs up to INT2 hits if the list contains a hit to an ALT contig. [5,200]
-a
Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments.
-C
Append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.
-Y
Use soft clipping CIGAR operation for supplementary alignments. By default, BWA-MEM uses soft clipping for the primary alignment and hard clipping for supplementary alignments.
-M
Mark shorter split hits as secondary (for Picard compatibility).
Metadata
Metadata
Assignees
Labels
No labels