Skip to content

PE reads - bwa mem alignment and softclipping issues #88

@colindaven

Description

@colindaven

Bug report by @irosenboom

It was reported for 2x150bp Novaseq reads that bwa mem performed alignments with extensive softclipping.

For example, some alignments were of 18bp with 82 bp simply softclipped off.

This led to serious problems with the species detected. We never noticed this before with single ended 75bp reads (confirmed here by test with the same datasets).

For these datasets, minimap2short led to much better alignments (provided the supplementary and secondary alignments were removed).

Solutions

  • use minimap2short for aligns (not well tested to date, esp with paired end reads!)
  • raise the soft clipping penalty from 5 to x for bwa mem @irosenboom can you please try this and see if your results improve ? I don't have comparable datasets
  • filter BAM file for long alignments, at least x = 70 bp ?
  • Test.

BWA mem options to test:

  • -M
  • -P
  • -L clipping penalty # was 5, try 20 ? This is the most likely option to solve the issue

From https://www.mankier.com/1/bwa



ALGORITHM OPTIONS:
-t INT

    Number of threads [1]
-k INT

    Minimum seed length. Matches shorter than INT will be missed. The alignment speed is usually insensitive to this value unless it significantly deviates from 20. [19]
-w INT

    Band width. Essentially, gaps longer than INT will not be found. Note that the maximum gap length is also affected by the scoring matrix and the hit length, not solely determined by this option. [100]
-d INT

    Off-diagonal X-dropoff (Z-dropoff). Stop extension when the difference between the best and the current extension score is above |i-j|*A+INT, where i and j are the current positions of the query and reference, respectively, and A is the matching score. Z-dropoff is similar to BLAST's X-dropoff except that it doesn't penalize gaps in one of the sequences in the alignment. Z-dropoff not only avoids unnecessary extension, but also reduces poor alignments inside a long good alignment. [100]
-r FLOAT

    Trigger re-seeding for a MEM longer than minSeedLen*FLOAT. This is a key heuristic parameter for tuning the performance. Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy. [1.5]
-c INT

    Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [500]
-D FLOAT

    Drop chains shorter than FLOAT fraction of the longest overlapping chain [0.5]
-m INT

    Perform at most INT rounds of mate-SW [50]
-W INT

    Drop a chain if the number of bases in seeds is smaller than INT. This option is primarily used for longer contigs/reads. When positive, it also affects seed filtering. [0]
-P

    In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.
SCORING OPTIONS:
-A INT

    Matching score. [1]
-B INT

    Mismatch penalty. The sequence error rate is approximately: {.75 * exp[-log(4) * B/A]}. [4]
-O INT[,INT]

    Gap open penalty. If two numbers are specified, the first is the penalty of openning a deletion and the second for openning an insertion. [6]
-E INT[,INT]

    Gap extension penalty. If two numbers are specified, the first is the penalty of extending a deletion and second for extending an insertion. A gap of length k costs O + k*E (i.e. [-O](https://www.mankier.com/1/bwa#-O) is for opening a zero-length gap). [1]
-L INT[,INT]

    Clipping penalty. When performing SW extension, BWA-MEM keeps track of the best score reaching the end of query. If this score is larger than the best SW score minus the clipping penalty, clipping will not be applied. Note that in this case, the SAM AS tag reports the best SW score; clipping penalty is not deduced. If two numbers are provided, the first is for 5'-end clipping and second for 3'-end clipping. [5]
-U INT

    Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as scoreRead1+scoreRead2-INT and scores a paired as scoreRead1+scoreRead2-insertPenalty. It compares these two scores to determine whether we should force pairing. A larger value leads to more aggressive read pair. [17]
-x STR

    Read type. Changes multiple parameters unless overriden [null]

    pacbio:

        -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)
    ont2d:

        -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)
    intractg:

        -B9 -O16 -L5 (intra-species contigs to ref)

INPUT/OUTPUT OPTIONS:
-p

    Smart pairing. If two adjacent reads have the same name, they are considered to form a read pair. This way, paired-end and single-end reads can be mixed in a single FASTA/Q stream.
-R STR

    Complete read group header line. '\t' can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null]
-H ARG

    If ARG starts with @, it is interpreted as a string and gets inserted into the output SAM header; otherwise, ARG is interpreted as a file with all lines starting with @ in the file inserted into the SAM header. [null]
-o FILE

    Write the output SAM file to FILE. For compatibility with other BWA commands, this option may also be given as -f FILE. [standard ouptut]
-q

    Don't reduce the mapping quality of split alignment of lower alignment score.
-5

    For split alignment, mark the segment with the smallest coordinate as the primary. It automatically applies option [-q](https://www.mankier.com/1/bwa#-q) as well. This option may help some Hi-C pipelines. By default, BWA-MEM marks highest scoring segment as primary.
-K  INT

    Process INT input bases in each batch regardless of the number of threads in use [10000000*nThreads]. By default, the batch size is proportional to the number of threads in use. Because the inferred insert size distribution slightly depends on the batch size, using different number of threads may produce different output. Specifying this option helps reproducibility.
-T INT

    Don't output alignment with score lower than INT. This option affects output and occasionally SAM flag 2. [30]
-j

    Treat ALT contigs as part of the primary assembly (i.e. ignore the db.prefix.alt file).
-h INT[,INT2]

    If a query has not more than INT hits with score higher than 80% of the best hit, output them all in the XA tag. If INT2 is specified, BWA-MEM outputs up to INT2 hits if the list contains a hit to an ALT contig. [5,200]
-a

    Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments.
-C

    Append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.
-Y

    Use soft clipping CIGAR operation for supplementary alignments. By default, BWA-MEM uses soft clipping for the primary alignment and hard clipping for supplementary alignments.
-M

    Mark shorter split hits as secondary (for Picard compatibility).


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions