-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Description
I am summarizing my experince with digital normalization in reference based RNAseq analysis and possible future directions. I went through 3 use cases:
- Diginorm for a group of samples in one experiment with k 20 and coverage 20x followed by using all output reads as one sample to make one transcriptome using Cufflinks. I compare the results to the output of classical Cufflinks pipeline where we assemble a transcriptome for each sample then merge all into one final transcriptome using cuffmerge. I got 2 main observations:
- Good thing: We catch transcripts with diginorm that we miss with classical cufflinks pipeline. Apparently, these are low abundant transcripts that we enrich for by digital normalization. Theoretically we can obtain the same results by merging all input reads to go through Cufflinks as a single sample, however this is computationally unfeasible once you reach couple GBs of sequencing. Most probably this is not important for differential expression analysis however I would assume that this will be extremely important for non-coding RNA analysis (I will explain later)
- Bad thing: We miss a lot of exon-exon junctions of highly covered genes because we trash those important reads versus enrichment of reads from primary transcripts
- Diginorm of samples belonging to the same experiment (as in no 1) at 10x coverage only then pooling all reads from all experiments (total 8 experiments). This is a scaling up of the first experiment. This integrative analysis is almost impossible without an idea like diginorm. You can view the resultant assembly by following this link. Yes, we miss genes found by other assemblies but we find new genes/isoforms and extended the UTRs of several gene models. Simply it is a new way of visualizing the data. Of course we need to add specific filters to remove predicted primary transcripts (I am working already on something) but I think this can turn into something really useful.
- Diginorm of single deep sequence samples at k 20 and C 200x. YES, 200x, I mean it. The output is AMAZING. With classical Cufflinks, a 20GB sample failed to finish (several times) after 7 days with 32 processors. After 200x normalization of mapped reads (~16GB) which retain ~ 40% of the reads, it took cufflinks 13 min to finish the job. The assembly is very comparable to the other samples in the same experiment made by typical pipeline. This needs to be repeated with samples that I know can finish cufflinks where I can compare the effect of diginorm on the same sample at different coverages. You can easily see how this would cut the analysis time of RNAseq assemblies specially with the expected coming deep coverage experiments. Two technical points should be highlighted here:
- With the current implementation of diginorm, we require huge RAM for these deep sequence samples adding a useless computational limitation while reference based experiments "usually" do not require that much RAM. As I discussed before with @ctb, an implementation of a RAM conservative data structure would enhance the application of diginorm in many situations.
- Diginorm favors retention of unmappable reads so it is much better to do the diginorm on mapped reads which means adding extra step for extracting mapped reads into fastq files then re-mapping after diginorm. We need diginorm to read BAM(or SAM) files to save these 2 step. I remember discussions around making a reference based diginorm which could be another option but I one more reason to encourage developing a BAM parser is to allow selective retention of exon junctions reads (which can be recognized by the CIGAR string in the BAM file). This would solve the problem seen in point 1 and 2 and might allow diginorm to perform a completely unfeasible integrative analysis.
Metadata
Metadata
Assignees
Labels
No labels