-
Notifications
You must be signed in to change notification settings - Fork 1k
Update binning tutorial #6409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vinisalazar
wants to merge
7
commits into
galaxyproject:main
Choose a base branch
from
vinisalazar:binning-tutorial
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+56
−35
Open
Update binning tutorial #6409
Changes from 5 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
703df11
Update CONTRIBUTORS.md
vinisalazar 7f4c380
tutorials/metagenomic-binning: small improvements
vinisalazar 4b9afb2
Add vamb to binning tools
vinisalazar 9594867
Fix references to MetaBAT2
vinisalazar 3c4c3db
Add requirements for metagenomics binning tutorial
vinisalazar 5806cfc
Update topics/microbiome/tutorials/metagenomics-binning/tutorial.md
paulzierep ab11f56
Merge branch 'main' into binning-tutorial
paulzierep File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes
11 changes: 11 additions & 0 deletions
11
topics/microbiome/tutorials/metagenomics-binning/tutorial.bib
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -4,17 +4,16 @@ title: Binning of metagenomic sequencing data | |||||
| zenodo_link: https://zenodo.org/record/7818827 | ||||||
| extra: | ||||||
| zenodo_link_results: https://zenodo.org/record/7845138 | ||||||
| level: Introductory | ||||||
| level: Intermediate | ||||||
| questions: | ||||||
| - What is metagenomic binning refers to? | ||||||
| - Which tools should be used for metagenomic binning? | ||||||
| - How to assess the quality of metagenomic data binning? | ||||||
| - Which tools may be used for metagenomic binning? | ||||||
| - How to assess the quality of metagenomic binning? | ||||||
| objectives: | ||||||
| - Describe what metagenomics binning is | ||||||
| - Describe common problems in metagenomics binning | ||||||
| - What software tools are available for metagenomics binning | ||||||
| - Binning of contigs into metagenome-assembled genomes (MAGs) using MetaBAT 2 software | ||||||
| - Evaluation of MAG quality and completeness using CheckM software | ||||||
| - Describe what is metagenomics binning. | ||||||
| - Describe common challenges in metagenomics binning. | ||||||
| - Perform metagenomic binning using MetaBAT 2 software. | ||||||
| - Evaluation of MAG quality and completeness using CheckM software. | ||||||
| time_estimation: 2H | ||||||
| key_points: | ||||||
| - Metagenomics binning is a computational approach to grouping together DNA sequences | ||||||
|
|
@@ -32,6 +31,11 @@ contributions: | |||||
| authorship: | ||||||
| - npechl | ||||||
| - fpsom | ||||||
paulzierep marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| requirements: | ||||||
| - type: internal | ||||||
| topic: metagenomics | ||||||
| tutorials: | ||||||
| - metagenomics-assembly | ||||||
| subtopic: metagenomics | ||||||
| tags: | ||||||
| - binning | ||||||
|
|
@@ -56,11 +60,14 @@ recordings: | |||||
|
|
||||||
| --- | ||||||
|
|
||||||
|
|
||||||
| Metagenomics is the study of genetic material recovered directly from environmental samples, such as soil, water, or gut contents, without the need for isolation or cultivation of individual organisms. Metagenomics binning is a process used to classify DNA sequences obtained from metagenomic sequencing into discrete groups, or bins, based on their similarity to each other. | ||||||
|
|
||||||
| The goal of metagenomics binning is to assign the DNA sequences to the organisms or taxonomic groups that they originate from, allowing for a better understanding of the diversity and functions of the microbial communities present in the sample. This is typically achieved through computational methods that include sequence similarity, composition, and other features to group the sequences into bins. | ||||||
|
|
||||||
| > <comment-title></comment-title> | ||||||
| > Before starting this tutorial, it is recommended to do the [**Metagenomics Assembly Tutorial**]({% link topics/microbiome/tutorials/metagenomics-assembly/tutorial.md %}) | ||||||
vinisalazar marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| {: .comment} | ||||||
|
|
||||||
| There are several approaches to metagenomics binning, including: | ||||||
|
|
||||||
| - **Sequence composition-based binning**: This method is based on the observation that different genomes have distinct sequence composition patterns, such as GC content or codon usage bias. By analyzing these patterns in metagenomic data, sequence fragments can be assigned to individual genomes or groups of genomes. | ||||||
|
|
@@ -76,7 +83,7 @@ There are several approaches to metagenomics binning, including: | |||||
| Each of these methods has its strengths and limitations, and the choice of binning method depends on the specific characteristics of the metagenomic data set and the research question being addressed. | ||||||
|
|
||||||
|
|
||||||
| **Metagenomics binning is a complex process that involves many steps and can be challenging due to several problems that can occur during the process**. Some of the most common problems encountered in metagenomics binning include: | ||||||
| **Metagenomic binning is a complex process that involves many steps and can be challenging due to several problems that can occur during the process**. Some of the most common problems encountered in metagenomic binning include: | ||||||
|
|
||||||
| - **High complexity**: Metagenomic samples contain DNA from multiple organisms, which can lead to high complexity in the data. | ||||||
| - **Fragmented sequences**: Metagenomic sequencing often generates fragmented sequences, which can make it difficult to assign reads to the correct bin. | ||||||
|
|
@@ -86,20 +93,23 @@ Each of these methods has its strengths and limitations, and the choice of binni | |||||
| - **Chimeric sequences**: Sequences that are the result of sequencing errors or contamination can lead to chimeric sequences, which can make it difficult to accurately bin reads. | ||||||
| - **Strain variation**: Organisms within a species can exhibit significant genetic variation, which can make it difficult to distinguish between different strains in a metagenomic sample. | ||||||
|
|
||||||
| There are plenty of computational tools to perform metafenomics binning. Some of the most widely used include: | ||||||
| There are plenty of algorithms that perform metagenomic binning. Some of the most widely used include: | ||||||
|
|
||||||
| - **MaxBin** ({%cite maxbin2015%}): A popular de novo binning algorithm that uses a combination of sequence features and marker genes to cluster contigs into genome bins. | ||||||
| - **MetaBAT** ({%cite Kang2019%}): Another widely used de novo binning algorithm that employs a hierarchical clustering approach based on tetranucleotide frequency and coverage information. | ||||||
| - **CONCOCT** ({%cite Alneberg2014%}): A de novo binning tool that uses a clustering algorithm based on sequence composition and coverage information to group contigs into genome bins. | ||||||
| - **MyCC** ({%cite Lin2016%}): A reference-based binning tool that uses sequence alignment to identify contigs belonging to the same genome or taxonomic group. | ||||||
| - **GroopM** ({%cite Imelfort2014%}): A hybrid binning tool that combines reference-based and de novo approaches to achieve high binning accuracy. | ||||||
| - **MetaWRAP** ({%cite Uritskiy2018%}): A comprehensive metagenomic analysis pipeline that includes various modules for quality control, assembly, binning, and annotation. | ||||||
| - **Anvi'o** ({%cite Eren2015%}): A platform for visualizing and analyzing metagenomic data, including features for binning, annotation, and comparative genomics. | ||||||
| - **SemiBin** ({%cite Pan2022%}): A command tool for metagenomic binning with deep learning, handles both short and long reads. | ||||||
| - **Vamb** ({%cite nissen2021improved%}): An algorithm that uses variational autoencoders (VAEs) to encode sequence composition and coverage information. | ||||||
|
|
||||||
| Other tools also include: | ||||||
| - **MetaWRAP** ({%cite Uritskiy2018%}): A comprehensive metagenomic analysis pipeline that includes various modules for quality control, assembly, binning, and annotation. | ||||||
| - **Anvi'o** ({%cite Eren2015%}): A platform for visualizing and analyzing metagenomic data, including features for binning, annotation, and comparative genomics. Uses CONCOCT as the default binning backend. | ||||||
|
|
||||||
| A benchmark study of metagenomics software can be found at {%cite Sczyrba2017%}. MetaBAT 2 outperforms previous MetaBAT and other alternatives in both accuracy and computational efficiency . All are based on default parameters ({%cite Sczyrba2017%}). | ||||||
|
|
||||||
| **In this tutorial, we will learn how to run metagenomic binning tools and evaluate the quality of the results**. In order to do that, we will use data from the study: [Temporal shotgun metagenomic dissection of the coffee fermentation ecosystem](https://www.ebi.ac.uk/metagenomics/studies/MGYS00005630#overview) and MetaBAT 2 algorithm. MetaBAT is a popular software tool for metagenomics binning, and there are several reasons why it is often used: | ||||||
| **In this tutorial, we will learn how to run metagenomic binning tools and evaluate the quality of the results**. In order to do that, we will use data from the study: [Temporal shotgun metagenomic dissection of the coffee fermentation ecosystem](https://www.ebi.ac.uk/metagenomics/studies/MGYS00005630#overview) and the MetaBAT 2 algorithm. MetaBAT is a popular software tool for metagenomics binning, and there are several reasons why it is often used: | ||||||
| - *High accuracy*: MetaBAT uses a combination of tetranucleotide frequency, coverage depth, and read linkage information to bin contigs, which has been shown to be highly accurate and efficient. | ||||||
| - *Easy to use*: MetaBAT has a user-friendly interface and can be run on a standard desktop computer, making it accessible to a wide range of researchers with varying levels of computational expertise. | ||||||
| - *Flexibility*: MetaBAT can be used with a variety of sequencing technologies, including Illumina, PacBio, and Nanopore, and can be applied to both microbial and viral metagenomes. | ||||||
|
|
@@ -186,7 +196,7 @@ As explained before, there are many challenges to metagenomics binning. The most | |||||
| - Chimeric sequences. | ||||||
| - Strain variation. | ||||||
|
|
||||||
| {:width="60%"} | ||||||
| {:width="60%"} | ||||||
|
|
||||||
| In this tutorial we will learn how to use **MetaBAT 2** {%cite Kang2019%} tool through Galaxy. **MetaBAT** stands for "Metagenome Binning based on Abundance and Tetranucleotide frequency". It is: | ||||||
|
|
||||||
|
|
@@ -196,21 +206,11 @@ In this tutorial we will learn how to use **MetaBAT 2** {%cite Kang2019%} tool t | |||||
| We will use the uploaded assembled fasta files as input to the algorithm (For simplicity reasons all other parameters will be preserved with their default values). | ||||||
|
|
||||||
| > <hands-on-title>Individual binning of short-reads with MetaBAT 2</hands-on-title> | ||||||
| > 1. {% tool [MetaBAT 2](toolshed.g2.bx.psu.edu/repos/iuc/megahit/megahit/1.2.9+galaxy0) %} with parameters: | ||||||
| > 1. {% tool [MetaBAT 2](https://toolshed.g2.bx.psu.edu/view/iuc/metabat2/01f02c5ddff8) %} with parameters: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🚫 [GTN Lint] <GTN:009> reported by reviewdog 🐶
Suggested change
|
||||||
| > - *"Fasta file containing contigs"*: `assembly fasta files` | ||||||
| > <!-- - In *Advanced options* | ||||||
| > - *"Percentage of good contigs considered for binning decided by connection among contigs"*: `default 95` | ||||||
| > - *"Minimum score of an edge for binning"*: `default 60` | ||||||
| > - *"Maximum number of edges per node"*: `default 200` | ||||||
| > - *"TNF probability cutoff for building TNF graph"*: `default 0` | ||||||
| > - *"Turn off additional binning for lost or small contigs?"*: `default "No"` | ||||||
| > - *"Minimum mean coverage of a contig in each library for binning"*: `default 1` | ||||||
| > - *"Minimum total effective mean coverage of a contig for binning "*: `default 1` | ||||||
| > - *"For exact reproducibility"*: `default 0` | ||||||
| > - In *Output options* | ||||||
| > - *"Minimum size of a bin as the output"*: `default 200000` | ||||||
| > - *"Output only sequence labels as a list in a column without sequences?"*: `default "No"` | ||||||
| > - *"Save cluster memberships as a matrix format?"*: `"Yes"` --> | ||||||
| > - In **Advanced options**, keep all as **default**. | ||||||
| > - In **Output options:** | ||||||
| > - *"Save cluster memberships as a matrix format?"*: `"Yes"` | ||||||
| > | ||||||
| {: .hands_on} | ||||||
|
|
||||||
|
|
@@ -244,20 +244,20 @@ These output files can be further analyzed and used for downstream applications | |||||
| > > ``` | ||||||
| > > | ||||||
| > > | ||||||
| > > 2. Create a collection named `MEGAHIT Contig`, rename your pairs with the sample name | ||||||
| > > 2. Create a collection named `MetaBAT2 Bins` and add the zip files to it. | ||||||
| > > | ||||||
| > {: .hands_on} | ||||||
| {: .comment} | ||||||
|
|
||||||
| > <question-title></question-title> | ||||||
| > <question-title>Binning metrics</question-title> | ||||||
| > | ||||||
| > 1. How many bins has been for ERR2231567 sample? | ||||||
| > 2. How many sequences are contained in the second bin? | ||||||
| > 2. How many contigs are in the bin with most contigs? What about the one with the least? | ||||||
| > | ||||||
| > > <solution-title></solution-title> | ||||||
| > > | ||||||
| > > 1. There are 6 bins identified | ||||||
| > > 2. 167 sequences are classified into the second bin. | ||||||
| > > 1. There are 6 bins identified. | ||||||
| > > 2. 7170 in the one with the most contigs, and 140 in the one with the least (these numbers may differ slightly depending on the version of MetaBAT2). | ||||||
| > > | ||||||
| > {: .solution} | ||||||
| > | ||||||
|
|
@@ -269,7 +269,7 @@ De-replication is the process of identifying sets of genomes that are the "same" | |||||
|
|
||||||
| A common use for genome de-replication is the case of individual assembly of metagenomic data. If metagenomic samples are collected in a series, a common way to assemble the short reads is with a “co-assembly”. That is, combining the reads from all samples and assembling them together. The problem with this is assembling similar strains together can severely fragment assemblies, precluding recovery of a good genome bin. An alternative option is to assemble each sample separately, and then “de-replicate” the bins from each assembly to make a final genome set. | ||||||
|
|
||||||
| {:width="80%"} | ||||||
| {:width="80%"} | ||||||
|
|
||||||
| MetaBAT 2 does not explicitly perform dereplication in the sense of identifying groups of identical or highly similar genomes in a given dataset. Instead, MetaBAT 2 focuses on improving the accuracy of binning by leveraging various features such as read coverage, differential coverage across samples, and sequence composition. It aims to distinguish between different genomes present in the metagenomic dataset and assign contigs to the appropriate bins. | ||||||
|
|
||||||
|
|
||||||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚫 [GTN Lint] <GTN:012> reported by reviewdog 🐶
Missing a DOI, URL or ISBN. Please add one of the three.