Skip to content

Commit 6f2c5d3

Browse files
committed
Updated split & group
1 parent 0335b25 commit 6f2c5d3

File tree

2 files changed

+76
-61
lines changed

2 files changed

+76
-61
lines changed

docs/side_quests/metadata.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,13 +76,13 @@ You'll find a main workflow file and a `data` directory containing a samplesheet
7676
└── nextflow.config
7777
```
7878

79-
The datasheet list the paths to the data files and some associated metadata, organized in 3 columns:
79+
The samplesheet list the paths to the data files and some associated metadata, organized in 3 columns:
8080

8181
- `id`: self-explanatory, an ID given to the file
8282
- `character`: a character name, that we will use later to draw different creatures
8383
- `data`: paths to `.txt` files that contain greetings in different languages
8484

85-
```console title="datasheet.csv"
85+
```console title="samplesheet.csv"
8686
id,character,recording
8787
sampleA,squirrel,/workspaces/training/side-quests/metadata/data/bonjour.txt
8888
sampleB,tux,/workspaces/training/side-quests/metadata/data/guten_tag.txt
@@ -868,7 +868,7 @@ This pattern of keeping metadata explicit and attached to the data is a core bes
868868

869869
Applying this pattern in your own work will enable you to build robust, maintainable bioinformatics workflows.
870870

871-
### Key learnings
871+
### Key patterns
872872

873873
1. **Reading and Structuring Metadata:** Reading CSV files and creating organized metadata maps that stay associated with your data files.
874874

docs/side_quests/splitting_and_grouping.md

Lines changed: 73 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -37,41 +37,38 @@ That covers the fundamentals of reading CSV files with `splitCsv` and creating m
3737

3838
## 0. Get started
3939

40-
Before we dive in, let's make sure you have everything you need.
41-
42-
### 0.1. Open the training codespace
40+
#### Open the training codespace
4341

4442
If you haven't yet done so, make sure to open the training environment as described in the [Environment Setup](../envsetup/index.md).
4543

4644
[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/nextflow-io/training?quickstart=1&ref=master)
4745

48-
### 0.2. Move into the project directory
46+
#### Move into the project directory
4947

50-
Let's move into the project directory.
48+
Let's move into the directory where the files for this tutorial are located.
5149

5250
```bash
5351
cd side-quests/splitting_and_grouping
5452
```
5553

5654
You can set VSCode to focus on this directory:
5755

58-
```bash title="Open VSCode in current directory"
56+
```bash
5957
code .
6058
```
6159

62-
### 0.3. Explore the materials
60+
#### Explore the materials
6361

64-
You'll find a main workflow file and a `data` directory containing a samplesheet, `samplesheet.csv`.
62+
You'll find a main workflow file and a `data` directory containing a samplesheet named `samplesheet.csv`.
6563

6664
```console title="Directory contents"
67-
> tree
6865
.
6966
├── data
7067
│ └── samplesheet.csv
7168
└── main.nf
7269
```
7370

74-
The samplesheet contains information about samples from different patients, including the patient ID, sample repeat number, type (normal or tumor), and paths to BAM files (which don't actually exist, but we will pretend they do).
71+
The samplesheet contains information about samples from different patients, including the patient ID, sample repeat number, type (normal or tumor), and paths to hypothetical data files (which don't actually exist, but we will pretend they do).
7572

7673
```console title="samplesheet.csv"
7774
id,repeat,type,bam
@@ -85,15 +82,17 @@ patientC,1,normal,patientC_rep1_normal.bam
8582
patientC,1,tumor,patientC_rep1_tumor.bam
8683
```
8784

88-
Note there are 8 samples in total from 3 patients (patientA has 2 repeats), 4 normal and 4 tumor.
85+
Note there are 8 samples in total from 3 patients (patientA has 2 technical repeats), consisting of 4 `tumor` samples (typically originating from tumor biopsies) and 4 `normal` samples (taken from healthy tissue or blood).
86+
87+
If you're not familiar with cancer analysis, just know that this refers to an experimental model that uses paired tumor/normal samples to perform contrastive analyses.
8988

90-
### 0.4. Scenario
89+
#### Scenario
9190

9291
Your challenge is to write a Nextflow workflow that will group and split the samples based on the associated metadata.
9392

9493
<!-- TODO: give a bit more details, similar to how it's done in the Metadata side quest -->
9594

96-
### 0.5. Readiness checklist
95+
#### Readiness checklist
9796

9897
Think you're ready to dive in?
9998

@@ -1097,85 +1096,101 @@ In this section, you've learned:
10971096

10981097
## Summary
10991098

1100-
In this side quest, you've learned how to split and group data using channels. By modifying the data as it flows through the pipeline, you can construct a pipeline that handles as many items as possible with no loops or while statements. It gracefully scales to large numbers of items. Here's what we achieved:
1101-
1102-
1. **Created structured input data**: Starting from a CSV file with meta maps (building on patterns from [Metadata in workflows](./metadata.md))
1103-
1104-
2. **Split data into separate channels**: We used `filter` to divide data into independent streams based on the `type` field
1105-
1106-
3. **Joined matched samples**: We used `join` to recombine related samples based on `id` and `repeat` fields
1099+
In this side quest, you've learned how to split and group data using channels.
11071100

1108-
4. **Distributed across intervals**: We used `combine` to create Cartesian products of samples with genomic intervals for parallel processing
1109-
1110-
5. **Aggregated by grouping keys**: We used `groupTuple` to collect samples sharing `id` and `interval` fields, merging technical replicates
1111-
1112-
This approach offers several advantages over writing a pipeline as more standard code, such as using for and while loops:
1101+
By modifying the data as it flows through the pipeline, you can construct a scalable pipeline without using loops or while statements, offering several advantages over more traditional approaches:
11131102

11141103
- We can scale to as many or as few inputs as we want with no additional code
11151104
- We focus on handling the flow of data through the pipeline, instead of iteration
11161105
- We can be as complex or simple as required
11171106
- The pipeline becomes more declarative, focusing on what should happen rather than how it should happen
11181107
- Nextflow will optimize execution for us by running independent operations in parallel
11191108

1120-
By mastering these channel operations, you can build flexible, scalable pipelines that handle complex data relationships without resorting to loops or iterative programming. This declarative approach allows Nextflow to optimize execution and parallelize independent operations automatically.
1109+
Mastering these channel operations will enable you to build flexible, scalable pipelines that handle complex data relationships without resorting to loops or iterative programming, allowing Nextflow to optimize execution and parallelize independent operations automatically.
11211110

1122-
### Key Patterns
1111+
### Key patterns
11231112

1124-
- **Filtering**
1113+
1. **Creating structured input data**: Starting from a CSV file with meta maps (building on patterns from [Metadata in workflows](./metadata.md))
11251114

1126-
```nextflow
1127-
// Filter channel based on condition
1128-
channel.filter { it.type == 'tumor' }
1129-
```
1115+
```groovy
1116+
ch_samples = channel.fromPath("./data/samplesheet.csv")
1117+
.splitCsv(header: true)
1118+
.map{ row ->
1119+
[[id:row.id, repeat:row.repeat, type:row.type], row.bam]
1120+
}
1121+
```
1122+
1123+
2. **Splitting data into separate channels**: We used `filter` to divide data into independent streams based on the `type` field
1124+
1125+
```groovy
1126+
// Filter channel based on condition
1127+
channel.filter { it.type == 'tumor' }
1128+
```
1129+
1130+
3. **Joining matched samples**: We used `join` to recombine related samples based on `id` and `repeat` fields
11301131

1131-
- **Joining Channels**
1132+
- Join two channels by key (first element of tuple)
11321133

1133-
```nextflow
1134-
// Join two channels by key (first element of tuple)
1135-
tumor_ch.join(normal_ch)
1134+
```groovy
1135+
tumor_ch.join(normal_ch)
1136+
```
11361137

1137-
// Extract joining key and join by this value
1138-
tumor_ch.map { meta, file -> [meta.id, meta, file] }
1139-
.join(
1138+
- Extract joining key and join by this value
1139+
1140+
```groovy
1141+
tumor_ch.map { meta, file -> [meta.id, meta, file] }
1142+
.join(
11401143
normal_ch.map { meta, file -> [meta.id, meta, file] }
11411144
)
1145+
```
1146+
1147+
- Join on multiple fields using subMap
11421148

1143-
// Join on multiple fields using subMap
1144-
tumor_ch.map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
1145-
.join(
1149+
```groovy
1150+
tumor_ch.map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
1151+
.join(
11461152
normal_ch.map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
11471153
)
1148-
```
1154+
```
11491155

1150-
- **Grouping Data**
1156+
4. **Distributing across intervals**: We used `combine` to create Cartesian products of samples with genomic intervals for parallel processing
11511157

1152-
```nextflow
1153-
// Group by the first element in each tuple
1154-
channel.groupTuple()
1155-
```
1158+
```groovy
1159+
samples_ch.combine(intervals_ch)
1160+
```
11561161

1157-
- **Combining channels**
1162+
5. **Aggregating by grouping keys**: We used `groupTuple` to group by the first element in each tuple, thereby collecting samples sharing `id` and `interval` fields and merging technical replicates
11581163

1159-
```nextflow
1160-
// Combine with Cartesian product
1161-
samples_ch.combine(intervals_ch)
1162-
```
1164+
```groovy
1165+
//
1166+
channel.groupTuple()
1167+
```
1168+
1169+
- **Optimizing the data structure:** We used `subMap` to extract specific fields and created a named closure for making transformations reusable
11631170

1164-
- **Data Structure Optimization**
1171+
- Extract specific fields from a map
11651172

1166-
```nextflow
1167-
// Extract specific fields using subMap
1173+
```groovy
11681174
meta.subMap(['id', 'repeat'])
1175+
```
1176+
1177+
- Named closure for reusable transformations
11691178

1170-
// Named closures for reusable transformations
1179+
```groovy
11711180
getSampleIdAndReplicate = { meta, file -> [meta.subMap(['id', 'repeat']), file] }
11721181
channel.map(getSampleIdAndReplicate)
11731182
```
11741183

1175-
## Resources
1184+
## Additional resources
11761185

11771186
- [filter](https://www.nextflow.io/docs/latest/operator.html#filter)
11781187
- [map](https://www.nextflow.io/docs/latest/operator.html#map)
11791188
- [join](https://www.nextflow.io/docs/latest/operator.html#join)
11801189
- [groupTuple](https://www.nextflow.io/docs/latest/operator.html#grouptuple)
11811190
- [combine](https://www.nextflow.io/docs/latest/operator.html#combine)
1191+
1192+
---
1193+
1194+
## What's next?
1195+
1196+
Return to the [menu of Side Quests](./index.md) or click the button in the bottom right of the page to move on to the next topic in the list.

0 commit comments

Comments
 (0)