You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/side_quests/splitting_and_grouping.md
+73-58Lines changed: 73 additions & 58 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,41 +37,38 @@ That covers the fundamentals of reading CSV files with `splitCsv` and creating m
37
37
38
38
## 0. Get started
39
39
40
-
Before we dive in, let's make sure you have everything you need.
41
-
42
-
### 0.1. Open the training codespace
40
+
#### Open the training codespace
43
41
44
42
If you haven't yet done so, make sure to open the training environment as described in the [Environment Setup](../envsetup/index.md).
45
43
46
44
[](https://codespaces.new/nextflow-io/training?quickstart=1&ref=master)
47
45
48
-
###0.2. Move into the project directory
46
+
####Move into the project directory
49
47
50
-
Let's move into the project directory.
48
+
Let's move into the directory where the files for this tutorial are located.
51
49
52
50
```bash
53
51
cd side-quests/splitting_and_grouping
54
52
```
55
53
56
54
You can set VSCode to focus on this directory:
57
55
58
-
```bash title="Open VSCode in current directory"
56
+
```bash
59
57
code .
60
58
```
61
59
62
-
###0.3. Explore the materials
60
+
####Explore the materials
63
61
64
-
You'll find a main workflow file and a `data` directory containing a samplesheet,`samplesheet.csv`.
62
+
You'll find a main workflow file and a `data` directory containing a samplesheet named`samplesheet.csv`.
65
63
66
64
```console title="Directory contents"
67
-
> tree
68
65
.
69
66
├── data
70
67
│ └── samplesheet.csv
71
68
└── main.nf
72
69
```
73
70
74
-
The samplesheet contains information about samples from different patients, including the patient ID, sample repeat number, type (normal or tumor), and paths to BAM files (which don't actually exist, but we will pretend they do).
71
+
The samplesheet contains information about samples from different patients, including the patient ID, sample repeat number, type (normal or tumor), and paths to hypothetical data files (which don't actually exist, but we will pretend they do).
Note there are 8 samples in total from 3 patients (patientA has 2 repeats), 4 normal and 4 tumor.
85
+
Note there are 8 samples in total from 3 patients (patientA has 2 technical repeats), consisting of 4 `tumor` samples (typically originating from tumor biopsies) and 4 `normal` samples (taken from healthy tissue or blood).
86
+
87
+
If you're not familiar with cancer analysis, just know that this refers to an experimental model that uses paired tumor/normal samples to perform contrastive analyses.
89
88
90
-
###0.4. Scenario
89
+
####Scenario
91
90
92
91
Your challenge is to write a Nextflow workflow that will group and split the samples based on the associated metadata.
93
92
94
93
<!-- TODO: give a bit more details, similar to how it's done in the Metadata side quest -->
95
94
96
-
###0.5. Readiness checklist
95
+
####Readiness checklist
97
96
98
97
Think you're ready to dive in?
99
98
@@ -1097,85 +1096,101 @@ In this section, you've learned:
1097
1096
1098
1097
## Summary
1099
1098
1100
-
In this side quest, you've learned how to split and group data using channels. By modifying the data as it flows through the pipeline, you can construct a pipeline that handles as many items as possible with no loops or while statements. It gracefully scales to large numbers of items. Here's what we achieved:
1101
-
1102
-
1.**Created structured input data**: Starting from a CSV file with meta maps (building on patterns from [Metadata in workflows](./metadata.md))
1103
-
1104
-
2.**Split data into separate channels**: We used `filter` to divide data into independent streams based on the `type` field
1105
-
1106
-
3.**Joined matched samples**: We used `join` to recombine related samples based on `id` and `repeat` fields
1099
+
In this side quest, you've learned how to split and group data using channels.
1107
1100
1108
-
4.**Distributed across intervals**: We used `combine` to create Cartesian products of samples with genomic intervals for parallel processing
1109
-
1110
-
5.**Aggregated by grouping keys**: We used `groupTuple` to collect samples sharing `id` and `interval` fields, merging technical replicates
1111
-
1112
-
This approach offers several advantages over writing a pipeline as more standard code, such as using for and while loops:
1101
+
By modifying the data as it flows through the pipeline, you can construct a scalable pipeline without using loops or while statements, offering several advantages over more traditional approaches:
1113
1102
1114
1103
- We can scale to as many or as few inputs as we want with no additional code
1115
1104
- We focus on handling the flow of data through the pipeline, instead of iteration
1116
1105
- We can be as complex or simple as required
1117
1106
- The pipeline becomes more declarative, focusing on what should happen rather than how it should happen
1118
1107
- Nextflow will optimize execution for us by running independent operations in parallel
1119
1108
1120
-
By mastering these channel operations, you can build flexible, scalable pipelines that handle complex data relationships without resorting to loops or iterative programming. This declarative approach allows Nextflow to optimize execution and parallelize independent operations automatically.
1109
+
Mastering these channel operations will enable you to build flexible, scalable pipelines that handle complex data relationships without resorting to loops or iterative programming, allowing Nextflow to optimize execution and parallelize independent operations automatically.
1121
1110
1122
-
### Key Patterns
1111
+
### Key patterns
1123
1112
1124
-
-**Filtering**
1113
+
1.**Creating structured input data**: Starting from a CSV file with meta maps (building on patterns from [Metadata in workflows](./metadata.md))
4.**Distributing across intervals**: We used `combine` to create Cartesian products of samples with genomic intervals for parallel processing
1151
1157
1152
-
```nextflow
1153
-
// Group by the first element in each tuple
1154
-
channel.groupTuple()
1155
-
```
1158
+
```groovy
1159
+
samples_ch.combine(intervals_ch)
1160
+
```
1156
1161
1157
-
-**Combining channels**
1162
+
5.**Aggregating by grouping keys**: We used `groupTuple` to group by the first element in each tuple, thereby collecting samples sharing `id` and `interval` fields and merging technical replicates
1158
1163
1159
-
```nextflow
1160
-
// Combine with Cartesian product
1161
-
samples_ch.combine(intervals_ch)
1162
-
```
1164
+
```groovy
1165
+
//
1166
+
channel.groupTuple()
1167
+
```
1168
+
1169
+
-**Optimizing the data structure:** We used `subMap` to extract specific fields and created a named closure for making transformations reusable
0 commit comments