Skip to content

Commit c77a166

Browse files
authored
Merge pull request #91 from ArcInstitute/dev
Add low count detection within pair-wise comparisons
2 parents 66c6b15 + 88ed081 commit c77a166

File tree

5 files changed

+177
-106
lines changed

5 files changed

+177
-106
lines changed

README.md

Lines changed: 98 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -36,28 +36,6 @@ ___
3636

3737
___
3838

39-
<details>
40-
<summary>Benchmarking</summary>
41-
<br>
42-
43-
Benchmarking ScreenPro2 with other CRISPR screen analysis tools
44-
45-
### More thoughtful NGS read trimming recovers more sgRNA counts
46-
47-
### ScreenPro2 statistical analysis is more accurate than ScreenProcessing
48-
49-
### ScreenPro2 is more flexible than ScreenProcessing
50-
51-
Not only does ScreenPro2 have more features than ScreenProcessing, but it is also more flexible. ScreenPro2 can process data from diverse CRISPR screen platforms and is designed to be modular to enable easy extension to custom CRISPR screen platforms or other commonly used platforms in addition to the ones currently implemented.
52-
53-
### ScreenPro2 is faster than ScreenProcessing
54-
55-
Last but not least, ScreenPro2 runs faster than ScreenProcessing (thanks to [biobear](https://github.com/wheretrue/biobear)) for processing FASTQ files.
56-
57-
</details>
58-
59-
___
60-
6139
## Installation
6240
ScreenPro2 is available on [PyPI](https://pypi.org/project/ScreenPro2/) and can be installed with pip:
6341
```bash
@@ -97,8 +75,6 @@ Data analysis for CRISPR screens with NGS readouts can be broken down into three
9775

9876
The first step in analyzing CRISPR screens with deep sequencing readouts is to process the FASTQ files and generate counts for each guide RNA element in the library. ScreenPro2 has built-in functionalities to process FASTQ files and generate counts for different types of CRISPR screens platforms (see [Supported CRISPR Screen Platforms](#supported-crispr-screen-platforms)).
9977

100-
___
101-
10278
<details>
10379
<summary>Command Line Interface (CLI)</summary>
10480
<br>
@@ -133,11 +109,10 @@ ___
133109
-o <output-directory>
134110
--write-count-matrix
135111
```
112+
___
136113

137114
</details>
138115

139-
___
140-
141116
<details>
142117
<summary>Python Package Usage</summary>
143118
<br>
@@ -205,85 +180,123 @@ ___
205180
```python
206181
adata = counter.build_counts_anndata()
207182
```
183+
184+
___
208185

209186
</details>
210187

211188
### Step 2: Phenotype calculation
212189

213190
Once you have the counts, you can use ScreenPro2 `phenoscore` and `phenostats` modules to calculate the phenotype scores and statistics between screen arms.
214191

215-
#### Load Data
216-
First, load your data into an `AnnData` object (see [anndata](https://anndata.readthedocs.io/en/latest/index.html) for more information).
192+
<details>
193+
<summary>Load Data</summary>
194+
<br>
217195

218-
The `AnnData` object must have the following contents:
219-
- `adata.X` – counts matrix (samples x targets) where each value represents the sequencing count from NGS data.
220-
- `adata.obs` – a pandas dataframe of samples metadata including "condition" and "replicate" columns.
221-
- "condition": the condition for each sample in the experiment.
222-
- "replicate": the replicate number for each sample in the experiment.
223-
- `adata.var` – a pandas dataframe of targets in sgRNA library including "target" and "targetType" columns.
224-
- "target": the target for each entry in reference sgRNA library. For single sgRNA libraries, this column can be
225-
used to store gene names. For dual or multiple targeting sgRNA libraries, this column can be used to store gene pairs
226-
or any other relevant information about the target.
227-
- "targetType": the type of target for each entry in reference sgRNA library. Note that this column is used to
228-
distinguish between different types of sgRNAs in the library and negative control sgRNAs can be defined as `"targetType" == "negative_control"`.
229-
This is important for the phenotype calculation step.
196+
First, load your data into an `AnnData` object (see [anndata](https://anndata.readthedocs.io/en/latest/index.html) for more information).
230197

198+
The `AnnData` object must have the following contents:
199+
- `adata.X` – counts matrix (samples x targets) where each value represents the sequencing count from NGS data.
200+
- `adata.obs` – a pandas dataframe of samples metadata including "condition" and "replicate" columns.
201+
- "condition": the condition for each sample in the experiment.
202+
- "replicate": the replicate number for each sample in the experiment.
203+
- `adata.var` – a pandas dataframe of targets in sgRNA library including "target" and "targetType" columns.
204+
- "target": the target for each entry in reference sgRNA library. For single sgRNA libraries, this column can be
205+
used to store gene names. For dual or multiple targeting sgRNA libraries, this column can be used to store gene pairs
206+
or any other relevant information about the target.
207+
- "targetType": the type of target for each entry in reference sgRNA library. Note that this column is used to
208+
distinguish between different types of sgRNAs in the library and negative control sgRNAs can be defined as `"targetType" == "negative_control"`.
209+
This is important for the phenotype calculation step.
231210

232-
ScreenPro2 has a built-in class for different types of CRISPR screen assays. Currently, there is a class called `PooledScreens`
233-
that can be used to process data from pooled CRISPR screens. To create a `PooledScreens` object from an `AnnData` object,
234-
you can use the following example code:
235211

236-
```python
237-
import pandas as pd
238-
import anndata as ad
239-
from screenpro.assays import PooledScreens
212+
ScreenPro2 has a built-in class for different types of CRISPR screen assays. Currently, there is a class called `PooledScreens`
213+
that can be used to process data from pooled CRISPR screens. To create a `PooledScreens` object from an `AnnData` object,
214+
you can use the following example code:
240215

241-
adata = ad.AnnData(
242-
X = counts_df, # pandas dataframe of counts (samples x targets)
243-
obs = meta_df, # pandas dataframe of samples metadata including "condition" and "replicate" columns
244-
var = target_df # pandas dataframe of targets metadata including "target" and "targetType" columns
245-
)
216+
```python
217+
import pandas as pd
218+
import anndata as ad
219+
from screenpro.assays import PooledScreens
220+
221+
adata = ad.AnnData(
222+
X = counts_df, # pandas dataframe of counts (samples x targets)
223+
obs = meta_df, # pandas dataframe of samples metadata including "condition" and "replicate" columns
224+
var = target_df # pandas dataframe of targets metadata including "target" and "targetType" columns
225+
)
246226

247-
screen = PooledScreens(adata)
248-
```
227+
screen = PooledScreens(adata)
228+
```
249229

250-
<img width="600" alt="image" src="https://github.com/ArcInstitute/ScreenPro2/assets/53412130/bb38d119-8f24-44fa-98ab-7ef4457ef8d2">
230+
<img width="600" alt="image" src="https://github.com/ArcInstitute/ScreenPro2/assets/53412130/bb38d119-8f24-44fa-98ab-7ef4457ef8d2">
251231

252-
#### Perform Screen Processing Analysis
253-
Once the screen object is created, you can use several available workflows to calculate the phenotype scores and statisitics by comparing each entry in reference sgRNA library between screen arms. Then, these scores and statistics are used to nominate hits.
232+
___
254233

255-
##### Drug Screen Workflow: calculate `gamma`, `rho`, and `tau` scores
256-
`.calculateDrugScreen` method can be used to calculate the enrichment of each gene between screen arms for a drug
257-
screen experiment. This method calculates `gamma`, `rho`, and `tau` scores for each gene and adds them to the
258-
`.phenotypes` attribute of the `PooledScreens` object.
234+
</details>
259235

260-
Here is an example for running the workflow on a [CRISPRi-dual-sgRNA-screens](#dcas9-crisprai-dual-sgrna-screens) dataset:
236+
<details>
237+
<summary>Run workflows</summary>
238+
<br>
261239

262-
```python
263-
# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
264-
screen.calculateDrugScreen(
265-
t0='T0',
266-
untreated='DMSO', # replace with the label for untreated condition
267-
treated='Drug', # replace with the label for treated condition
268-
score_level='compare_reps'
269-
)
270-
```
271-
___
272-
For example, in a Decitabine CRISPRi drug screen (see Figure 1B-C in [this bioRxiv paper](https://www.biorxiv.org/content/10.1101/2022.12.14.518457v2.full)), each phenotype score represents a comparison between different arms of the screen and `rho` scores shows the main drug phenotype as illustrated here:
273-
<img width="800" alt="image" src="https://github.com/abearab/ScreenPro2/assets/53412130/b84b3e1f-e049-4da6-b63d-d4c72bc97cda">
240+
Once the screen object is created, you can use several available workflows to calculate the phenotype scores and statisitics by comparing each entry in reference sgRNA library between screen arms. Then, these scores and statistics are used to nominate hits.
274241

275-
##### Flow cytometry based screen workflow: calculate phenotype score to compare high and low bins
276-
`.calculateFlowBasedScreen` method can be used to calculate the enrichment of each target between high bin vs. low bin
277-
of a flow cytometry-based screen experiment. This method calculates `PhenoScore` for each target and adds them to the
278-
`.phenotypes` attribute of the `PooledScreens` object.
242+
##### Drug Screen Workflow: calculate `gamma`, `rho`, and `tau` scores
243+
`.calculateDrugScreen` method can be used to calculate the enrichment of each gene between screen arms for a drug
244+
screen experiment. This method calculates `gamma`, `rho`, and `tau` scores for each gene and adds them to the
245+
`.phenotypes` attribute of the `PooledScreens` object.
246+
247+
Here is an example for running the workflow on a [CRISPRi-dual-sgRNA-screens](#dcas9-crisprai-dual-sgrna-screens) dataset:
248+
249+
```python
250+
# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
251+
screen.calculateDrugScreen(
252+
t0='T0',
253+
untreated='DMSO', # replace with the label for untreated condition
254+
treated='Drug', # replace with the label for treated condition
255+
score_level='compare_reps'
256+
)
257+
```
258+
___
259+
For example, in a Decitabine CRISPRi drug screen (see Figure 1B-C in [this bioRxiv paper](https://www.biorxiv.org/content/10.1101/2022.12.14.518457v2.full)), each phenotype score represents a comparison between different arms of the screen and `rho` scores shows the main drug phenotype as illustrated here:
260+
<img width="800" alt="image" src="https://github.com/abearab/ScreenPro2/assets/53412130/b84b3e1f-e049-4da6-b63d-d4c72bc97cda">
261+
262+
##### Flow cytometry based screen workflow: calculate phenotype score to compare high and low bins
263+
`.calculateFlowBasedScreen` method can be used to calculate the enrichment of each target between high bin vs. low bin
264+
of a flow cytometry-based screen experiment. This method calculates `PhenoScore` for each target and adds them to the
265+
`.phenotypes` attribute of the `PooledScreens` object.
266+
267+
```python
268+
# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
269+
screen.calculateFlowBasedScreen(
270+
low_bin='low_bin', high_bin='high_bin',
271+
score_level='compare_reps'
272+
)
273+
```
274+
___
275+
276+
</details>
277+
278+
<details>
279+
<summary>Benchmarking ScreenPro2 vs other CRISPR screen processing tools</summary>
280+
<br>
281+
282+
Coming soon...
283+
284+
</details>
285+
286+
<!-- Benchmarking ScreenPro2 with other CRISPR screen analysis tools
287+
288+
### More thoughtful NGS read trimming recovers more sgRNA counts
289+
290+
### ScreenPro2 statistical analysis is more accurate than ScreenProcessing
291+
292+
### ScreenPro2 is more flexible than ScreenProcessing
293+
294+
Not only does ScreenPro2 have more features than ScreenProcessing, but it is also more flexible. ScreenPro2 can process data from diverse CRISPR screen platforms and is designed to be modular to enable easy extension to custom CRISPR screen platforms or other commonly used platforms in addition to the ones currently implemented.
295+
296+
### ScreenPro2 is faster than ScreenProcessing
297+
298+
Last but not least, ScreenPro2 runs faster than ScreenProcessing (thanks to [biobear](https://github.com/wheretrue/biobear)) for processing FASTQ files. -->
279299

280-
```python
281-
# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
282-
screen.calculateFlowBasedScreen(
283-
low_bin='low_bin', high_bin='high_bin',
284-
score_level='compare_reps'
285-
)
286-
```
287300

288301
### Step 3: Data visualization
289302

screenpro/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,6 @@
3131
from .dashboard import DrugScreenDashboard
3232

3333

34-
__version__ = "0.4.10"
34+
__version__ = "0.4.11"
3535
__author__ = "Abe Arab"
3636

screenpro/assays/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ def _auto_run_name(self):
110110
)
111111
return run_name
112112

113-
def filterLowCounts(self, filter_type='all', minimum_reads=50):
113+
def filterLowCounts(self, filter_type='all', minimum_reads=1):
114114
"""
115115
Filter low counts in adata.X
116116
"""

screenpro/phenoscore/__init__.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,9 @@
2828
def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', test='ttest',
2929
growth_rate=1, n_reps='auto', keep_top_n = None, collapse_var=False,
3030
num_pseudogenes='auto', pseudogene_size='auto',
31-
count_layer=None, ctrl_label='negative_control'):
31+
count_layer=None, count_filter_type='mean', count_filter_threshold=40,
32+
ctrl_label='negative_control'
33+
):
3234
"""Calculate phenotype score and p-values when comparing `cond_test` vs `cond_ref`.
3335
3436
Args:
@@ -44,6 +46,8 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
4446
num_pseudogenes (int): number of pseudogenes to generate
4547
pseudogene_size (int): number of sgRNA elements in each pseudogene
4648
count_layer (str): count layer to use for calculating score, default is None (use default count layer in adata.X)
49+
count_filter_type (str): filter type for counts, default is 'mean'
50+
count_filter_threshold (int): filter threshold for counts, default is 40
4751
ctrl_label (str): control label, default is 'negative_control'
4852
4953
Returns:
@@ -85,7 +89,9 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
8589
var_names=var_names,
8690
test=test,
8791
ctrl_label=ctrl_label,
88-
growth_rate=growth_rate
92+
growth_rate=growth_rate,
93+
filter_type=count_filter_type,
94+
filter_threshold=count_filter_threshold
8995
)
9096

9197
elif score_level in ['compare_guides']:
@@ -114,6 +120,8 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
114120
test=test,
115121
ctrl_label=ctrl_label,
116122
growth_rate=growth_rate,
123+
filter_type=count_filter_type,
124+
filter_threshold=count_filter_threshold
117125
)
118126

119127
# get best best transcript as lowest p-value for each target

0 commit comments

Comments
 (0)