The tool is used to:
- Generate the jobs for ega-file-transfer-to-collab
- Audit and report the files to be staged to or removed from ega Aspera server
The tool needs to talk to two different kinds of git repository to gather the information to realize the above tasks.
- The ega auditing git repository(https://github.com/icgc-dcc/ega-file-transfer.git)
- Job repositories which are tracking ega-file-transfer-to-collab job status(e.g., http://142.1.177.124/jt-hub/ega-file-transfer-to-collab-jtracker)
Before you can run the tool, you need to configure the tool. The configuration file locates ega_script/conf/conf.yaml. You may need to change the following two base_path for ega_audit and ega_job respectively.
ega_audit_base_path: "../ega-file-transfer"
ega_job_base_path: ".."
The above default configuration will assume:
- All the job repos which are set up for ega-file-transfer-to-collab, e.g.,
ega-file-transfer-to-collab-jtracker(http://142.1.177.124/jt-hub/ega-file-transfer-to-collab-jtracker), aregit cloneto the same folder as the tool script - The ega auditing git repository:
ega-file-transfer(https://github.com/icgc-dcc/ega-file-transfer) is alsogit cloneto the same folder as the tool script
├── ega-file-transfer
├── ega-file-transfer-to-collab-2-jtracker
├── ega-file-transfer-to-collab-3-jtracker
├── ega-file-transfer-to-collab-4-jtracker
├── ega-file-transfer-to-collab-5-jtracker
├── ega-file-transfer-to-collab-jtracker
└── ega_script
Since ega auditing git repository is version controlled, before we can run the tool to generate the jobs and report to_stage or to_remove files, we also need to set the version of the ega auditing reports in the conf/conf.yaml, e.g.,
file_version: "v20170630"
Get the source script of the tool
git clone [email protected]:lindaxiang/ega_script.git
Then you can run ./main.py -h to get the usage of the tool
usage: main.py [-h] [-c CONF] -t TASK [-p [PROJECT [PROJECT ...]]]
[-s [SEQ_STRATEGY [SEQ_STRATEGY ...]]]
EGA-file-to-colllab job generator and auditor
optional arguments:
-h, --help show this help message and exit
-c CONF, --setting CONF
Specify ega setting file
-t TASK, --task TASK Specify the task
-p [PROJECT [PROJECT ...]], --project [PROJECT [PROJECT ...]]
Specify the project
-s [SEQ_STRATEGY [SEQ_STRATEGY ...]], --seq_strategy [SEQ_STRATEGY [SEQ_STRATEGY ...]]
Specify the sequencing strategy
For example generating jobs for RNA-Seq data of project CLLE-ES, do this:
cd ega_script
./main.py -t job -p CLLE-ES -s RNA-Seq
- If no
projectis specified, the tool will generate the eligible jobs for all the projects which have auditing reports available. - If no
seq_strategyis specified, the tool will generate the eligible jobs for all kinds of seq_strategy which are included in the related auditing reports. - The generated jobs locates in
job_state.backlogof one of the job repositories which is defined in theconf/conf.yaml, you can change thejob folderif needed:
job_folder: "ega-file-transfer-to-collab-2-jtracker/ega-file-transfer-to-collab.0.6.jtracker/job_state.backlog"
In order to get the list of files which are to be staged to Aspera server by EGA, do this:
cd ega_script
./main.py -t stage
You can specify the project and seq_strategy in order to get the list of files which are only for given sequence strategies and belong to given projects.
The tool will generate to_stage_*.tsv files under each project. For example:
ega_operation/
├── BRCA-KR
│ └── to_stage_run.tsv
├── CLLE-ES
│ └── to_stage_run.tsv
├── LICA-FR
│ └── to_stage_run.tsv
├── MALY-DE
│ └── to_stage_analysis.tsv
├── OV-AU
│ └── to_stage_analysis.tsv
├── PACA-AU
│ ├── to_stage_analysis.tsv
│ └── to_stage_run.tsv
├── PAEN-AU
│ └── to_stage_analysis.tsv
└── to_remove.tsv
In order to list all files which can be removed from Aspera server by EGA, do this:
cd ega_script
./main.py -t remove
The tool will generate to_remove.txt file locating at: ega-file-transfer/ega_operation/to_remove.tsv
When using the tool to generate the jobs or report the to_stage or to_remove files, the tool did many QC checks based on the auditing reports, the QC results are logged into the *.log files locates:
ega_script/log/
├── error.log
├── info.log
└── warn.log
Here are some sample log messages:
2017-07-25 15:12:45,689 - audit.stage - WARNING - LICA-FR::EGAF00000483937 has the same file_md5sum and encrypted_file_md5sum: set(['772febc5f8fea25a9b09e43dd51e43bd'])
2017-07-25 15:12:45,690 - audit.stage - WARNING - LICA-FR::EGAF00000483938 has the same file_md5sum and encrypted_file_md5sum: set(['170588f8a583c2d4fee882fdfcb6133b'])
2017-07-25 15:12:45,690 - audit.stage - WARNING - LICA-FR::EGAF00000483899 has the same file_md5sum and encrypted_file_md5sum: set(['83aed772452945dc994bcfad7edebc3a'])
2017-07-25 15:12:49,248 - audit.stage - WARNING - MALY-DE::EGAF00001592148 has the id inconsistent: ega_analysis_id in audit report version v20170630
2017-07-25 15:12:49,248 - audit.stage - WARNING - MALY-DE::EGAF00001592148 has the id inconsistent: file_name in audit report version v20170630
2017-07-25 15:12:49,248 - audit.stage - WARNING - MALY-DE::EGAF00001592148 has the id inconsistent: encrypted_file_md5sum in audit report version v20170630
- Linda Xiang - Initial work