BinAC/Software/Nextflow: Difference between revisions
F Bartusch (talk | contribs) (Created page with "= Description = = Usage =") |
F Bartusch (talk | contribs) No edit summary |
||
(18 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Description = |
= Description = |
||
Nextflow is a scientific workflow system predominantly used for bioinformatics data analysis. This documentation also covers nf-core, a community-driven initiative to curate a collection of analysis pipelines built using Nextflow. |
|||
The documentation in the bwHPC Wiki serves as a 'getting started' guide for installing and using Nextflow with nf-core on BinAC. |
|||
The [https://nf-co.re/pipelines/ nf-core documentation] provides detailed information for each pipeline. |
|||
This documentation does not cover how to write your own pipelines. This information is available in the [https://www.nextflow.io/docs/latest/index.html Nextflow documentation]. |
|||
= Installation = |
|||
We recommend installing Nextflow via Miniconda. |
|||
Since Nextflow is often used with nf-core pipelines, we also recommend installing the nf-core tools. |
|||
The following commands will create a new Conda environment that provides Nextflow and nf-core tools. |
|||
It also sets a shared Singularity cache directory in your <code>bashrc</code> where all Singularity containers are stored. |
|||
<pre> |
|||
conda create --name nf-core python=3.12 nf-core nextflow |
|||
echo "export NXF_SINGULARITY_CACHEDIR=/beegfs/work/container/apptainer_cache/$USER" >> ~/.bashrc |
|||
echo "export SINGULARITY_CACHEDIR=/beegfs/work/container/apptainer_cache/$USER" >> ~/.bashrc |
|||
source ~/.bashrc |
|||
conda activate nf-core |
|||
</pre> |
|||
= Usage = |
= Usage = |
||
== Install a nf-core pipeline == |
|||
You can start and run pipelines now and Nextflow will pull all containers automatically. |
|||
<b>However</b> we encountered issues when a pipeline starts more than one job that pulls the same image simultaneously. |
|||
Therefore we recommend downloading the pipeline and its containers first using the nf-core tools. |
|||
In this guide, we will use the <code>rnaseq</code> pipeline in revision <code>3.14.0</code>. To make the code examples more readable and broadly applicable, we will first specify some environment variables. |
|||
If you use another <code>pipeline</code> and/or another <code>revision</code>, simply change the <code>pipeline</code> and <code>revision</code> environment variables. |
|||
The current working directory should be one of your workspaces under <code>/beegfs/work</code>. |
|||
<pre> |
|||
cd /beegfs/work/<path to your workspace> |
|||
export pipeline=rnaseq |
|||
export revision=3.14.0 |
|||
export pipeline_dir=${PWD}/nf-core-${pipeline}/$(echo $revision | tr . _) |
|||
export nxf_work_dir=${PWD}/work |
|||
export nxf_output_dir=${PWD}/output |
|||
echo "Pipeline will be downloaded to: ${pipeline_dir}" |
|||
</pre> |
|||
The following command will download the pipeline into your current working directory and also pull any Singularity containers that aren't yet in the cache. This can take some time if the images aren't in your container cache yet, so grab a coffee. |
|||
<pre> |
|||
nf-core download -o ${pipeline_dir} -x none -d -u amend --container-system singularity -r ${revision} ${pipeline} |
|||
</pre> |
|||
If there are errors during this step, contact [https://wiki.bwhpc.de/e/BinAC/Support BinAC support ], and provide the commands you used along with the error message. |
|||
== Test nf-core pipeline == |
|||
The first thing you should do after downloading the pipeline is to perform a test run. nf-core pipelines come with a test profile that should work right out of the box. Additionally, there is a BinAC profile for nf-core, which includes settings for BinAC's job scheduler and queue configurations. |
|||
Nextflow pipelines do not run in the background by default, so it is best to use a terminal multiplexer (like <code>screen</code> or <code>tmux</code>) when running a long pipeline. Terminal multiplexers allow you to have multiple windows within a single terminal. The advantage of using these for running Nextflow pipelines is that you can detach from the terminal and reattach them later (even through an SSH connection) to check on the pipeline’s progress. |
|||
This ensures that the pipeline continues to run even if you disconnect from the cluster. The detached session will keep running. |
|||
Start a screen session: |
|||
<pre> |
|||
screen |
|||
</pre> |
|||
Since this is a new terminal, you will need to load the Conda environment again. |
|||
Note that environment variables like <code>pipeline</code> are already set because we defined them using the <code>export</code> keyword, which makes them available to child processes. |
|||
<pre> |
|||
conda activate nf-core |
|||
</pre> |
|||
Now you can run the pipeline test. |
|||
You should always specify two directories when running the pipeline to ensure you know exactly where the results are stored. |
|||
One directory is <code>work-dir</code>, where Nextflow stores intermediate results. |
|||
The other directory is <code>outdir</code>, where Nextflow stores the final pipeline results. |
|||
<!-- TODO: Difference - and --arguments --> |
|||
<pre> |
|||
nextflow run ${pipeline_dir} \ |
|||
-profile binac,test \ |
|||
-work-dir ${nxf_work_dir} \ |
|||
--outdir ${nxf_output_dir} |
|||
</pre> |
|||
<!-- TODO add screenshot --> |
|||
As mentioned the pipeline runs in a screen session. |
|||
You can detach from the screen session and the pipeline will continue to run. |
|||
The keyboard shortcut for detaching is <code>CTRL+c</code> followed by <code>d</code>. |
|||
That means you press the <code>CTRL</code> and <code>c</code> keys at the same time. Then you release the keys and press <code>d</code>. |
|||
You should now be detached from the screen session and back in your login terminal. |
|||
<!--TODO add screenshot--> |
|||
While in your login terminal (or another window in your screen session), you can observe that Nextflow has submitted a job to the cluster for each pipeline process execution. |
|||
Your output will differ, but it should show some pipeline jobs whose job names begin with <code>nf-NFCORE</code>. |
|||
<pre> |
|||
(base) [tu_iioba01@login03 ~]$ qstat -u $USER |
|||
mgmt02: |
|||
Req'd Req'd Elap |
|||
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time |
|||
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- |
|||
11626226 tu_iioba01 short nf-NFCORE_RNASE 19779 -- -- 6gb 04:00:00 C -- |
|||
11626227 tu_iioba01 short nf-NFCORE_RNASE 19788 1 2 6gb 06:00:00 C -- |
|||
11626228 tu_iioba01 short nf-NFCORE_RNASE 19805 1 2 6gb 06:00:00 C -- |
|||
11626229 tu_iioba01 short nf-NFCORE_RNASE 19819 -- -- 6gb 04:00:00 C -- |
|||
11626230 tu_iioba01 short nf-NFCORE_RNASE 19839 -- -- 6gb 04:00:00 C -- |
|||
</pre> |
|||
Now we are returning to the Nextflow process in the screen session where the pipeline is running. |
|||
You can list your screen sessions and their IDs with <code>screen -ls</code> |
|||
<pre> |
|||
(nf-core) [tu_iioba01@login03 nextflow_tests]$ screen -ls |
|||
There is a screen on: |
|||
<screen session ID>.pts-2.login03 (Detached) |
|||
1 Socket in /var/run/screen/S-tu_iioba01. |
|||
</pre> |
|||
If there is only one screen session, you can reattach with: |
|||
<pre> |
|||
screen -r |
|||
</pre> |
|||
Otherwise, you will need to specify the screen session number: |
|||
<pre> |
|||
screen -r <screen session ID> |
|||
</pre> |
|||
You can observe the pipeline's execution progress. In the end, it should look like this: |
|||
<pre> |
|||
-[nf-core/rnaseq] Pipeline completed successfully - |
|||
Completed at: 13-Aug-2024 16:35:24 |
|||
Duration : 17m 37s |
|||
CPU hours : 0.6 |
|||
Succeeded : 194 |
|||
</pre> |
|||
The test run was successful. Now you can run the pipeline with your own data. |
|||
== Run pipeline with your own data == |
|||
Usually, you specify your input files for nf-core pipelines in a samplesheet. |
|||
A typical samplesheet for a pipeline is located at <code>assets/samplesheet.csv</code> in the pipeline directory. |
|||
You can use this as a template for specifying your own datasets. |
|||
<pre> |
|||
$ cat ${pipeline_dir}/assets/samplesheet.csv |
|||
sample,fastq_1,fastq_2,strandedness |
|||
control_REP1,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz,forward |
|||
control_REP2,/path/to/fastq/files/AEG588A2_S2_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A2_S2_L002_R2_001.fastq.gz,forward |
|||
control_REP3,/path/to/fastq/files/AEG588A3_S3_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A3_S3_L002_R2_001.fastq.gz,forward |
|||
treatment_REP1,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,,forward |
|||
treatment_REP2,/path/to/fastq/files/AEG588A5_S5_L003_R1_001.fastq.gz,,forward |
|||
treatment_REP3,/path/to/fastq/files/AEG588A6_S6_L003_R1_001.fastq.gz,,forward |
|||
treatment_REP3,/path/to/fastq/files/AEG588A6_S6_L004_R1_001.fastq.gz,,forward |
|||
</pre> |
|||
You can run the pipeline with your own samplesheet using: |
|||
<pre> |
|||
nextflow run ${pipeline_dir} \ |
|||
-profile binac \ |
|||
-work-dir ${nxf_work_dir} \ |
|||
--input <path to your samplesheet.csv> \ |
|||
--outdir ${nxf_output_dir} |
|||
</pre> |
|||
Please note that we cannot cover every possible parameter for each nf-core pipeline here. For detailed information, check the pipeline documentation before using a pipeline productively for the first time. |
|||
As usual, you can contact [https://wiki.bwhpc.de/e/BinAC/Support BinAC support ] if you have any problems or questions. |
|||
<!-- TODO Add hints to potential useful tools like nf-core creating parameter file or creating parameter file on the web. --> |
Latest revision as of 17:49, 22 August 2024
Description
Nextflow is a scientific workflow system predominantly used for bioinformatics data analysis. This documentation also covers nf-core, a community-driven initiative to curate a collection of analysis pipelines built using Nextflow.
The documentation in the bwHPC Wiki serves as a 'getting started' guide for installing and using Nextflow with nf-core on BinAC. The nf-core documentation provides detailed information for each pipeline.
This documentation does not cover how to write your own pipelines. This information is available in the Nextflow documentation.
Installation
We recommend installing Nextflow via Miniconda. Since Nextflow is often used with nf-core pipelines, we also recommend installing the nf-core tools.
The following commands will create a new Conda environment that provides Nextflow and nf-core tools.
It also sets a shared Singularity cache directory in your bashrc
where all Singularity containers are stored.
conda create --name nf-core python=3.12 nf-core nextflow echo "export NXF_SINGULARITY_CACHEDIR=/beegfs/work/container/apptainer_cache/$USER" >> ~/.bashrc echo "export SINGULARITY_CACHEDIR=/beegfs/work/container/apptainer_cache/$USER" >> ~/.bashrc source ~/.bashrc conda activate nf-core
Usage
Install a nf-core pipeline
You can start and run pipelines now and Nextflow will pull all containers automatically. However we encountered issues when a pipeline starts more than one job that pulls the same image simultaneously. Therefore we recommend downloading the pipeline and its containers first using the nf-core tools.
In this guide, we will use the rnaseq
pipeline in revision 3.14.0
. To make the code examples more readable and broadly applicable, we will first specify some environment variables.
If you use another pipeline
and/or another revision
, simply change the pipeline
and revision
environment variables.
The current working directory should be one of your workspaces under /beegfs/work
.
cd /beegfs/work/<path to your workspace> export pipeline=rnaseq export revision=3.14.0 export pipeline_dir=${PWD}/nf-core-${pipeline}/$(echo $revision | tr . _) export nxf_work_dir=${PWD}/work export nxf_output_dir=${PWD}/output echo "Pipeline will be downloaded to: ${pipeline_dir}"
The following command will download the pipeline into your current working directory and also pull any Singularity containers that aren't yet in the cache. This can take some time if the images aren't in your container cache yet, so grab a coffee.
nf-core download -o ${pipeline_dir} -x none -d -u amend --container-system singularity -r ${revision} ${pipeline}
If there are errors during this step, contact BinAC support , and provide the commands you used along with the error message.
Test nf-core pipeline
The first thing you should do after downloading the pipeline is to perform a test run. nf-core pipelines come with a test profile that should work right out of the box. Additionally, there is a BinAC profile for nf-core, which includes settings for BinAC's job scheduler and queue configurations.
Nextflow pipelines do not run in the background by default, so it is best to use a terminal multiplexer (like screen
or tmux
) when running a long pipeline. Terminal multiplexers allow you to have multiple windows within a single terminal. The advantage of using these for running Nextflow pipelines is that you can detach from the terminal and reattach them later (even through an SSH connection) to check on the pipeline’s progress.
This ensures that the pipeline continues to run even if you disconnect from the cluster. The detached session will keep running.
Start a screen session:
screen
Since this is a new terminal, you will need to load the Conda environment again.
Note that environment variables like pipeline
are already set because we defined them using the export
keyword, which makes them available to child processes.
conda activate nf-core
Now you can run the pipeline test.
You should always specify two directories when running the pipeline to ensure you know exactly where the results are stored.
One directory is work-dir
, where Nextflow stores intermediate results.
The other directory is outdir
, where Nextflow stores the final pipeline results.
nextflow run ${pipeline_dir} \ -profile binac,test \ -work-dir ${nxf_work_dir} \ --outdir ${nxf_output_dir}
As mentioned the pipeline runs in a screen session.
You can detach from the screen session and the pipeline will continue to run.
The keyboard shortcut for detaching is CTRL+c
followed by d
.
That means you press the CTRL
and c
keys at the same time. Then you release the keys and press d
.
You should now be detached from the screen session and back in your login terminal.
While in your login terminal (or another window in your screen session), you can observe that Nextflow has submitted a job to the cluster for each pipeline process execution.
Your output will differ, but it should show some pipeline jobs whose job names begin with nf-NFCORE
.
(base) [tu_iioba01@login03 ~]$ qstat -u $USER mgmt02: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- 11626226 tu_iioba01 short nf-NFCORE_RNASE 19779 -- -- 6gb 04:00:00 C -- 11626227 tu_iioba01 short nf-NFCORE_RNASE 19788 1 2 6gb 06:00:00 C -- 11626228 tu_iioba01 short nf-NFCORE_RNASE 19805 1 2 6gb 06:00:00 C -- 11626229 tu_iioba01 short nf-NFCORE_RNASE 19819 -- -- 6gb 04:00:00 C -- 11626230 tu_iioba01 short nf-NFCORE_RNASE 19839 -- -- 6gb 04:00:00 C --
Now we are returning to the Nextflow process in the screen session where the pipeline is running.
You can list your screen sessions and their IDs with screen -ls
(nf-core) [tu_iioba01@login03 nextflow_tests]$ screen -ls There is a screen on: <screen session ID>.pts-2.login03 (Detached) 1 Socket in /var/run/screen/S-tu_iioba01.
If there is only one screen session, you can reattach with:
screen -r
Otherwise, you will need to specify the screen session number:
screen -r <screen session ID>
You can observe the pipeline's execution progress. In the end, it should look like this:
-[nf-core/rnaseq] Pipeline completed successfully - Completed at: 13-Aug-2024 16:35:24 Duration : 17m 37s CPU hours : 0.6 Succeeded : 194
The test run was successful. Now you can run the pipeline with your own data.
Run pipeline with your own data
Usually, you specify your input files for nf-core pipelines in a samplesheet.
A typical samplesheet for a pipeline is located at assets/samplesheet.csv
in the pipeline directory.
You can use this as a template for specifying your own datasets.
$ cat ${pipeline_dir}/assets/samplesheet.csv sample,fastq_1,fastq_2,strandedness control_REP1,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz,forward control_REP2,/path/to/fastq/files/AEG588A2_S2_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A2_S2_L002_R2_001.fastq.gz,forward control_REP3,/path/to/fastq/files/AEG588A3_S3_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A3_S3_L002_R2_001.fastq.gz,forward treatment_REP1,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,,forward treatment_REP2,/path/to/fastq/files/AEG588A5_S5_L003_R1_001.fastq.gz,,forward treatment_REP3,/path/to/fastq/files/AEG588A6_S6_L003_R1_001.fastq.gz,,forward treatment_REP3,/path/to/fastq/files/AEG588A6_S6_L004_R1_001.fastq.gz,,forward
You can run the pipeline with your own samplesheet using:
nextflow run ${pipeline_dir} \ -profile binac \ -work-dir ${nxf_work_dir} \ --input <path to your samplesheet.csv> \ --outdir ${nxf_output_dir}
Please note that we cannot cover every possible parameter for each nf-core pipeline here. For detailed information, check the pipeline documentation before using a pipeline productively for the first time.
As usual, you can contact BinAC support if you have any problems or questions.