Basic Count workflow¶
This example runs the count workflow on 5’/5’ WT MRPA data in the HEPG2 cell line from Klein J., Agarwal, V., Keith, A., et al. 2019.
Prerequirements¶
This example depends on the following data and software:
Installation of MPRAflow¶
Please install conda, the MPRAflow environment and clone the actual MPRAflow master branch. You will find more help under Installation.
Producing an association pickle¶
This workflow requires a python dictionary of candidate regulatory sequence (CRS) mapped to their barcodes in a pickle format. For this example the file can be generated using Basic association workflow or it can be found in the example folder of the GitHub repository.
Design file¶
File can be generated using the Basic association workflow or downloaded from the example folder.
Reads¶
There is one condition (HEPG2) with three technical replicates. Each replicate contains a forward (barcode-forward), reverse (barcode-reverse), and index (unique molecular identifier) read for DNA and RNA. These data must be downloaded. All data is publically available on the short read archive (SRA). We will use SRA-toolkit to obtain the data.
Note
You need 9 GB disk space to download the data and upwards of 50 GB to proccess it!
conda install sra-tools
mkdir -p Count_Basic/data
cd Count_Basic/data
fastq-dump --gzip --split-files SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886
cd ..
For large files and unstable internet connection we reccommend the comand prefetch from SRA tools before running fastq-dump. This command is much smarter in warnings when something went wrong.
conda install sra-tools cd Count_Basic/data prefetch SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886 fastq-dump –gzip –split-files SRR10800986 cd ..
Note
Please be sure that all files are downloaded completely without errors! Depending on your internet connection this can take a while. If you just want some data to run MPRAflow you can just limit yourself to one condition and/or just one replicate.
With
tree data
the folder should look like this:
data
Here is an overview of the files:
Condition | GEO Accession | SRA Accession | SRA Runs |
---|---|---|---|
HEPG2-DNA-1: HEPG2 DNA replicate 1 | GSM4237863 | SRX7474781 | SRR10800881 |
HEPG2-RNA-1: HEPG2 RNA replicate 1 | GSM4237864 | SRX7474782 | SRR10800882 |
HEPG2-DNA-2: HEPG2 DNA replicate 2 | GSM4237865 | SRX7474783 | SRR10800883 |
HEPG2-RNA-2: HEPG2 RNA replicate 2 | GSM4237866 | SRX7474784 | SRR10800884 |
HEPG2-DNA-3: HEPG2 DNA replicate 3 | GSM4237867 | SRX7474785 | SRR10800885 |
HEPG2-RNA-3: HEPG2 RNA replicate 3 | GSM4237868 | SRX7474786 | SRR10800886 |
MPRAflow¶
Now we are close to starting MPRAflow and count the number of barcodes. But before we need to generate an environment csv file to tell nextflow the conditions, replicates and the corresponding reads.
Create experiment.csv¶
Our experiment file looks exactly like this:
Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz
Save it into the Count_Basic/data
folder under experiment.csv
.
Run nextflow¶
Now we have everything at hand to run the count MPRAflow pipeline. Therefore we have to be in the cloned MPRAflow folder. But we will change the working and output directory to the Count_Basic
folder. The MPRAflow count command is:
cd <path/to/MPRAflow>/MPRAflow
conda activate MPRAflow
nextflow run count.nf -w <path/to/Basic>/Count_Basic/work --experiment-file "<path/to/Basic>/Count_Basic/data/experiment.csv" --dir "<path/to/Basic>/Count_Basic/data" --outdir "<path/to/Basic>/Count_Basic/output" --design "<path/to/design/fasta>/design.fa" --association "<path/to/association/pickle>/SRR10800986_filtered_coords_to_barcodes.pickle"
Note
Please check your conf/cluster.config
file if it is correctly configured (e.g. with your SGE cluster commands).
If everything works fine the following 5 processes will run: create_BAM (make idx)
raw_counts
, filter_counts
, final_counts
, dna_rna_merge_counts
, calc_correlations
, make_master_tables
.
Results¶
All output files will be in the Count_Basic/output
folder.
We expect the program to output the following status when complete:
start analysis
executor > sge (32)
[23/09474b] process > create_BAM (make idx) [100%] 6 of 6 ✔
[0f/4ee034] process > raw_counts (6) [100%] 6 of 6 ✔
[01/6ac02f] process > filter_counts (6) [100%] 6 of 6 ✔
[4f/b23748] process > final_counts (6) [100%] 6 of 6 ✔
[86/4ded79] process > dna_rna_merge_counts (3) [100%] 3 of 3 ✔
[29/0813f8] process > dna_rna_merge (3) [100%] 3 of 3 ✔
[1d/4e7d56] process > calc_correlations (1) [100%] 1 of 1 ✔
[9c/4714cb] process > make_master_tables (1) [100%] 1 of 1 ✔
Completed at: 07-Jan-2020 04:29:07
Duration : 11h 28m 5s
CPU hours : 41.5
Succeeded : 32