Basic Count workflow

This example runs the count workflow on 5’/5’ WT MRPA data in the HEPG2 cell line from Klein J., Agarwal, V., Keith, A., et al. 2019.

Prerequirements

This example depends on the following data and software:

Installation of MPRAflow

Please install conda, the MPRAflow environment and clone the actual MPRAflow master branch. You will find more help under Installation.

Producing an association pickle

This workflow requires a python dictionary of candidate regulatory sequence (CRS) mapped to their barcodes in a pickle format. For this example the file can be generated using Basic association workflow or it can be found in the example folder of the GitHub repository.

Design file

File can be generated using the Basic association workflow or downloaded from the example folder.

Reads

There is one condition (HEPG2) with three technical replicates. Each replicate contains a forward (barcode-forward), reverse (barcode-reverse), and index (unique molecular identifier) read for DNA and RNA. These data must be downloaded. All data is publically available on the short read archive (SRA). We will use SRA-toolkit to obtain the data.

Note

You need 9 GB disk space to download the data and upwards of 50 GB to proccess it!

conda install sra-tools
mkdir -p Count_Basic/data
cd Count_Basic/data
fastq-dump --gzip --split-files SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886
cd ..

For large files and unstable internet connection we reccommend the comand prefetch from SRA tools before running fastq-dump. This command is much smarter in warnings when something went wrong.

conda install sra-tools cd Count_Basic/data prefetch SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886 fastq-dump –gzip –split-files SRR10800986 cd ..

Note

Please be sure that all files are downloaded completely without errors! Depending on your internet connection this can take a while. If you just want some data to run MPRAflow you can just limit yourself to one condition and/or just one replicate.

With

tree data

the folder should look like this:

data

Here is an overview of the files:

HEPG2 data
Condition GEO Accession SRA Accession SRA Runs
HEPG2-DNA-1: HEPG2 DNA replicate 1 GSM4237863 SRX7474781 SRR10800881
HEPG2-RNA-1: HEPG2 RNA replicate 1 GSM4237864 SRX7474782 SRR10800882
HEPG2-DNA-2: HEPG2 DNA replicate 2 GSM4237865 SRX7474783 SRR10800883
HEPG2-RNA-2: HEPG2 RNA replicate 2 GSM4237866 SRX7474784 SRR10800884
HEPG2-DNA-3: HEPG2 DNA replicate 3 GSM4237867 SRX7474785 SRR10800885
HEPG2-RNA-3: HEPG2 RNA replicate 3 GSM4237868 SRX7474786 SRR10800886

MPRAflow

Now we are close to starting MPRAflow and count the number of barcodes. But before we need to generate an environment csv file to tell nextflow the conditions, replicates and the corresponding reads.

Create experiment.csv

Our experiment file looks exactly like this:

Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz

Save it into the Count_Basic/data folder under experiment.csv.

Run nextflow

Now we have everything at hand to run the count MPRAflow pipeline. Therefore we have to be in the cloned MPRAflow folder. But we will change the working and output directory to the Count_Basic folder. The MPRAflow count command is:

cd <path/to/MPRAflow>/MPRAflow
conda activate MPRAflow
nextflow run count.nf -w <path/to/Basic>/Count_Basic/work --experiment-file "<path/to/Basic>/Count_Basic/data/experiment.csv" --dir "<path/to/Basic>/Count_Basic/data" --outdir "<path/to/Basic>/Count_Basic/output" --design "<path/to/design/fasta>/design.fa" --association "<path/to/association/pickle>/SRR10800986_filtered_coords_to_barcodes.pickle"

Note

Please check your conf/cluster.config file if it is correctly configured (e.g. with your SGE cluster commands).

If everything works fine the following 5 processes will run: create_BAM (make idx) raw_counts, filter_counts, final_counts, dna_rna_merge_counts, calc_correlations, make_master_tables.

Results

All output files will be in the Count_Basic/output folder.

We expect the program to output the following status when complete:

start analysis
executor >  sge (32)
[23/09474b] process > create_BAM (make idx)    [100%] 6 of 6 ✔
[0f/4ee034] process > raw_counts (6)           [100%] 6 of 6 ✔
[01/6ac02f] process > filter_counts (6)        [100%] 6 of 6 ✔
[4f/b23748] process > final_counts (6)         [100%] 6 of 6 ✔
[86/4ded79] process > dna_rna_merge_counts (3) [100%] 3 of 3 ✔
[29/0813f8] process > dna_rna_merge (3)        [100%] 3 of 3 ✔
[1d/4e7d56] process > calc_correlations (1)    [100%] 1 of 1 ✔
[9c/4714cb] process > make_master_tables (1)   [100%] 1 of 1 ✔
Completed at: 07-Jan-2020 04:29:07
Duration    : 11h 28m 5s
CPU hours   : 41.5
Succeeded   : 32