Basic association workflow

This example runs the association workflow on 5’/5’ WT MRPA data in the HEPG2 cell line from Klein J., Agarwal, V., Keith, A., et al. 2019.

Prerequirements

This example depends on the following data and software:

Installation of MPRAflow

Please install conda, the MPRAflow environment and clone the actual MPRAflow master branch. You will find more help under Installation.

Meta Data

It is necessary to get the ordered oligo array so that each enhancer sequence can be labeled in the analysis and to trim any adaptors still in the sequence, in this case we trim off 15bp from the end of each sequence

mkdir -p Assoc_Basic/data
cd Assoc_Basic/data
wget ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4237nnn/GSM4237954/suppl/GSM4237954_9MPRA_elements.fa.gz

zcat GSM4237954_9MPRA_elements.fa.gz |awk '{ count+=1; if (count == 1) { print } else { print substr($1,1,171)}; if (count == 2) { count=0 } }' > design.fa

Reads

There is one set of association sequencing for this data, which contains a forward (CRS-forward), reverse (CRS-reverse), and index (barcode) read for DNA and RNA. These data must be downloaded. All data is publically available on the short read archive (SRA). We will use SRA-toolkit to obtain the data.

Note

You need 10 GB disk space to download the data!

conda install sra-tools
cd Assoc_Basic/data
fastq-dump --gzip --split-files SRR10800986
cd ..

For large files and unstable internet connection we reccommend the comand prefetch from SRA tools before running fastq-dump. This command is much smarter in warnings when something went wrong.

conda install sra-tools
cd Assoc_Basic/data
prefetch SRR10800986
fastq-dump --gzip --split-files SRR10800986
cd ..

Note

Please be sure that all files are downloaded completely without errors! Depending on your internet connection this can take a while. If you just want some data to run MPRAflow you can just limit yourself to one condition and/or just one replicate.

With

tree data

the folder should look like this:

data

Here is an overview of the files:

HEPG2 association data
Condition GEO Accession SRA Accession SRA Runs
HEPG2-association: HEPG2 library association GSM4237954 SRX7474872 SRR10800986

MPRAflow

Now we are ready to run MPRAflow and create CRS-barcode mappings.

Run nextflow

Now we have everything at hand to run the count MPRAflow pipeline. Therefore we have to be in the cloned MPRAflow folder. But we will change the working and output directory to the Assoc_Basic folder. The MPRAflow count command is:

cd <path/to/MPRAflow>/MPRAflow
conda activate MPRAflow
nextflow run association.nf -w <path/to/Basic>/Assoc_Basic/work --fastq-insert "<path/to/Basic>/Assoc_Basic/data/SRR10800986_1.fastq.gz" --fastq-insertPE "<path/to/Basic>/Assoc_Basic/data/SRR10800986_3.fastq.gz" --fastq-bc "<path/to/Basic>/Assoc_Basic/data/SRR10800986_2.fastq.gz" --design "<path/to/Basic>/Assoc_Basic/data/design.fa" --name assoc_basic --outdir <path/to/Basic>/Assoc_Basic/output

Note

Please check your conf/cluster.config file if it is correctly configured (e.g. with your SGE cluster commands).

If everything works fine the following 7 processes will run: count_bc_nolab create_BWA_ref, PE_merge, align_BWA_PE, collect_chunks, map_element_barcodes, filter_barcodes.

Results

All needed output files will be in the Assoc_Basic/output folder.