Quickstart¶

Create an experiment.csv in the format below, including the header. DNA_R1 or RNA_R1 is name of the gzipped fastq of the forward read of the DNA or RNA from the defined condition and replicate. DNA_R2 or RNA_R2 is the corresponding index read with UMIs (excluding sample barcodes) and DNA_R3 or RNA_R3 of the reverse read. If you do not have UMIs remove the columns DNA_R2 and RNA_R2 or leave them empty.

Condition,Replicate,DNA_R1,DNA_R2,DNA_R3,RNA_R1,RNA_R2,RNA_R3
condition1,1,cond1_rep1_DNA_FWD_reads.fastq.gz,cond1_rep1_DNA_IDX_reads.fastq.gz,cond1_rep1_DNA_REV_reads.fastq.gz,cond1_rep1_RNA_FWD_reads.fastq.gz,cond1_rep1_RNA_IDX_reads.fastq.gz,cond1_rep1_RNA_REV_reads.fastq.gz
condition1,2,cond1_rep2_DNA_FWD_reads.fastq.gz,cond1_rep2_DNA_IDX_reads.fastq.gz,cond1_rep2_DNA_REV_reads.fastq.gz,cond1_rep2_RNA_FWD_reads.fastq.gz,cond1_rep2_RNA_IDX_reads.fastq.gz,cond1_rep2_RNA_REV_reads.fastq.gz
condition2,1,cond2_rep1_DNA_FWD_reads.fastq.gz,cond2_rep1_DNA_IDX_reads.fastq.gz,cond2_rep1_DNA_REV_reads.fastq.gz,cond2_rep1_RNA_FWD_reads.fastq.gz,cond2_rep1_RNA_IDX_reads.fastq.gz,cond2_rep1_RNA_REV_reads.fastq.gz
condition2,2,cond2_rep2_DNA_FWD_reads.fastq.gz,cond2_rep2_DNA_IDX_reads.fastq.gz,cond2_rep2_DNA_REV_reads.fastq.gz,cond2_rep2_RNA_FWD_reads.fastq.gz,cond2_rep2_RNA_IDX_reads.fastq.gz,cond2_rep2_RNA_REV_reads.fastq.gz

If you would like each insert to be colored based on different user-specified categories, such as “positive control”, “negative control”, “shuffled control”, and “putative enhancer”, to assess the overall quality the user can create a ‘label’ tsv in the format below that maps the name to category:

insert1_name insert1_label
insert2_name insert2_label
The insert names must exactly match the names in the design FASTA file.

Run Association if using a design with randomly paired candidate sequences and barcodes

conda activate MPRAflow
nextflow run association.nf --fastq-insert "${fastq_prefix}_R1_001.fastq.gz" --design "ordered_candidate_sequences.fa" --fastq-bc "${fastq_prefix}_R2_001.fastq.gz"
Note

This will run in local mode, please submit this command to your cluster’s queue if you would like to run a parallelized version.

Run Count

conda activate MPRAflow
nextflow run count.nf --dir "bulk_FASTQ_directory" --e "experiment.csv" --design "ordered_candidate_sequences.fa" --association "dictionary_of_candidate_sequences_to_barcodes.p"
Be sure that the experiment.csv is correct. All fastq files must be in the same folder given by the --dir option. If you do not have UMIs please use the option --no-umi. Please specify your barcode length and umi-length with --bc-length and --umi-length.

Run saturation mutagenesis

conda activate MPRAflow
nextflow run saturationMutagenesis.nf --dir "directory_of_DNA/RNA_counts" --e "satMutexperiment.csv" --assignment "yourSpecificAssignmentFile.variants.txt.gz"
Note

The experiment file is different from the count workflow. It should contain the condition, replicate and filename of the counts, like:
Condition,Replicate,COUNTS
condition1,1,cond1_1_counts.tsv.gz
condition1,2,cond1_2_counts.tsv.gz
condition1,3,cond1_3_counts.tsv.gz
condition2,1,cond2_1_counts.tsv.gz
condition2,2,cond2_2_counts.tsv.gz
condition2,3,cond2_3_counts.tsv.gz
The count files can be generated by the count workflow, are named: <condition>_<replicate>_counts.tsv.gz and can be found in the outs/<condition>/<replicate> folder. They have to be copied or linked into the --dir folder.