Count for Saturation Mutagenesis of the TERT promoter

This example runs the count workflow on saturation mutagenesis data of the TERT promoter from Kircher et al. 2019. It will only do the count part. For unraveling single varaint effects please go to the example: Saturation mutagenesis of the TERT promoter. The same saturation mutagenesis library was used in four different experiments. We will focus only on the experiments in HEK293T and in glioblastoma SF7996 (GBM) cells (two different conditions).

Prerequirements

This example depends on the following data and software:

Installation of MPRAflow

Please install conda, the MPRAflow environment and clone the actual MPRAflow master branch. You will find more help under Installation.

Reads

We have two conditions (HEK293T and GBM cells). Each of them has three technical replicates and one replicate exists of forward, reverse and index reads for DNA and RNA. These data has to be downloaded. All data is public available on the short read archive (SRA). We will use the SRA-tools to download the reads.

Note

You need 16 GB disk space to download the data!

conda install sra-tools
mkdir -p Count_TERT/data
cd Count_TERT/data
fastq-dump --gzip --split-files SRR8647059 SRR8647060 SRR8647061 SRR8647062 SRR8647063 SRR8647064 SRR8647119 SRR8647120 SRR8647121 SRR8647122 SRR8647123 SRR8647124 SRR8647125 SRR8647126 SRR8647127 SRR8647128 SRR8647129 SRR8647130
cd ..

Note

Please be sure that all files are downloaded completely without errors! Depending on your internet connection this can take a while. If you just want some data to run MPRAflow you can just limit yourself to one condition and/or just one replicate.

With

tree data

the folder should look like this:

data
├── SRR8647059_1.fastq.gz
├── SRR8647059_2.fastq.gz
├── SRR8647059_3.fastq.gz
├── SRR8647060_1.fastq.gz
├── SRR8647060_2.fastq.gz
├── SRR8647060_3.fastq.gz
├── SRR8647061_1.fastq.gz
├── SRR8647061_2.fastq.gz
├── SRR8647061_3.fastq.gz
├── SRR8647062_1.fastq.gz
├── SRR8647062_2.fastq.gz
├── SRR8647062_3.fastq.gz
├── SRR8647063_1.fastq.gz
├── SRR8647063_2.fastq.gz
├── SRR8647063_3.fastq.gz
├── SRR8647064_1.fastq.gz
├── SRR8647064_2.fastq.gz
├── SRR8647064_3.fastq.gz
├── SRR8647119_1.fastq.gz
├── SRR8647119_2.fastq.gz
├── SRR8647119_3.fastq.gz
├── SRR8647120_1.fastq.gz
├── SRR8647120_2.fastq.gz
├── SRR8647120_3.fastq.gz
├── SRR8647120_4.fastq.gz
├── SRR8647121_1.fastq.gz
├── SRR8647121_2.fastq.gz
├── SRR8647121_3.fastq.gz
├── SRR8647122_1.fastq.gz
├── SRR8647122_2.fastq.gz
├── SRR8647122_3.fastq.gz
├── SRR8647122_4.fastq.gz
├── SRR8647123_1.fastq.gz
├── SRR8647123_2.fastq.gz
├── SRR8647123_3.fastq.gz
├── SRR8647124_1.fastq.gz
├── SRR8647124_2.fastq.gz
├── SRR8647124_3.fastq.gz
├── SRR8647124_4.fastq.gz
├── SRR8647125_1.fastq.gz
├── SRR8647125_2.fastq.gz
├── SRR8647125_3.fastq.gz
├── SRR8647126_1.fastq.gz
├── SRR8647126_2.fastq.gz
├── SRR8647126_3.fastq.gz
├── SRR8647126_4.fastq.gz
├── SRR8647127_1.fastq.gz
├── SRR8647127_2.fastq.gz
├── SRR8647127_3.fastq.gz
├── SRR8647128_1.fastq.gz
├── SRR8647128_2.fastq.gz
├── SRR8647128_3.fastq.gz
├── SRR8647128_4.fastq.gz
├── SRR8647129_1.fastq.gz
├── SRR8647129_2.fastq.gz
├── SRR8647129_3.fastq.gz
├── SRR8647130_1.fastq.gz
├── SRR8647130_2.fastq.gz
├── SRR8647130_3.fastq.gz
└── SRR8647130_4.fastq.gz

0 directories, 60 files

Here is an overview of the files:

TERT data
Condition GEO Accession SRA Accession SRA Runs
TERT-GBM-DNA-1: TERT-GBM transfection DNA replicate 1 GSM3604284 SRX5444854 SRR8647059
TERT-GBM-DNA-2: TERT-GBM transfection DNA replicate 2 GSM3604285 SRX5444855 SRR8647060
TERT-GBM-DNA-3: TERT-GBM transfection DNA replicate 3 GSM3604286 SRX5444856 SRR8647061
TERT-GBM-RNA-1: TERT-GBM transfection RNA replicate 1 GSM3604287 SRX5444857 SRR8647062
TERT-GBM-RNA-2: TERT-GBM transfection RNA replicate 2 GSM3604288 SRX5444858 SRR8647063
TERT-GBM-RNA-3: TERT-GBM transfection RNA replicate 3 GSM3604289 SRX5444859 SRR8647064
TERT-HEK-DNA-1: TERT-HEK transfection DNA replicate 1 GSM3604302 SRX5444888 SRR8647119, SRR8647120
TERT-HEK-DNA-2: TERT-HEK transfection DNA replicate 2 GSM3604303 SRX5444889 SRR8647121, SRR8647122
TERT-HEK-DNA-3: TERT-HEK transfection DNA replicate 3 GSM3604304 SRX5444890 SRR8647123, SRR8647124
TERT-HEK-RNA-1: TERT-HEK transfection RNA replicate 1 GSM3604305 SRX5444891 SRR8647125, SRR8647126
TERT-HEK-RNA-2: TERT-HEK transfection RNA replicate 2 GSM3604306 SRX5444892 SRR8647127, SRR8647128
TERT-HEK-RNA-3: TERT-HEK transfection RNA replicate 3 GSM3604307 SRX5444893 SRR8647129, SRR8647130

Note

The runs SRR8647120, SRR8647122, SRR8647124, SRR8647126, SRR8647128, and SRR8647130 have two index reads (forward and reverse). In the publication by Kircher et al. 2019 merging and trimming is used to combine them. For simplification we will discard the reverse index reads: rm data/*_4.fastq.gz


Also two different sequencing runs where made in condition TERT-HEK using the same library. Therefore, we have to process both runs together. So we will combine the reads. But in this csase bot sequencing runs have different read lengths (20 and 50 bp). Different read length will make false assumption when merging paired-end reads. Therefore we cut them down to 20 bp.

for i in 1 2 3; do
   zcat data/{SRR8647119,SRR8647120}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647119_SRR8647120_$i.fastq.gz;
   zcat data/{SRR8647121,SRR8647122}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647121_SRR8647122_$i.fastq.gz;
   zcat data/{SRR8647123,SRR8647124}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647123_SRR8647124_$i.fastq.gz;
   zcat data/{SRR8647125,SRR8647126}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647125_SRR8647126_$i.fastq.gz;
   zcat data/{SRR8647127,SRR8647128}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647127_SRR8647128_$i.fastq.gz;
   zcat data/{SRR8647129,SRR8647130}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647129_SRR8647130_$i.fastq.gz;
done

Note

If you combine multiple sequence runs (e.g. you need more reads) you have to combine the reads before. Otherwise barcodes with the same UMI can be count twice. But it is important that all read lengths are the same. The easiest workaround is to cut them down to the minimum length. If you have a different library (but with the same barcodes) you should run the count utility with both runs separately using different conditions. Later you have to combine the final counts of both conditions.

MPRAflow

Now we are close to start MPRAflow and count the number of barcodes. But before we need to generate an environment.csv file to tell nextflow the conditions, replicates and the corresponding reads.

Create experiment.csv

Our experiment file looks exactly like this:

Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
TERT-GBM,1,SRR8647059_1.fastq.gz,SRR8647059_3.fastq.gz,SRR8647059_2.fastq.gz,SRR8647062_1.fastq.gz,SRR8647062_3.fastq.gz,SRR8647062_2.fastq.gz
TERT-GBM,2,SRR8647060_1.fastq.gz,SRR8647060_3.fastq.gz,SRR8647060_2.fastq.gz,SRR8647063_1.fastq.gz,SRR8647063_3.fastq.gz,SRR8647063_2.fastq.gz
TERT-GBM,3,SRR8647061_1.fastq.gz,SRR8647061_3.fastq.gz,SRR8647061_2.fastq.gz,SRR8647064_1.fastq.gz,SRR8647064_3.fastq.gz,SRR8647064_2.fastq.gz
TERT-HEK,1,SRR8647119_SRR8647120_1.fastq.gz,SRR8647119_SRR8647120_3.fastq.gz,SRR8647119_SRR8647120_2.fastq.gz,SRR8647125_SRR8647126_1.fastq.gz,SRR8647125_SRR8647126_3.fastq.gz,SRR8647125_SRR8647126_2.fastq.gz
TERT-HEK,2,SRR8647121_SRR8647122_1.fastq.gz,SRR8647121_SRR8647122_3.fastq.gz,SRR8647121_SRR8647122_2.fastq.gz,SRR8647127_SRR8647128_1.fastq.gz,SRR8647127_SRR8647128_3.fastq.gz,SRR8647127_SRR8647128_2.fastq.gz
TERT-HEK,3,SRR8647123_SRR8647124_1.fastq.gz,SRR8647123_SRR8647124_3.fastq.gz,SRR8647123_SRR8647124_2.fastq.gz,SRR8647129_SRR8647130_1.fastq.gz,SRR8647129_SRR8647130_3.fastq.gz,SRR8647129_SRR8647130_2.fastq.gz

Save it into the Count_TERT/data folder under experiment.csv.

Run nextflow

Now we have everything at hand to run the count MPRAflow pipeline. Therefore we have to be in the cloned MPRAflow folder. But we will change the working and output directory to the Count_TERT folder. For the TERT example the barcode length is 20 bp and the UMI length 10 bp. The MPRAflow count command is:

cd <path/to/MPRAflow>/MPRAflow
conda activate MPRAflow
nextflow run -resume -w <path/to/TERT>/Count_TERT/work  count.nf --experiment-file "<path/to/TERT>/Count_TERT/data/experiment.csv" --dir "<path/to/TERT>/Count_TERT/data" --outdir "<path/to/TERT>/Count_TERT/output" --bc-length 20 --umi-length 10

Note

Please check your conf/cluster.config file if it is correctly configured (e.g. with your SGE cluster commands).

If everything works fine the following 5 processes will run: create_BAM (make idx) raw_counts, filter_counts, final_counts, dna_rna_merge_counts.

[fe/d8ac14] process > create_BAM (make idx)    [100%] 12 of 12 ✔
[7d/b56129] process > raw_counts (12)          [100%] 12 of 12 ✔
[06/2c938d] process > filter_counts (12)       [100%] 12 of 12 ✔
[2d/ce1afe] process > final_counts (12)        [100%] 12 of 12 ✔
[68/df8db0] process > dna_rna_merge_counts (6) [100%] 6 of 6 ✔
Completed at: 09-Jan-2020 15:38:32
Duration    : 3h 45m 17s
CPU hours   : 21.8
Succeeded   : 54

Results

All needed output files will be in the Count_TERT/output folder. In this tutorial we are only interested in the counts per barcode, because we can use these outputs in the Saturation mutagenesis of the TERT promoter tutorial.

cd <path/to/TERT>/Count_TERT
tree output -P "*[123]_counts.tsv.gz"
output
├── TERT-GBM
│   ├── 1
│   │   └── TERT-GBM_1_counts.tsv.gz
│   ├── 2
│   │   └── TERT-GBM_2_counts.tsv.gz
│   └── 3
│       └── TERT-GBM_3_counts.tsv.gz
└── TERT-HEK
    ├── 1
    │   └── TERT-HEK_1_counts.tsv.gz
    ├── 2
    │   └── TERT-HEK_2_counts.tsv.gz
    └── 3
        └── TERT-HEK_3_counts.tsv.gz

8 directories, 6 files

The count files are tab separated and contain the barcode, the number number of unique UMI DNA counts and the umber of unique RNA counts. E.g. this is an example count file:

zcat output/TERT-GBM/1/TERT-GBM_1_counts.tsv.gz | head
AAAAAAAAAAAAAAA 2       76
AAAAAAAAAAAAAAC 1       2
AAAAAAAAAAAAAAT 1       7
AAAAAAAAAAAAAGA 1       5
AAAAAAAAAAAAATA 2       7
AAAAAAAAAAAAATG 2       6
AAAAAAAAAAAAATT 1       2
AAAAAAAAAAAATGA 3       1
AAAAAAAAAAAATGG 1       3
AAAAAAAAAAAATTA 1       2