Count for Saturation Mutagenesis of the TERT promoter¶

This example runs the count workflow on saturation mutagenesis data of the TERT promoter from Kircher et al. 2019. It will only do the count part. For unraveling single varaint effects please go to the example: Saturation mutagenesis of the TERT promoter. The same saturation mutagenesis library was used in four different experiments. We will focus only on the experiments in HEK293T and in glioblastoma SF7996 (GBM) cells (two different conditions).

Prerequirements¶

This example depends on the following data and software:

Installation of MPRAflow¶

Please install conda, the MPRAflow environment and clone the actual MPRAflow master branch. You will find more help under Installation.

Reads¶

We have two conditions (HEK293T and GBM cells). Each of them has three technical replicates and one replicate exists of forward, reverse and index reads for DNA and RNA. These data has to be downloaded. All data is public available on the short read archive (SRA). We will use the SRA-tools to download the reads.

Note

You need 16 GB disk space to download the data!

conda install sra-tools
mkdir -p Count_TERT/data
cd Count_TERT/data
fastq-dump --gzip --split-files SRR8647059 SRR8647060 SRR8647061 SRR8647062 SRR8647063 SRR8647064 SRR8647119 SRR8647120 SRR8647121 SRR8647122 SRR8647123 SRR8647124 SRR8647125 SRR8647126 SRR8647127 SRR8647128 SRR8647129 SRR8647130
cd ..

Note

Please be sure that all files are downloaded completely without errors! Depending on your internet connection this can take a while. If you just want some data to run MPRAflow you can just limit yourself to one condition and/or just one replicate.

With

tree data

the folder should look like this:

data
├── SRR8647059_1.fastq.gz
├── SRR8647059_2.fastq.gz
├── SRR8647059_3.fastq.gz
├── SRR8647060_1.fastq.gz
├── SRR8647060_2.fastq.gz
├── SRR8647060_3.fastq.gz
├── SRR8647061_1.fastq.gz
├── SRR8647061_2.fastq.gz
├── SRR8647061_3.fastq.gz
├── SRR8647062_1.fastq.gz
├── SRR8647062_2.fastq.gz
├── SRR8647062_3.fastq.gz
├── SRR8647063_1.fastq.gz
├── SRR8647063_2.fastq.gz
├── SRR8647063_3.fastq.gz
├── SRR8647064_1.fastq.gz
├── SRR8647064_2.fastq.gz
├── SRR8647064_3.fastq.gz
├── SRR8647119_1.fastq.gz
├── SRR8647119_2.fastq.gz
├── SRR8647119_3.fastq.gz
├── SRR8647120_1.fastq.gz
├── SRR8647120_2.fastq.gz
├── SRR8647120_3.fastq.gz
├── SRR8647120_4.fastq.gz
├── SRR8647121_1.fastq.gz
├── SRR8647121_2.fastq.gz
├── SRR8647121_3.fastq.gz
├── SRR8647122_1.fastq.gz
├── SRR8647122_2.fastq.gz
├── SRR8647122_3.fastq.gz
├── SRR8647122_4.fastq.gz
├── SRR8647123_1.fastq.gz
├── SRR8647123_2.fastq.gz
├── SRR8647123_3.fastq.gz
├── SRR8647124_1.fastq.gz
├── SRR8647124_2.fastq.gz
├── SRR8647124_3.fastq.gz
├── SRR8647124_4.fastq.gz
├── SRR8647125_1.fastq.gz
├── SRR8647125_2.fastq.gz
├── SRR8647125_3.fastq.gz
├── SRR8647126_1.fastq.gz
├── SRR8647126_2.fastq.gz
├── SRR8647126_3.fastq.gz
├── SRR8647126_4.fastq.gz
├── SRR8647127_1.fastq.gz
├── SRR8647127_2.fastq.gz
├── SRR8647127_3.fastq.gz
├── SRR8647128_1.fastq.gz
├── SRR8647128_2.fastq.gz
├── SRR8647128_3.fastq.gz
├── SRR8647128_4.fastq.gz
├── SRR8647129_1.fastq.gz
├── SRR8647129_2.fastq.gz
├── SRR8647129_3.fastq.gz
├── SRR8647130_1.fastq.gz
├── SRR8647130_2.fastq.gz
├── SRR8647130_3.fastq.gz
└── SRR8647130_4.fastq.gz

0 directories, 60 files

Here is an overview of the files:

TERT data¶
Condition	GEO Accession	SRA Accession	SRA Runs
TERT-GBM-DNA-1: TERT-GBM transfection DNA replicate 1	GSM3604284	SRX5444854	SRR8647059
TERT-GBM-DNA-2: TERT-GBM transfection DNA replicate 2	GSM3604285	SRX5444855	SRR8647060
TERT-GBM-DNA-3: TERT-GBM transfection DNA replicate 3	GSM3604286	SRX5444856	SRR8647061
TERT-GBM-RNA-1: TERT-GBM transfection RNA replicate 1	GSM3604287	SRX5444857	SRR8647062
TERT-GBM-RNA-2: TERT-GBM transfection RNA replicate 2	GSM3604288	SRX5444858	SRR8647063
TERT-GBM-RNA-3: TERT-GBM transfection RNA replicate 3	GSM3604289	SRX5444859	SRR8647064
TERT-HEK-DNA-1: TERT-HEK transfection DNA replicate 1	GSM3604302	SRX5444888	SRR8647119, SRR8647120
TERT-HEK-DNA-2: TERT-HEK transfection DNA replicate 2	GSM3604303	SRX5444889	SRR8647121, SRR8647122
TERT-HEK-DNA-3: TERT-HEK transfection DNA replicate 3	GSM3604304	SRX5444890	SRR8647123, SRR8647124
TERT-HEK-RNA-1: TERT-HEK transfection RNA replicate 1	GSM3604305	SRX5444891	SRR8647125, SRR8647126
TERT-HEK-RNA-2: TERT-HEK transfection RNA replicate 2	GSM3604306	SRX5444892	SRR8647127, SRR8647128
TERT-HEK-RNA-3: TERT-HEK transfection RNA replicate 3	GSM3604307	SRX5444893	SRR8647129, SRR8647130

Note

The runs SRR8647120, SRR8647122, SRR8647124, SRR8647126, SRR8647128, and SRR8647130 have two index reads (forward and reverse). In the publication by Kircher et al. 2019 merging and trimming is used to combine them. For simplification we will discard the reverse index reads: rm data/*_4.fastq.gz

Also two different sequencing runs where made in condition TERT-HEK using the same library. Therefore, we have to process both runs together. So we will combine the reads. But in this csase bot sequencing runs have different read lengths (20 and 50 bp). Different read length will make false assumption when merging paired-end reads. Therefore we cut them down to 20 bp.

for i in 1 2 3; do
   zcat data/{SRR8647119,SRR8647120}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647119_SRR8647120_$i.fastq.gz;
   zcat data/{SRR8647121,SRR8647122}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647121_SRR8647122_$i.fastq.gz;
   zcat data/{SRR8647123,SRR8647124}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647123_SRR8647124_$i.fastq.gz;
   zcat data/{SRR8647125,SRR8647126}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647125_SRR8647126_$i.fastq.gz;
   zcat data/{SRR8647127,SRR8647128}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647127_SRR8647128_$i.fastq.gz;
   zcat data/{SRR8647129,SRR8647130}_$i.fastq.gz | cut -c 1-20 | gzip -c > data/SRR8647129_SRR8647130_$i.fastq.gz;
done

Note

If you combine multiple sequence runs (e.g. you need more reads) you have to combine the reads before. Otherwise barcodes with the same UMI can be count twice. But it is important that all read lengths are the same. The easiest workaround is to cut them down to the minimum length. If you have a different library (but with the same barcodes) you should run the count utility with both runs separately using different conditions. Later you have to combine the final counts of both conditions.

MPRAflow¶

Now we are close to start MPRAflow and count the number of barcodes. But before we need to generate an environment.csv file to tell nextflow the conditions, replicates and the corresponding reads.

Create experiment.csv¶

Our experiment file looks exactly like this:

Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
TERT-GBM,1,SRR8647059_1.fastq.gz,SRR8647059_3.fastq.gz,SRR8647059_2.fastq.gz,SRR8647062_1.fastq.gz,SRR8647062_3.fastq.gz,SRR8647062_2.fastq.gz
TERT-GBM,2,SRR8647060_1.fastq.gz,SRR8647060_3.fastq.gz,SRR8647060_2.fastq.gz,SRR8647063_1.fastq.gz,SRR8647063_3.fastq.gz,SRR8647063_2.fastq.gz
TERT-GBM,3,SRR8647061_1.fastq.gz,SRR8647061_3.fastq.gz,SRR8647061_2.fastq.gz,SRR8647064_1.fastq.gz,SRR8647064_3.fastq.gz,SRR8647064_2.fastq.gz
TERT-HEK,1,SRR8647119_SRR8647120_1.fastq.gz,SRR8647119_SRR8647120_3.fastq.gz,SRR8647119_SRR8647120_2.fastq.gz,SRR8647125_SRR8647126_1.fastq.gz,SRR8647125_SRR8647126_3.fastq.gz,SRR8647125_SRR8647126_2.fastq.gz
TERT-HEK,2,SRR8647121_SRR8647122_1.fastq.gz,SRR8647121_SRR8647122_3.fastq.gz,SRR8647121_SRR8647122_2.fastq.gz,SRR8647127_SRR8647128_1.fastq.gz,SRR8647127_SRR8647128_3.fastq.gz,SRR8647127_SRR8647128_2.fastq.gz
TERT-HEK,3,SRR8647123_SRR8647124_1.fastq.gz,SRR8647123_SRR8647124_3.fastq.gz,SRR8647123_SRR8647124_2.fastq.gz,SRR8647129_SRR8647130_1.fastq.gz,SRR8647129_SRR8647130_3.fastq.gz,SRR8647129_SRR8647130_2.fastq.gz

Save it into the Count_TERT/data folder under experiment.csv.

Run nextflow¶

Now we have everything at hand to run the count MPRAflow pipeline. Therefore we have to be in the cloned MPRAflow folder. But we will change the working and output directory to the Count_TERT folder. For the TERT example the barcode length is 20 bp and the UMI length 10 bp. The MPRAflow count command is:

cd <path/to/MPRAflow>/MPRAflow
conda activate MPRAflow
nextflow run -resume -w <path/to/TERT>/Count_TERT/work  count.nf --experiment-file "<path/to/TERT>/Count_TERT/data/experiment.csv" --dir "<path/to/TERT>/Count_TERT/data" --outdir "<path/to/TERT>/Count_TERT/output" --bc-length 20 --umi-length 10

Note

Please check your conf/cluster.config file if it is correctly configured (e.g. with your SGE cluster commands).

If everything works fine the following 5 processes will run: create_BAM (make idx) raw_counts, filter_counts, final_counts, dna_rna_merge_counts.

[fe/d8ac14] process > create_BAM (make idx)    [100%] 12 of 12 ✔
[7d/b56129] process > raw_counts (12)          [100%] 12 of 12 ✔
[06/2c938d] process > filter_counts (12)       [100%] 12 of 12 ✔
[2d/ce1afe] process > final_counts (12)        [100%] 12 of 12 ✔
[68/df8db0] process > dna_rna_merge_counts (6) [100%] 6 of 6 ✔
Completed at: 09-Jan-2020 15:38:32
Duration    : 3h 45m 17s
CPU hours   : 21.8
Succeeded   : 54

Results¶

All needed output files will be in the Count_TERT/output folder. In this tutorial we are only interested in the counts per barcode, because we can use these outputs in the Saturation mutagenesis of the TERT promoter tutorial.

cd <path/to/TERT>/Count_TERT
tree output -P "*[123]_counts.tsv.gz"

output
├── TERT-GBM
│   ├── 1
│   │   └── TERT-GBM_1_counts.tsv.gz
│   ├── 2
│   │   └── TERT-GBM_2_counts.tsv.gz
│   └── 3
│       └── TERT-GBM_3_counts.tsv.gz
└── TERT-HEK
    ├── 1
    │   └── TERT-HEK_1_counts.tsv.gz
    ├── 2
    │   └── TERT-HEK_2_counts.tsv.gz
    └── 3
        └── TERT-HEK_3_counts.tsv.gz

8 directories, 6 files

The count files are tab separated and contain the barcode, the number number of unique UMI DNA counts and the umber of unique RNA counts. E.g. this is an example count file:

zcat output/TERT-GBM/1/TERT-GBM_1_counts.tsv.gz | head

AAAAAAAAAAAAAAA 2       76
AAAAAAAAAAAAAAC 1       2
AAAAAAAAAAAAAAT 1       7
AAAAAAAAAAAAAGA 1       5
AAAAAAAAAAAAATA 2       7
AAAAAAAAAAAAATG 2       6
AAAAAAAAAAAAATT 1       2
AAAAAAAAAAAATGA 3       1
AAAAAAAAAAAATGG 1       3
AAAAAAAAAAAATTA 1       2