Usecase2

TSS data processing and differential analysis using refTSS

This tutorial provides step by step workflow to achive gold standard expression analysis using refTSS, and obtains TSS expression table, input TSSID list for refTSS-LD, quality assessment of TSS sequencing and statistical testing results. These scripts are based on the scripts written in Morioka MS et al..

For hands on tutorial, refTSS provides an example data-set published by Yasukawa et al. at use-case file folder:

https://reftss.riken.jp/usecase_files/ .

Software Requirements

NCBI tools (for fastq file download)
bigWigAverageOverBed, bedGraphToBigWig in Jim Kent Utility tools

 http://genome-test.cse.ucsc.edu/~kent/exe/

fastqc program
cutadapt
STAR
R (version 4.2.2 or later)

 https://www.r-project.org/

BEDtools (Confirmed to work with version 2.28 and version 2.29)

 http://code.google.com/p/bedtools/

1. FASTQ preparation

Download genome reference files

GRCh38 human genome assembly files

hg38 assembly

hg38 gene annotation

  = .genome file preparation =
  pip install --upgrade pip
  pip install pyfaidx
  gzip -d GRCh38.p14.genome.fa.gz
  faidx GRCh38.p14.genome.fa -i chromsizes >GRCh38.p14.genome


1.2 Download fastq files from NCBI SRA
Dfafadafds

*   fasterq-dump DRR159241 -o mutA_rep1.fq -e 12 -p
  fasterq-dump DRR159242 -o mutA_rep2.fq -e 12 -p
  fasterq-dump DRR159243 -o mutA_rep3.fq -e 12 -p
  fasterq-dump DRR159244 -o wt_rep1.fq -e 12 -p
  fasterq-dump DRR159245 -o wt_rep2.fq -e 12 -p
  fasterq-dump DRR159246 -o wt_rep3.fq -e 12 -p
  
Sequencing quality check by Phread scale
fastqc -o qc_mutA_rep1 mutA_rep1.fq
fastqc -o qc_mutA_rep2 mutA_rep2.fq
fastqc -o qc_mutA_rep3 mutA_rep3.fq
fastqc -o qc_wt_rep1 wt_rep1.fq
fastqc -o qc_wt_rep2 wt_rep2.fq
fastqc -o qc_wt_rep3 wt_rep3.fq



[Optional] Adaptor trimiming step (e.g., two color illumina chemistries)
One of the major sequecing techs are both NextSeq and the NovaSeq which uses two-colored chemisty. In this case, 3'-end of `G` was derived from end of sequencing with good sequence quality.  cutadapt with `--nextseq-trim=20` option will work as regular quality trimming (where one would use `-q 20` instead), except that the qualities of 3'-end `G` bases are ignored. 

-m: minimum output read length
-a: adaptor trimming
Trueseq read1 adaptor sequence:AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Nextra adaptor sequence: CTGTCTCTTATACACATCT
--nextseq-trim=20
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o mutA_rep1.tm.fq mutA_rep1.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o mutA_rep2.tm.fq mutA_rep2.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o mutA_rep3.tm.fq mutA_rep3.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o wt_rep1.tm.fq wt_rep1.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o wt_rep2.tm.fq wt_rep2.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o wt_rep3.tm.fq wt_rep3.fq

2. Mapping TSS seq reads
STAR genome index creation
STAR --runMode genomeGenerate \
	--genomeDir GRCh38p14_star_index \
	--runThreadN 16 \
	--genomeFastaFiles GRCh38.p14.genome.fa \
	--sjdbGTFfile gencode.v46.annotation.gtf


STAR Mapping with example options
GTF=/home/masaki.morioka/genome/gencode.v46.annotation.gtf
SGENOME=/home/masaki.m/genome/GRCh38p14_star_index
_naming=mutA_rep1

STAR --runThreadN 12 --genomeDir $SGENOME \
		--readFilesIn ${_naming}.tm.fq \
		--outSAMtype BAM SortedByCoordinate \
		--outTmpKeep None \
	  --outFileNamePrefix ${_naming}





3. CTSS file generation and quality assessment of TSS sequencing
3.1 Download and excuting the Shell Script for TSS tag count table
Download the shellscript for counting TSS tags and the reference of canonical TSS coordinates (GENCODE)

Download scripts from Github
git clone https://github.com/suimye/cage_tutorial.git
chmod 755 ./cage_tutorial/*

Download canonical promoter coordinates (e.g., GENCODE)
wget https:://refTSS.riken.jp/tutorial/gencode.v40.promoter300-100.bed.gz 
gzip -d gencode.v40.promoter300-100.bed.gz 

Making a directory for tag count result
mkdir gencode_count
cd gencode_count/

Excuting the script for tag counting

Example: bash cage.counting.pipeline.b0.01.sh {sample.bam} {mapping quality (integer)} {referece.bed} {genome size file}
sh ../cage_tutorial/cage.counting.pipeline.b0.01.sh ../mutA_rep1Aligned.sortedByCoord.out.bam 20 gencode.v40.promoter300-100.bed ../GRCh38.p14.genome


Following six tag count files are generated.

wt_rep1Aligned.sortedByCoord.out.240925.ctss.txt
wt_rep2Aligned.sortedByCoord.out.240925.ctss.txt
wt_rep3Aligned.sortedByCoord.out.240925.ctss.txt
mutA_rep1Aligned.sortedByCoord.out.240925.ctss.txt
mutA_rep2Aligned.sortedByCoord.out.240925.ctss.txt
mutA_rep3Aligned.sortedByCoord.out.240925.ctss.txt



3.2 TSS Quality Evaluation and Create Expression Tables Using R
TSS sequencing technologies are expected to enrich TSS infomation from sequencing reads. Yield of tags should be overlapped with canonical TSS annotation such as 5 prime end of GENCODE mRNA annotation.

The script `promoter_mapping_rate.R` reads the files ending with `ctss.txt` generated in the previous process and creates an expression data table. Be careful not to have any unnecessary `ctss.txt` files in your working folder as it may cause errors. After the args option, enter the labels for the comparison groups (in this case, WT and Mutant) separated by spaces. Thease input labels will use for regular expression of file names.  If you specify "none" for the `--args` option, it is used when you want to simply create an expression data table without considering replicate experiments.

Generating normalized data for two groups using RLE normalization in edgeR
R --slave --vanilla --args wt mutA < ../cage_tutorial/promoter_mapping_rate.b0.2.R
simply generating the TSS tag count matrix
R --slave --vanilla --args none  < ../cage_tutorial/promoter_mapping_rate.b0.2.R 

Here is the example output `TSS.qcbarplot.pdf` from result of test data set.

File:Https://github.com/user-attachments/assets/b2e24f78-4d05-4135-bf30-957ab7389178  width=1020px



4. Differential expression analysis of TSS sequencing using refTSS
TSS tags will be useful for gene expression analysis. Caculating TSS tag counts on refTSS as well as GENCODE promoter, will provide TSS-wide comprehensive view of expression changes in the genome.

Let's start to make directory for this analysis

make new directory
cd ../
mkdir reftss_count
cd reftss_count
download refTSS human corrdinate file
wget https://reftss.riken.jp/datafiles/4.1/human/refTSS_v4.1_human_coordinate.hg38.bed.txt.gz
gzip -d refTSS_v4.1_human_coordinate.hg38.bed.txt.gz

Calculation of TSS tag counts on refTSS

Example: bash cage.counting.pipeline.b0.02.sh {sample.bam} {mapping quality (integer)} {referece.bed} {genome size file}
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../mutA_rep1Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../mutA_rep2Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../mutA_rep3Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../wt_rep1Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../wt_rep2Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../wt_rep3Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome

Making TSS expression matrix in R environment

R --slave --vanilla --args wt mutA < ../cage_tutorial/promoter_mapping_rate.b0.2.R

After excustion of previous command, R data file was created as `cage_count{date}.RData`

4.1 Statistical testing and MA-plot
Processing data and statistical testing are conductedon R environment. 

library(edgeR)
load("cage_count{date}.RData")

R object list
> ls()
 [1] "args"       "cname"      "d"          "df"         "dgeL"      
 [6] "fl"         "flag"       "gp"         "gp1"        "gp2"       
[11] "i"          "log2cpm"    "read_d"     "rname"      "size"      
[16] "time_stamp"

statistical testing by exact test
et_res <* exactTest(dgeL)

Extraction of top 100 TSSs by statistics (p-value, FDR)
detss <* topTags(et_res,n=100,adjust.method="BH",sort.by="Pvalue",p.value=0.0001)

dividing up* and down-regulated TSS list
de_up <* subset(detss$table,logFC>0)
de_down <* subset(detss$table,logFC<0)

4.2 TSS ID list preparation for GWAS-LD enrichment analysis
To prepare input for GWAS-LD enrichment analysis, TSS ID lists were restored in working directory. GWAS-LD enrichment analysis introduced in here.

This process removes dobule quatation and line numbers from rowname vectors.
upregulated list
sink("up_det.list.txt")
  cat(rownames(de_up), sep = "\n") 
sink() 

downregulated list
sink("down_det.list.txt")
  cat(rownames(de_down), sep = "\n") 
sink() 



4.3 Visualization of differential expression by MA-plot
One of the standard approaches on visualization of expression value is MA-plot showing relationship between average expression and fold difference in each genes (or TSSs). Here is an example of MA-plot using test data.

pdf("maplot.pdf")
  plot(et_res$table$logCPM,et_res$table$logFC,cex=0.3,pch=16,col="darkgrey",xlab="Average log-expression",ylab="Expression log-ratio (MutA vs WT)")
  points(de_up$logCPM,de_up$logFC,cex=0.7,pch=16,col="deeppink")
  points(de_down$logCPM,de_down$logFC,cex=0.7,pch=16,col="steelblue")
  abline(h=0,col="tan4")
dev.off()

File:Https://github.com/user-attachments/assets/7c3aa90f-dd60-4520-a514-57a32501c828  width=880px



Reference
* Script: Morioka MS et al. Methods Mol Biol. 2020:2120:277-301. doi:10.1007/978-1-0716-0327-7_20.
* Example dataset: Yasukawa et al. Nat Commun. 2020:11:1557, doi:10.1038/s41457-020. 15289-7 
* edgeR DOI:10.18129/B9.bioc.edgeR 
* Kent Util tool: https://hgdownload.soe.ucsc.edu/admin/exe
* BEDtools: Quinlan AR and Hall IM Bioinformatics. 2010:26, 6, pp. 841–842.
* FASTQC: Wingett SW, Andrews S. et al. 2018 doi:10.12688/f1000research.15931.2. eCollection
* STAR: Dobin A et al. Bioinformatics  2013:29(1):15-21. doi:10.1093/bioinformatics/bts635. Epub 2012 Oct 25.
* Cutadapt: Marcel Martin doi:10.14806/ej.17.1.200

*Usecase2*

TSS data processing and differential analysis using refTSS

Software Requirements

1. FASTQ preparation

Download genome reference files

1.2 Download fastq files from NCBI SRA

Sequencing quality check by Phread scale

[Optional] Adaptor trimiming step (e.g., two color illumina chemistries)

-m: minimum output read length

-a: adaptor trimming

Trueseq read1 adaptor sequence:AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Nextra adaptor sequence: CTGTCTCTTATACACATCT

--nextseq-trim=20

2. Mapping TSS seq reads

STAR genome index creation

STAR Mapping with example options

3. CTSS file generation and quality assessment of TSS sequencing

3.1 Download and excuting the Shell Script for TSS tag count table

Download scripts from Github

Download canonical promoter coordinates (e.g., GENCODE)

Making a directory for tag count result

Example: bash cage.counting.pipeline.b0.01.sh {sample.bam} {mapping quality (integer)} {referece.bed} {genome size file}

3.2 TSS Quality Evaluation and Create Expression Tables Using R

Generating normalized data for two groups using RLE normalization in edgeR

simply generating the TSS tag count matrix

4. Differential expression analysis of TSS sequencing using refTSS

make new directory

download refTSS human corrdinate file

Example: bash cage.counting.pipeline.b0.02.sh {sample.bam} {mapping quality (integer)} {referece.bed} {genome size file}

4.1 Statistical testing and MA-plot

R object list

statistical testing by exact test

Extraction of top 100 TSSs by statistics (p-value, FDR)

dividing up* and down-regulated TSS list

4.2 TSS ID list preparation for GWAS-LD enrichment analysis

This process removes dobule quatation and line numbers from rowname vectors.

upregulated list

downregulated list

4.3 Visualization of differential expression by MA-plot

Reference

Usecase2