Usecase2: Difference between revisions

Latest revision as of 17:47, 7 October 2024

This usecase provides step by step workflow to achieve gold standard expression analysis using refTSS, and obtains TSS expression table, input TSSID list for refTSS-LD, quality assessment of TSS sequencing and statistical testing results. These scripts are based on the scripts written in Morioka MS et al..

For hands on tutorial, refTSS provides an example data-set published by Yasukawa et al. at use-case file folder:

https://reftss.riken.jp/usecase_files/ .

Software Requirements

SRA Toolkit (for fastq file download)
bigWigAverageOverBed, bedGraphToBigWig programs in Jim Kent Utility tools
fastqc program
cutadapt
STAR aligner
R(version 4.2.2 or later)
bedtools (Confirmed to work with version 2.28 and version 2.29)

1. FASTQ preparation

1.1 Download genome reference files

GRCh38 human genome assembly file is downloaded from GENCODE. After downloading genome fasta file, .genome file which required in several utility tools is generated by faidx function in NCBI tools.

* hg38 assembly

* hg38 gene annotation

   # .genome file generation
   pip install --upgrade pip
   pip install pyfaidx
   gzip -d GRCh38.p14.genome.fa.gz
   faidx GRCh38.p14.genome.fa -i chromsizes >GRCh38.p14.genome

1.2 FASTQ file preparation

Downloading an example data set (CAGE data) from SRA.

  fasterq-dump DRR159241 -o mutA_rep1.fq -e 12 -p
  fasterq-dump DRR159242 -o mutA_rep2.fq -e 12 -p
  fasterq-dump DRR159243 -o mutA_rep3.fq -e 12 -p
  fasterq-dump DRR159244 -o wt_rep1.fq -e 12 -p
  fasterq-dump DRR159245 -o wt_rep2.fq -e 12 -p
  fasterq-dump DRR159246 -o wt_rep3.fq -e 12 -p

Sequencing quality check by fastqc program

fastqc -o qc_mutA_rep1 mutA_rep1.fq
fastqc -o qc_mutA_rep2 mutA_rep2.fq
fastqc -o qc_mutA_rep3 mutA_rep3.fq
fastqc -o qc_wt_rep1 wt_rep1.fq
fastqc -o qc_wt_rep2 wt_rep2.fq
fastqc -o qc_wt_rep3 wt_rep3.fq

1.3 [Optional] Adaptor trimming step (e.g., two color illumina chemistries)

One of the major sequencing techs are both NextSeq and the NovaSeq which uses two-colored chemisty. In this case, 3'-end of G was derived from end of sequencing with good sequence quality. cutadapt with --nextseq-trim=20 option will work as regular quality trimming (where one would use -q 20 instead), except that the qualities of 3'-end G bases are ignored.

# -m: minimum output read length 
# -a: adaptor trimming 
# Trueseq read1 adaptor sequence:AGATCGGAAGAGCACACGTCTGAACTCCAGTCA 
# Nextra adaptor sequence: CTGTCTCTTATACACATCT 
# --nextseq-trim=20  
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o mutA_rep1.tm.fq mutA_rep1.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o mutA_rep2.tm.fq mutA_rep2.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o mutA_rep3.tm.fq mutA_rep3.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o wt_rep1.tm.fq wt_rep1.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o wt_rep2.tm.fq wt_rep2.fq
cutadapt -m 30 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o wt_rep3.tm.fq wt_rep3.fq

2. Mapping TSS-seq reads on reference genome

This example uses STAR program for mapping read on reference genome.

# STAR genome index generation 

STAR --runMode genomeGenerate \
	--genomeDir GRCh38p14_star_index \
	--runThreadN 16 \
	--genomeFastaFiles GRCh38.p14.genome.fa \
	--sjdbGTFfile gencode.v46.annotation.gtf


# STAR Mapping with example options 
GTF=/home/masaki.morioka/genome/gencode.v46.annotation.gtf
SGENOME=/home/masaki.m/genome/GRCh38p14_star_index
_naming=mutA_rep1

STAR --runThreadN 12 --genomeDir $SGENOME \
		--readFilesIn ${_naming}.tm.fq \
		--outSAMtype BAM SortedByCoordinate \
		--outTmpKeep None \
	  --outFileNamePrefix ${_naming}

3. CTSS file generation and quality assessment of TSS sequencing

3.1 Download and running the script for TSS tag count table

Download the our custom script for counting TSS tags and the reference of canonical TSS coordinates (GENCODE).

# Download scripts from Github #
git clone https://github.com/suimye/cage_tutorial.git
chmod 755 ./cage_tutorial/*

# Download canonical promoter coordinates (e.g., GENCODE) 
wget https:://refTSS.riken.jp/tutorial/gencode.v40.promoter300-100.bed.gz 
gzip -d gencode.v40.promoter300-100.bed.gz 

# Making a directory for tag count result 
mkdir gencode_count
cd gencode_count/

Running the script for tag counting

# Example: bash cage.counting.pipeline.b0.01.sh {sample.bam} {mapping quality (integer)} {referece.bed} {genome size file} =
sh ../cage_tutorial/cage.counting.pipeline.b0.01.sh ../mutA_rep1Aligned.sortedByCoord.out.bam 20 gencode.v40.promoter300-100.bed ../GRCh38.p14.genome
-----

Following six tag count files are generated.

- wt_rep1Aligned.sortedByCoord.out.240925.ctss.txt
- wt_rep2Aligned.sortedByCoord.out.240925.ctss.txt
- wt_rep3Aligned.sortedByCoord.out.240925.ctss.txt
- mutA_rep1Aligned.sortedByCoord.out.240925.ctss.txt
- mutA_rep2Aligned.sortedByCoord.out.240925.ctss.txt
- mutA_rep3Aligned.sortedByCoord.out.240925.ctss.txt

3.2 TSS Quality Evaluation and Create Expression Tables Using R

TSS sequencing technologies are expected to enrich TSS information from sequencing reads. Yield of tags should be overlapped with canonical TSS annotation such as 5 prime end of GENCODE mRNA annotation.

The script promoter_mapping_rate.R reads the files ending with ctss.txt generated in the previous process and creates an expression data table. Be careful not to have any unnecessary ctss.txt files in your working folder as it may cause errors. After the args option, enter the labels for the comparison groups (in this case, WT and Mutant) separated by spaces. These input labels will use for regular expression of file names. If you specify "none" for the --args option, it is used when you want to simply create an expression data table without considering replicate experiments.

= Generating normalized data for two groups using RLE normalization in edgeR  =
R --slave --vanilla --args wt mutA < ../cage_tutorial/promoter_mapping_rate.b0.2.R
= simply generating the TSS tag count matrix =
R --slave --vanilla --args none  < ../cage_tutorial/promoter_mapping_rate.b0.2.R

Here is the example output `TSS.qcbarplot.pdf` from result of test data set.

image-tssqcplot

4. Differential expression analysis of TSS sequencing using refTSS

TSS tags will be useful for gene expression analysis. Calculating TSS tag counts on refTSS will provide TSS-wide comprehensive view of expression changes in the genome. This process use shell script and R.

Making directory for this analysis

# make new directory 
cd ../
mkdir reftss_count
cd reftss_count
# download refTSS human corrdinate file 
wget https://reftss.riken.jp/datafiles/4.1/human/refTSS_v4.1_human_coordinate.hg38.bed.txt.gz
gzip -d refTSS_v4.1_human_coordinate.hg38.bed.txt.gz

4.1 Calculation of TSS tag counts on refTSS

# Example: bash cage.counting.pipeline.b0.02.sh {sample.bam} {mapping quality (integer)} {referece.bed} {genome size file} 
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../mutA_rep1Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../mutA_rep2Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../mutA_rep3Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../wt_rep1Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../wt_rep2Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome
sh ../cage_tutorial/cage.counting.pipeline.b0.02.sh ../wt_rep3Aligned.sortedByCoord.out.bam 20 refTSS_v4.1_human_coordinate.hg38.bed.txt ../GRCh38.p14.genome

4.2 Making TSS expression matrix in R environment

R --slave --vanilla --args wt mutA < ../cage_tutorial/promoter_mapping_rate.b0.2.R

After execution of previous command, R data file was generated as cage_count{date}.RData

4.3 Statistical testing

Previous process generate a R data file in working directory. Loading R data file and calculating statistics on R console.

library(edgeR)
load("cage_count{date}.RData")

# R object list 
> ls()
 [1] "args"       "cname"      "d"          "df"         "dgeL"      
 [6] "fl"         "flag"       "gp"         "gp1"        "gp2"       
[11] "i"          "log2cpm"    "read_d"     "rname"      "size"      
[16] "time_stamp"

# Statistical testing by exact test 
et_res <- exactTest(dgeL)

# Extraction of top 100 TSSs by statistics (p-value, FDR) 
detss <- topTags(et_res,n=100,adjust.method="BH",sort.by="Pvalue",p.value=0.0001)

# Dividing up- and down-regulated TSS list 
de_up <- subset(detss$table,logFC>0)
de_down <- subset(detss$table,logFC<0)

4.4 Visualization of differential expression by MA-plot

One of the standard approaches on visualization of expression value is MA-plot showing relationship between average expression and fold difference in each genes (or TSSs). Here is an example of MA-plot using test data.

pdf("maplot.pdf")
  plot(et_res$table$logCPM,et_res$table$logFC,cex=0.3,pch=16,col="darkgrey",xlab="Average log-expression",ylab="Expression log-ratio (MutA vs WT)")
  points(de_up$logCPM,de_up$logFC,cex=0.7,pch=16,col="deeppink")
  points(de_down$logCPM,de_down$logFC,cex=0.7,pch=16,col="steelblue")
  abline(h=0,col="tan4")
dev.off()

4.5 GWAS-LD enrichment analysis using differentially expressed TSS information

One of the standard approaches in downstream analysis of gene expression is enrichment analysis that can be useful for evaluating biological function under condition. To prepare input for GWAS-LD enrichment analysis on refTSS, TSS ID lists were restored in working directory. Algorithm of GWAS-LD enrichment analysis is described in here.

# This process removes double quotation and line numbers from rowname vectors. 

# upregulated list
sink("up_det.list.txt")
  cat(rownames(de_up), sep = "\n") 
sink() 


# downregulated list
sink("down_det.list.txt")
  cat(rownames(de_down), sep = "\n") 
sink()

*Access to refTSS-LD

*input text from up_det.list.txt and down_det.lsit.txt

*Push run botton

Reference

* Script: Morioka MS et al. Methods Mol Biol. 2020:2120:277-301. doi:10.1007/978-1-0716-0327-7_20.

* Example dataset: Yasukawa et al. Nat Commun. 2020:11:1557, doi:10.1038/s41457-020. 15289-7

* edgeR DOI:10.18129/B9.bioc.edgeR

* Kent Util tool: https://hgdownload.soe.ucsc.edu/admin/exe

* BEDtools: Quinlan AR and Hall IM Bioinformatics. 2010:26, 6, pp. 841–842.

* FASTQC: Wingett SW, Andrews S. et al. 2018 doi:10.12688/f1000research.15931.2. eCollection

* STAR: Dobin A et al. Bioinformatics 2013:29(1):15-21. doi:10.1093/bioinformatics/bts635. Epub 2012 Oct 25.

* Cutadapt: Marcel Martin doi:10.14806/ej.17.1.200

@@ Line 25: / Line 25: @@
 ===== 1.1 Download genome reference files =====
-GRCh38 human genome assembly file is downloaded from GENCODE. After downloading genome fasta file, `.genome` file which required in several utility tools is generated by `faidx` function in NCBI tools.
+: GRCh38 human genome assembly file is downloaded from GENCODE. After downloading genome fasta file, '''.genome''' file which required in several utility tools is generated by '''faidx''' function in NCBI tools.
-* [[Source::https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/GRCh38.p14.genome.fa.gz|hg38 assembly]]
+: * [[Source::https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/GRCh38.p14.genome.fa.gz|hg38 assembly]]
+: * [[Source::https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gtf.gz|hg38 gene annotation]]
-* [[Source::https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gtf.gz|hg38 gene annotation]]
 <pre>
     # .genome file generation
@@ Line 38: / Line 37: @@
 ===== 1.2 FASTQ file preparation =====
-Downloading an example data set (CAGE data) from SRA.
+: Downloading an example data set (CAGE data) from SRA.
 <pre>
    fasterq-dump DRR159241 -o mutA_rep1.fq -e 12 -p
@@ Line 60: / Line 59: @@
 ===== 1.3 [Optional] Adaptor trimming step (e.g., two color illumina chemistries) =====
-One of the major sequencing techs are both NextSeq and the NovaSeq which uses two-colored chemisty. In this case, 3'-end of `G` was derived from end of sequencing with good sequence quality.  cutadapt with `--nextseq-trim=20` option will work as regular quality trimming (where one would use `-q 20` instead), except that the qualities of 3'-end `G` bases are ignored.
+: One of the major sequencing techs are both NextSeq and the NovaSeq which uses two-colored chemisty. In this case, 3'-end of '''G''' was derived from end of sequencing with good sequence quality.  cutadapt with '''--nextseq-trim=20''' option will work as regular quality trimming (where one would use '''-q 20''' instead), except that the qualities of 3'-end '''G''' bases are ignored.
 <pre>
@@ Line 77: / Line 76: @@
 === 2. Mapping TSS-seq reads on reference genome  ===
+: This example uses STAR program for mapping read on reference genome.
-This example uses STAR program for mapping read on reference genome.
 <pre>
 # STAR genome index generation
@@ Line 104: / Line 102: @@
 ===== 3.1 Download and running the script for TSS tag count table =====
-Download the our custom script for counting TSS tags and the reference of canonical TSS coordinates (GENCODE).
+: Download the our custom script for counting TSS tags and the reference of canonical TSS coordinates (GENCODE).
 <pre>
 # Download scripts from Github #
@@ Line 119: / Line 118: @@
 </pre>
-Running the script for tag counting
+: Running the script for tag counting
 <pre>
@@ Line 127: / Line 126: @@
 </pre>
-Following six tag count files are generated.
+: Following six tag count files are generated.
 <pre>
@@ Line 139: / Line 138: @@
 ===== 3.2 TSS Quality Evaluation and Create Expression Tables Using R =====
-TSS sequencing technologies are expected to enrich TSS information from sequencing reads. Yield of tags should be overlapped with canonical TSS annotation such as 5 prime end of GENCODE mRNA annotation.
+: TSS sequencing technologies are expected to enrich TSS information from sequencing reads. Yield of tags should be overlapped with canonical TSS annotation such as 5 prime end of GENCODE mRNA annotation.
-The script `promoter_mapping_rate.R` reads the files ending with `ctss.txt` generated in the previous process and creates an expression data table. Be careful not to have any unnecessary `ctss.txt` files in your working folder as it may cause errors. After the args option, enter the labels for the comparison groups (in this case, WT and Mutant) separated by spaces. These input labels will use for regular expression of file names.  If you specify "none" for the `--args` option, it is used when you want to simply create an expression data table without considering replicate experiments.
+: The script '''promoter_mapping_rate.R''' reads the files ending with '''ctss.txt''' generated in the previous process and creates an expression data table. Be careful not to have any unnecessary '''ctss.txt''' files in your working folder as it may cause errors. After the args option, enter the labels for the comparison groups (in this case, WT and Mutant) separated by spaces. These input labels will use for regular expression of file names.  If you specify "none" for the '''--args''' option, it is used when you want to simply create an expression data table without considering replicate experiments.
 <pre>
@@ Line 150: / Line 149: @@
 </pre>
-Here is the example output `TSS.qcbarplot.pdf` from result of test data set.
+: Here is the example output `TSS.qcbarplot.pdf` from result of test data set.
 <html>
@@ Line 157: / Line 156: @@
 === 4. Differential expression analysis of TSS sequencing using refTSS ===
+: TSS tags will be useful for gene expression analysis. Calculating TSS tag counts on refTSS will provide TSS-wide comprehensive view of expression changes in the genome. This process use shell script and R.
-TSS tags will be useful for gene expression analysis. Calculating TSS tag counts on refTSS will provide TSS-wide comprehensive view of expression changes in the genome. This process use shell script and R.
+'''Making directory for this analysis'''
-===== Making directory for this analysis =====
 <pre>
 # make new directory
@@ Line 186: / Line 184: @@
 R --slave --vanilla --args wt mutA < ../cage_tutorial/promoter_mapping_rate.b0.2.R
 </pre>
-After execution of previous command, R data file was generated as `cage_count{date}.RData`
+: After execution of previous command, R data file was generated as '''cage_count{date}.RData'''
 ===== 4.3 Statistical testing =====
-Previous process generate a R data file in working directory. Loading R data file and calculating statistics on R console.
+: Previous process generate a R data file in working directory. Loading R data file and calculating statistics on R console.
 <pre>
 library(edgeR)
@@ Line 213: / Line 211: @@
 ===== 4.4 Visualization of differential expression by MA-plot =====
-One of the standard approaches on visualization of expression value is MA-plot showing relationship between average expression and fold difference in each genes (or TSSs). Here is an example of MA-plot using test data.
+: One of the standard approaches on visualization of expression value is '''MA-plot''' showing relationship between average expression and fold difference in each genes (or TSSs). Here is an example of MA-plot using test data.
 <pre>
@@ Line 224: / Line 222: @@
 </pre>
-<html>
+: <html>
 <p style="text-align: center;"><img src="https://github.com/user-attachments/assets/7c3aa90f-dd60-4520-a514-57a32501c828" style="zoom:40%;"></p>
 </html>
@@ Line 230: / Line 228: @@
 ===== 4.5 GWAS-LD enrichment analysis using differentially expressed TSS information =====
-One of the standard approaches in downstream analysis of gene expression is enrichment analysis that can be useful for evaluating biological function under condition. To prepare input for GWAS-LD enrichment analysis on refTSS, TSS ID lists were restored in working directory. Algorithm of GWAS-LD enrichment analysis is described in [[Source::https://reftss.riken.jp/reftss/Manual|here]].
+: One of the standard approaches in downstream analysis of gene expression is enrichment analysis that can be useful for evaluating biological function under condition. To prepare input for GWAS-LD enrichment analysis on refTSS, TSS ID lists were restored in working directory. Algorithm of GWAS-LD enrichment analysis is described in [[Source::https://reftss.riken.jp/reftss/Manual|here]].
 <pre>
 # This process removes double quotation and line numbers from rowname vectors.
@@ Line 246: / Line 245: @@
 </pre>
-# Access to [[GWAS-LD_enrichment|refTSS-LD]]
+: *Access to [[GWAS-LD_enrichment|refTSS-LD]]
-# input text from up_det.list.txt and down_det.lsit.txt
+: *input text from up_det.list.txt and down_det.lsit.txt
-# Push run botton
+: *Push run botton
-<html>
+: <html>
 <p style="text-align: center;"><img src="https://github.com/user-attachments/assets/e35f7c2e-6fbb-4f6c-836d-80f82ce76b1d" style="zoom:80%;"></p>
 </html>
@@ Line 259: / Line 258: @@
 === Reference ===
-<pre>
+: * Script: Morioka MS et al. Methods Mol Biol. 2020:2120:277-301. doi:10.1007/978-1-0716-0327-7_20.
-* Script: Morioka MS et al. Methods Mol Biol. 2020:2120:277-301. doi:10.1007/978-1-0716-0327-7_20.
+: * Example dataset: Yasukawa et al. Nat Commun. 2020:11:1557, doi:10.1038/s41457-020. 15289-7
-* Example dataset: Yasukawa et al. Nat Commun. 2020:11:1557, doi:10.1038/s41457-020. 15289-7
+: * edgeR DOI:10.18129/B9.bioc.edgeR
-* edgeR DOI:10.18129/B9.bioc.edgeR
+: * Kent Util tool: https://hgdownload.soe.ucsc.edu/admin/exe
-* Kent Util tool: https://hgdownload.soe.ucsc.edu/admin/exe
+: * BEDtools: Quinlan AR and Hall IM Bioinformatics. 2010:26, 6, pp. 841–842.
-* BEDtools: Quinlan AR and Hall IM Bioinformatics. 2010:26, 6, pp. 841–842.
+: * FASTQC: Wingett SW, Andrews S. et al. 2018 doi:10.12688/f1000research.15931.2. eCollection
-* FASTQC: Wingett SW, Andrews S. et al. 2018 doi:10.12688/f1000research.15931.2. eCollection
+: * STAR: Dobin A et al. Bioinformatics  2013:29(1):15-21. doi:10.1093/bioinformatics/bts635. Epub 2012 Oct 25.
-* STAR: Dobin A et al. Bioinformatics  2013:29(1):15-21. doi:10.1093/bioinformatics/bts635. Epub 2012 Oct 25.
+: * Cutadapt: Marcel Martin doi:10.14806/ej.17.1.200
-* Cutadapt: Marcel Martin doi:10.14806/ej.17.1.200
-</pre>