gyazgeldi/STRTN: STRT-N
STRTN-NextSeq analysis pipeline
A pipeline for the analysis of STRT-N RNA-sequencing outputs from NextSeq. This pipeline is based on STRT2 pipeline that is developed by Masahito Yoshihara (Ezer et al., 2021).
NOTE: Sequence data processing, visualization on UCSC and visualization using Seurat require about 150GB, 15GB, 50MB and 150MB of memory depending on genome size and raw data size. The installation of conda packages and pipeline, and running the STRT-N pipeline should be run on Linux terminal.
Install
git clone https://github.com/gyazgeldi/STRTN.git
Dependencies
For `STRTN.sh` (The main pipeline with visualization)
- Picard
- HISAT2
- SAMtools
- bedtools
- Subread
- Seqtk
- UCSC-bedGraphToBigWig
For `STRTN-Seurat.sh` (The visualization of results using Seurat package)
- stringr
- dplyr
- ggplot2
- cowplot
- ggbeeswarm
- forcats
- Seurat
For `fastq-fastQC.sh`
- FastQC
- MultiQC
The conda environment is provided as `STRTN-env.yml`. The environment can be created with the followings. In this example, the pipeline will be cloned into the “STRTN-test” folder, but you may change it to any name according to your project.
conda env create -n STRTN-test -f STRTN-env.yml
conda activate STRTN-test
For CSC, these software are available through the `module` command in the scripts (`STRTN-CSC.sh`, `STRTN-Seurat.sh`, `STRTN-UCSC-Allas.sh`, `STRTN-TFE-CSC.sh`, and `fastq-fastqc-CSC.sh` as well as `STRTN-Indexes-Dictionary-CSC.sh`).
Requirements
- Illumina BaseCalls files (.bcl). The number of lanes is determined based on the number of directories in the basecalls directory. Here is an example of 4 lanes:
├── L001
├── L002
├── L003
└── L004
- HISAT2 index built with a reference genome, (ribosomal DNA), and ERCC spike-ins
- See also [How to build HISAT2 index](https://github.com/gyazgeldi/STRTN#how-to-build-hisat2-index-in-csc).
- The HISAT2 index directory should include the followings:
├── [basename].1.ht2
├── [basename].2.ht2
├── [basename].3.ht2
├── [basename].4.ht2
├── [basename].5.ht2
├── [basename].6.ht2
├── [basename].7.ht2
├── [basename].8.ht2
├── [basename].fasta
└── [basename].dict
- Source files (in `src` directory)
- `barcode.txt` : Barcode sequence with barcode name (1–48). __Please modify if you used different (number of) barcodes.__
- `ERCC.bed` : 5'-end 50 nt region of ERCC spike-ins ([SRM2374](https://www-s.nist.gov/srmors/view_detail.cfm?srm=2374)) for annotation and quality check.
- `Example-BarcodesStages` : Sample explanation for data reduction and visualization using PCA, UMAP and violin plots.
Usage:
For general users:
./STRTN.sh -o {OUTPUT_NAME} -g {GENOME_VALUE} -a {ANNO_VALUE} -b {BaseCallsDir_PATH} -i {Index_PATH} -w {WorkingDir_PATH} -c {center_VALUE} -r {run_VALUE} -s {READ_STRUCTURE}
For CSC users:
sbatch -A project_2005262 ./STRTN-CSC.sh -o {OUTPUT_NAME} -g {GENOME_VALUE} -a {ANNO_VALUE} -b {BaseCallsDir_PATH} -i {Index_PATH} -w {WorkingDir_PATH} -c {center_VALUE} -r {run_VALUE} -s {READ_STRUCTURE}
Example usage
For general users:
./STRTN.sh -o STRTN_MOUSE_LIB -g mm39 -a wgEncodeGencodeBasicVM30 -b /mnt/c/Users/gamyaz/STRTN-Pipeline/Data/Intensities/BaseCalls -i /mnt/c/Users/gamyaz/STRTN-Pipeline/mouse_index/mouse_reference -w /mnt/c/Users/gamyaz/STRTN-Pipeline -p /mnt/c/Users/gamyaz/Downloads/ENTER/pkgs/picard-2.27.4-hdfd78af_0/share/picard-2.27.4-0 -c FUGU -r RUNBARCODE -s 8M3S75T6B
For CSC users:
sbatch -A project_2005262 ./STRTN-CSC.sh -o STRTN_MOUSE_LIB -g mm39 -a wgEncodeGencodeBasicVM30 -b /scratch/project_2005262/Data/Intensities/BaseCalls -i /scratch/project_2005262/mouse_index/mouse_reference -w /scratch/project_2005262 -c FUGU -r RUNBARCODE -s 8M3S75T6B
Parameters
Mandatory:
Name
Description
-g, --genome
Reference genome. Choose one hg19/hg38/mm9/mm10/mm39/canFam3/canFam6/bosTau9.
-b, --basecalls
/PATH/to/the Illumina basecalls directory.
-i, --index
/PATH/to/the directory and basename of the HISAT2 index for the reference genome.
-w, --working
/PATH/to/the working directory.
-p, --picardhome
/PATH/to/the picard.jar.
Optional:
Name
Default
Description
-o, --out
OUTPUT
Output file name.
-a, --annotation
ref
Gene annotation for QC and counting. <br> Choose from `ref`(RefSeq)/`ens`(Ensembl)/`kg`(UCSC KnownGenes), or directly input the Gencode annotation file name (eg. `wgEncodeGencodeBasicVM30`) for Gencode. Note that some annotations are unavailable in some cases. Please find the details below.
-c, --center
CENTER
The name of the sequencing center that produced the reads.<br>Required for the the Picard IlluminaBasecallsToSam program.
-r, --run
RUNBARCODE
The barcode of the run. Prefixed to read names. Required for the the Picard IlluminaBasecallsToSam program.
-s, --structure
8M3S75T6B
Read structure. Required for the the Picard IlluminaBasecallsToSam program. Details are described here
-d, --dta
Add `-d, --dta` (downstream-transcriptome-assembly) if you plan to perform TFE-based analysis. Please note that this leads to fewer alignments with short-anchors.
-h, --help
Show usage.
-v, --version
Show version.
`-a, --annotation` availability as of Nov 2022:
RefSeq (ref)
Ensembl (ens)
KnownGenes (kg)
Gencode
hg19 (human)
+
+
+
+
hg38 (human)
+
-
+
+
mm9 (mouse)
+
+
+
-
mm10 (mouse)
+
-
+
+
mm39 (mouse)
+
-
+
+
canFam3 (dog)
+
+
-
-
canFam6 (dog)
+
-
-
-
bosTau9 (bovine)
+
+
-
-
Outputs
Outputs are provided in `out` directory.
Unaligned BAM files generated with Picard IlluminaBasecallsToSam program are found in `tmp/Unaligned_bam`.
OUTPUT-QC.txt
Quality check report for all samples. It is provided in out directory by STRTN.sh.
Column
Value
Barcode
Sample name. `OUTPUT` with numbers
Qualified_reads
Primary aligned read count
Total_reads
Read count without redundant (duplicate) reads
Redundancy
Qualified reads / Total reads
Mapped_reads
Mapped read count (Total reads without unmapped reads)
Mapped_rate
Mapped reads / Total reads
Spikein_reads
Read count mapped to ERCC spike-ins
Spikein-5end_reads
Read count mapped to the 5'-end 50 nt region of ERCC spike-ins
Spikein-5end_rate
Spikein-5end reads / Spikein reads
Coding_reads
Read count aligned within any exon or the 500 bp upstream of coding genes
Coding-5end_reads
Read count aligned the 5′-UTR or 500 bp upstream of coding genes
Coding-5end_rate
Coding-5end reads / Coding reads
OUTPUT-QC-plots.pdf
Quality check report by boxplots. Mapped_reads, Mapped_rate, Spikein_reads, Mapped / Spikein, Spikein-5end_rate, and Coding-5end_rate are shown for all samples. Barcode numbers of outlier samples are marked with red characters. It is provided in out directory by STRTN.sh.
Please consider these outlier samples for the further downstream analysis.
OUTPUT_byGene-counts.txt
Read count table output from. It is provided in out directory by STRTN.sh.
featureCounts
OUTPUT_byGene-counts.txt.summary
Filtering summary from. It is provided in out directory by STRTN.sh.
featureCounts
Output_bam
Resulting BAM files including unmapped, non-primary aligned, and duplicated (marked) reads. Files are provided in out directory by STRTN.sh.
Output_bai
Index files (.bai) of the resulting BAM files in the Output_bam directory. Files are provided in out directory by STRTN.sh.
OUTPUT-QC-plots.pdf
Quality check report by boxplots. Mapped_reads, Mapped_rate, Spikein_reads, Mapped / Spikein, Spikein-5end_rate, and Coding-5end_rate are shown for all samples. Barcode numbers of outlier samples are marked with red characters. It is provided in out directory by STRTN.sh.
Please consider these outlier samples for the further downstream analysis.
OUTPUT_byGene-counts.txt
Read count table output from. It is provided in out directory by STRTN.sh.
featureCounts. https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf
OUTPUT_byGene-counts.txt.summary
Filtering summary from. It is provided in out directory by STRTN.sh.
featureCounts. https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf
Output_bam
Resulting BAM files including unmapped, non-primary aligned, and duplicated (marked) reads. Files are provided in out directory by STRTN.sh.
Output_bai
Index files (.bai) of the resulting BAM files in the Output_bam directory. Files are provided in out directory by STRTN.sh.
OUTPUT.output.bam
BAM files containing reads except for duplicate and non-primary reads. Files are provided in the working directory by STRTN.sh.
OUTPUT.minus.bw and OUTPUT.plus.bw
BigWig files for each strands of each sample. Files are provided in the working directory by STRTN.sh.
coding_5end.bb
BigBed file for coding-5'end annotation file. It is provided in the working directory by STRTN.sh
hub.txt
Parameters for each tracks. It is provided in the working directory by STRTN-UCSC-Allas.sh.
Link of hub.txt file
Provided by STRTN-UCSC-Allas.sh.
OUTPUT-QC-BeeswarmPlots.pdf
Visualization quality check values for each developmental stage using BeeswarmPlots. It is provided in out directory by STRTN-Seurat.sh.
Rplots.pdf
Elbow, JackStraw, PCA, UMAP and violin plots. It is provided in out directory by STRTN-Seurat.sh.
ExtractIlluminaBarcodes_Metrics
Metrics file produced by the Picard ExtractIlluminaBarcodes program. The number of matches/mismatches between the barcode reads and the actual barcodes is shown per lane. https://gatk.broadinstitute.org/hc/en-us/articles/360037426491-ExtractIlluminaBarcodes-Picard-
HISAT2_Metrics
Alignment summary of samples from each lane produced by the HISAT2 program. https://daehwankimlab.github.io/hisat2/manual/
MarkDuplicates_Metrics
Metrics file indicating the numbers of duplicates produced by the Picard MarkDuplicates program. https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-
fastq-fastQC.sh
After running the pipeline above, you can generate fastq files for each sample from the output BAM files in the fastq directory. These fastq files (without duplicated reads) can be submitted to public sequence databases. FastQC files are also generated for each fastq file in the fastqc directory. Based on the FastQC results, MultiQC report (MultiQC_report.html) is generated.
Genome indexing and creating sequence dictionary
Genome indexes for mouse STRT-N library is available in https://doi.org/10.5281/zenodo.7457660 and the procedure is in https://github.com/gyazgeldi/STRTN#how-to-build-hisat2-index-in-csc