gyazgeldi/STRTN: STRT-N

Katayama Shintaro

gyazgeldi/STRTN: STRT-N

2023-01-11, 2023-01-11

software

STRTN-NextSeq analysis pipeline A pipeline for the analysis of STRT-N RNA-sequencing outputs from NextSeq. This pipeline is based on STRT2 pipeline that is developed by Masahito Yoshihara (Ezer et al., 2021). NOTE: Sequence data processing, visualization on UCSC and visualization using Seurat require about 150GB, 15GB, 50MB and 150MB of memory depending on genome size and raw data size. The installation of conda packages and pipeline, and running the STRT-N pipeline should be run on Linux terminal. Install git clone https://github.com/gyazgeldi/STRTN.git  Dependencies For `STRTN.sh` (The main pipeline with visualization) - Picard - HISAT2 - SAMtools - bedtools - Subread - Seqtk - UCSC-bedGraphToBigWig For `STRTN-Seurat.sh` (The visualization of results using Seurat package) - stringr - dplyr - ggplot2 - cowplot - ggbeeswarm - forcats - Seurat For `fastq-fastQC.sh`  - FastQC - MultiQC The conda environment is provided as `STRTN-env.yml`. The environment can be created with the followings. In this example, the pipeline will be cloned into the “STRTN-test” folder, but you may change it to any name according to your project. conda env create -n STRTN-test -f STRTN-env.yml conda activate STRTN-test For CSC, these software are available through the `module` command in the scripts (`STRTN-CSC.sh`, `STRTN-Seurat.sh`, `STRTN-UCSC-Allas.sh`, `STRTN-TFE-CSC.sh`, and `fastq-fastqc-CSC.sh` as well as `STRTN-Indexes-Dictionary-CSC.sh`). Requirements - Illumina BaseCalls files (.bcl). The number of lanes is determined based on the number of directories in the basecalls directory. Here is an example of 4 lanes:    ├── L001   ├── L002   ├── L003   └── L004 - HISAT2 index built with a reference genome, (ribosomal DNA), and ERCC spike-ins    - See also [How to build HISAT2 index](https://github.com/gyazgeldi/STRTN#how-to-build-hisat2-index-in-csc).   - The HISAT2 index directory should include the followings:     ├── [basename].1.ht2     ├── [basename].2.ht2     ├── [basename].3.ht2     ├── [basename].4.ht2     ├── [basename].5.ht2     ├── [basename].6.ht2     ├── [basename].7.ht2     ├── [basename].8.ht2     ├── [basename].fasta     └── [basename].dict - Source files (in `src` directory)   - `barcode.txt` : Barcode sequence with barcode name (1–48). __Please modify if you used different (number of) barcodes.__   - `ERCC.bed` : 5'-end 50 nt region of ERCC spike-ins ([SRM2374](https://www-s.nist.gov/srmors/view_detail.cfm?srm=2374)) for annotation and quality check.   - `Example-BarcodesStages` : Sample explanation for data reduction and visualization using PCA, UMAP and violin plots. Usage: For general users: ./STRTN.sh -o {OUTPUT_NAME} -g {GENOME_VALUE} -a {ANNO_VALUE} -b {BaseCallsDir_PATH} -i {Index_PATH} -w {WorkingDir_PATH} -c {center_VALUE} -r {run_VALUE} -s {READ_STRUCTURE}     For CSC users: sbatch -A project_2005262 ./STRTN-CSC.sh -o {OUTPUT_NAME} -g {GENOME_VALUE} -a {ANNO_VALUE} -b {BaseCallsDir_PATH} -i {Index_PATH} -w {WorkingDir_PATH} -c {center_VALUE} -r {run_VALUE} -s {READ_STRUCTURE}   Example usage For general users: ./STRTN.sh -o STRTN_MOUSE_LIB -g mm39 -a wgEncodeGencodeBasicVM30 -b /mnt/c/Users/gamyaz/STRTN-Pipeline/Data/Intensities/BaseCalls -i /mnt/c/Users/gamyaz/STRTN-Pipeline/mouse_index/mouse_reference -w /mnt/c/Users/gamyaz/STRTN-Pipeline -p /mnt/c/Users/gamyaz/Downloads/ENTER/pkgs/picard-2.27.4-hdfd78af_0/share/picard-2.27.4-0 -c FUGU -r RUNBARCODE -s 8M3S75T6B For CSC users: sbatch -A project_2005262 ./STRTN-CSC.sh -o STRTN_MOUSE_LIB -g mm39 -a wgEncodeGencodeBasicVM30 -b /scratch/project_2005262/Data/Intensities/BaseCalls -i /scratch/project_2005262/mouse_index/mouse_reference -w /scratch/project_2005262 -c FUGU -r RUNBARCODE -s 8M3S75T6B Parameters Mandatory: Name Description -g, --genome Reference genome. Choose one hg19/hg38/mm9/mm10/mm39/canFam3/canFam6/bosTau9. -b, --basecalls /PATH/to/the Illumina basecalls directory. -i, --index /PATH/to/the directory and basename of the HISAT2 index for the reference genome. -w, --working /PATH/to/the working directory. -p, --picardhome /PATH/to/the picard.jar. Optional: Name Default Description -o, --out OUTPUT Output file name. -a, --annotation ref Gene annotation for QC and counting. <br> Choose from `ref`(RefSeq)/`ens`(Ensembl)/`kg`(UCSC KnownGenes), or directly input the Gencode annotation file name (eg. `wgEncodeGencodeBasicVM30`) for Gencode. Note that some annotations are unavailable in some cases. Please find the details below. -c, --center  CENTER The name of the sequencing center that produced the reads.<br>Required for the the Picard IlluminaBasecallsToSam program. -r, --run RUNBARCODE The barcode of the run. Prefixed to read names. Required for the the Picard IlluminaBasecallsToSam program. -s, --structure 8M3S75T6B Read structure. Required for the the Picard IlluminaBasecallsToSam program. Details are described here -d, --dta   Add `-d, --dta` (downstream-transcriptome-assembly) if you plan to perform TFE-based analysis. Please note that this leads to fewer alignments with short-anchors. -h, --help   Show usage. -v, --version   Show version. `-a, --annotation` availability as of Nov 2022:   RefSeq (ref) Ensembl (ens) KnownGenes (kg) Gencode hg19 (human) + + + + hg38 (human) + - + + mm9 (mouse)  + + + - mm10 (mouse) + - + + mm39 (mouse)  + - + + canFam3 (dog) + + - - canFam6 (dog) + - - - bosTau9 (bovine) + + - - Outputs Outputs are provided in `out` directory. Unaligned BAM files generated with Picard IlluminaBasecallsToSam program are found in `tmp/Unaligned_bam`. OUTPUT-QC.txt  Quality check report for all samples. It is provided in out directory by STRTN.sh. Column Value Barcode Sample name. `OUTPUT` with numbers Qualified_reads Primary aligned read count Total_reads Read count without redundant (duplicate) reads Redundancy Qualified reads / Total reads Mapped_reads Mapped read count (Total reads without unmapped reads) Mapped_rate Mapped reads / Total reads Spikein_reads Read count mapped to ERCC spike-ins Spikein-5end_reads Read count mapped to the 5'-end 50 nt region of ERCC spike-ins Spikein-5end_rate Spikein-5end reads / Spikein reads Coding_reads Read count aligned within any exon or the 500 bp upstream of coding genes Coding-5end_reads Read count aligned the 5′-UTR or 500 bp upstream of coding genes Coding-5end_rate Coding-5end reads / Coding reads OUTPUT-QC-plots.pdf Quality check report by boxplots. Mapped_reads, Mapped_rate, Spikein_reads, Mapped / Spikein, Spikein-5end_rate, and Coding-5end_rate are shown for all samples. Barcode numbers of outlier samples are marked with red characters. It is provided in out directory by STRTN.sh. Please consider these outlier samples for the further downstream analysis. OUTPUT_byGene-counts.txt Read count table output from. It is provided in out directory by STRTN.sh. featureCounts OUTPUT_byGene-counts.txt.summary Filtering summary from. It is provided in out directory by STRTN.sh.  featureCounts Output_bam Resulting BAM files including unmapped, non-primary aligned, and duplicated (marked) reads. Files are provided in out directory by STRTN.sh. Output_bai Index files (.bai) of the resulting BAM files in the Output_bam directory. Files are provided in out directory by STRTN.sh. OUTPUT-QC-plots.pdf Quality check report by boxplots. Mapped_reads, Mapped_rate, Spikein_reads, Mapped / Spikein, Spikein-5end_rate, and Coding-5end_rate are shown for all samples. Barcode numbers of outlier samples are marked with red characters. It is provided in out directory by STRTN.sh. Please consider these outlier samples for the further downstream analysis. OUTPUT_byGene-counts.txt Read count table output from. It is provided in out directory by STRTN.sh. featureCounts. https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf OUTPUT_byGene-counts.txt.summary Filtering summary from. It is provided in out directory by STRTN.sh.  featureCounts. https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf Output_bam Resulting BAM files including unmapped, non-primary aligned, and duplicated (marked) reads. Files are provided in out directory by STRTN.sh. Output_bai Index files (.bai) of the resulting BAM files in the Output_bam directory. Files are provided in out directory by STRTN.sh. OUTPUT.output.bam BAM files containing reads except for duplicate and non-primary reads. Files are provided in the working directory by STRTN.sh. OUTPUT.minus.bw and OUTPUT.plus.bw BigWig files for each strands of each sample. Files are provided in the working directory by STRTN.sh. coding_5end.bb BigBed file for coding-5'end annotation file. It is provided in the working directory by STRTN.sh hub.txt Parameters for each tracks. It is provided in the working directory by STRTN-UCSC-Allas.sh. Link of hub.txt file Provided by STRTN-UCSC-Allas.sh. OUTPUT-QC-BeeswarmPlots.pdf Visualization quality check values for each developmental stage using BeeswarmPlots. It is provided in out directory by STRTN-Seurat.sh. Rplots.pdf Elbow, JackStraw, PCA, UMAP and violin plots. It is provided in out directory by STRTN-Seurat.sh. ExtractIlluminaBarcodes_Metrics Metrics file produced by the Picard ExtractIlluminaBarcodes program. The number of matches/mismatches between the barcode reads and the actual barcodes is shown per lane. https://gatk.broadinstitute.org/hc/en-us/articles/360037426491-ExtractIlluminaBarcodes-Picard- HISAT2_Metrics Alignment summary of samples from each lane produced by the HISAT2 program. https://daehwankimlab.github.io/hisat2/manual/ MarkDuplicates_Metrics Metrics file indicating the numbers of duplicates produced by the Picard MarkDuplicates program. https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard- fastq-fastQC.sh After running the pipeline above, you can generate fastq files for each sample from the output BAM files in the fastq directory. These fastq files (without duplicated reads) can be submitted to public sequence databases. FastQC files are also generated for each fastq file in the fastqc directory. Based on the FastQC results, MultiQC report (MultiQC_report.html) is generated. Genome indexing and creating sequence dictionary Genome indexes for mouse STRT-N library is available in https://doi.org/10.5281/zenodo.7457660 and the procedure is in https://github.com/gyazgeldi/STRTN#how-to-build-hisat2-index-in-csc

https://doi.org/10.5281/zenodo.7524239