Genome indexes for Mus musculus (mm39)

cc-by-4.0Katayama Shintaro2025-04-292022-12-202022-12-20https://datakatalogi.helsinki.fi/handle/123456789/5222BUILDING HISAT2 INDEXES IN CSC Here is the case for house mouse genome (mm39). The genome indexing step requires big memory and it might not be possible to carry out it on a laptop. Genome indexes for Mus musculus (mm39) were created using HISAT2 v2.2.1 on CSC (IT Center for Science), thanks to CSC-Puhti.  1. Create conda environment folder file to install the required packages, install and add the bin directory to the path. mkdir STRTN-env conda-containerize new --prefix STRTN-env STRTN-env.yml export PATH="<install_dir>/STRTN-env/bin:$PATH" 2. Load the required module. module load tykky export PATH="<install_dir>/STRTN-env/bin:$PATH" module load r-env if test -f ~/.Renviron; then     sed -i '/TMPDIR/d' ~/.Renviron fi echo "TMPDIR=${WorkingDir_PATH}" >> ~/.Renviron 3. Obtain the genome sequences of reference and ERCC spike-ins. You may add the ribosomal DNA repetitive unit for human (U13369) and mouse (BK000964). wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz unpigz -c mm39.fa.gz | ruby -ne '$ok = $_ !~ /^>chrUn_/ if $_ =~ /^>/; puts $_ if $ok' > mouse_reference.fasta wget https://tsapps.nist.gov/srmext/certificates/documents/SRM2374_putative_T7_products_NoPolyA_v2.FASTA cat SRM2374_putative_T7_products_NoPolyA_v2.FASTA >> mouse_reference.fasta 4. Extract splice sites and exons from a GTF file. Here we used wgEncodeGencodeBasicVM30 as the annotation file. You may additionally perform `hisat2_extract_snps_haplotypes_UCSC.py` to extract SNPs and haplotypes from a dbSNP file for human and mouse. wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/database/wgEncodeGencodeBasicVM30.txt.gz unpigz -c wgEncodeGencodeBasicVM30.txt.gz | hisat2_extract_splice_sites.py - | grep -v ^chrUn > splice_sites.txt unpigz -c wgEncodeGencodeBasicVM30.txt.gz | hisat2_extract_exons.py - | grep -v ^chrUn > exons.txt 5. Build the HISAT2 index. This outputs a set of files with suffixes. Here, `mouse_reference.1.ht2`, `mouse_reference.2.ht2`, ..., `mouse_reference.8.ht2` are generated.<br>In this case, `mouse_reference` is the basename used for `-i, --index`. hisat2-build mouse_reference.fasta --ss splice_sites.txt --exon exons.txt mouse_index/mouse_reference 6. Create the sequence dictionary for the reference and Spike-in sequences. This is required for the Picard MergeBamAlignment program. Note that the original FASTA file (`mouse_reference.fasta` here) is also required. picard CreateSequenceDictionary R=mouse_reference.fasta O=mouse_reference.dict 7. Put the genome indexes, genome fasta file, sequence dictionary to same folder. mv mouse_reference.dict mouse_reference mv mouse_reference.fasta mouse_referencegenome indexesmousemm39Genome indexes for Mus musculus (mm39)dataset