Genome indexes for Mus musculus (mm39)

Katayama Shintaro

Genome indexes for Mus musculus (mm39)

Date

2022-12-20, 2022-12-20

Creator/contributor

Katayama Shintaro

Publication Type

dataset

Repositories

Zenodo

Description

BUILDING HISAT2 INDEXES IN CSC Here is the case for house mouse genome (mm39). The genome indexing step requires big memory and it might not be possible to carry out it on a laptop. Genome indexes for Mus musculus (mm39) were created using HISAT2 v2.2.1 on CSC (IT Center for Science), thanks to CSC-Puhti.  1. Create conda environment folder file to install the required packages, install and add the bin directory to the path. mkdir STRTN-env conda-containerize new --prefix STRTN-env STRTN-env.yml export PATH="<install_dir>/STRTN-env/bin:$PATH" 2. Load the required module. module load tykky export PATH="<install_dir>/STRTN-env/bin:$PATH" module load r-env if test -f ~/.Renviron; then     sed -i '/TMPDIR/d' ~/.Renviron fi echo "TMPDIR=${WorkingDir_PATH}" >> ~/.Renviron 3. Obtain the genome sequences of reference and ERCC spike-ins. You may add the ribosomal DNA repetitive unit for human (U13369) and mouse (BK000964). wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz unpigz -c mm39.fa.gz | ruby -ne '$ok = $_ !~ /^>chrUn_/ if $_ =~ /^>/; puts $_ if $ok' > mouse_reference.fasta wget https://tsapps.nist.gov/srmext/certificates/documents/SRM2374_putative_T7_products_NoPolyA_v2.FASTA cat SRM2374_putative_T7_products_NoPolyA_v2.FASTA >> mouse_reference.fasta 4. Extract splice sites and exons from a GTF file. Here we used wgEncodeGencodeBasicVM30 as the annotation file. You may additionally perform `hisat2_extract_snps_haplotypes_UCSC.py` to extract SNPs and haplotypes from a dbSNP file for human and mouse. wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/database/wgEncodeGencodeBasicVM30.txt.gz unpigz -c wgEncodeGencodeBasicVM30.txt.gz | hisat2_extract_splice_sites.py - | grep -v ^chrUn > splice_sites.txt unpigz -c wgEncodeGencodeBasicVM30.txt.gz | hisat2_extract_exons.py - | grep -v ^chrUn > exons.txt 5. Build the HISAT2 index. This outputs a set of files with suffixes. Here, `mouse_reference.1.ht2`, `mouse_reference.2.ht2`, ..., `mouse_reference.8.ht2` are generated.<br>In this case, `mouse_reference` is the basename used for `-i, --index`. hisat2-build mouse_reference.fasta --ss splice_sites.txt --exon exons.txt mouse_index/mouse_reference 6. Create the sequence dictionary for the reference and Spike-in sequences. This is required for the Picard MergeBamAlignment program. Note that the original FASTA file (`mouse_reference.fasta` here) is also required. picard CreateSequenceDictionary R=mouse_reference.fasta O=mouse_reference.dict 7. Put the genome indexes, genome fasta file, sequence dictionary to same folder. mv mouse_reference.dict mouse_reference mv mouse_reference.fasta mouse_reference

Link to original dataset

https://doi.org/10.5281/zenodo.7457660

Keyword

genome indexes, mouse, mm39

View full metadata

University of Helsinki

University of Helsinki Data catalogue

Genome indexes for Mus musculus (mm39)

Restricted Availability

Date

Persistent identifier of the Data Catalogue metadata

Creator/contributor

Editor

Journal title

Journal volume

Publisher

Publication Type

Peer Review Status

Repositories

Access rights

ISBN

ISSN

Description

Link to original dataset

Keyword (yso)

Keyword

Publication Series

Journal title

Location of the original dataset