Genome indexes for Mus musculus (mm39)
No Thumbnail Available
Restricted Availability
Date
2022-12-20, 2022-12-20
Persistent identifier of the Data Catalogue metadata
Creator/contributor
Editor
Journal title
Journal volume
Publisher
Publication Type
dataset
Peer Review Status
Repositories
Access rights
ISBN
ISSN
Description
BUILDING HISAT2 INDEXES IN CSC
Here is the case for house mouse genome (mm39). The genome indexing step requires big memory and it might not be possible to carry out it on a laptop. Genome indexes for Mus musculus (mm39) were created using HISAT2 v2.2.1 on CSC (IT Center for Science), thanks to CSC-Puhti.
1. Create conda environment folder file to install the required packages, install and add the bin directory to the path.
mkdir STRTN-env
conda-containerize new --prefix STRTN-env STRTN-env.yml
export PATH="<install_dir>/STRTN-env/bin:$PATH"
2. Load the required module.
module load tykky
export PATH="<install_dir>/STRTN-env/bin:$PATH"
module load r-env
if test -f ~/.Renviron; then
sed -i '/TMPDIR/d' ~/.Renviron
fi
echo "TMPDIR=${WorkingDir_PATH}" >> ~/.Renviron
3. Obtain the genome sequences of reference and ERCC spike-ins. You may add the ribosomal DNA repetitive unit for human (U13369) and mouse (BK000964).
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
unpigz -c mm39.fa.gz | ruby -ne '$ok = $_ !~ /^>chrUn_/ if $_ =~ /^>/; puts $_ if $ok' > mouse_reference.fasta
wget https://tsapps.nist.gov/srmext/certificates/documents/SRM2374_putative_T7_products_NoPolyA_v2.FASTA
cat SRM2374_putative_T7_products_NoPolyA_v2.FASTA >> mouse_reference.fasta
4. Extract splice sites and exons from a GTF file. Here we used wgEncodeGencodeBasicVM30 as the annotation file. You may additionally perform `hisat2_extract_snps_haplotypes_UCSC.py` to extract SNPs and haplotypes from a dbSNP file for human and mouse.
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/database/wgEncodeGencodeBasicVM30.txt.gz
unpigz -c wgEncodeGencodeBasicVM30.txt.gz | hisat2_extract_splice_sites.py - | grep -v ^chrUn > splice_sites.txt
unpigz -c wgEncodeGencodeBasicVM30.txt.gz | hisat2_extract_exons.py - | grep -v ^chrUn > exons.txt
5. Build the HISAT2 index. This outputs a set of files with suffixes. Here, `mouse_reference.1.ht2`, `mouse_reference.2.ht2`, ..., `mouse_reference.8.ht2` are generated.<br>In this case, `mouse_reference` is the basename used for `-i, --index`.
hisat2-build mouse_reference.fasta --ss splice_sites.txt --exon exons.txt mouse_index/mouse_reference
6. Create the sequence dictionary for the reference and Spike-in sequences. This is required for the Picard MergeBamAlignment program. Note that the original FASTA file (`mouse_reference.fasta` here) is also required.
picard CreateSequenceDictionary R=mouse_reference.fasta O=mouse_reference.dict
7. Put the genome indexes, genome fasta file, sequence dictionary to same folder.
mv mouse_reference.dict mouse_reference
mv mouse_reference.fasta mouse_reference