45,671 Escherichia coli genomes
No Thumbnail Available
Restricted Availability
Date
2024-12-05, 2024-12-05
Persistent identifier of the Data Catalogue metadata
Creator/contributor
Editor
Journal title
Journal volume
Publisher
Publication Type
dataset
Peer Review Status
Repositories
Access rights
ISBN
ISSN
Description
45,671 Escherichia coli genomes
This upload contains 45,671 high-quality E. coli assemblies collected from multiple sources with emphasis on improving coverage of commensal E. coli and adding more sequences from underrepresented countries.
The assemblies are compressed using the Assembled Genomes Compressor AGC which achieves a >10x better compression ratio over plain gzip and allows quick retrieval of one or more sequences. Instructions for decompressing the files are provided below.
Usage
Extracting the AGC archive
Install AGC version 3.0 or newer from bioconda or by downloading a precompiled binary from https://github.com/refresh-bio/agc/releases.
Run `agc listset 45k_E_coli_genomes.agc > 45k_E_coli_genomes_filenames.txt` to list the assemblies in the archive.
To extract a single assembly, run `agc getset -p -o assembly_name.fa 45k_E_coli_genomes.agc assembly_name.fa`.
To extract all assemblies, run `cat 45k_E_coli_genomes_filenames.txt | xargs -I {} agc getset -o {} -p 45k_E_coli_genomes.agc {}`
warning extracting the whole archive this way will take ~240GB disk space.
note remember to use the `-p` toggle with agc to reduce the runtime.
parallelise the extraction by running `parallel -j <number of threads> 'agc getset -o {} -t 1 -p 45k_E_coli_genomes.agc {}' < 45k_E_coli_genomes_filenames.txt`.
Metadata
The file `45k_E_coli_metadata.tsv` details which multilocus sequence type (ST) and clonal complex (CC) the assemblies belong to, assigned using the ecoli#1 scheme in fastmlst v0.0.15, the phylogroup the ST belongs to according to Horesh et al. (2021), and the presence of at least 17 of the 19 pks island genes (pks+) and/or the clbS gene encoding the colibactin self-resistance protein ClbS which were determined using the clbtype scripts.
Distribution
You are free to use the assemblies, provided that the sources are cited accordingly.
Citation
This collection was first published as part of the study "Geographical variation in the incidence of colorectal cancer and urinary tract cancer is associated with population exposure to colibactin-producing Escherichia coli" in Lancet Microbe on 5 December 2024, doi: 10.1016/j.lanmic.2024.101015.
Methods
Definition of Escherichia coli
In the context of these files, E. coli is defined as all assemblies belonging to the same 97% pangenome average nucleotide identity (panANI) cluster which is comparable to the more traditional definition of a bacterial species as 95% ANI clusters. This definition is consistent with the one provided in GTDB Release 09-RS220 (24th April 2024).
The 97%-panANI clusters were defined by running panaani v0.1.0 on the quality filtered source data referenced below.
Source data
Assemblies from the following studies were considered for inclusion:
661k genomes collection from Blackwell et al. 2021
Avian E. coli genomes
Bin-assembled E. coli from Bangladesh, Finland, Pakistan, UK, Zimbabwe
Carbapenemase producing E. coli from Norwegian travellers
E. coli genomes from a previous index
E. coli isolated from inpatients in Tanzania
E. coli from various sources in Kenya
E. coli from a one health study in Ghana
E. coli from bloodstream infections in Nigeria
Enteropathogenic E. coli from Sub-Saharan Africa and South Asia
Enteropathogenic E. coli isolated from cattle in South Africa
ESBL positive E. coli from Benin
ESBL producing E. coli and Klebsiella from rats in Guinea
ESBL producing E. coli from European soldiers deployed in Mali
ESBL producing E. coli from Malawi
GTDB representative genomes from Release 214
Livestock E. coli from England
Metagenome-assembled genome from Hadza hunter-gatherers
Mgnify chicken gut v1.0
Mgnify cow rumen v1.0
Mgnify human gut v2.0.1
Mgnify human oral v1.0
Mgnify human vaginal v1.0
Mgnify pig gut v1.0
One health study in the US
Salmonella genomes from water in Lake Victoria
Quality filtering
An assembly met the criteria for high-quality if the following conditions were met:
checkm v1.2.2 completeness >=90%
checkm v1.2.2 contamination <= 10%
gunc v1.0.5 chimerism analysis score "pass"