45,671 Escherichia coli genomes

dc.contributor.affiliationUniversity of Helsinki-Mäklin, Tommi
dc.contributor.authorMäklin, Tommi
dc.date.accessioned2025-04-29T14:01:05Z
dc.date.issued2024-12-05
dc.date.issued2024-12-05
dc.description45,671 Escherichia coli genomes This upload contains 45,671 high-quality E. coli assemblies collected from multiple sources with emphasis on improving coverage of commensal E. coli and adding more sequences from underrepresented countries. The assemblies are compressed using the Assembled Genomes Compressor AGC which achieves a >10x better compression ratio over plain gzip and allows quick retrieval of one or more sequences. Instructions for decompressing the files are provided below. Usage Extracting the AGC archive Install AGC version 3.0 or newer from bioconda or by downloading a precompiled binary from https://github.com/refresh-bio/agc/releases. Run `agc listset 45k_E_coli_genomes.agc > 45k_E_coli_genomes_filenames.txt` to list the assemblies in the archive. To extract a single assembly, run `agc getset -p -o assembly_name.fa 45k_E_coli_genomes.agc assembly_name.fa`. To extract all assemblies, run `cat 45k_E_coli_genomes_filenames.txt | xargs -I {} agc getset  -o {} -p 45k_E_coli_genomes.agc {}` warning extracting the whole archive this way will take ~240GB disk space. note remember to use the `-p` toggle with agc to reduce the runtime. parallelise the extraction by running `parallel -j <number of threads> 'agc getset -o {} -t 1 -p 45k_E_coli_genomes.agc {}' < 45k_E_coli_genomes_filenames.txt`. Metadata The file `45k_E_coli_metadata.tsv` details which multilocus sequence type (ST) and clonal complex (CC) the assemblies belong to, assigned using the ecoli#1 scheme in fastmlst v0.0.15, the phylogroup the ST belongs to according to Horesh et al. (2021), and the presence of at least 17 of the 19 pks island genes (pks+) and/or the clbS gene encoding the colibactin self-resistance protein ClbS which were determined using the clbtype scripts. Distribution You are free to use the assemblies, provided that the sources are cited accordingly. Citation This collection was first published as part of the study "Geographical variation in the incidence of colorectal cancer and urinary tract cancer is associated with population exposure to colibactin-producing Escherichia coli" in Lancet Microbe on 5 December 2024, doi: 10.1016/j.lanmic.2024.101015. Methods Definition of Escherichia coli In the context of these files, E. coli is defined as all assemblies belonging to the same 97% pangenome average nucleotide identity (panANI) cluster which is comparable to the more traditional definition of a bacterial species as 95% ANI clusters. This definition is consistent with the one provided in GTDB Release 09-RS220 (24th April 2024). The 97%-panANI clusters were defined by running panaani v0.1.0 on the quality filtered source data referenced below. Source data Assemblies from the following studies were considered for inclusion: 661k genomes collection from Blackwell et al. 2021 Avian E. coli genomes Bin-assembled E. coli from Bangladesh, Finland, Pakistan, UK, Zimbabwe Carbapenemase producing E. coli from Norwegian travellers E. coli genomes from a previous index E. coli isolated from inpatients in Tanzania E. coli from various sources in Kenya E. coli from a one health study in Ghana E. coli from bloodstream infections in Nigeria Enteropathogenic E. coli from Sub-Saharan Africa and South Asia Enteropathogenic E. coli isolated from cattle in South Africa ESBL positive E. coli from Benin ESBL producing E. coli and Klebsiella from rats in Guinea ESBL producing E. coli from European soldiers deployed in Mali ESBL producing E. coli from Malawi GTDB representative genomes from Release 214 Livestock E. coli from England Metagenome-assembled genome from Hadza hunter-gatherers Mgnify chicken gut v1.0 Mgnify cow rumen v1.0 Mgnify human gut v2.0.1 Mgnify human oral v1.0 Mgnify human vaginal v1.0 Mgnify pig gut v1.0 One health study in the US Salmonella genomes from water in Lake Victoria Quality filtering An assembly met the criteria for high-quality if the following conditions were met: checkm v1.2.2 completeness >=90% checkm v1.2.2 contamination <= 10% gunc v1.0.5 chimerism analysis score "pass"
dc.identifierhttps://doi.org/10.5281/zenodo.13374348
dc.identifier.urihttps://datakatalogi.helsinki.fi/handle/123456789/4723
dc.rights.licensecc-by-4.0
dc.title45,671 Escherichia coli genomes
dc.typedataset