BBS phase 1 & phase 2 high quality E. coli bin assembled genomes
No Thumbnail Available
Restricted Availability
Date
2024-11-07, 2024-11-07
Persistent identifier of the Data Catalogue metadata
Creator/contributor
Editor
Journal title
Journal volume
Publisher
Publication Type
dataset
Peer Review Status
Repositories
Access rights
ISBN
ISSN
Description
1,402 Escherichia coli bin assembled genomes derived from the metagenome data collected as part of the BabyBiome study (BBS) phase 1 & phase 2.
The data in this upload was first published as part of "Group 2 and 3 ABC-transporter dependant K-antigen loci contribute significantly to variation in the invasive potential of Escherichia coli" (Gladstone et al. 2024, to be released).
Files
Assembly data:
BBS_E_coli_BAGs.tar: Archive containing sequences of the 1,402 bin assembled genomes.
BBS_E_coli_metadata.tsv: Table linking the sequence assemblies to the subject data.
Capsule predictions:
BBS_E_coli_Kaptive_output.csv: Capsule predictions for all sequence data.
BBS_E_coli_deduplicated_sequences_IDs.txt: Filenames for assemblies that constitute the 873 deduplicated sequences analysed in Gladstone et al. 2024.
Quality control data:
BBS_E_coli_demix_check_scores.tsv: Output from demix_check for the sequence assemblies.
BBS_E_coli_checkm_results.tsv: Output from checkm.
BBS_E_coli_gunc_results.tsv: Output from gunc.
Methods
Bin assembled genomes
Source data:
BBS phase 1: Shao et al. 2019
BBS phase 2: Shao et al. 2024
The data was produced using the mSWEEP and mGEMS pipeline (Mäklin et al. 2020 & Mäklin et al. 2021) following the steps described in Khawaja, Mäklin, Kallonen, et al. 2024.
Quality control
The BAGs in this upload were filtered with demix_check (https://github.com/harry-thorpe/demix_check) and only those with a quality score 1 or 2 are included. For the capsule type annotations, contigs shorter than 5,000bp were removed but the short contigs are still present in the uploaded files). Further QC data is available from checkm (Parks et al. 2015) and gunc (Orakov et al. 2022) results.
Multilocus sequence typing
Sequence type (ST) was determined using fastmlst (Guerrero-Araya et al. 2021) with the `ecoli#1` database.
PopPUNK clustering
Sequence clusters (SC) correspond to the database available from https://zenodo.org/records/12528310 and were created using PopPUNK (Lees et al. 2019). Construction is described in Khawaja, Mäklin, Kallonen, et al. 2024.
Capsule type annotations
The capsule type annotations were created using Kaptive (Lam et al. 2022) with an E. coli specific database available from https://github.com/rgladstone/EC-K-typing and described in Gladstone et al. 2024.