Close

Digital Microbe: A Genome-Informed Data Integration Framework for Collaborative Research on Emerging Model Organisms

Iva Veseli, Zachary S. Cooper, Michelle A. DeMers, Matthew S. Schechter, Samuel Miller, Laura Weber, Christa B. Smith, Lidimarie T. Rodriguez, William F. Schroer, Matthew R. McIlvin, Paloma Z. Lopez, Makoto Saito, Sonya Dyhrman, A. Murat Eren, Mary Ann Moran, Rogier Braakman

Preprint posted on 17 January 2024 https://www.biorxiv.org/content/10.1101/2024.01.16.575828v1

Allowing team science by sharing and curating model organism data more efficiently. “Digital Microbes” pave the way for interoperability, reproducibility, and collaborative microbial science.

Selected by Benjamin Dominik Maier, Jennifer Ann Black

Background

The ease with which whole genomes can now be sequenced has led to an exponential increase in the number of microbes developed as model organisms in microbiological research. This surge, coupled with insights from experimental, modeling, and field studies, has yielded a vast amount of knowledge relevant to microbial physiology, ecology, and biogeochemistry. However, this wealth of data has also brought forth a need for innovative solutions to manage and collaborate effectively.

While various digital resources for model microbe data analysis and exchange exist, they often fall short in fostering a team science approach. Existing solutions lack intermediate data products essential for the coordination of downstream analyses and tend to be centralised, posing vulnerabilities to loss of funding and a lack of researcher control.

This preprint proposes the concept of “Digital Microbes” as a solution to address the shortcomings of existing platforms. The Digital Microbe concept revolves around a curated and versioned public data package designed for collaborative research. The authors use two model organisms, Ruegeria pomeroyi DSS-3 and Alteromonas, to demonstrate the relevance and potential of their framework. Overall, Digital Microbes can provide an architecture for reproducible, open, and extensible collaborative work in microbiology.

What we like about this preprint

  • Any (as the authors put it) “framework for integrative collaborative team science” is highly desirable as a constructive way to move science forward.
  • The framework allows to iteratively add new data and data types while keeping track of changes using version control maintaining self-containment, flexibility and reproducibility.
  • The ability of this framework to also incorporate data from any organism, including non-model organisms, is valuable, allowing researchers the flexibility to investigate their data irrespective of their organism of study.
  • The platform contains flexible workflows and the authors are working on expanding the toolbox by e.g. adding a toolkit for metabolic modeling.
  • Since the framework isn’t dependent on any particular software platform, its overall maintenance should remain viable over an extended period.

 

Posted on: 1 March 2024 , updated on: 5 March 2024

doi: https://doi.org/10.1242/prelights.36514

Read preprint (No Ratings Yet)

Author's response

The author team shared

Dear Benjamin Maier, Jennifer Black, and Reinier Prosee:

First, thank you very much for your detailed report on our recent preprint. We’re excited to hear from others who appreciate the challenges of team science in the age of big data (and of scientific reproducibility in general). We see the Digital Microbe framework as a decentralized solution to some of these challenges that is hopefully accessible to microbiologists of multiple backgrounds who work with ‘omics data.

One key point that seems relevant to all of the questions you posed is the fact that this framework is platform-agnostic, both in terms of the software used to generate and work with the Digital Microbe databases, and in terms of how these databases are shared. For practical purposes, our preprint described the implementation of Digital Microbes used by our team (the Center for Chemical Currencies of a Microbial Planet, or C-CoMP; https://ccomp-stc.org), which relies upon the existing open-source software platform anvi’o (https://anvio.org) for database creation and the associated toolkit of analysis programs, as well as the data sharing website Zenodo (https://zenodo.org) for making Digital Microbe data products accessible. Anvi’o is a software that many of us already work with and/or develop, so it was a natural choice for our use case, and may be a convenient option for anyone wishing to create and use a Digital Microbe without having to write too much of their own code. However, anvi’o does not represent the only possible implementation of a Digital Microbe, nor is it the only way to interact with these databases (which can be accessed via any programming language that supports SQL). Similarly, Digital Microbes can be shared in multiple ways – whether it’s uploaded to a data-sharing site like Zenodo or FigShare, via email, or hosted on a personal webpage. We hope that its inherent flexibility will encourage many researchers to adopt this framework, and that eventually the community will coalesce around a few robust and highly-functional implementations to facilitate efficient collaboration and diverse research workflows (similar to how many people use Microsoft Word or Google Docs for collaborative writing).

The dichotomy between data and the software used to store, access and/or analyze that data can be confusing (as we realized for ourselves while writing the paper about it). We therefore want to clarify the distinction between the following: 1) The Digital Microbe framework, which is a conceptual framework for storing (pan)genome-linked ‘omics data in a consolidated package suitable for collaborative research; 2) the implementation of a Digital Microbe, which could be different across research teams and consists of the software tools used to put the data package together as well as the medium or mechanism used to share it; and 3) the integrated toolkit for analyzing the data within a Digital Microbe, which is a perk of our specific implementation using anvi’o. In less technical terms: 1) The Digital Microbe is a strategy for storing related ‘omics data with clearly-defined, required features that support collaborative work on these data; 2) there are many ways to create and share a Digital Microbe; and 3) we’ve proposed one particular way of making Digital Microbes that includes tools for directly analyzing the data therein. Figure 2 in our preprint demonstrates how these concepts relate to each other. The preprint also describes the required features of a Digital Microbe, which may help to clarify any confusion.

On a related note, another important consideration is that Digital Microbes are meant to be decentralized – that is, any researcher can create and share a Digital Microbe, and there is no central authority responsible for hosting and maintaining these datasets. We recognize the importance and essential role that centralized databases play in supporting science and hosting publicly-available data. Yet, in developing this framework, we also considered the limitations of these existing platforms, especially in regards to expanding researcher control of their data, avoiding restrictions on file formats and data types, and avoiding the inherent risk of discontinued funding of centralized resources.

We hope this provides some general context to our answers to your open questions, which you can find below.

1. How will you promote the use of this tool among microbiologists? Will you provide training materials and seek community feedback to refine the Digital Microbe tool? If yes, how will you gather user feedback?

Scientists do not necessarily excel at promoting things, and we are not any different. Thus, our best hope is that this concept will promote itself through peer-reviewed publications that will showcase its utility, from which we already benefit as a group of scientists working together on the same organisms. That said, we did write a blog post on the C-CoMP website that showcases our Ruegeria pomeroyi Digital Microbe and explains how to access the data stored in it, with the hope that it would offer additional means for others to familiarize themselves with this concept. We’ve also discussed the framework in presentations at scientific meetings, and will continue to use that avenue to spread the word about it.

The Digital Microbe framework is platform-agnostic, but we do provide training materials and collect user feedback for our particular implementation that is based on the anvi’o software ecosystem. The anvi’o website hosts many tutorials (https://anvio.org/learn/) and documentation pages (https://anvio.org/help/main/) describing how to work with anvi’o databases, and the aforementioned blog post doubles as a more specific tutorial for working with our R. Pomeroyi Digital Microbe. A reproducible workflow describing how each of our Digital Microbes was generated is published alongside the data itself. We would encourage future creators and users of Digital Microbes to publish similar posts and workflows to increase the visibility and reproducibility of their science with this framework.

Additionally, the anvi’o platform has an extensive online community that not only provides support to users via answering questions and offering advice on our Discord channel, but also identifies issues and suggests improvements to the software. We plan to leverage this existing community to gather feedback for refining our implementation of the Digital Microbe framework.

2. How will you ensure that researchers of diverse backgrounds and expertise, engaged in field, experimental, and computational studies can all benefit from this platform?

In general, the flexibility inherent to the implementation and sharing of Digital Microbes means that there can be a variety of ways to utilize the framework, which will hopefully increase its accessibility to researchers of various levels of computational training. Our specific implementation of this framework could be a good starting point for those without the time or skills to implement a Digital Microbe from scratch. Although using anvi’o requires basic knowledge of the command line, the platform also contains an interactive interface for data visualization that is approachable to researchers irrespective of their computational training.

Furthermore, our implementation was designed with input from individuals with primarily field and experimental expertise to ensure that we can accommodate those needs, and the developers of the anvi’o platform welcome feedback and feature requests from researchers who need more support.

3. Despite the clear differences with existing databases (as detailed in table 1), wouldn’t it be better to merge this tool with existing platforms rather than introducing a new Platform?

One of the main arguments against choosing a single platform to host Digital Microbe data and tools for its creation and use is the intention for Digital Microbes to be a decentralized solution for data sharing and collaborative analysis – one that ensures researchers have full control over how the data are formatted and shared, how the data can be analyzed (for instance, via custom analysis pipelines with mutable parameters) or visualized, and how the data can be updated as it evolves through collaborative research over time. This strategy lends itself to the reality of collaborative research on public data, which often requires more flexibility than centralized platforms typically support. For example, a single Digital Microbe could be expanded in different ways as various research groups take unique approaches to analyze it, and these groups could independently host their versions of the original dataset without necessarily updating one centralized version. In cases such as this, the association of Digital Object Identifiers (DOIs) to Digital Microbe data products would ensure that the related Digital Microbes could be linked to the original one. This decentralization also prevents widespread data loss in the event that an existing platform runs out of funding or is deprecated, since the responsibility for ensuring the data remains shareable lies with the various researchers who create Digital Microbes rather than a single centralized database or entity.

Of course, there are also advantages for an existing platform to support the sharing and use of Digital Microbes; in particular, facilitated access to a large variety of Digital Microbes all hosted in the same place, lower barriers to their creation and use via a standardized (and already implemented) infrastructure, and familiarity of users with the existing platform. It would therefore be excellent if some of the current tools would be willing to adapt their systems to accommodate an implementation of the Digital Microbe framework. Some existing platforms that are already designed for collaborative science (for example, KBase) would need only minor changes to be able to do this. We’d be happy to work with existing platforms if they would like to support our implementation of the framework.

4. How will the digital microbes be shared and curated?

How to share a Digital Microbe is up to the research team that creates it. In C-CoMP, we’ve been hosting our Digital Microbe data packages on Zenodo, but there are many other ways to share research data. The research community could curate the data over time with the help of those who are hosting the data. For example, suppose someone makes a Digital Microbe and shares it on Github. Someone else could download those data, do some experiments – say they do some gene knockouts to manually curate functional annotations – and submit a pull request to update the Digital Microbe with the new annotations. As another example, suppose someone sequences new Alteromonas genomes and wants to include them in the C-CoMP Alteromonas pangenome. They could either 1) download our Digital Microbe from Zenodo and combine our data with theirs to regenerate the pangenome, thereby creating a new Digital Microbe that they host elsewhere (hopefully with a reference to the DOI of the original data product), or 2) get in touch with us to update our Digital Microbe to a new version including their newly-sequenced genomes.

a. When would you like researchers to share their data – once it has been peer-reviewed and published or rather before? If the latter, will there be quality checks on the shared datasets?

Considering that the Digital Microbe framework is meant as a tool for collaborative research, these datasets should be shared whenever it is best to facilitate collaboration on the data. Of course, that sharing can initially be limited to collaborators rather than the general public, if desired. And just like most research datasets today, the creators of a Digital Microbe can decide if (and when, and how) to share these data with a broader audience. But in an ideal world, and in the spirit of open science, we would hope that researchers would make their Digital Microbes publicly available, accessible via the internet and referenced via a DOI or URL in any related scientific publications (at or before the time of publication). Given the decentralization and implementation flexibility of Digital Microbes, standardized and/or automated quality checks are not possible unless the platform(s) hosting them create and enforce them.

b. Will there be a need to continue storing and exchanging intermediate data products, or can we rely solely on sharing raw data and standardized pipelines/building blocks, considering the increasing adoption of containerized workflows (e.g. SnakeMake/NextFlow with Docker/Singularity containers), which promote reproducibility?

This is a great question. There is immense value in sharing reproducible and/or containerized workflows, and that is a practice that should continue to be encouraged. We offer the following reasons for why sharing intermediate data products is valuable in addition to sharing reproducible workflows:

1) Not everyone has access to the know-how (or computational resources) to run reproducible/containerized workflows, especially on very large datasets.

2) The ability to share intermediate data products is valuable for those whose research workflows go beyond standard pipelines (especially in exploratory research)

3) Intermediate data products provide quick access to results that can be further analyzed by individuals with different research questions

In short, making intermediate data accessible saves time and resources for everyone at the expense of storing slightly larger data products, so we think it is worth it.

5. How does the digital microbe handle very large data sets and can the graphical interface support scaling for e.g. single-cell and multi-omics studies?

This question seems to be asking about our particular implementation of the framework, which relies on anvi’o. We’ll therefore answer based on our experience working with SQL databases and the anvi’o interactive interface, but note that other implementations of Digital Microbes may be more or less scalable.

Digital Microbe data products, as we have implemented them (ie, with SQLite), should scale to very large datasets. The databases remain programmatically accessible with extremely large datasets, but the interactive interface can be slow to non-responsive in these cases. It is worth noting, however, that some of the interface issues can be ameliorated when it’s run on an HPC with a higher memory allocation, and that the interface isn’t crucial for developing insights from integrated genome-linked ‘omics data. So far we haven’t run into any scaling issues that made research with large datasets intractable with the anvi’o platform. However, we will certainly keep an eye on others’ experiences with our implementation, and do our best to address any performance bottlenecks that appear over time.

6. In the conclusion, you mention that you currently work on an integrated toolkit for metabolic modeling.

This question also pertains to our particular implementation of Digital Microbes, and more specifically to the associated collection of analysis programs that are part of the anvi’o platform. Other implementations won’t necessarily have an integrated toolkit for metabolic modeling – although we have proposed embedded analysis tools as one of the features of a Digital Microbe, we don’t intend for this to be a rigid, predefined set of analysis capabilities since analytical needs will vary across different research teams.

a. Can you already tell more about it and will it be an encapsulated stand-alone tool or based on established frameworks such as COBREXA?

Until recently, our Digital Microbes (as created using anvi’o) were unequipped to track molecular metabolic data, both from gene functional annotations, such as KEGG reaction predictions associated with orthologs, and metabolomics. Given the flexibility and scope of Digital Microbes, the inclusion of this new data facilitates the study of biochemical networks not only for a single organism in the context of its genome, but also for the shared and differential gene clusters among groups of related organisms in a pangenome, and for a microbial community in a metagenome. Ecological and evolutionary investigations can be extended to the molecular level with the benefit of this integrated framework. The constellation of information that could already be included in our Digital Microbes – such as genomic and pangenomic organization, taxonomic abundances in metagenomes, and transcriptomic and proteomic expression – can thereby be analyzed in the context of biochemical pathways, including KEGG reference pathway maps, and molecular signatures, such as the classes of compounds predicted to be processed by genes given their orthology annotations. Here are a few types of questions we hope this integration can help answer, especially using the new molecular metabolic API in anvi’o. Which organisms in a metagenome produce and consume different classes of compounds, and what are the potential cross-feeding relationships? Which biochemical pathways are shared throughout a pangenome – and can these be used to define a core metabolic model for the clade – as opposed to differing between subclades in the pangenome? Leveraging scalable workflows in anvi’o, how are the metabolic capabilities predicted from genomes distributed across the tree of life?

Digital Microbes with molecular resolution can be applied in many ways to the construction and refinement of metabolic models usable by numerical modeling software such as COBRApy. Anvi’o can translate Digital Microbe data into a metabolic model file in a standard format that can be read by modeling software. Such draft models – which anvi’o can automatically create from any genome, including novel metagenome-assembled genomes – invariably require significant refinement, largely due to missing and overbroad orthology annotations. The “gap-filling” of missing reactions in a model, a key, labor-intensive stage in the construction of any model, can be aided by data integration via Digital Microbes. The contextualization of reactions in KEGG reference modules and pathways facilitates the determination of reaction “gaps” – which pathways are truly encoded by the genome due to overwhelming evidence for reactions versus incomplete pathways that are only represented by enzymes with promiscuous activity; soon anvi’o will be able to print KEGG pathway maps with customizable visual representations of various data, such as the presence and absence of enzyme orthologs in the genome, and proteomic and metabolomic abundances. A variety of other data available in our Digital Microbes can be used to corroborate and gap-fill pathways, including gene synteny, codon usage consistency, and transcriptomic, proteomic, and metabolomic abundances.

We do not aspire to supplant or recapitulate established numerical metabolic modeling tools in anvi’o, but instead to use the novel and deep data integration capabilities of Digital Microbes to produce accurate models and empower complex scientific inquiries.

b. Are users able to embed their own workflows in the platform? If so, how?

Workflows are crucial in reproducible high-throughput bioinformatic analyses, which is why a number of them have been built into anvi’o. The modularity of anvi’o commands also allows the software to be integrated into custom workflows. However, the straightforward interoperability of disparate anvi’o modules hinges on consolidated data storage in platform-agnostic Digital Microbes like the implementation that we have developed. Therefore, while it is certainly possible and encouraged for contributors to add workflows via the anvi’o Snakemake framework, Digital Microbes are a data storage medium that can be plugged into any type of reproducible workflow.

Have your say

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Register here

preLists in the bioinformatics category:

‘In preprints’ from Development 2022-2023

A list of the preprints featured in Development's 'In preprints' articles between 2022-2023

 



List by Alex Eve, Katherine Brown

9th International Symposium on the Biology of Vertebrate Sex Determination

This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.

 



List by Martin Estermann

Alumni picks – preLights 5th Birthday

This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.

 



List by Sergio Menchero et al.

Fibroblasts

The advances in fibroblast biology preList explores the recent discoveries and preprints of the fibroblast world. Get ready to immerse yourself with this list created for fibroblasts aficionados and lovers, and beyond. Here, my goal is to include preprints of fibroblast biology, heterogeneity, fate, extracellular matrix, behavior, topography, single-cell atlases, spatial transcriptomics, and their matrix!

 



List by Osvaldo Contreras

Single Cell Biology 2020

A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.

 



List by Alex Eve

Antimicrobials: Discovery, clinical use, and development of resistance

Preprints that describe the discovery of new antimicrobials and any improvements made regarding their clinical use. Includes preprints that detail the factors affecting antimicrobial selection and the development of antimicrobial resistance.

 



List by Zhang-He Goh
Close