The author team shared
Dear Benjamin Maier, Jennifer Black, and Reinier Prosee:
First, thank you very much for your detailed report on our recent preprint. We’re excited to hear from others who appreciate the challenges of team science in the age of big data (and of scientific reproducibility in general). We see the Digital Microbe framework as a decentralized solution to some of these challenges that is hopefully accessible to microbiologists of multiple backgrounds who work with ‘omics data.
One key point that seems relevant to all of the questions you posed is the fact that this framework is platform-agnostic, both in terms of the software used to generate and work with the Digital Microbe databases, and in terms of how these databases are shared. For practical purposes, our preprint described the implementation of Digital Microbes used by our team (the Center for Chemical Currencies of a Microbial Planet, or C-CoMP; https://ccomp-stc.org), which relies upon the existing open-source software platform anvi’o (https://anvio.org) for database creation and the associated toolkit of analysis programs, as well as the data sharing website Zenodo (https://zenodo.org) for making Digital Microbe data products accessible. Anvi’o is a software that many of us already work with and/or develop, so it was a natural choice for our use case, and may be a convenient option for anyone wishing to create and use a Digital Microbe without having to write too much of their own code. However, anvi’o does not represent the only possible implementation of a Digital Microbe, nor is it the only way to interact with these databases (which can be accessed via any programming language that supports SQL). Similarly, Digital Microbes can be shared in multiple ways – whether it’s uploaded to a data-sharing site like Zenodo or FigShare, via email, or hosted on a personal webpage. We hope that its inherent flexibility will encourage many researchers to adopt this framework, and that eventually the community will coalesce around a few robust and highly-functional implementations to facilitate efficient collaboration and diverse research workflows (similar to how many people use Microsoft Word or Google Docs for collaborative writing).
The dichotomy between data and the software used to store, access and/or analyze that data can be confusing (as we realized for ourselves while writing the paper about it). We therefore want to clarify the distinction between the following: 1) The Digital Microbe framework, which is a conceptual framework for storing (pan)genome-linked ‘omics data in a consolidated package suitable for collaborative research; 2) the implementation of a Digital Microbe, which could be different across research teams and consists of the software tools used to put the data package together as well as the medium or mechanism used to share it; and 3) the integrated toolkit for analyzing the data within a Digital Microbe, which is a perk of our specific implementation using anvi’o. In less technical terms: 1) The Digital Microbe is a strategy for storing related ‘omics data with clearly-defined, required features that support collaborative work on these data; 2) there are many ways to create and share a Digital Microbe; and 3) we’ve proposed one particular way of making Digital Microbes that includes tools for directly analyzing the data therein. Figure 2 in our preprint demonstrates how these concepts relate to each other. The preprint also describes the required features of a Digital Microbe, which may help to clarify any confusion.
On a related note, another important consideration is that Digital Microbes are meant to be decentralized – that is, any researcher can create and share a Digital Microbe, and there is no central authority responsible for hosting and maintaining these datasets. We recognize the importance and essential role that centralized databases play in supporting science and hosting publicly-available data. Yet, in developing this framework, we also considered the limitations of these existing platforms, especially in regards to expanding researcher control of their data, avoiding restrictions on file formats and data types, and avoiding the inherent risk of discontinued funding of centralized resources.
We hope this provides some general context to our answers to your open questions, which you can find below.
1. How will you promote the use of this tool among microbiologists? Will you provide training materials and seek community feedback to refine the Digital Microbe tool? If yes, how will you gather user feedback?
Scientists do not necessarily excel at promoting things, and we are not any different. Thus, our best hope is that this concept will promote itself through peer-reviewed publications that will showcase its utility, from which we already benefit as a group of scientists working together on the same organisms. That said, we did write a blog post on the C-CoMP website that showcases our Ruegeria pomeroyi Digital Microbe and explains how to access the data stored in it, with the hope that it would offer additional means for others to familiarize themselves with this concept. We’ve also discussed the framework in presentations at scientific meetings, and will continue to use that avenue to spread the word about it.
The Digital Microbe framework is platform-agnostic, but we do provide training materials and collect user feedback for our particular implementation that is based on the anvi’o software ecosystem. The anvi’o website hosts many tutorials (https://anvio.org/learn/) and documentation pages (https://anvio.org/help/main/) describing how to work with anvi’o databases, and the aforementioned blog post doubles as a more specific tutorial for working with our R. Pomeroyi Digital Microbe. A reproducible workflow describing how each of our Digital Microbes was generated is published alongside the data itself. We would encourage future creators and users of Digital Microbes to publish similar posts and workflows to increase the visibility and reproducibility of their science with this framework.
Additionally, the anvi’o platform has an extensive online community that not only provides support to users via answering questions and offering advice on our Discord channel, but also identifies issues and suggests improvements to the software. We plan to leverage this existing community to gather feedback for refining our implementation of the Digital Microbe framework.
2. How will you ensure that researchers of diverse backgrounds and expertise, engaged in field, experimental, and computational studies can all benefit from this platform?
In general, the flexibility inherent to the implementation and sharing of Digital Microbes means that there can be a variety of ways to utilize the framework, which will hopefully increase its accessibility to researchers of various levels of computational training. Our specific implementation of this framework could be a good starting point for those without the time or skills to implement a Digital Microbe from scratch. Although using anvi’o requires basic knowledge of the command line, the platform also contains an interactive interface for data visualization that is approachable to researchers irrespective of their computational training.
Furthermore, our implementation was designed with input from individuals with primarily field and experimental expertise to ensure that we can accommodate those needs, and the developers of the anvi’o platform welcome feedback and feature requests from researchers who need more support.
3. Despite the clear differences with existing databases (as detailed in table 1), wouldn’t it be better to merge this tool with existing platforms rather than introducing a new Platform?
One of the main arguments against choosing a single platform to host Digital Microbe data and tools for its creation and use is the intention for Digital Microbes to be a decentralized solution for data sharing and collaborative analysis – one that ensures researchers have full control over how the data are formatted and shared, how the data can be analyzed (for instance, via custom analysis pipelines with mutable parameters) or visualized, and how the data can be updated as it evolves through collaborative research over time. This strategy lends itself to the reality of collaborative research on public data, which often requires more flexibility than centralized platforms typically support. For example, a single Digital Microbe could be expanded in different ways as various research groups take unique approaches to analyze it, and these groups could independently host their versions of the original dataset without necessarily updating one centralized version. In cases such as this, the association of Digital Object Identifiers (DOIs) to Digital Microbe data products would ensure that the related Digital Microbes could be linked to the original one. This decentralization also prevents widespread data loss in the event that an existing platform runs out of funding or is deprecated, since the responsibility for ensuring the data remains shareable lies with the various researchers who create Digital Microbes rather than a single centralized database or entity.
Of course, there are also advantages for an existing platform to support the sharing and use of Digital Microbes; in particular, facilitated access to a large variety of Digital Microbes all hosted in the same place, lower barriers to their creation and use via a standardized (and already implemented) infrastructure, and familiarity of users with the existing platform. It would therefore be excellent if some of the current tools would be willing to adapt their systems to accommodate an implementation of the Digital Microbe framework. Some existing platforms that are already designed for collaborative science (for example, KBase) would need only minor changes to be able to do this. We’d be happy to work with existing platforms if they would like to support our implementation of the framework.
4. How will the digital microbes be shared and curated?
How to share a Digital Microbe is up to the research team that creates it. In C-CoMP, we’ve been hosting our Digital Microbe data packages on Zenodo, but there are many other ways to share research data. The research community could curate the data over time with the help of those who are hosting the data. For example, suppose someone makes a Digital Microbe and shares it on Github. Someone else could download those data, do some experiments – say they do some gene knockouts to manually curate functional annotations – and submit a pull request to update the Digital Microbe with the new annotations. As another example, suppose someone sequences new Alteromonas genomes and wants to include them in the C-CoMP Alteromonas pangenome. They could either 1) download our Digital Microbe from Zenodo and combine our data with theirs to regenerate the pangenome, thereby creating a new Digital Microbe that they host elsewhere (hopefully with a reference to the DOI of the original data product), or 2) get in touch with us to update our Digital Microbe to a new version including their newly-sequenced genomes.
a. When would you like researchers to share their data – once it has been peer-reviewed and published or rather before? If the latter, will there be quality checks on the shared datasets?
Considering that the Digital Microbe framework is meant as a tool for collaborative research, these datasets should be shared whenever it is best to facilitate collaboration on the data. Of course, that sharing can initially be limited to collaborators rather than the general public, if desired. And just like most research datasets today, the creators of a Digital Microbe can decide if (and when, and how) to share these data with a broader audience. But in an ideal world, and in the spirit of open science, we would hope that researchers would make their Digital Microbes publicly available, accessible via the internet and referenced via a DOI or URL in any related scientific publications (at or before the time of publication). Given the decentralization and implementation flexibility of Digital Microbes, standardized and/or automated quality checks are not possible unless the platform(s) hosting them create and enforce them.
b. Will there be a need to continue storing and exchanging intermediate data products, or can we rely solely on sharing raw data and standardized pipelines/building blocks, considering the increasing adoption of containerized workflows (e.g. SnakeMake/NextFlow with Docker/Singularity containers), which promote reproducibility?
This is a great question. There is immense value in sharing reproducible and/or containerized workflows, and that is a practice that should continue to be encouraged. We offer the following reasons for why sharing intermediate data products is valuable in addition to sharing reproducible workflows:
1) Not everyone has access to the know-how (or computational resources) to run reproducible/containerized workflows, especially on very large datasets.
2) The ability to share intermediate data products is valuable for those whose research workflows go beyond standard pipelines (especially in exploratory research)
3) Intermediate data products provide quick access to results that can be further analyzed by individuals with different research questions
In short, making intermediate data accessible saves time and resources for everyone at the expense of storing slightly larger data products, so we think it is worth it.
5. How does the digital microbe handle very large data sets and can the graphical interface support scaling for e.g. single-cell and multi-omics studies?
This question seems to be asking about our particular implementation of the framework, which relies on anvi’o. We’ll therefore answer based on our experience working with SQL databases and the anvi’o interactive interface, but note that other implementations of Digital Microbes may be more or less scalable.
Digital Microbe data products, as we have implemented them (ie, with SQLite), should scale to very large datasets. The databases remain programmatically accessible with extremely large datasets, but the interactive interface can be slow to non-responsive in these cases. It is worth noting, however, that some of the interface issues can be ameliorated when it’s run on an HPC with a higher memory allocation, and that the interface isn’t crucial for developing insights from integrated genome-linked ‘omics data. So far we haven’t run into any scaling issues that made research with large datasets intractable with the anvi’o platform. However, we will certainly keep an eye on others’ experiences with our implementation, and do our best to address any performance bottlenecks that appear over time.
6. In the conclusion, you mention that you currently work on an integrated toolkit for metabolic modeling.
This question also pertains to our particular implementation of Digital Microbes, and more specifically to the associated collection of analysis programs that are part of the anvi’o platform. Other implementations won’t necessarily have an integrated toolkit for metabolic modeling – although we have proposed embedded analysis tools as one of the features of a Digital Microbe, we don’t intend for this to be a rigid, predefined set of analysis capabilities since analytical needs will vary across different research teams.
a. Can you already tell more about it and will it be an encapsulated stand-alone tool or based on established frameworks such as COBREXA?
Until recently, our Digital Microbes (as created using anvi’o) were unequipped to track molecular metabolic data, both from gene functional annotations, such as KEGG reaction predictions associated with orthologs, and metabolomics. Given the flexibility and scope of Digital Microbes, the inclusion of this new data facilitates the study of biochemical networks not only for a single organism in the context of its genome, but also for the shared and differential gene clusters among groups of related organisms in a pangenome, and for a microbial community in a metagenome. Ecological and evolutionary investigations can be extended to the molecular level with the benefit of this integrated framework. The constellation of information that could already be included in our Digital Microbes – such as genomic and pangenomic organization, taxonomic abundances in metagenomes, and transcriptomic and proteomic expression – can thereby be analyzed in the context of biochemical pathways, including KEGG reference pathway maps, and molecular signatures, such as the classes of compounds predicted to be processed by genes given their orthology annotations. Here are a few types of questions we hope this integration can help answer, especially using the new molecular metabolic API in anvi’o. Which organisms in a metagenome produce and consume different classes of compounds, and what are the potential cross-feeding relationships? Which biochemical pathways are shared throughout a pangenome – and can these be used to define a core metabolic model for the clade – as opposed to differing between subclades in the pangenome? Leveraging scalable workflows in anvi’o, how are the metabolic capabilities predicted from genomes distributed across the tree of life?
Digital Microbes with molecular resolution can be applied in many ways to the construction and refinement of metabolic models usable by numerical modeling software such as COBRApy. Anvi’o can translate Digital Microbe data into a metabolic model file in a standard format that can be read by modeling software. Such draft models – which anvi’o can automatically create from any genome, including novel metagenome-assembled genomes – invariably require significant refinement, largely due to missing and overbroad orthology annotations. The “gap-filling” of missing reactions in a model, a key, labor-intensive stage in the construction of any model, can be aided by data integration via Digital Microbes. The contextualization of reactions in KEGG reference modules and pathways facilitates the determination of reaction “gaps” – which pathways are truly encoded by the genome due to overwhelming evidence for reactions versus incomplete pathways that are only represented by enzymes with promiscuous activity; soon anvi’o will be able to print KEGG pathway maps with customizable visual representations of various data, such as the presence and absence of enzyme orthologs in the genome, and proteomic and metabolomic abundances. A variety of other data available in our Digital Microbes can be used to corroborate and gap-fill pathways, including gene synteny, codon usage consistency, and transcriptomic, proteomic, and metabolomic abundances.
We do not aspire to supplant or recapitulate established numerical metabolic modeling tools in anvi’o, but instead to use the novel and deep data integration capabilities of Digital Microbes to produce accurate models and empower complex scientific inquiries.
b. Are users able to embed their own workflows in the platform? If so, how?
Workflows are crucial in reproducible high-throughput bioinformatic analyses, which is why a number of them have been built into anvi’o. The modularity of anvi’o commands also allows the software to be integrated into custom workflows. However, the straightforward interoperability of disparate anvi’o modules hinges on consolidated data storage in platform-agnostic Digital Microbes like the implementation that we have developed. Therefore, while it is certainly possible and encouraged for contributors to add workflows via the anvi’o Snakemake framework, Digital Microbes are a data storage medium that can be plugged into any type of reproducible workflow.