Close

Large-scale analyses of human microbiomes reveal thousands of small, novel genes and their predicted functions

Hila Sberro, Nicholas Greenfield, Georgios Pavlopoulos, Nikos Kyrpides, Ami S Bhatt

Posted on: 23 January 2019 , updated on: 24 January 2019

Preprint posted on 13 December 2018

Article now published in Cell at http://dx.doi.org/10.1016/j.cell.2019.07.016

The power of a computational magnifying lens – peering into diversity of proteins, large and small, encoded by microbes in our bodies reveals many novel small ORFs, their putative functions and hints on their evolution and diversity

Selected by Ganesh Kadamur

Context: The coding region of genomes, commonly called Open Reading Frames (ORFs), have long been defined using a seemingly arbitrary minimum length cutoff (typically 50aa). Challenging this paradigm, studies over the past decade have shown widespread translation of stable products that fall below this threshold in both prokaryotes and eukaryotes, with functions in quorum sensing, development and calcium signaling, just to name a few. These proteins are encoded not only in intergenic regions, but also within annotated ORFs and have been named small ORFs (sORF). However, these studies have mostly relied on investigating well studied model organisms and so, extensive, large scale studies are notably missing.

Methodology: In this preprint, Sberro et al. tackle the question using large publicly available datasets and computational approaches. They mine metagenomes isolated from >250 human subjects as part of the Human Microbiome Project (NIH HMP I-II), running them through to a battery of analyses and computational pipelines. Using MetaProdigal, a prediction tool optimized to look for sORFs, they identify >2.5 million sORFs. Combining sequence based and domain analysis, these were grouped into ~400,000 clusters. To benchmark their methods, they search this set for ~30 sORFs from model organisms that have been well characterized and surprisingly find that almost half of these are absent in the human microbiome. To prune this list of ~400k clusters and increase confidence in their results, the authors utilize RNAcode, a program that incorporates evolutionary and mutational signatures amongst homologs to narrow down to bona fide sORFs. The authors then proceed to functional prediction and categorization of these sORF families using information about taxonomic specialization, intracellular location prediction based on sequence analysis and comparison to other environmentally sampled metagenomes. Also, as prokaryotic genes are commonly found in operons where functionally related genes are commonly clustered together, the genomic neighbourhood of sORF families was also analyzed for functional annotation. Together, this computational tour de force analysis has unearthed many novel sORF families, indicated putative functions and generates a vast body of hypotheses that can now be experimentally tested.

Pipeline for identification and prediction of function of sORFs from human microbiome metagenomic data (taken from Sberro et al., bioRxiv, 2018)

Key Findings:

  • The human microbiome has >4000 sORF families of which almost 50% are not detected in species from other sequenced microbiomes (soil, water, mouse etc), thus highlighting the uniqueness of the microbiota that call our bodies home. In the process, the authors show that ~2400 families identified here are present in genomes included in the RefSeq database. However, more than 1000 such families had remained unannotated because of the arbitrary 50aa length cutoff.
  • Some sORF families are more conserved than others. About 20 families are present in >50 species, whereas ~3000 families are found only in 10 or less species, suggesting that rapid evolutionary mutation and specialization is widespread amongst sORFs.
  • A mere 4% of all identified sORF families possess annotated domains, underscoring the breadth of unexplored sequence and structure space amongst sORFs.
  • 13 novel families are highly conserved across microbiomes isolated from different human niches (gut, mouth, skin) and thus likely encode essential housekeeping proteins. Almost half of these families are ribosome associated proteins – homologs of these families are also present in non-human microbiome species, supporting the prediction of them playing a critical role across phyla.
  • No single protein family is present across all human niches sampled, implying niche-specific evolution of protein sequences and families. It is pertinent to note here that undersampling, of donor samples per niche, might bias this interpretation.
  • About a third of novel sORFs are predicted to generate transmembrane and/or secreted proteins. Analysis of their genomic context suggests roles in quorum sensing, toxin-antitoxin systems and inter-cellular communication
  • Clues from genomic neighbourhood identifies ~200 families as potential phage defense genes with roles in CRISPR response, and ~600 families that might mediate horizontal gene transfer events.

Why I like the work: This work extends previous findings that show the widespread yet unappreciated translation of small proteins, defined as <50aa in length. While earlier work was focused on a small number of well studied model organisms, this expands our knowledge of sORF families by orders of magnitude, uncovering uncharacterized protein domains (and thus folds) in the process. This work has generated a rich resource ripe for future exploration, with promise of discovery of new antibiotics, tools that could be developed to interrogate a plethora of cellular processes and possibly also innovative ways to design cell permeable proteins for drug delivery. The section where the authors clearly spell out potential pitfalls of the methods and conclusions of their work is also particularly commendable. Finally, this work is a great showcase of how combining different computational tools and pipelines can yield important insights into novel biology.

Future directions:

  • Mass spectrometry to validate expression of predicted sORFs at protein level. This could also be complemented by techniques such as ribosome profiling.
  • Re-analysis of the data to recognize co-occuring species, and presence of sORFs in both species with predicted quorum sensing roles in both – for example, a secreted sORF in one that acts as a signal transducer, and a signal receiving receptor, typically not a sORF, in the other species. This could enlighten on communication pathways employed by these species in specific niches to regulate inter-cellular crosstalk.
  • Genetic studies in a wide range of species to test function. Development of a high throughput platform to study some families could be especially useful, for example sORFs predicted to play roles in quorum sensing. This would be contingent on ability to culture the species outside the body, and development of tools for genetic manipulation.
  • Explore diversity and specialization of sORF families across individuals from different ethnic backgrounds, as research increasingly shows microbiomes widely vary based on diet, environmental factors etc. Such analyses could particularly help identify rapidly evolving sequences that are most important for local adaptation.

Tags: function prediction, microbiome, proteins

doi: https://doi.org/10.1242/prelights.7731

Read preprint (No Ratings Yet)

Have your say

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Register here

Also in the bioinformatics category:

Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods

Constantin Ahlmann-Eltze, Wolfgang Huber, Simon Anders

Selected by 11 November 2024

Benjamin Dominik Maier

Bioinformatics

Functional Diversity of Memory CD8 T Cells is Spatiotemporally Imprinted

Miguel Reina-Campos, Alexander Monell, Amir Ferry, et al.

Selected by 22 August 2024

Marina Schernthanner

Bioinformatics

Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium

Nikolai Hecker , Niklas Kempynck , David Mauduit, et al.

Selected by 02 July 2024

Rodrigo Senovilla-Ganzo

Bioinformatics

Also in the microbiology category:

Green synthesized silver nanoparticles from Moringa: Potential for preventative treatment of SARS-CoV-2 contaminated water

Adebayo J. Bello, Omorilewa B. Ebunoluwa, Rukayat O. Ayorinde, et al.

Selected by 14 November 2024

Safieh Shah, Benjamin Dominik Maier

Epidemiology

Intracellular diffusion in the cytoplasm increases with cell size in fission yeast

Catherine Tan, Michael C. Lanz, Matthew Swaffer, et al.

Selected by 18 October 2024

Leeba Ann Chacko, Sameer Thukral

Cell Biology

Significantly reduced, but balanced, rates of mitochondrial fission and fusion are sufficient to maintain the integrity of yeast mitochondrial DNA

Brett T. Wisniewski, Laura L. Lackner

Selected by 30 August 2024

Leeba Ann Chacko

Cell Biology

Also in the systems biology category:

Modular control of time and space during vertebrate axis segmentation

Ali Seleit, Ian Brettell, Tomas Fitzgerald, et al.

AND

Natural genetic variation quantitatively regulates heart rate and dimension

Jakob Gierten, Bettina Welz, Tomas Fitzgerald, et al.

Selected by 24 June 2024

Girish Kale, Jennifer Ann Black

Developmental Biology

Expressive modeling and fast simulation for dynamic compartments

Till Köster, Philipp Henning, Tom Warnke, et al.

Selected by 18 April 2024

Benjamin Dominik Maier

Systems Biology

Clusters of lineage-specific genes are anchored by ZNF274 in repressive perinucleolar compartments

Martina Begnis, Julien Duc, Sandra Offner, et al.

Selected by 10 April 2024

Silvia Carvalho

Cell Biology

preLists in the bioinformatics category:

‘In preprints’ from Development 2022-2023

A list of the preprints featured in Development's 'In preprints' articles between 2022-2023

 



List by Alex Eve, Katherine Brown

9th International Symposium on the Biology of Vertebrate Sex Determination

This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.

 



List by Martin Estermann

Alumni picks – preLights 5th Birthday

This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.

 



List by Sergio Menchero et al.

Fibroblasts

The advances in fibroblast biology preList explores the recent discoveries and preprints of the fibroblast world. Get ready to immerse yourself with this list created for fibroblasts aficionados and lovers, and beyond. Here, my goal is to include preprints of fibroblast biology, heterogeneity, fate, extracellular matrix, behavior, topography, single-cell atlases, spatial transcriptomics, and their matrix!

 



List by Osvaldo Contreras

Single Cell Biology 2020

A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.

 



List by Alex Eve

Antimicrobials: Discovery, clinical use, and development of resistance

Preprints that describe the discovery of new antimicrobials and any improvements made regarding their clinical use. Includes preprints that detail the factors affecting antimicrobial selection and the development of antimicrobial resistance.

 



List by Zhang-He Goh

Also in the systems biology category:

2024 Hypothalamus GRC

This 2024 Hypothalamus GRC (Gordon Research Conference) preList offers an overview of cutting-edge research focused on the hypothalamus, a critical brain region involved in regulating homeostasis, behavior, and neuroendocrine functions. The studies included cover a range of topics, including neural circuits, molecular mechanisms, and the role of the hypothalamus in health and disease. This collection highlights some of the latest advances in understanding hypothalamic function, with potential implications for treating disorders such as obesity, stress, and metabolic diseases.

 



List by Nathalie Krauth

‘In preprints’ from Development 2022-2023

A list of the preprints featured in Development's 'In preprints' articles between 2022-2023

 



List by Alex Eve, Katherine Brown

EMBL Synthetic Morphogenesis: From Gene Circuits to Tissue Architecture (2021)

A list of preprints mentioned at the #EESmorphoG virtual meeting in 2021.

 



List by Alex Eve

Single Cell Biology 2020

A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.

 



List by Alex Eve

ASCB EMBO Annual Meeting 2019

A collection of preprints presented at the 2019 ASCB EMBO Meeting in Washington, DC (December 7-11)

 



List by Madhuja Samaddar et al.

Pattern formation during development

The aim of this preList is to integrate results about the mechanisms that govern patterning during development, from genes implicated in the processes to theoritical models of pattern formation in nature.

 



List by Alexa Sadier
Close