Menu

Close

Large-scale analyses of human microbiomes reveal thousands of small, novel genes and their predicted functions

Hila Sberro, Nicholas Greenfield, Georgios Pavlopoulos, Nikos Kyrpides, Ami S Bhatt

Preprint posted on December 13, 2018 https://www.biorxiv.org/content/early/2018/12/13/494179

The power of a computational magnifying lens – peering into diversity of proteins, large and small, encoded by microbes in our bodies reveals many novel small ORFs, their putative functions and hints on their evolution and diversity

Selected by Ganesh Kadamur

Context: The coding region of genomes, commonly called Open Reading Frames (ORFs), have long been defined using a seemingly arbitrary minimum length cutoff (typically 50aa). Challenging this paradigm, studies over the past decade have shown widespread translation of stable products that fall below this threshold in both prokaryotes and eukaryotes, with functions in quorum sensing, development and calcium signaling, just to name a few. These proteins are encoded not only in intergenic regions, but also within annotated ORFs and have been named small ORFs (sORF). However, these studies have mostly relied on investigating well studied model organisms and so, extensive, large scale studies are notably missing.

Methodology: In this preprint, Sberro et al. tackle the question using large publicly available datasets and computational approaches. They mine metagenomes isolated from >250 human subjects as part of the Human Microbiome Project (NIH HMP I-II), running them through to a battery of analyses and computational pipelines. Using MetaProdigal, a prediction tool optimized to look for sORFs, they identify >2.5 million sORFs. Combining sequence based and domain analysis, these were grouped into ~400,000 clusters. To benchmark their methods, they search this set for ~30 sORFs from model organisms that have been well characterized and surprisingly find that almost half of these are absent in the human microbiome. To prune this list of ~400k clusters and increase confidence in their results, the authors utilize RNAcode, a program that incorporates evolutionary and mutational signatures amongst homologs to narrow down to bona fide sORFs. The authors then proceed to functional prediction and categorization of these sORF families using information about taxonomic specialization, intracellular location prediction based on sequence analysis and comparison to other environmentally sampled metagenomes. Also, as prokaryotic genes are commonly found in operons where functionally related genes are commonly clustered together, the genomic neighbourhood of sORF families was also analyzed for functional annotation. Together, this computational tour de force analysis has unearthed many novel sORF families, indicated putative functions and generates a vast body of hypotheses that can now be experimentally tested.

Pipeline for identification and prediction of function of sORFs from human microbiome metagenomic data (taken from Sberro et al., bioRxiv, 2018)

Key Findings:

  • The human microbiome has >4000 sORF families of which almost 50% are not detected in species from other sequenced microbiomes (soil, water, mouse etc), thus highlighting the uniqueness of the microbiota that call our bodies home. In the process, the authors show that ~2400 families identified here are present in genomes included in the RefSeq database. However, more than 1000 such families had remained unannotated because of the arbitrary 50aa length cutoff.
  • Some sORF families are more conserved than others. About 20 families are present in >50 species, whereas ~3000 families are found only in 10 or less species, suggesting that rapid evolutionary mutation and specialization is widespread amongst sORFs.
  • A mere 4% of all identified sORF families possess annotated domains, underscoring the breadth of unexplored sequence and structure space amongst sORFs.
  • 13 novel families are highly conserved across microbiomes isolated from different human niches (gut, mouth, skin) and thus likely encode essential housekeeping proteins. Almost half of these families are ribosome associated proteins – homologs of these families are also present in non-human microbiome species, supporting the prediction of them playing a critical role across phyla.
  • No single protein family is present across all human niches sampled, implying niche-specific evolution of protein sequences and families. It is pertinent to note here that undersampling, of donor samples per niche, might bias this interpretation.
  • About a third of novel sORFs are predicted to generate transmembrane and/or secreted proteins. Analysis of their genomic context suggests roles in quorum sensing, toxin-antitoxin systems and inter-cellular communication
  • Clues from genomic neighbourhood identifies ~200 families as potential phage defense genes with roles in CRISPR response, and ~600 families that might mediate horizontal gene transfer events.

Why I like the work: This work extends previous findings that show the widespread yet unappreciated translation of small proteins, defined as <50aa in length. While earlier work was focused on a small number of well studied model organisms, this expands our knowledge of sORF families by orders of magnitude, uncovering uncharacterized protein domains (and thus folds) in the process. This work has generated a rich resource ripe for future exploration, with promise of discovery of new antibiotics, tools that could be developed to interrogate a plethora of cellular processes and possibly also innovative ways to design cell permeable proteins for drug delivery. The section where the authors clearly spell out potential pitfalls of the methods and conclusions of their work is also particularly commendable. Finally, this work is a great showcase of how combining different computational tools and pipelines can yield important insights into novel biology.

Future directions:

  • Mass spectrometry to validate expression of predicted sORFs at protein level. This could also be complemented by techniques such as ribosome profiling.
  • Re-analysis of the data to recognize co-occuring species, and presence of sORFs in both species with predicted quorum sensing roles in both – for example, a secreted sORF in one that acts as a signal transducer, and a signal receiving receptor, typically not a sORF, in the other species. This could enlighten on communication pathways employed by these species in specific niches to regulate inter-cellular crosstalk.
  • Genetic studies in a wide range of species to test function. Development of a high throughput platform to study some families could be especially useful, for example sORFs predicted to play roles in quorum sensing. This would be contingent on ability to culture the species outside the body, and development of tools for genetic manipulation.
  • Explore diversity and specialization of sORF families across individuals from different ethnic backgrounds, as research increasingly shows microbiomes widely vary based on diet, environmental factors etc. Such analyses could particularly help identify rapidly evolving sequences that are most important for local adaptation.

Tags: function prediction, microbiome, proteins

Posted on: 23rd January 2019 , updated on: 24th January 2019

Read preprint (No Ratings Yet)




  • Have your say

    Your email address will not be published. Required fields are marked *

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    Sign up to customise the site to your preferences and to receive alerts

    Register here
    Close