Large-scale analyses of human microbiomes reveal thousands of small, novel genes and their predicted functions
Preprint posted on December 13, 2018 https://www.biorxiv.org/content/early/2018/12/13/494179
The power of a computational magnifying lens – peering into diversity of proteins, large and small, encoded by microbes in our bodies reveals many novel small ORFs, their putative functions and hints on their evolution and diversityGanesh Kadamur
Context: The coding region of genomes, commonly called Open Reading Frames (ORFs), have long been defined using a seemingly arbitrary minimum length cutoff (typically 50aa). Challenging this paradigm, studies over the past decade have shown widespread translation of stable products that fall below this threshold in both prokaryotes and eukaryotes, with functions in quorum sensing, development and calcium signaling, just to name a few. These proteins are encoded not only in intergenic regions, but also within annotated ORFs and have been named small ORFs (sORF). However, these studies have mostly relied on investigating well studied model organisms and so, extensive, large scale studies are notably missing.
Methodology: In this preprint, Sberro et al. tackle the question using large publicly available datasets and computational approaches. They mine metagenomes isolated from >250 human subjects as part of the Human Microbiome Project (NIH HMP I-II), running them through to a battery of analyses and computational pipelines. Using MetaProdigal, a prediction tool optimized to look for sORFs, they identify >2.5 million sORFs. Combining sequence based and domain analysis, these were grouped into ~400,000 clusters. To benchmark their methods, they search this set for ~30 sORFs from model organisms that have been well characterized and surprisingly find that almost half of these are absent in the human microbiome. To prune this list of ~400k clusters and increase confidence in their results, the authors utilize RNAcode, a program that incorporates evolutionary and mutational signatures amongst homologs to narrow down to bona fide sORFs. The authors then proceed to functional prediction and categorization of these sORF families using information about taxonomic specialization, intracellular location prediction based on sequence analysis and comparison to other environmentally sampled metagenomes. Also, as prokaryotic genes are commonly found in operons where functionally related genes are commonly clustered together, the genomic neighbourhood of sORF families was also analyzed for functional annotation. Together, this computational tour de force analysis has unearthed many novel sORF families, indicated putative functions and generates a vast body of hypotheses that can now be experimentally tested.
Pipeline for identification and prediction of function of sORFs from human microbiome metagenomic data (taken from Sberro et al., bioRxiv, 2018)
- The human microbiome has >4000 sORF families of which almost 50% are not detected in species from other sequenced microbiomes (soil, water, mouse etc), thus highlighting the uniqueness of the microbiota that call our bodies home. In the process, the authors show that ~2400 families identified here are present in genomes included in the RefSeq database. However, more than 1000 such families had remained unannotated because of the arbitrary 50aa length cutoff.
- Some sORF families are more conserved than others. About 20 families are present in >50 species, whereas ~3000 families are found only in 10 or less species, suggesting that rapid evolutionary mutation and specialization is widespread amongst sORFs.
- A mere 4% of all identified sORF families possess annotated domains, underscoring the breadth of unexplored sequence and structure space amongst sORFs.
- 13 novel families are highly conserved across microbiomes isolated from different human niches (gut, mouth, skin) and thus likely encode essential housekeeping proteins. Almost half of these families are ribosome associated proteins – homologs of these families are also present in non-human microbiome species, supporting the prediction of them playing a critical role across phyla.
- No single protein family is present across all human niches sampled, implying niche-specific evolution of protein sequences and families. It is pertinent to note here that undersampling, of donor samples per niche, might bias this interpretation.
- About a third of novel sORFs are predicted to generate transmembrane and/or secreted proteins. Analysis of their genomic context suggests roles in quorum sensing, toxin-antitoxin systems and inter-cellular communication
- Clues from genomic neighbourhood identifies ~200 families as potential phage defense genes with roles in CRISPR response, and ~600 families that might mediate horizontal gene transfer events.
Why I like the work: This work extends previous findings that show the widespread yet unappreciated translation of small proteins, defined as <50aa in length. While earlier work was focused on a small number of well studied model organisms, this expands our knowledge of sORF families by orders of magnitude, uncovering uncharacterized protein domains (and thus folds) in the process. This work has generated a rich resource ripe for future exploration, with promise of discovery of new antibiotics, tools that could be developed to interrogate a plethora of cellular processes and possibly also innovative ways to design cell permeable proteins for drug delivery. The section where the authors clearly spell out potential pitfalls of the methods and conclusions of their work is also particularly commendable. Finally, this work is a great showcase of how combining different computational tools and pipelines can yield important insights into novel biology.
- Mass spectrometry to validate expression of predicted sORFs at protein level. This could also be complemented by techniques such as ribosome profiling.
- Re-analysis of the data to recognize co-occuring species, and presence of sORFs in both species with predicted quorum sensing roles in both – for example, a secreted sORF in one that acts as a signal transducer, and a signal receiving receptor, typically not a sORF, in the other species. This could enlighten on communication pathways employed by these species in specific niches to regulate inter-cellular crosstalk.
- Genetic studies in a wide range of species to test function. Development of a high throughput platform to study some families could be especially useful, for example sORFs predicted to play roles in quorum sensing. This would be contingent on ability to culture the species outside the body, and development of tools for genetic manipulation.
- Explore diversity and specialization of sORF families across individuals from different ethnic backgrounds, as research increasingly shows microbiomes widely vary based on diet, environmental factors etc. Such analyses could particularly help identify rapidly evolving sequences that are most important for local adaptation.
Posted on: 23rd January 2019 , updated on: 24th January 2019Read preprint
Also in the bioinformatics category:
A localization screen reveals translation factories and widespread co-translational RNA targeting
|Selected by||Mafalda Pimentel|
Astrocytes and neurons share brain region-specific transcriptional signatures
|Selected by||Idoia Quintana-Urzainqui|
Differentiation of human intestinal organoids with endogenous vascular endothelial cells
|Selected by||Nozomu Takata|
Also in the microbiology category:
Phage infection mediates inhibition of bystander bacteria
|Selected by||Josie Gibson|
Anti-biofilm efficacy of a medieval treatment for bacterial infection requires the combination of multiple ingredients
The safety profile of Bald’s eyesalve for the treatment of bacterial infections
|Selected by||Snehal Kadam|
PlanktonScope: Affordable modular imaging platform for citizen oceanography
|Selected by||Mariana De Niz|
Also in the systems biology category:
Retrospective identification of rare cell populations underlying drug resistance connects molecular variability with cell fate
|Selected by||Pavithran Ravindran|
SIV and Mycobacterium tuberculosis synergy within the granuloma accelerates the reactivation pattern of latent tuberculosis
|Selected by||Louise Fraser|
A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing
|Selected by||Robert Mahen|
preListsbioinformatics category:in the
Antimicrobials: Discovery, clinical use, and development of resistance
Preprints that describe the discovery of new antimicrobials and any improvements made regarding their clinical use. Includes preprints that detail the factors affecting antimicrobial selection and the development of antimicrobial resistance.
|List by||Zhang-He Goh|
Also in the microbiology category:
BioMalPar XVI: Biology and Pathology of the Malaria Parasite
[under construction] Preprints presented at the (fully virtual) EMBL BioMalPar XVI, 17-18 May 2020 #emblmalaria
|List by||Gautam Dey, Samantha Seah|
ECFG15 – Fungal biology
Preprints presented at 15th European Conference on Fungal Genetics 17-20 February 2020 Rome
|List by||Hiral Shah|
EMBL Seeing is Believing – Imaging the Molecular Processes of Life
Preprints discussed at the 2019 edition of Seeing is Believing, at EMBL Heidelberg from the 9th-12th October 2019
|List by||Gautam Dey|
Also in the systems biology category:
ASCB EMBO Annual Meeting 2019
A collection of preprints presented at the 2019 ASCB EMBO Meeting in Washington, DC (December 7-11)
|List by||Madhuja Samaddar, Ramona Jühlen, Amanda Haage, Laura McCormick, Maiko Kitaoka|
Pattern formation during development
The aim of this preList is to integrate results about the mechanisms that govern patterning during development, from genes implicated in the processes to theoritical models of pattern formation in nature.
|List by||Alexa Sadier|