Classification of non-coding variants with high pathogenic impact

Lambert Moyon, Camille Berthelot, Alexandra Louis, Nga Thi Thuy Nguyen, Hugues Roest Crollius

Preprint posted on 3 May 2021 https://www.biorxiv.org/content/10.1101/2021.05.03.442347v1

Article now published in PLOS Genetics at http://dx.doi.org/10.1371/journal.pgen.1010191

A novel machine learning method for prioritizing candidate pathogenic non-coding variants from short read genome sequencing

Selected by Jeffrey Calhoun

Categories: bioinformatics, genetics, genomics

Background to the preprint:

One of the current challenges plaguing the field of clinical genetics is how to interpret genetic variants outside of the protein-coding portion of the genome. Previously, the non-coding genome was generally inaccessible, as the cost of short read genome sequencing was a significant barrier. Now, the reduced cost of sequencing has made this feasible, though the infrastructure for data storage and analysis is still mostly limited to gene panels and exome sequencing. For many genetic disorders, it is hypothesized that non-coding genetic variants likely contribute to disease susceptibility, but our relative inability to decode the non-coding genome is a hurdle which currently limits the utility of short read genome sequencing. The goal of this preprint is to use machine learning to develop a new bioinformatic pipeline for prioritization of non-coding disease causing or disease susceptibility related variants from short read genome sequencing.

Key findings of the preprint:

The authors developed three machine learning models using a positive training set of validated non-coding regulatory variants from the Human Gene Mutation Database (HGMD) and a negative training set of non-coding variants with no clinical significance from the ClinVar database. Importantly, these models include deep annotation of variants, including evolutionary conservation, sequence features such as epigenetic marks, and predicted genic interactions. The authors first developed two extreme models: (1) a ‘Random’ model using random subsampling of negative training set variants and (2) a ‘Local’ model used a subsampling of negative training set variants within a short distance (1 kb) of positive training variants. They also developed an intermediate ( ‘Adjusted’) model using a subsample of negative control variants from the same cytogenetic band as variants present in the positive training set, a method also used by Genomiser (Smedley et al., 2016). Based on ‘10×10’ cross-validation and additional testing, the ‘Adjusted’ model performs well and may generalize, or broadly perform well on various datasets, based on strong classifier performance on both its negative training set and when substituting random subsampling of negative training set variants. The authors renamed the ‘Adjusted’ model as FINSURF, or Functional Interpretation of Non-coding Sequences Using Random Forests. In head-to-head comparisons with other models, FINSURF performed well relative to other available methods such as Genomiser, NCBoost, and others.

The authors used K-means clustering to identify multiple clusters in the positive training set and investigated which underlying features contributed to this clustering. In the most prominent cluster, it was clear that evolutionary conservation was the primary contributor, and this cluster contained the highest percentage of true positives. Transcription factor binding site (TFBS) clustering was another important feature driving clustering, sometimes in combination with evolutionary conservation. Other features, including epigenetic marks, gene associations, and CpG island annotation also contributed to clustering.

To assess the practicality of using FINSURF in identification of pathogenic non-coding variants in genome sequencing, the authors generated synthetic data including known pathogenic non-coding variants and a typical number of benign variants from a reference donor. Importantly, the pathogenic non-coding variants used here were independent from those used in the training set. The authors built a prioritization pipeline which focused on a subset of the genome (16%) which is either evolutionarily conserved or predicted to be regulatory. They then filtered for variants annotated to have genic interactions with known disease-related genes from the Online Mendelian Inheritance in Man (OMIM) database. For each discrete disease (n=30), there were on average 115 variants present after filtering. The known pathogenic variant was the highest FINSURF scoring variant among the list in 11 instances. In 12 additional instances, the pathogenic variant was present in the top 5 (n=8) or top 10 (n=4) of the variant list sorted by FINSURF score. This analysis suggests that it is feasible within normal clinical genetics workflows to generate a reasonable list of candidate non-coding variants which is highly likely to contain a pathogenic non-coding variant if it is present in the genome.

What you like about the preprint/why you think this new work is important:

Analysis of genome sequencing is easy until it becomes hard. For some individuals, genome sequencing identifies a pathogenic de novo variant that likely would have also been identified by gene panel or exome sequencing. In other cases, it is possible to identify copy number variants (CNVs) overlapping at least one exon of a disease-associated gene, which may have been identified by an array. However, when you finish screening for coding variants and come up empty, you start to wonder if there is a pathogenic non-coding variant present. Finding these variants is one of the major goals of medical genetics in the era of short read genome sequencing but has remained a significant challenge. This preprint is an important next step in identifying candidate or pathogenic non-coding variants from short read genomes. The authors have shared this tool with both a web server and github for installation on local machines, which makes this work accessible for others. It will be very interesting to see over the next few years whether labs can successfully use this approach to identify and validate novel pathogenic non-coding variants. If so, this could provide more incentive to leave the exome era behind and fully take the leap into the genome era.

Questions for the authors:

-What do the final lists of variants by OMIM disease class look like? Is it relatively easy to tell the pathogenic variant apart from the benign variants in those lists? Or is that an additional challenge that will need to be addressed?

-Have you used FINSURF on any ‘real-life’ genomes in addition to the synthetic dataset in the preprint? If so, are you finding it makes identifying relevant non-coding variants apart from benign non-coding variants easier?

-Do you have any advice for individuals like myself who are interested in using FINSURF to prioritize non-coding variants in a short read genome dataset?

Tags: genome, machine learning, non-coding variant, variant interpretation

Posted on: 22 June 2021

doi: https://doi.org/10.1242/prelights.29696

Read preprint

(No Ratings Yet)

Have your say Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

Expressive modeling and fast simulation for dynamic compartments

Till Köster, Philipp Henning, Tom Warnke, et al.

Selected by Benjamin Dominik Maier

Transcriptional profiling of human brain cortex identifies novel lncRNA-mediated networks dysregulated in amyotrophic lateral sclerosis

Alessandro Palma, Monica Ballarino

Selected by Julio Molina Pineda

Spatial transcriptomics elucidates medulla niche supporting germinal center response in myasthenia gravis thymoma

Yoshiaki Yasumizu, Makoto Kinoshita, Martin Jinye Zhang, et al.

Selected by Jessica Chevallier

Also in the genetics category:

Temporal constraints on enhancer usage shape the regulation of limb gene transcription

Raquel Rouco, Antonella Rauseo, Guillaume Sapin, et al.

Selected by María Mariner-Faulí

A long non-coding RNA at the cortex locus controls adaptive colouration in butterflies

Luca Livraghi, Joseph J. Hanly, Elizabeth Evans, et al.

AND

The ivory lncRNA regulates seasonal color patterns in buckeye butterflies

Richard A. Fandino, Noah K. Brady, Martik Chatterjee, et al.

AND

A micro-RNA drives a 100-million-year adaptive evolution of melanic patterns in butterflies and moths

Shen Tian, Tirtha Das Banerjee, Jocelyn Liang Qi Wee, et al.

Selected by Isabella Cisneros

A revised single-cell transcriptomic atlas of Xenopus embryo reveals new differentiation dynamics

Kseniya Petrova, Maksym Tretiakov, Aleksandr Kotov, et al.

Selected by Rachel Mckeown

Also in the genomics category:

Temporal constraints on enhancer usage shape the regulation of limb gene transcription

Raquel Rouco, Antonella Rauseo, Guillaume Sapin, et al.

Selected by María Mariner-Faulí

Transcriptional profiling of human brain cortex identifies novel lncRNA-mediated networks dysregulated in amyotrophic lateral sclerosis

Alessandro Palma, Monica Ballarino

Selected by Julio Molina Pineda

A long non-coding RNA at the cortex locus controls adaptive colouration in butterflies

Luca Livraghi, Joseph J. Hanly, Elizabeth Evans, et al.

AND

The ivory lncRNA regulates seasonal color patterns in buckeye butterflies

Richard A. Fandino, Noah K. Brady, Martik Chatterjee, et al.

AND

A micro-RNA drives a 100-million-year adaptive evolution of melanic patterns in butterflies and moths

Shen Tian, Tirtha Das Banerjee, Jocelyn Liang Qi Wee, et al.

Selected by Isabella Cisneros

preLists in the bioinformatics category:

‘In preprints’ from Development 2022-2023

A list of the preprints featured in Development's 'In preprints' articles between 2022-2023

Classification of non-coding variants with high pathogenic impact

Share this:

Have your say Cancel reply

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

Expressive modeling and fast simulation for dynamic compartments

Transcriptional profiling of human brain cortex identifies novel lncRNA-mediated networks dysregulated in amyotrophic lateral sclerosis

Spatial transcriptomics elucidates medulla niche supporting germinal center response in myasthenia gravis thymoma

Also in the genetics category:

Temporal constraints on enhancer usage shape the regulation of limb gene transcription

A long non-coding RNA at the cortex locus controls adaptive colouration in butterflies

The ivory lncRNA regulates seasonal color patterns in buckeye butterflies

A micro-RNA drives a 100-million-year adaptive evolution of melanic patterns in butterflies and moths

A revised single-cell transcriptomic atlas of Xenopus embryo reveals new differentiation dynamics

Also in the genomics category:

Temporal constraints on enhancer usage shape the regulation of limb gene transcription

Transcriptional profiling of human brain cortex identifies novel lncRNA-mediated networks dysregulated in amyotrophic lateral sclerosis

A long non-coding RNA at the cortex locus controls adaptive colouration in butterflies

The ivory lncRNA regulates seasonal color patterns in buckeye butterflies

A micro-RNA drives a 100-million-year adaptive evolution of melanic patterns in butterflies and moths

preLists in the bioinformatics category:

‘In preprints’ from Development 2022-2023

9th International Symposium on the Biology of Vertebrate Sex Determination

Alumni picks – preLights 5th Birthday

Fibroblasts

Single Cell Biology 2020

Antimicrobials: Discovery, clinical use, and development of resistance

Also in the genetics category:

BSCB-Biochemical Society 2024 Cell Migration meeting

9th International Symposium on the Biology of Vertebrate Sex Determination

Alumni picks – preLights 5th Birthday

Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University

20th “Genetics Workshops in Hungary”, Szeged (25th, September)

2nd Conference of the Visegrád Group Society for Developmental Biology

EMBL Conference: From functional genomics to systems biology

TAGC 2020

ECFG15 – Fungal biology

Autophagy

Zebrafish immunology

Also in the genomics category:

BSCB-Biochemical Society 2024 Cell Migration meeting

preLights peer support – preprints of interest

9th International Symposium on the Biology of Vertebrate Sex Determination

Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University

20th “Genetics Workshops in Hungary”, Szeged (25th, September)

EMBL Conference: From functional genomics to systems biology

TAGC 2020