Classification of non-coding variants with high pathogenic impact

Lambert Moyon, Camille Berthelot, Alexandra Louis, Nga Thi Thuy Nguyen, Hugues Roest Crollius

Preprint posted on 3 May 2021

Article now published in PLOS Genetics at

A novel machine learning method for prioritizing candidate pathogenic non-coding variants from short read genome sequencing

Selected by Jeffrey Calhoun

Background to the preprint:

One of the current challenges plaguing the field of clinical genetics is how to interpret genetic variants outside of the protein-coding portion of the genome. Previously, the non-coding genome was generally inaccessible, as the cost of short read genome sequencing was a significant barrier. Now, the reduced cost of sequencing has made this feasible, though the infrastructure for data storage and analysis is still mostly limited to gene panels and exome sequencing. For many genetic disorders, it is hypothesized that non-coding genetic variants likely contribute to disease susceptibility, but our relative inability to decode the non-coding genome is a hurdle which currently limits the utility of short read genome sequencing. The goal of this preprint is to use machine learning to develop a new bioinformatic pipeline for prioritization of non-coding disease causing or disease susceptibility related variants from short read genome sequencing.

Key findings of the preprint:

The authors developed three machine learning models using a positive training set of validated non-coding regulatory variants from the Human Gene Mutation Database (HGMD) and a negative training set of non-coding variants with no clinical significance from the ClinVar database. Importantly, these models include deep annotation of variants, including evolutionary conservation, sequence features such as epigenetic marks, and predicted genic interactions. The authors first developed two extreme models: (1) a ‘Random’ model using random subsampling of negative training set variants and (2) a ‘Local’ model used a subsampling of negative training set variants within a short distance (1 kb) of positive training variants. They also developed an intermediate ( ‘Adjusted’) model using a subsample of negative control variants from the same cytogenetic band as variants present in the positive training set, a method also used by Genomiser (Smedley et al., 2016). Based on ‘10×10’ cross-validation and additional testing, the ‘Adjusted’ model performs well and may generalize, or broadly perform well on various datasets, based on strong classifier performance on both its negative training set and when substituting random subsampling of negative training set variants. The authors renamed the ‘Adjusted’ model as FINSURF, or Functional Interpretation of Non-coding Sequences Using Random Forests. In head-to-head comparisons with other models, FINSURF performed well relative to other available methods such as Genomiser, NCBoost, and others.

The authors used K-means clustering to identify multiple clusters in the positive training set and investigated which underlying features contributed to this clustering. In the most prominent cluster, it was clear that evolutionary conservation was the primary contributor, and this cluster contained the highest percentage of true positives. Transcription factor binding site (TFBS) clustering was another important feature driving clustering, sometimes in combination with evolutionary conservation. Other features, including epigenetic marks, gene associations, and CpG island annotation also contributed to clustering.

To assess the practicality of using FINSURF in identification of pathogenic non-coding variants in genome sequencing, the authors generated synthetic data including known pathogenic non-coding variants and a typical number of benign variants from a reference donor. Importantly, the pathogenic non-coding variants used here were independent from those used in the training set. The authors built a prioritization pipeline which focused on a subset of the genome (16%) which is either evolutionarily conserved or predicted to be regulatory. They then filtered for variants annotated to have genic interactions with known disease-related genes from the Online Mendelian Inheritance in Man (OMIM) database. For each discrete disease (n=30), there were on average 115 variants present after filtering. The known pathogenic variant was the highest FINSURF scoring variant among the list in 11 instances. In 12 additional instances, the pathogenic variant was present in the top 5 (n=8) or top 10 (n=4) of the variant list sorted by FINSURF score. This analysis suggests that it is feasible within normal clinical genetics workflows to generate a reasonable list of candidate non-coding variants which is highly likely to contain a pathogenic non-coding variant if it is present in the genome.

What you like about the preprint/why you think this new work is important:

Analysis of genome sequencing is easy until it becomes hard. For some individuals, genome sequencing identifies a pathogenic de novo variant that likely would have also been identified by gene panel or exome sequencing. In other cases, it is possible to identify copy number variants (CNVs) overlapping at least one exon of a disease-associated gene, which may have been identified by an array. However, when you finish screening for coding variants and come up empty, you start to wonder if there is a pathogenic non-coding variant present. Finding these variants is one of the major goals of medical genetics in the era of short read genome sequencing but has remained a significant challenge. This preprint is an important next step in identifying candidate or pathogenic non-coding variants from short read genomes. The authors have shared this tool with both a web server and github for installation on local machines, which makes this work accessible for others. It will be very interesting to see over the next few years whether labs can successfully use this approach to identify and validate novel pathogenic non-coding variants. If so, this could provide more incentive to leave the exome era behind and fully take the leap into the genome era.

Questions for the authors:

-What do the final lists of variants by OMIM disease class look like? Is it relatively easy to tell the pathogenic variant apart from the benign variants in those lists? Or is that an additional challenge that will need to be addressed?

-Have you used FINSURF on any ‘real-life’ genomes in addition to the synthetic dataset in the preprint? If so, are you finding it makes identifying relevant non-coding variants apart from benign non-coding variants easier?

-Do you have any advice for individuals like myself who are interested in using FINSURF to prioritize non-coding variants in a short read genome dataset?

Tags: genome, machine learning, non-coding variant, variant interpretation

Posted on: 22 June 2021


Read preprint (No Ratings Yet)

Have your say

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Register here

Also in the genetics category:

Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University

This preList contains preprints discussed during the 'Semmelweis Symposium 2022' (7-9 November), organised around the 40th anniversary of international medical education at Semmelweis University covering a wide range of topics.


List by Nándor Lipták

20th “Genetics Workshops in Hungary”, Szeged (25th, September)

In this annual conference, Hungarian geneticists, biochemists and biotechnologists presented their works. Link:


List by Nándor Lipták

2nd Conference of the Visegrád Group Society for Developmental Biology

Preprints from the 2nd Conference of the Visegrád Group Society for Developmental Biology (2-5 September, 2021, Szeged, Hungary)


List by Nándor Lipták

EMBL Conference: From functional genomics to systems biology

Preprints presented at the virtual EMBL conference "from functional genomics and systems biology", 16-19 November 2020


List by Jesus Victorino

TAGC 2020

Preprints recently presented at the virtual Allied Genetics Conference, April 22-26, 2020. #TAGC20


List by Maiko Kitaoka et al.

ECFG15 – Fungal biology

Preprints presented at 15th European Conference on Fungal Genetics 17-20 February 2020 Rome


List by Hiral Shah


Preprints on autophagy and lysosomal degradation and its role in neurodegeneration and disease. Includes molecular mechanisms, upstream signalling and regulation as well as studies on pharmaceutical interventions to upregulate the process.


List by Sandra Malmgren Hill

Zebrafish immunology

A compilation of cutting-edge research that uses the zebrafish as a model system to elucidate novel immunological mechanisms in health and disease.


List by Shikha Nayar