Classification of non-coding variants with high pathogenic impact

Lambert Moyon, Camille Berthelot, Alexandra Louis, Nga Thi Thuy Nguyen, Hugues Roest Crollius

Preprint posted on May 03, 2021

A novel machine learning method for prioritizing candidate pathogenic non-coding variants from short read genome sequencing

Selected by Jeffrey Calhoun

Background to the preprint:

One of the current challenges plaguing the field of clinical genetics is how to interpret genetic variants outside of the protein-coding portion of the genome. Previously, the non-coding genome was generally inaccessible, as the cost of short read genome sequencing was a significant barrier. Now, the reduced cost of sequencing has made this feasible, though the infrastructure for data storage and analysis is still mostly limited to gene panels and exome sequencing. For many genetic disorders, it is hypothesized that non-coding genetic variants likely contribute to disease susceptibility, but our relative inability to decode the non-coding genome is a hurdle which currently limits the utility of short read genome sequencing. The goal of this preprint is to use machine learning to develop a new bioinformatic pipeline for prioritization of non-coding disease causing or disease susceptibility related variants from short read genome sequencing.

Key findings of the preprint:

The authors developed three machine learning models using a positive training set of validated non-coding regulatory variants from the Human Gene Mutation Database (HGMD) and a negative training set of non-coding variants with no clinical significance from the ClinVar database. Importantly, these models include deep annotation of variants, including evolutionary conservation, sequence features such as epigenetic marks, and predicted genic interactions. The authors first developed two extreme models: (1) a ‘Random’ model using random subsampling of negative training set variants and (2) a ‘Local’ model used a subsampling of negative training set variants within a short distance (1 kb) of positive training variants. They also developed an intermediate ( ‘Adjusted’) model using a subsample of negative control variants from the same cytogenetic band as variants present in the positive training set, a method also used by Genomiser (Smedley et al., 2016). Based on ‘10×10’ cross-validation and additional testing, the ‘Adjusted’ model performs well and may generalize, or broadly perform well on various datasets, based on strong classifier performance on both its negative training set and when substituting random subsampling of negative training set variants. The authors renamed the ‘Adjusted’ model as FINSURF, or Functional Interpretation of Non-coding Sequences Using Random Forests. In head-to-head comparisons with other models, FINSURF performed well relative to other available methods such as Genomiser, NCBoost, and others.

The authors used K-means clustering to identify multiple clusters in the positive training set and investigated which underlying features contributed to this clustering. In the most prominent cluster, it was clear that evolutionary conservation was the primary contributor, and this cluster contained the highest percentage of true positives. Transcription factor binding site (TFBS) clustering was another important feature driving clustering, sometimes in combination with evolutionary conservation. Other features, including epigenetic marks, gene associations, and CpG island annotation also contributed to clustering.

To assess the practicality of using FINSURF in identification of pathogenic non-coding variants in genome sequencing, the authors generated synthetic data including known pathogenic non-coding variants and a typical number of benign variants from a reference donor. Importantly, the pathogenic non-coding variants used here were independent from those used in the training set. The authors built a prioritization pipeline which focused on a subset of the genome (16%) which is either evolutionarily conserved or predicted to be regulatory. They then filtered for variants annotated to have genic interactions with known disease-related genes from the Online Mendelian Inheritance in Man (OMIM) database. For each discrete disease (n=30), there were on average 115 variants present after filtering. The known pathogenic variant was the highest FINSURF scoring variant among the list in 11 instances. In 12 additional instances, the pathogenic variant was present in the top 5 (n=8) or top 10 (n=4) of the variant list sorted by FINSURF score. This analysis suggests that it is feasible within normal clinical genetics workflows to generate a reasonable list of candidate non-coding variants which is highly likely to contain a pathogenic non-coding variant if it is present in the genome.

What you like about the preprint/why you think this new work is important:

Analysis of genome sequencing is easy until it becomes hard. For some individuals, genome sequencing identifies a pathogenic de novo variant that likely would have also been identified by gene panel or exome sequencing. In other cases, it is possible to identify copy number variants (CNVs) overlapping at least one exon of a disease-associated gene, which may have been identified by an array. However, when you finish screening for coding variants and come up empty, you start to wonder if there is a pathogenic non-coding variant present. Finding these variants is one of the major goals of medical genetics in the era of short read genome sequencing but has remained a significant challenge. This preprint is an important next step in identifying candidate or pathogenic non-coding variants from short read genomes. The authors have shared this tool with both a web server and github for installation on local machines, which makes this work accessible for others. It will be very interesting to see over the next few years whether labs can successfully use this approach to identify and validate novel pathogenic non-coding variants. If so, this could provide more incentive to leave the exome era behind and fully take the leap into the genome era.

Questions for the authors:

-What do the final lists of variants by OMIM disease class look like? Is it relatively easy to tell the pathogenic variant apart from the benign variants in those lists? Or is that an additional challenge that will need to be addressed?

-Have you used FINSURF on any ‘real-life’ genomes in addition to the synthetic dataset in the preprint? If so, are you finding it makes identifying relevant non-coding variants apart from benign non-coding variants easier?

-Do you have any advice for individuals like myself who are interested in using FINSURF to prioritize non-coding variants in a short read genome dataset?

Tags: genome, machine learning, non-coding variant, variant interpretation

Posted on: 22nd June 2021


Read preprint (No Ratings Yet)

Have your say

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Register here