Classification of non-coding variants with high pathogenic impact
Posted on: 22 June 2021
Preprint posted on 3 May 2021
Article now published in PLOS Genetics at http://dx.doi.org/10.1371/journal.pgen.1010191
A novel machine learning method for prioritizing candidate pathogenic non-coding variants from short read genome sequencing
Selected by Jeffrey CalhounCategories: bioinformatics, genetics, genomics
Background to the preprint:
One of the current challenges plaguing the field of clinical genetics is how to interpret genetic variants outside of the protein-coding portion of the genome. Previously, the non-coding genome was generally inaccessible, as the cost of short read genome sequencing was a significant barrier. Now, the reduced cost of sequencing has made this feasible, though the infrastructure for data storage and analysis is still mostly limited to gene panels and exome sequencing. For many genetic disorders, it is hypothesized that non-coding genetic variants likely contribute to disease susceptibility, but our relative inability to decode the non-coding genome is a hurdle which currently limits the utility of short read genome sequencing. The goal of this preprint is to use machine learning to develop a new bioinformatic pipeline for prioritization of non-coding disease causing or disease susceptibility related variants from short read genome sequencing.
Key findings of the preprint:
The authors developed three machine learning models using a positive training set of validated non-coding regulatory variants from the Human Gene Mutation Database (HGMD) and a negative training set of non-coding variants with no clinical significance from the ClinVar database. Importantly, these models include deep annotation of variants, including evolutionary conservation, sequence features such as epigenetic marks, and predicted genic interactions. The authors first developed two extreme models: (1) a ‘Random’ model using random subsampling of negative training set variants and (2) a ‘Local’ model used a subsampling of negative training set variants within a short distance (1 kb) of positive training variants. They also developed an intermediate ( ‘Adjusted’) model using a subsample of negative control variants from the same cytogenetic band as variants present in the positive training set, a method also used by Genomiser (Smedley et al., 2016). Based on ‘10×10’ cross-validation and additional testing, the ‘Adjusted’ model performs well and may generalize, or broadly perform well on various datasets, based on strong classifier performance on both its negative training set and when substituting random subsampling of negative training set variants. The authors renamed the ‘Adjusted’ model as FINSURF, or Functional Interpretation of Non-coding Sequences Using Random Forests. In head-to-head comparisons with other models, FINSURF performed well relative to other available methods such as Genomiser, NCBoost, and others.
The authors used K-means clustering to identify multiple clusters in the positive training set and investigated which underlying features contributed to this clustering. In the most prominent cluster, it was clear that evolutionary conservation was the primary contributor, and this cluster contained the highest percentage of true positives. Transcription factor binding site (TFBS) clustering was another important feature driving clustering, sometimes in combination with evolutionary conservation. Other features, including epigenetic marks, gene associations, and CpG island annotation also contributed to clustering.
To assess the practicality of using FINSURF in identification of pathogenic non-coding variants in genome sequencing, the authors generated synthetic data including known pathogenic non-coding variants and a typical number of benign variants from a reference donor. Importantly, the pathogenic non-coding variants used here were independent from those used in the training set. The authors built a prioritization pipeline which focused on a subset of the genome (16%) which is either evolutionarily conserved or predicted to be regulatory. They then filtered for variants annotated to have genic interactions with known disease-related genes from the Online Mendelian Inheritance in Man (OMIM) database. For each discrete disease (n=30), there were on average 115 variants present after filtering. The known pathogenic variant was the highest FINSURF scoring variant among the list in 11 instances. In 12 additional instances, the pathogenic variant was present in the top 5 (n=8) or top 10 (n=4) of the variant list sorted by FINSURF score. This analysis suggests that it is feasible within normal clinical genetics workflows to generate a reasonable list of candidate non-coding variants which is highly likely to contain a pathogenic non-coding variant if it is present in the genome.
What you like about the preprint/why you think this new work is important:
Analysis of genome sequencing is easy until it becomes hard. For some individuals, genome sequencing identifies a pathogenic de novo variant that likely would have also been identified by gene panel or exome sequencing. In other cases, it is possible to identify copy number variants (CNVs) overlapping at least one exon of a disease-associated gene, which may have been identified by an array. However, when you finish screening for coding variants and come up empty, you start to wonder if there is a pathogenic non-coding variant present. Finding these variants is one of the major goals of medical genetics in the era of short read genome sequencing but has remained a significant challenge. This preprint is an important next step in identifying candidate or pathogenic non-coding variants from short read genomes. The authors have shared this tool with both a web server and github for installation on local machines, which makes this work accessible for others. It will be very interesting to see over the next few years whether labs can successfully use this approach to identify and validate novel pathogenic non-coding variants. If so, this could provide more incentive to leave the exome era behind and fully take the leap into the genome era.
Questions for the authors:
-What do the final lists of variants by OMIM disease class look like? Is it relatively easy to tell the pathogenic variant apart from the benign variants in those lists? Or is that an additional challenge that will need to be addressed?
-Have you used FINSURF on any ‘real-life’ genomes in addition to the synthetic dataset in the preprint? If so, are you finding it makes identifying relevant non-coding variants apart from benign non-coding variants easier?
-Do you have any advice for individuals like myself who are interested in using FINSURF to prioritize non-coding variants in a short read genome dataset?
doi: https://doi.org/10.1242/prelights.29696
Read preprintSign up to customise the site to your preferences and to receive alerts
Register hereAlso in the bioinformatics category:
Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods
Benjamin Dominik Maier
Functional Diversity of Memory CD8 T Cells is Spatiotemporally Imprinted
Marina Schernthanner
Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium
Rodrigo Senovilla-Ganzo
Also in the genetics category:
Intracellular diffusion in the cytoplasm increases with cell size in fission yeast
Leeba Ann Chacko, Sameer Thukral
HIF1A contributes to the survival of aneuploid and mosaic pre-implantation embryos
Anchel De Jaime Soguero
Significantly reduced, but balanced, rates of mitochondrial fission and fusion are sufficient to maintain the integrity of yeast mitochondrial DNA
Leeba Ann Chacko
Also in the genomics category:
A fine kinetic balance of interactions directs transcription factor hubs to genes
Deevitha Balasubramanian
Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium
Rodrigo Senovilla-Ganzo
Modular control of time and space during vertebrate axis segmentation
AND
Natural genetic variation quantitatively regulates heart rate and dimension
Girish Kale, Jennifer Ann Black
preListsbioinformatics category:
in the‘In preprints’ from Development 2022-2023
A list of the preprints featured in Development's 'In preprints' articles between 2022-2023
List by | Alex Eve, Katherine Brown |
9th International Symposium on the Biology of Vertebrate Sex Determination
This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.
List by | Martin Estermann |
Alumni picks – preLights 5th Birthday
This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.
List by | Sergio Menchero et al. |
Fibroblasts
The advances in fibroblast biology preList explores the recent discoveries and preprints of the fibroblast world. Get ready to immerse yourself with this list created for fibroblasts aficionados and lovers, and beyond. Here, my goal is to include preprints of fibroblast biology, heterogeneity, fate, extracellular matrix, behavior, topography, single-cell atlases, spatial transcriptomics, and their matrix!
List by | Osvaldo Contreras |
Single Cell Biology 2020
A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.
List by | Alex Eve |
Antimicrobials: Discovery, clinical use, and development of resistance
Preprints that describe the discovery of new antimicrobials and any improvements made regarding their clinical use. Includes preprints that detail the factors affecting antimicrobial selection and the development of antimicrobial resistance.
List by | Zhang-He Goh |
Also in the genetics category:
BSDB/GenSoc Spring Meeting 2024
A list of preprints highlighted at the British Society for Developmental Biology and Genetics Society joint Spring meeting 2024 at Warwick, UK.
List by | Joyce Yu, Katherine Brown |
BSCB-Biochemical Society 2024 Cell Migration meeting
This preList features preprints that were discussed and presented during the BSCB-Biochemical Society 2024 Cell Migration meeting in Birmingham, UK in April 2024. Kindly put together by Sara Morais da Silva, Reviews Editor at Journal of Cell Science.
List by | Reinier Prosee |
9th International Symposium on the Biology of Vertebrate Sex Determination
This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.
List by | Martin Estermann |
Alumni picks – preLights 5th Birthday
This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.
List by | Sergio Menchero et al. |
Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University
This preList contains preprints discussed during the 'Semmelweis Symposium 2022' (7-9 November), organised around the 40th anniversary of international medical education at Semmelweis University covering a wide range of topics.
List by | Nándor Lipták |
20th “Genetics Workshops in Hungary”, Szeged (25th, September)
In this annual conference, Hungarian geneticists, biochemists and biotechnologists presented their works. Link: http://group.szbk.u-szeged.hu/minikonf/archive/prg2021.pdf
List by | Nándor Lipták |
2nd Conference of the Visegrád Group Society for Developmental Biology
Preprints from the 2nd Conference of the Visegrád Group Society for Developmental Biology (2-5 September, 2021, Szeged, Hungary)
List by | Nándor Lipták |
EMBL Conference: From functional genomics to systems biology
Preprints presented at the virtual EMBL conference "from functional genomics and systems biology", 16-19 November 2020
List by | Jesus Victorino |
TAGC 2020
Preprints recently presented at the virtual Allied Genetics Conference, April 22-26, 2020. #TAGC20
List by | Maiko Kitaoka et al. |
ECFG15 – Fungal biology
Preprints presented at 15th European Conference on Fungal Genetics 17-20 February 2020 Rome
List by | Hiral Shah |
Autophagy
Preprints on autophagy and lysosomal degradation and its role in neurodegeneration and disease. Includes molecular mechanisms, upstream signalling and regulation as well as studies on pharmaceutical interventions to upregulate the process.
List by | Sandra Malmgren Hill |
Zebrafish immunology
A compilation of cutting-edge research that uses the zebrafish as a model system to elucidate novel immunological mechanisms in health and disease.
List by | Shikha Nayar |
Also in the genomics category:
BSCB-Biochemical Society 2024 Cell Migration meeting
This preList features preprints that were discussed and presented during the BSCB-Biochemical Society 2024 Cell Migration meeting in Birmingham, UK in April 2024. Kindly put together by Sara Morais da Silva, Reviews Editor at Journal of Cell Science.
List by | Reinier Prosee |
9th International Symposium on the Biology of Vertebrate Sex Determination
This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.
List by | Martin Estermann |
Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University
This preList contains preprints discussed during the 'Semmelweis Symposium 2022' (7-9 November), organised around the 40th anniversary of international medical education at Semmelweis University covering a wide range of topics.
List by | Nándor Lipták |
20th “Genetics Workshops in Hungary”, Szeged (25th, September)
In this annual conference, Hungarian geneticists, biochemists and biotechnologists presented their works. Link: http://group.szbk.u-szeged.hu/minikonf/archive/prg2021.pdf
List by | Nándor Lipták |
EMBL Conference: From functional genomics to systems biology
Preprints presented at the virtual EMBL conference "from functional genomics and systems biology", 16-19 November 2020
List by | Jesus Victorino |
TAGC 2020
Preprints recently presented at the virtual Allied Genetics Conference, April 22-26, 2020. #TAGC20
List by | Maiko Kitaoka et al. |