GeneWalk identifies relevant gene functions for a biological context using network representation learning

Robert Ietswaart, Benjamin M. Gyori, John A. Bachman, Peter K. Sorger, L. Stirling Churchman

Preprint posted on September 05, 2019

Fear The Walking Gene: knowledge-based machine learning is able to highlight gene functions relevant for a distinct biological context

Selected by Ramona Jühlen

Background and GeneWalk methodology

High-throughput functional genomics can provide scientists with a long list of candidate genes which could play a role in the biological context they are studying. But how should one narrow down such a list to get candidate genes which are the most important in the given context? Gene functions are commonly interrogated using GO annotations, and GO coupled to gene set enrichment analysis (GSEA) can be used to reveal enriched biological functions in a gene set. This analysis, however, does not address the context-specific functions of individual genes in the dataset. To overcome this shortcoming, the authors developed GeneWalk, a novel approach using knowledge-based machine learning and statistical modelling.

First, GeneWalk assembles a context-specific gene network from a knowledge base (e.g. Pathway Commons, INDRA) starting with a list of input genes obtained from a specific experiment. This gene network is added to a GO network resulting in a full GeneWalk network (GWN) (Figure 1).

Next, the GWN structure is learned by an unsupervised network representation learning algorithm, termed DeepWalk (1). Briefly, using random walks the local neighbourhood of nodes (representing genes or GO terms) is scanned, summarised as a collection of neighbouring node pairs and provided as a training set for a neural network with one hidden layer (the layer between input and out, i.e. the artificial neuron) (Figure 1). After training, each input node in the GWN is represented as a vector by the resultant hidden layer weights.

Finally, GeneWalk determines by significance testing whether the similarity value between a gene and a GO term is higher than that of a generated null distribution of similarity values (Figure 1). Yielded adjusted p-values rank the relevant context-specific GO term for a gene of interest.


Figure 1. Scheme of GeneWalk methodology. Details outlining GeneWalk network representation learning and significance testing of the GeneWalk methodology.

Example applications of GeneWalk

To test GeneWalk the authors set out to use it first in an already characterised experimental context. Oligodendrocytes myelinate neurons in the brain in a QKI-dependent mechanism, where the gene QKI codes for a RNA-binding protein involved in alternative splicing. RNA-sequencing data of QkI-deficient murine oligodendrocytes revealed 1899 differentially expressed genes, and several of those strong down-regulated genes have been linked to neuron myelination (e.g. Mal, Pllp, Plp1) (2). GeneWalk, using the knowledge base INDRA, identified in the RNA-sequencing data of QkI-deficient murine brains that GO terms linked to neuron myelination were most similar to the differentially expressed genes Mal, Pllp and Plp1. GSEA analysis using PANTHER also identified myelination-related processes to be enriched; however, specific gene functions in this biological context could not be recovered. Additionally, the authors present that GeneWalk is not influenced by biases from genes with a high or low number of GO annotations, or from the degree of connectivity of GO annotations of a gene

Next, in order to apply GeneWalk in a different experimental set-up the authors reanalysed
published Native Elongation Transcript sequencing (NET-seq) data of a human T-cell acute
lymphoblastic leukaemia (ALL) cell line responding to treatment with JQ1 (3). JQ1 is a small drug
that targets BRD4 and other BET family members that are involved in haematologic cancers like
ALL. With NET-seq a quantitative read-out of the nascent transcription is possible. By first
calculating differentially transcribed protein-coding genes, GeneWalk identified 28% similar GO
terms for these genes, whereas conventional GSEA only identified five high-level functions with
low fold enrichment. These results reveal the advantage of GeneWalk (and disadvantage of GSEA),
when a magnitude of functionally unrelated genes are mis-regulated. Furthermore, in this
experiment GeneWalk was able to systematically prioritise context-specific functions of genes with
a multitude of GO annotations (e.g. MYC or BRCA1), that are not all relevant for this specific
biological context.

As a third application of GeneWalk the authors generated NET-seq data from HeLa cells treated
with the biflavonoid isoginkgetin (IsoG). IsoG is a plant-derived compound with possible anticancerogenic abilities. It has been shown that IsoG inhibits pre-mRNA splicing in vitro and in vivo
and causes Pol II accumulation at the 5’-end of genes (4); however, its exact mode of action
remains to be elucidated. NET-seq revealed 2940 genes as differentially transcribed upon IsoG
treatment and GeneWalk found that 24% of these genes had at least one similar GO term. On the
contrast to GSEA, GeneWalk found HES1, EGR1 and IRF1 as plausible candidate genes for
inhibiting Pol II transcriptional elongation after IsoG treatment.

Summed up, the authors provide a novel computational tool that is able to identify context-specific
gene functions in gene sets of experimental assays. These assays are not limited to input data of
RNA-sequencing or NET-seq, but can also be transferred to e.g. CRISPR screens or mass
spectrometry approaches.

What I like about this work and open questions

GeneWalk supplements over-representation tests and GSEA of GO annotations. I am currently
doing GSEA using the R package clusterProfiler (5), and now I will alternatively analyse my data
using GeneWalk in order to complement my results. Both tools seem to be a great combination
(they are also both open source)!

It will be good to know whether it will be possible in the future to add another genome wide
annotation parameter by mapping Entrez Gene identifiers, so that data of more species can be
analysed (Bioconductor provides OrgDb for 20 species).

Additional references

1. B. Perozzi, R. Al-Rfou, S. Skiena, Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining – KDD ’14, 701–710 (2014).
2. L. Darbelli, K. Choquet, S. Richard, C. L. Kleinman, Sci Rep. 7, 1–13 (2017).
3. G. E. Winter et al., Mol. Cell. 67, 5-18.e19 (2017).
4. K. O’Brien, A. J. Matlin, A. M. Lowell, M. J. Moore, J. Biol. Chem. 283, 33147–33154 (2008).
5. G. Yu, L.-G. Wang, Y. Han, Q.-Y. He, OMICS. 16, 284–287 (2012).

More information

Tags: crispr screen, genomics, gsea, proteomics, python

Posted on: 4th October 2019

Read preprint (No Ratings Yet)

  • Author's response

    Robert Ietswaart and L. Stirling Churchman shared

    Dear Ramona,

    Thank you for the exciting preLight on GeneWalk! We hope it will help to get more insight
    into your functional genomics data. We agree with you that GeneWalk is complementary to GSEA as GeneWalk is more focused on getting insight into the functional roles of individual genes, whilst GSEA provides more global information on which processes are relevant to the biological context.

    As you pointed out, GeneWalk currently works for human and mouse, but we are still looking into the best way to extend GeneWalk to other model organisms. The feasibility of extending GeneWalk depends mostly on whether there are open source knowledge bases available that contain mechanistic reaction statements such as “CDK9 phosphorylates RNA Polymerase II” that have previously been reported in the scientific literature. Such reactions are slightly different from gene annotations, which serve more as curated function labels for genes. GeneWalk makes use of these reactions (besides annotations) as they provide an understanding of how the input genes interact with each other. A gene that interacts with many functionally related input genes is then found to be central to the biological context and those shared functions are ranked as most relevant to that gene.

    Thank you for suggesting OrgDb at Bioconductor! As far as we understand, it provides a
    gene annotation database for many organisms, so we would also still look for reaction
    knowledge bases for those organisms to complement the information needed for GeneWalk. Budding yeast for instance, seems like a model organism that GeneWalk could be extended to in the future by using reactions from the Saccharomyces Genome Database (SGD). We’ve recently added support for human Ensembl gene IDs as input and some data visualization code in Python and R in the tutorial. We are open to hear more suggestions on useful features from the community.

    Kind regards,

    Robert and Stirling

    Have your say

    Your email address will not be published. Required fields are marked *

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    Sign up to customise the site to your preferences and to receive alerts

    Register here