Modeling transcriptional profiles of gene perturbation with a deep neural network

Wenke Liu, Xuya Wang, D R Mani, David Fenyö

Posted on: 13 February 2024 , updated on: 14 February 2024

Preprint posted on 16 July 2021

Can ChatGPT predict DNA mutations from RNA expression values? This exciting manuscript finds that training a deep learning model with 978 transcript levels is sufficient to predict which gene got silenced following shRNA knockdown!

Selected by Raquel Moya

Categories: bioinformatics, cancer biology

Context

The ability to predict genetic perturbation(s) from the gene expression profile of a cell line or tissue can provide valuable genomic information to researchers, clinicians, and patients, while avoiding expensive sequencing and time-consuming genomic analyses. Before such a prediction tool can be used in a clinical setting, however, researchers must extensively validate that it can correctly infer known genomic perturbations in existing datasets. Deep learning models have recently been developed and applied to the task of accurately predicting genomic data (e.g., Xpresso, Basenji, Enformer, and Borzoi).^1–4 In line with this, the authors of this preprint have trained a deep learning model using the Connectivity Map (CMap) dataset, which is a large collection of transcriptomic profiles in response to many different compounds and genetic modifications. One goal of this deep learning model is to improve the conventional algorithm developed by CMap to determine a perturbation target gene by comparing its gene expression profile against other profiles using a similarity metric.

Methods

The dataset used in this study derives from short hairpin RNA (shRNA) knockdown experiments conducted in 9 cell lines (GEO accession GSE106127) conducted as part of the NIH Library of Integrated Network-Based Cellular Signatures (LINCS) initiative.⁵ Gene expression profiles for these experiments were generated using the L1000 platform, which is a multiplexed gene expression assay that involves measuring transcript levels with fluorescent beads using a flow-cytometry scanner. Each bead was analyzed both for its color (indicating gene identity) and fluorescence intensity of the phycoerythrin signal (denoting gene abundance).⁵ The authors developed a deep learning model with a convolutional input layer with 978 nodes (one for each gene), 5 hidden layers of different sizes, and one output node which outputs a vector of probabilities for each possible perturbation target gene (n = 4,313). The model was trained on 80% of the total 341,336 gene expression profiles, where 10% was reserved for model validation and 10% for testing. The aim of the model during classification was to accurately predict the perturbation target gene for each shRNA knockdown.

Key Insights

After training, the model performed with an average AUROC across all classes of 0.99 on the testing data. The authors applied this trained model to a similar, but inherently different, CRISPR knockdown dataset for which the model performed less well (average AUROC across all classes = 0.6078). Attempts to improve performance on the CRISPR knockdown dataset did not change its initial performance. Taken together, this manuscript has demonstrated a deep learning adaptation of a more conventional algorithm that performed well to predict a shRNA target gene. Generally, it is exciting to see the application of deep learning frameworks to gene expression data because the complexity of the transcriptome is potentially matched by the power and scalability of a large network.

Questions and suggestions for the authors

Some questions remain regarding the appropriate application and generalizability of the presented modeling approach. While reading this preprint, there were a few specific things – detailed below – which I hope the authors could comment on:

Landmark genes:
- From the text it’s unclear what “landmark genes” are. Based on the original L1000 paper⁵ I found that the ~1,000 “landmark” transcripts were selected in an unbiased manner such that their gene expression patterns were orthogonal.
- The authors could touch on how the original L1000 paper used the measurement of the ~1,000 “landmark” transcripts to infer the remainder of the transcriptome. Only the expression values of the 978 chosen “landmark” genes were involved in training the deep learning model in this manuscript, however it may contextualize some flaws in the model’s performance to know that inference of the remainder of the transcriptome using the “landmark” set was accurate for only 81% (n = 9,196/11,350) of inferred genes. Thus, 17% of genes can’t be inferred from the L1000 transcript set. Not having captured some gene expression patterns within the 978 “landmark” genes could affect model performance and make a certain perturbation target gene appear as a different one. A larger or different set of “landmark” transcripts that recapitulates the transcriptome better may facilitate improved model performance.
- The authors claim that “landmark” genes have higher accuracy, yet the bar plots in Fig. 2B don’t show much difference in accuracy.
Off-target effects: The original L1000 paper discusses the possibility of their platform being able to analyze off-target effects. They compared “similarity between shRNAs targeting the same gene (“shared gene”) and shRNAs targeting different genes but sharing the 2-8 nucleotide seed sequence known to contribute to off-target effects (“shared seed”)”.⁵ Their conclusion was that the shared gene similarity was only slightly greater than random. In contrast, shared seed pairs were dramatically more similar to a null distribution, indicating that the magnitude of off-target effects exceeds that of on-target effects for shRNA knockdown.
- It seems tricky to train a model to accurately predict the perturbation target gene given the amount of off-target effects captured by the training data. Could the authors of this manuscript adapt the model to mitigate the influence of off-target effects? One idea would be to somehow combine or reduce the noise in the gene expression profiles from different shRNAs that target the same gene. The reason why this could improve model performance is because two different shRNAs targeting the same gene should have similar on-target effects and different off-target ones.
- On that note, how much variability around a measurement exists in this dataset for different shRNAs targeting the same gene? How concordant are their profiles?
Imbalanced classes:
- Genes with low accuracy had a small number of shRNAs. Did the authors try to control for the number of shRNAs per gene in the training data by filtering or down-sampling some target genes?
- How imbalanced are the classes? The authors could visually or numerically describe to what degree this is true (i.e., number of shRNAs per target gene and whether this is uniform across target genes).
- The authors report that the average accuracy of the trained model on the test set is 74.9%, but isn’t accuracy not the best measure for imbalanced classes? One option would be AUPRC (area under the precision-recall curve). The authors also report AUROC, which is great, however caution should be used because high recall can be achieved at a very low false positive rate owing to the large number of negatives in the test set, making it easy to obtain a high AUROC even when false positives vastly outnumber true positives (i.e., high false discovery rate).
Generalizing to a CRISPR knockdown dataset:
- It is well documented that a model trained for one context can often perform poorly on another. While the L1000 platform was also used to measure gene expression profiles in the CRISPR knockdown dataset, there are biochemical and experimental differences between shRNA and CRISPR knockdowns. This begs the question about how appropriate it is to apply a model trained on shRNA data to CRISPR data.
- Is it possible that the model has memorized the training data? It would not be ideal if certain profiles in the training data exhibited similar profile-perturbation relationships in the testing data, such as the same shRNA target gene. Predicting the label is easy if a model is trained using features from the same group. One good practice can be to partition entire biological groups into either the training or test set.
- The network is fully connected and such networks tend to memorize training data entirely when given enough time. Could the model architecture be contributing to memorization of the training data and lower generalizability?
CGS query method: The CMap dataset was designed such that a gene perturbation can be inferred from a gene expression profile by entering a query profile that is compared to profiles in the dataset using a similarity metric. Presumably, the authors tried to improve the performance of this query method, which can tell the user the target gene by comparing the query profile to a consensus gene signature (CGS), the weighted sum of raw profiles targeting the same gene. Does the deep learning model outperform this method? It would be best to benchmark the authors’ model by comparing it to the query method’s performance.
Parameters: The authors describe the parameters selected for the model, but don’t provide a justification for how or why those parameters are selected. A hyperparameter search may be a good thing to try.

References

Agarwal, V. & Shendure, J. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep. 31, 107663 (2020).
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. 2023.08.30.555582 Preprint at https://doi.org/10.1101/2023.08.30.555582 (2023).
Subramanian, A. et al. A Next Generation Connectivity Map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437-1452.e17 (2017).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).

Tags: cmap, crispr knockdown, deep learning, lincs, llm, machine learning, neural network, shrna

doi: https://doi.org/10.1242/prelights.36495

Read preprint

(No Ratings Yet)

Have your say Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium

Nikolai Hecker , Niklas Kempynck , David Mauduit, et al.

Selected by 02 July 2024

Rodrigo Senovilla-Ganzo

Expressive modeling and fast simulation for dynamic compartments

Till Köster, Philipp Henning, Tom Warnke, et al.

Selected by 18 April 2024

Benjamin Dominik Maier

Transcriptional profiling of human brain cortex identifies novel lncRNA-mediated networks dysregulated in amyotrophic lateral sclerosis

Alessandro Palma, Monica Ballarino

Selected by 16 April 2024

Julio Molina Pineda

Discussion

Also in the cancer biology category:

Mitochondria-derived nuclear ATP surge protects against confinement-induced proliferation defects

Ritobrata Ghose, Fabio Pezzano, Savvas Kourtis, et al.

Selected by 16 May 2024

Teodora Piskova

Spatial transcriptomics elucidates medulla niche supporting germinal center response in myasthenia gravis thymoma

Yoshiaki Yasumizu, Makoto Kinoshita, Martin Jinye Zhang, et al.

Selected by 27 March 2024

Jessica Chevallier

Discussion

Invasion of glioma cells through confined space requires membrane tension regulation and mechano-electrical coupling via Plexin-B2

Chrystian Junqueira Alves, Theodore Hannah, Sita Sadia, et al.

Selected by 13 February 2024

Jade Chan

Discussion

preLists in the bioinformatics category:

‘In preprints’ from Development 2022-2023

A list of the preprints featured in Development's 'In preprints' articles between 2022-2023

Modeling transcriptional profiles of gene perturbation with a deep neural network

Share this:

Have your say Cancel reply

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium

Expressive modeling and fast simulation for dynamic compartments

Transcriptional profiling of human brain cortex identifies novel lncRNA-mediated networks dysregulated in amyotrophic lateral sclerosis

Also in the cancer biology category:

Mitochondria-derived nuclear ATP surge protects against confinement-induced proliferation defects

Spatial transcriptomics elucidates medulla niche supporting germinal center response in myasthenia gravis thymoma

Invasion of glioma cells through confined space requires membrane tension regulation and mechano-electrical coupling via Plexin-B2

preLists in the bioinformatics category:

‘In preprints’ from Development 2022-2023

9th International Symposium on the Biology of Vertebrate Sex Determination

Alumni picks – preLights 5th Birthday

Fibroblasts

Single Cell Biology 2020

Antimicrobials: Discovery, clinical use, and development of resistance

Also in the cancer biology category:

BSCB-Biochemical Society 2024 Cell Migration meeting

CSHL 87th Symposium: Stem Cells

Journal of Cell Science meeting ‘Imaging Cell Dynamics’

CellBio 2022 – An ASCB/EMBO Meeting

Fibroblasts

Single Cell Biology 2020

ASCB EMBO Annual Meeting 2019

Lung Disease and Regeneration

Anticancer agents: Discovery and clinical use

Biophysical Society Annual Meeting 2019