Modeling transcriptional profiles of gene perturbation with a deep neural network

Wenke Liu, Xuya Wang, D R Mani, David Fenyö

Posted on: 13 February 2024 , updated on: 30 September 2024

Preprint posted on 16 July 2021

Can we predict DNA perturbations from RNA expression values? This exciting manuscript finds that training a deep learning model with 978 transcript levels is sufficient to predict which gene got silenced following shRNA knockdown!

Selected by Raquel Moya

Categories: bioinformatics, cancer biology

Context

The ability to predict genetic perturbation(s) from the gene expression profile of a cell line or tissue can provide valuable genomic information to researchers, clinicians, and patients, while avoiding expensive sequencing and time-consuming genomic analyses. Before such a prediction tool can be used in a clinical setting, however, researchers must extensively validate that it can correctly infer known genomic perturbations in existing datasets. Deep learning models have recently been developed and applied to the task of accurately predicting genomic data (e.g., Xpresso, Basenji, Enformer, and Borzoi).^1–4 In line with this, the authors of this preprint have trained a deep learning model using the Connectivity Map (CMap) dataset, which is a large collection of transcriptomic profiles in response to many different compounds and genetic modifications. One goal of this deep learning model is to improve the conventional algorithm developed by CMap that determines a perturbation target gene by comparing its gene expression profile against other profiles using a similarity metric.

Methods

The dataset used in this study derives from short hairpin RNA (shRNA) knockdown experiments conducted in 9 cell lines (GEO accession GSE106127) conducted as part of the NIH Library of Integrated Network-Based Cellular Signatures (LINCS) initiative.⁵ Gene expression profiles for these experiments were generated using the L1000 platform, which is a multiplexed gene expression assay that involves measuring transcript levels with fluorescent beads using a flow-cytometry scanner. Each bead was analyzed both for its color (indicating gene identity) and fluorescence intensity of the phycoerythrin signal (denoting gene abundance).⁵ The authors developed a deep learning model with a convolutional input layer with 978 nodes (one for each gene), 5 hidden layers of different sizes, and one output node which outputs a vector of probabilities for each possible perturbation target gene (n = 4,313). The model was trained on 80% of the total 341,336 gene expression profiles, where 10% was reserved for model validation and 10% for testing. The aim of the model during classification was to accurately predict the perturbation target gene for each shRNA knockdown.

Key Insights

After training, the model performed with an average AUROC across all classes of 0.99 on the testing data. The authors applied this trained model to a similar, but inherently different, CRISPR knockdown dataset for which the model performed less well (average AUROC across all classes = 0.6078). Attempts to improve performance on the CRISPR knockdown dataset did not change its initial performance. Taken together, this manuscript has demonstrated a deep learning adaptation of a more conventional algorithm that performed well to predict a shRNA target gene. Generally, it is exciting to see the application of deep learning frameworks to gene expression data because the complexity of the transcriptome is potentially matched by the power and scalability of a large network.

Questions for the authors

This intriguing study raised some questions about its application and the generalizability of the presented modeling approach.

Landmark genes:
- The authors could touch on how the original L1000 paper used the measurement of the ~1,000 “landmark” transcripts to infer the remainder of the transcriptome. Only the expression values of the 978 chosen “landmark” genes were involved in training the deep learning model in this manuscript, however it may contextualize the model’s performance to mention that inference of the remainder of the transcriptome using the “landmark” set was accurate for only 81% (n = 9,196/11,350) of inferred genes. Thus, 17% of genes can’t be inferred from the L1000 transcript set. This could affect model performance and make a certain perturbation target gene appear as a different one. A larger or different set of “landmark” transcripts that recapitulates the transcriptome better may facilitate improved model performance.
Off-target effects: The original L1000 paper discusses the possibility of their platform being able to analyze off-target effects. They compared “similarity between shRNAs targeting the same gene (“shared gene”) and shRNAs targeting different genes but sharing the 2-8 nucleotide seed sequence known to contribute to off-target effects (“shared seed”)”.⁵ Their conclusion was that the shared gene similarity was only slightly greater than random. In contrast, shared seed pairs were dramatically more similar to a null distribution, indicating that the magnitude of off-target effects exceeds that of on-target effects for shRNA knockdown.
- It seems tricky to train a model to accurately predict the perturbation target gene given the amount of off-target effects captured by the training data. Could the model be adapted model to mitigate the influence of off-target effects? One idea could be to somehow combine or reduce the noise in the gene expression profiles from different shRNAs that target the same gene. This could improve model performance because two different shRNAs targeting the same gene might have similar on-target effects and different off-target ones.
- How much variability around a measurement exists in this dataset for different shRNAs targeting the same gene? How concordant are their profiles?
Imbalanced classes:
- Genes with low accuracy had a small number of shRNAs. Could the authors control for the number of shRNAs per gene in the training data by filtering or down-sampling some target genes?
- How imbalanced are the classes? The authors could show the number of shRNAs per target gene and whether this is uniform across target genes, for example.
- The authors report that the average accuracy of the trained model on the test set is 74.9%. It would also be great to show AUPRC (area under the precision-recall curve) too. The authors report AUROC, which is great, however caution could be used because high recall can be achieved at a very low false positive rate owing to the large number of negatives in the test set, making it easy to obtain a high AUROC even when false positives vastly outnumber true positives (i.e., high false discovery rate).
Generalizing to a CRISPR knockdown dataset:
- It is possible that a model trained for one context can often perform poorly on another. While the L1000 platform was also used to measure gene expression profiles in the CRISPR knockdown dataset, there are biochemical and experimental differences between shRNA and CRISPR knockdowns.
- Is it possible that the model has memorized the training data? Certain profiles in the training data could exhibit similar profile-perturbation relationships in the testing data, such as the same shRNA target gene. I wonder how partitioning entire biological groups into either the training or test set would affect results.
- The network is fully connected and such networks tend to memorize training data entirely when given enough time. Could the model architecture be contributing to memorization of the training data and lower generalizability?
CGS query method: The CMap dataset can be used to infer a gene perturbation from a gene expression profile by entering a query profile that is compared to profiles in the dataset using a similarity metric. Does the deep learning model outperform this method? I’d be curious to know how the new model performs compared to the query method’s performance.

References

Agarwal, V. & Shendure, J. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep. 31, 107663 (2020).
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. 2023.08.30.555582 Preprint at https://doi.org/10.1101/2023.08.30.555582 (2023).
Subramanian, A. et al. A Next Generation Connectivity Map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437-1452.e17 (2017).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).