Modeling transcriptional profiles of gene perturbation with a deep neural network

Wenke Liu, Xuya Wang, D R Mani, David Fenyö

Preprint posted on 16 July 2021

Can ChatGPT predict DNA mutations from RNA expression values? This exciting manuscript finds that training a deep learning model with 978 transcript levels is sufficient to predict which gene got silenced following shRNA knockdown!

Selected by Raquel Moya


The ability to predict genetic perturbation(s) from the gene expression profile of a cell line or tissue can provide valuable genomic information to researchers, clinicians, and patients, while avoiding expensive sequencing and time-consuming genomic analyses. Before such a prediction tool can be used in a clinical setting, however, researchers must extensively validate that it can correctly infer known genomic perturbations in existing datasets. Deep learning models have recently been developed and applied to the task of accurately predicting genomic data (e.g., Xpresso, Basenji, Enformer, and Borzoi).1–4 In line with this, the authors of this preprint have trained a deep learning  model using the Connectivity Map (CMap) dataset, which is a large collection of transcriptomic profiles in response to many different compounds and genetic modifications. One goal of this deep learning model is to improve the conventional algorithm developed by CMap to determine a perturbation target gene by comparing its gene expression profile against other profiles using a similarity metric.


The dataset used in this study derives from short hairpin RNA (shRNA) knockdown experiments conducted in 9 cell lines (GEO accession GSE106127) conducted as part of the NIH Library of Integrated Network-Based Cellular Signatures (LINCS) initiative.5 Gene expression profiles for these experiments were generated using the L1000 platform, which is a multiplexed gene expression assay that involves measuring transcript levels with fluorescent beads using a flow-cytometry scanner. Each bead was analyzed both for its color (indicating gene identity) and fluorescence intensity of the phycoerythrin signal (denoting gene abundance).5 The authors developed a deep learning model with a convolutional input layer with 978 nodes (one for each gene), 5 hidden layers of different sizes, and one output node which outputs a vector of probabilities for each possible perturbation target gene (n = 4,313). The model was trained on 80% of the total 341,336 gene expression profiles, where 10% was reserved for model validation and 10% for testing. The aim of the model during classification was to accurately predict the perturbation target gene for each shRNA knockdown.


Key Insights

After training, the model performed with an average AUROC across all classes of 0.99 on the testing data. The authors applied this trained model to a similar, but inherently different, CRISPR knockdown dataset for which the model performed less well (average AUROC across all classes = 0.6078). Attempts to improve performance on the CRISPR knockdown dataset did not change its initial performance. Taken together, this manuscript has demonstrated a deep learning adaptation of a more conventional algorithm that performed well to predict a shRNA target gene. Generally, it is exciting to see the application of deep learning frameworks to gene expression data because the complexity of the transcriptome is potentially matched by the power and scalability of a large network.


Questions and suggestions for the authors

Some questions remain regarding the appropriate application and generalizability of the presented modeling approach. While reading this preprint, there were a few specific things – detailed below – which I hope the authors could comment on:

  • Landmark genes:
    • From the text it’s unclear what “landmark genes” are. Based on the original L1000 paper5 I found that the ~1,000 “landmark” transcripts were selected in an unbiased manner such that their gene expression patterns were orthogonal.
    • The authors could touch on how the original L1000 paper used the measurement of the ~1,000 “landmark” transcripts to infer the remainder of the transcriptome. Only the expression values of the 978 chosen “landmark” genes were involved in training the deep learning model in this manuscript, however it may contextualize some flaws in the model’s performance to know that inference of the remainder of the transcriptome using the “landmark” set was accurate for only 81% (n = 9,196/11,350) of inferred genes. Thus, 17% of genes can’t be inferred from the L1000 transcript set.  Not having captured some gene expression patterns within the 978 “landmark” genes could affect model performance and make a certain perturbation target gene appear as a different one. A larger or different set of “landmark” transcripts that recapitulates the transcriptome better may facilitate improved model performance.
    • The authors claim that “landmark” genes have higher accuracy, yet the bar plots in Fig. 2B don’t show much difference in accuracy.
  • Off-target effects: The original L1000 paper discusses the possibility of their platform being able to analyze off-target effects. They compared “similarity between shRNAs targeting the same gene (“shared gene”) and shRNAs targeting different genes but sharing the 2-8 nucleotide seed sequence known to contribute to off-target effects (“shared seed”)”.5 Their conclusion was that the shared gene similarity was only slightly greater than random. In contrast, shared seed pairs were dramatically more similar to a null distribution, indicating that the magnitude of off-target effects exceeds that of on-target effects for shRNA knockdown.
    • It seems tricky to train a model to accurately predict the perturbation target gene given the amount of off-target effects captured by the training data. Could the authors of this manuscript adapt the model to mitigate the influence of off-target effects? One idea would be to somehow combine or reduce the noise in the gene expression profiles from different shRNAs that target the same gene. The reason why this could improve model performance is because two different shRNAs targeting the same gene should have similar on-target effects and different off-target ones.
    • On that note, how much variability around a measurement exists in this dataset for different shRNAs targeting the same gene? How concordant are their profiles?
  • Imbalanced classes:
    • Genes with low accuracy had a small number of shRNAs. Did the authors try to control for the number of shRNAs per gene in the training data by filtering or down-sampling some target genes?
    • How imbalanced are the classes? The authors could visually or numerically describe to what degree this is true (i.e., number of shRNAs per target gene and whether this is uniform across target genes).
    • The authors report that the average accuracy of the trained model on the test set is 74.9%, but isn’t accuracy not the best measure for imbalanced classes? One option would be AUPRC (area under the precision-recall curve). The authors also report AUROC, which is great, however caution should be used because high recall can be achieved at a very low false positive rate owing to the large number of negatives in the test set, making it easy to obtain a high AUROC even when false positives vastly outnumber true positives (i.e., high false discovery rate).
  • Generalizing to a CRISPR knockdown dataset:
    • It is well documented that a model trained for one context can often perform poorly on another. While the L1000 platform was also used to measure gene expression profiles in the CRISPR knockdown dataset, there are biochemical and experimental differences between shRNA and CRISPR knockdowns. This begs the question about how appropriate it is to apply a model trained on shRNA data to CRISPR data.
    • Is it possible that the model has memorized the training data? It would not be ideal if certain profiles in the training data exhibited similar profile-perturbation relationships in the testing data, such as the same shRNA target gene. Predicting the label is easy if a model is trained using features from the same group. One good practice can be to partition entire biological groups into either the training or test set.
    • The network is fully connected and such networks tend to memorize training data entirely when given enough time. Could the model architecture be contributing to memorization of the training data and lower generalizability?
  • CGS query method: The CMap dataset was designed such that a gene perturbation can be inferred from a gene expression profile by entering a query profile that is compared to profiles in the dataset using a similarity metric. Presumably, the authors tried to improve the performance of this query method, which can tell the user the target gene by comparing the query profile to a consensus gene signature (CGS), the weighted sum of raw profiles targeting the same gene. Does the deep learning model outperform this method? It would be best to benchmark the authors’ model by comparing it to the query method’s performance.
  • Parameters: The authors describe the parameters selected for the model, but don’t provide a justification for how or why those parameters are selected. A hyperparameter search may be a good thing to try.



  1. Agarwal, V. & Shendure, J. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep. 31, 107663 (2020).
  2. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
  3. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
  4. Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. 2023.08.30.555582 Preprint at (2023).
  5. Subramanian, A. et al. A Next Generation Connectivity Map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437-1452.e17 (2017).
  6. Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).

Tags: cmap, crispr knockdown, deep learning, lincs, llm, machine learning, neural network, shrna

Posted on: 13 February 2024 , updated on: 14 February 2024


Read preprint (No Ratings Yet)

Have your say

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Register here

preLists in the bioinformatics category:

‘In preprints’ from Development 2022-2023

A list of the preprints featured in Development's 'In preprints' articles between 2022-2023


List by Alex Eve, Katherine Brown

9th International Symposium on the Biology of Vertebrate Sex Determination

This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.


List by Martin Estermann

Alumni picks – preLights 5th Birthday

This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.


List by Sergio Menchero et al.


The advances in fibroblast biology preList explores the recent discoveries and preprints of the fibroblast world. Get ready to immerse yourself with this list created for fibroblasts aficionados and lovers, and beyond. Here, my goal is to include preprints of fibroblast biology, heterogeneity, fate, extracellular matrix, behavior, topography, single-cell atlases, spatial transcriptomics, and their matrix!


List by Osvaldo Contreras

Single Cell Biology 2020

A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.


List by Alex Eve

Antimicrobials: Discovery, clinical use, and development of resistance

Preprints that describe the discovery of new antimicrobials and any improvements made regarding their clinical use. Includes preprints that detail the factors affecting antimicrobial selection and the development of antimicrobial resistance.


List by Zhang-He Goh

Also in the cancer biology category:

CSHL 87th Symposium: Stem Cells

Preprints mentioned by speakers at the #CSHLsymp23


List by Alex Eve

Journal of Cell Science meeting ‘Imaging Cell Dynamics’

This preList highlights the preprints discussed at the JCS meeting 'Imaging Cell Dynamics'. The meeting was held from 14 - 17 May 2023 in Lisbon, Portugal and was organised by Erika Holzbaur, Jennifer Lippincott-Schwartz, Rob Parton and Michael Way.


List by Helen Zenner

CellBio 2022 – An ASCB/EMBO Meeting

This preLists features preprints that were discussed and presented during the CellBio 2022 meeting in Washington, DC in December 2022.


List by Nadja Hümpfer et al.


The advances in fibroblast biology preList explores the recent discoveries and preprints of the fibroblast world. Get ready to immerse yourself with this list created for fibroblasts aficionados and lovers, and beyond. Here, my goal is to include preprints of fibroblast biology, heterogeneity, fate, extracellular matrix, behavior, topography, single-cell atlases, spatial transcriptomics, and their matrix!


List by Osvaldo Contreras

Single Cell Biology 2020

A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.


List by Alex Eve

ASCB EMBO Annual Meeting 2019

A collection of preprints presented at the 2019 ASCB EMBO Meeting in Washington, DC (December 7-11)


List by Madhuja Samaddar et al.

Lung Disease and Regeneration

This preprint list compiles highlights from the field of lung biology.


List by Rob Hynds

Anticancer agents: Discovery and clinical use

Preprints that describe the discovery of anticancer agents and their clinical use. Includes both small molecules and macromolecules like biologics.


List by Zhang-He Goh

Biophysical Society Annual Meeting 2019

Few of the preprints that were discussed in the recent BPS annual meeting at Baltimore, USA


List by Joseph Jose Thottacherry