Similarity metric learning on perturbational datasets improves functional identification of perturbations

Ian Smith, Petr Smirnov, Benjamin Haibe-Kains

Preprint posted on 11 June 2023

Weak supervision - Strong results! Smith and colleagues introduce Perturbational Metric Learning (PeML), a weakly supervised similarity metric learning method to extract biological relationships from noisy high-throughput perturbational datasets

Selected by Benjamin Dominik Maier, Anna Foix Romero


Similarity metric learning

Similarity metric learning is a technique to measure how similar or different things are from each other. Well known traditional similarity functions include the Pearson and Spearman correlation for omics modalities (Urbanczyk-Wochniak et al., 2003) and Gene Set Enrichment Analysis (GSEA) for gene expression analysis (Subramanian et al., 2005).

Machine learning-based similarity metric learning works in a weakly supervised manner (Duffner et al., 2021). This means that the similarity metric learning doesn’t try to categorise things into classes making it suitable when labels are unknown or hard to obtain (Hernández-González et al. 2016). Instead, it integrates known similarities from repeated measurements to create a high-dimensional space (embedding) where similar things are grouped together. Thus, it learns how to determine if new, unseen examples belong to the same class or exhibit similarity.

In the context of biology, similarity metric learning has proven particularly valuable for analysing large biological datasets. Biological measurements, such as gene expression or cell morphology, are often complex, exhibiting multimodal characteristics, susceptibility to confounding factors, and cell-to-cell variability (Eling et al., 2019). This complexity makes interpretation challenging, especially with sparse single-cell data and low signal-to-noise ratios. However, employing a similarity function tailored to the specific dataset transforms the data into a meaningful context-specific representation, enabling us to identify patterns and relationships within the dataset. For instance, it may help us to identify the mechanism of action, which is the specific way a treatment or substance affects a biological system.

High-throughput perturbational datasets

Recent advances in cost-effective transcriptomics and image-based profiling technologies have made it possible to create extensive public datasets allowing researchers to study the effects of chemical or genetic perturbations on cells, in an automated high-throughput manner. Notably, the Next Generation L1000 Connectivity Map (Subramanian et al., 2017) and the JUMP Cell Painting project (Chandrasekaran et al., 2023), developed through collaborations between pharmaceutical companies and research institutes, contain cell profiles of cells exposed to more than 100,000 unique compounds and genetic manipulations. These collaborations provide a unique opportunity to explore genetic patterns and similarities to a) identify drug mechanisms of action, b) nominate therapeutics for a particular disease, and c) construct biological networks among perturbations and genes.

Key Findings


Smith and colleagues introduce PeML, a weakly supervised similarity metric learning method that transforms biological measurements into an intrinsic, dataset-specific basis. Thus, biological relationships and mechanisms can be extracted from noisy high-throughput perturbational datasets. To measure the performance of the new method, the authors use the L1000 dataset comprising gene expression signatures of compounds in cancer and immortalised cell lines, as well as the CDRP Cell Painting dataset containing cellular morphology and function data from a single cell line. The authors show that PeML maximises the discrimination of replicate signatures, improves recall in biological data and yields better prediction of compound mechanisms of action. Recall (also known as sensitivity or true positive rate) is calculated as the ratio of the true positive (TP) predictions to the total number of actual positive instances in the dataset. PeML is capable of being learned with moderate dataset sizes and goes beyond traditional approaches by capturing a more profound notion of similarity. Therefore, it might improve data classification, clustering, and subsequent analyses.

Fig. 1 Schematic of the weakly supervised ML similarity metric learning method Perturbational Metric Learning (PeML). Figure taken from Smith et al. (2023), BioRxiv published under the CC-BY-NC-ND 4.0 International licence.

Perturbational Metric Learning (PeML)

PeML is a weakly supervised machine learning framework that learns a similarity function between samples. This method uses replicates of experiments as ground truth to train a data-driven similarity function. Unlike traditional methods, PeML is a feature transformation technique that works directly on processed genetic or physical characteristics data, eliminating the need to extract new features from the original raw data.

PeML improves replicate recall in biological data

First, the authors conducted a replicate recall analysis to quantify the model’s ability to capture biologically relevant relationships in the data. To account for differences between cell lines, separate context-specific models were trained for each cell line. The training was performed on small batches of data instead of the entire dataset at once (mini-batch stochastic gradient descent), making it more efficient. Signatures representing the same compound treatment were grouped together, regardless of dosage or time point. As a similarity metric balanced AUC was used, which adjusts for some classes having more examples than others. AUC is a metric for evaluating machine learning models in binary classification tasks. It measures the area under the Receiver Operating Characteristic curve, where TPR (correctly classified positive samples) is plotted against FPR (incorrectly classified negative samples), providing insight into the model’s ability to distinguish between classes.

Subsequently, the model’s generalizability and performance across different compounds was evaluated using 5-fold compound-wise cross-validations. This means that the dataset was split into five parts based on the compounds, and each part was used as a validation set once while the other four parts were used for training. Thus, the authors demonstrated that PeML outperformed the baseline cosine similarity, yielding higher replicate rank and improving recall for replicate pairs in various cell lines, as well as achieving better results for previously unseen compounds.

PeML improves prediction of compound mechanism of action from perturbational signatures

Next, the authors benchmarked PeML’s ability to identify drugs’ mechanisms of action. Across each cell line in the L1000 and Cell Painting datasets, they found that PeML recovers a greater proportion of biologically-relevant mechanisms of action. Furthermore, a signal-to-noise ratio analysis revealed that PeML better discriminates similar pairs from the background than standard similarity metrics.

Generalizability of PeML

While the previous analyses demonstrated promising results for large high-quality datasets, the performance on smaller training datasets remained unknown. Hence, the authors assessed the minimal training data required for a well-generalised model by downsampling the original datasets. Their results indicate that a few hundred conditions with replicates are sufficient to identify and retrieve biologically relevant associations from a given dataset.

Finally, the authors tested their initial hypothesis that context-specific models tailored to a specific cell line perform better than pan-models trained on all cell lines. The results demonstrated that learning context-specific models for different cancer cell lines improved similarity retrieval tasks compared to models trained in all contexts and cosine models.

Fig. 2 Cell line-specific metric learning functions outperform a pan-dataset function and a baseline cosine function in predicting Mechanism of Action. Figure taken from Smith et al. (2023), BioRxiv published under the CC-BY-NC-ND 4.0 International license.

Further Material

GitHub Repository

R package (not released yet)

Conclusion and Perspective

As the volume of large-scale biological datasets continues to grow, the increasing relevance of weakly supervised learning algorithms becomes evident, offering data-driven and scalable analysis while minimising the dependency on costly and time-consuming expert annotations and training data. In this preprint, Smith and colleagues present Perturbational Metric Learning (PeML), a powerful tool for the analysis of large biological datasets. PeML learns a data-driven similarity function by transforming biological measurements into an intrinsic, dataset-specific basis to extract meaningful biological associations such as compound mechanisms of action from noisy datasets. In addition to capturing a more meaningful notion of similarity, data in the transformed basis can be used for other analysis tasks, such as classification and clustering.

The idea of integrating large-scale imaging data into our pipelines has emerged as a pressing challenge. This led us to consider featuring a preprint that offers valuable insights into bridging multi-omics data analysis with imaging and machine learning. Given Benjamin’s expertise in integrating and analysing large-scale multi-omics data, along with Anna’s background in computer vision, bioimage analysis and machine learning, this preLight post presents an exciting opportunity for interdisciplinary collaboration.


Chandrasekaran, S. N., Ackerman, J., Alix, E., Ando, D. M., Arevalo, J., Bennion, M., Boisseau, N., Borowa, A., Boyd, J. D., Brino, L., Byrne, P. J., Ceulemans, H., Ch’ng, C., Cimini, B. A., Clevert, D.-A., Deflaux, N., Doench, J. G., Dorval, T., Doyonnas, R., … & Carpenter, A. E. (2023). JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations. bioRxiv.

Eling, N., Morgan, M. D., & Marioni, J. C. (2019). Challenges in measuring and understanding biological noise. Nature reviews. Genetics, 20(9), 536–548.

Hernández-González, J., Inza, I., & Lozano, J. A. (2016). Weak supervision and other non-standard classification problems: A taxonomy. Pattern Recognition Letters, 69, 49–55. https://doi:10.1016/j.patrec.2015.10.008

Stefan Duffner, Christophe Garcia, Khalid Idrissi, Atilla Baskurt. Similarity Metric Learning. Multi-faceted Deep Learning – Models and Data, 2021. ⟨hal-03465119⟩

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., … & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545-15550.

Subramanian, A., Narayan, R., Corsello, S. M., Peck, D. D., Natoli, T. E., Lu, X., Gould, J., Davis, J. F., Tubelli, A. A., Asiedu, J. K., Lahr, D. L., Hirschman, J. E., Liu, Z., Donahue, M., Julian, B., Khan, M., Wadden, D., Smith, I. C., Lam, D., Liberzon, A., … & Golub, T. R. (2017). A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell, 171(6), 1437–1452.e17.

Urbanczyk-Wochniak, E., Luedemann, A., Kopka, J., Selbig, J., Roessner-Tunali, U., Willmitzer, L. and Fernie, A.R. (2003), Parallel analysis of transcript and metabolic profiles: a new approach in systems biology. EMBO reports, 4: 989-993.

Tags: cell painting, classification, clustering, correlations, data-driven, gsea, machine learning, next generation connectivity map (l1000), perturbations, similarity metric learning

Posted on: 17 August 2023


Read preprint (No Ratings Yet)

Author's response

The author team shared

Thanks very much for your interest in my PeML manuscript.

Q1: We are curious to see how the performance of PeML compares with other state-of-the-art similarity metric learning methods providing a more comprehensive evaluation of the proposed method. Have you run any benchmarks to compare PeML to other algorithms?

Despite advances in representation learning – the umbrella under which self-supervised learning (SSL) and weakly supervised learning (WSL) fall – the most commonly used metrics remain off-the-shelf methods like correlation and gene set approaches. The recent revolution in self-supervised learning has most affected computer vision and NLP, as with SimCLR and LLMs. The properties of these spaces are quite different from biological data – in particular that identity preserving transformations are easier to define. Differential perturbational signatures, measuring changes in some feature space, must be analyzed with a scale invariant metric, like cosine: doubling the dose of a drug should produce a similar signature. For both of these reasons, there are relatively few metric learning methods that can be intelligently applied to the perturbational domain. There have been some interesting recent developments in biological SSL. A preprint by Moshkov 2022 on a weakly supervised CNN-based method for cell painting perturbational similarity is promising; CLEAR from Han 2022 develops an scRNA-specific method; scGPT from the Bo Wang lab uses an attention mask on scRNA data to do SSL. However, these are not perturbational datasets. In short, I am not aware of perturbational WSL metric learning methods that are platform-agnostic like PeML. The key advantage of PeML is its simplicity: it is domain agnostic and does not make assumptions about identity-preserving transformations beyond experimental reproducibility. I have not compared PeML’s encoding to a more complex approach, like Moshkov, but it is a worthwhile question.

Q2: PeML requires biological replicates of experiments as ground truth of similar signatures, which may not always be available or feasible to obtain. Is there an alternative way to obtain or infer ground truth signatures?

This is an excellent question, as in domains like scRNA, obtaining some form of replicate experiments isn’t feasible. At present, SSL methods require a ground truth either from identity-preserving transformations or replicates. I speculate that for spaces with local convexity, it might be possible to learn the properties of the space from a particular set of replicated experiments and extrapolate in general. For instance, for scRNA, it might be possible to calibrate the space with repeated measurements of a spike-in control, learn a metric, then apply it to data where replicated experimentation is impossible. An approach like this seems more satisfying than SSL with imputation via an attention mask. Another possible method for defining prior similarity would be to use annotation from another source, such as a phenotypic readout or label of data points, but this coarse approach risks oversimplifying important differences. Ultimately, I don’t know of a better way to learn the properties of a space other than repeated (synthetic or otherwise) measurements.

Q3: Considering that PeML may not be well-suited for datasets with a small number of replicates or features, and its applicability varying based on the specific characteristics of the biological data; when would you recommend using PeML to identify relationships in a dataset and when should one use alternative methods?

Basically any form of representation learning should be evaluated with some benchmark. One of the simplest ways of doing representation learning is to do PCA on a dataset and discard some number of components. But to validate that this is useful, it’s necessary to have some ground truth task on which it can be shown the representation helps extract meaningful information before applying it to new analysis. As we have shown with PeML, you don’t necessarily need a colossal dataset to extract a useful WSL representation. My recommendation would be to first identify a representative benchmark task, then compare any number of representation learning methods to determine which method performs best on that benchmark. Much like a cross-validated R2 value for regression, it’s necessary to have some quantifiable evidence of performance before blithely applying representation methods, especially when all these methods have the risk of failing to generalize to a new domain or dataset.

Q4: In your discussion, you mention that “The space of transformations on transcriptomic data, for example, that leave the identity of the biological state unchanged is unknown.” What approaches/ideas are currently discussed by the community to address and overcome this challenge?

Self-supervised learning has unlocked an entire world of label-free learning, exemplified by revolutions in computer vision classification and Large Language Models in NLP. Biological space is sufficiently complex that Weakly Supervised Learning, the poor relation of SSL, has been needed to learn biological relationships. The presence of these relationships and the relevance of a lower dimensional manifold has been known for decades; pathways are a great example of structure on gene expression data. It may be that attention masks as from NLP and in scGPT are sufficient to learn this manifold, but I believe that we must identify the neighbourhoods of a particular class or data point. Apart from the challenge of understanding the properties of the biological manifold, each domain is different. Cellular morphology imaging has proved useful due to its high-throughput, and much of the knowledge from computer vision can be translated to that space. Proteomics, transcriptomics, chromatin measurements, and DNA are all significantly trickier, and the space of cancer has a vast mutational landscape and relatively few samples from which to learn. We can leverage existing compendia, like TCGA, GTEX, ICGC, and even L1000. Ultimately, I suspect each data modality will require its own insights and tricks, analogous to the image transformations from computer vision, to learn the properties of the data manifold.

Have your say

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Register here

preLists in the bioinformatics category:

9th International Symposium on the Biology of Vertebrate Sex Determination

This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.


List by Martin Estermann

Alumni picks – preLights 5th Birthday

This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.


List by Sergio Menchero et al.


The advances in fibroblast biology preList explores the recent discoveries and preprints of the fibroblast world. Get ready to immerse yourself with this list created for fibroblasts aficionados and lovers, and beyond. Here, my goal is to include preprints of fibroblast biology, heterogeneity, fate, extracellular matrix, behavior, topography, single-cell atlases, spatial transcriptomics, and their matrix!


List by Osvaldo Contreras

Single Cell Biology 2020

A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.


List by Alex Eve

Antimicrobials: Discovery, clinical use, and development of resistance

Preprints that describe the discovery of new antimicrobials and any improvements made regarding their clinical use. Includes preprints that detail the factors affecting antimicrobial selection and the development of antimicrobial resistance.


List by Zhang-He Goh

Also in the molecular biology category:

CSHL 87th Symposium: Stem Cells

Preprints mentioned by speakers at the #CSHLsymp23


List by Alex Eve

9th International Symposium on the Biology of Vertebrate Sex Determination

This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.


List by Martin Estermann

Alumni picks – preLights 5th Birthday

This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.


List by Sergio Menchero et al.

CellBio 2022 – An ASCB/EMBO Meeting

This preLists features preprints that were discussed and presented during the CellBio 2022 meeting in Washington, DC in December 2022.


List by Nadja Hümpfer et al.

EMBL Synthetic Morphogenesis: From Gene Circuits to Tissue Architecture (2021)

A list of preprints mentioned at the #EESmorphoG virtual meeting in 2021.


List by Alex Eve

FENS 2020

A collection of preprints presented during the virtual meeting of the Federation of European Neuroscience Societies (FENS) in 2020


List by Ana Dorrego-Rivas

ECFG15 – Fungal biology

Preprints presented at 15th European Conference on Fungal Genetics 17-20 February 2020 Rome


List by Hiral Shah

ASCB EMBO Annual Meeting 2019

A collection of preprints presented at the 2019 ASCB EMBO Meeting in Washington, DC (December 7-11)


List by Madhuja Samaddar et al.

Lung Disease and Regeneration

This preprint list compiles highlights from the field of lung biology.


List by Rob Hynds


This list of preprints is focused on work expanding our knowledge on mitochondria in any organism, tissue or cell type, from the normal biology to the pathology.


List by Sandra Franco Iborra