Similarity metric learning on perturbational datasets improves functional identification of perturbations

Ian Smith, Petr Smirnov, Benjamin Haibe-Kains

Posted on: 17 August 2023

Preprint posted on 11 June 2023

Weak supervision - Strong results! Smith and colleagues introduce Perturbational Metric Learning (PeML), a weakly supervised similarity metric learning method to extract biological relationships from noisy high-throughput perturbational datasets

Selected by Benjamin Dominik Maier, Anna Foix Romero

Categories: bioinformatics, molecular biology, systems biology

Background

Similarity metric learning

Similarity metric learning is a technique to measure how similar or different things are from each other. Well known traditional similarity functions include the Pearson and Spearman correlation for omics modalities (Urbanczyk-Wochniak et al., 2003) and Gene Set Enrichment Analysis (GSEA) for gene expression analysis (Subramanian et al., 2005).

Machine learning-based similarity metric learning works in a weakly supervised manner (Duffner et al., 2021). This means that the similarity metric learning doesn’t try to categorise things into classes making it suitable when labels are unknown or hard to obtain (Hernández-González et al. 2016). Instead, it integrates known similarities from repeated measurements to create a high-dimensional space (embedding) where similar things are grouped together. Thus, it learns how to determine if new, unseen examples belong to the same class or exhibit similarity.

In the context of biology, similarity metric learning has proven particularly valuable for analysing large biological datasets. Biological measurements, such as gene expression or cell morphology, are often complex, exhibiting multimodal characteristics, susceptibility to confounding factors, and cell-to-cell variability (Eling et al., 2019). This complexity makes interpretation challenging, especially with sparse single-cell data and low signal-to-noise ratios. However, employing a similarity function tailored to the specific dataset transforms the data into a meaningful context-specific representation, enabling us to identify patterns and relationships within the dataset. For instance, it may help us to identify the mechanism of action, which is the specific way a treatment or substance affects a biological system.

High-throughput perturbational datasets

Recent advances in cost-effective transcriptomics and image-based profiling technologies have made it possible to create extensive public datasets allowing researchers to study the effects of chemical or genetic perturbations on cells, in an automated high-throughput manner. Notably, the Next Generation L1000 Connectivity Map (Subramanian et al., 2017) and the JUMP Cell Painting project (Chandrasekaran et al., 2023), developed through collaborations between pharmaceutical companies and research institutes, contain cell profiles of cells exposed to more than 100,000 unique compounds and genetic manipulations. These collaborations provide a unique opportunity to explore genetic patterns and similarities to a) identify drug mechanisms of action, b) nominate therapeutics for a particular disease, and c) construct biological networks among perturbations and genes.

Key Findings

Overview

Smith and colleagues introduce PeML, a weakly supervised similarity metric learning method that transforms biological measurements into an intrinsic, dataset-specific basis. Thus, biological relationships and mechanisms can be extracted from noisy high-throughput perturbational datasets. To measure the performance of the new method, the authors use the L1000 dataset comprising gene expression signatures of compounds in cancer and immortalised cell lines, as well as the CDRP Cell Painting dataset containing cellular morphology and function data from a single cell line. The authors show that PeML maximises the discrimination of replicate signatures, improves recall in biological data and yields better prediction of compound mechanisms of action. Recall (also known as sensitivity or true positive rate) is calculated as the ratio of the true positive (TP) predictions to the total number of actual positive instances in the dataset. PeML is capable of being learned with moderate dataset sizes and goes beyond traditional approaches by capturing a more profound notion of similarity. Therefore, it might improve data classification, clustering, and subsequent analyses.

Fig. 1 Schematic of the weakly supervised ML similarity metric learning method Perturbational Metric Learning (PeML). Figure taken from Smith et al. (2023), BioRxiv published under the CC-BY-NC-ND 4.0 International licence.

Perturbational Metric Learning (PeML)

PeML is a weakly supervised machine learning framework that learns a similarity function between samples. This method uses replicates of experiments as ground truth to train a data-driven similarity function. Unlike traditional methods, PeML is a feature transformation technique that works directly on processed genetic or physical characteristics data, eliminating the need to extract new features from the original raw data.

PeML improves replicate recall in biological data

First, the authors conducted a replicate recall analysis to quantify the model’s ability to capture biologically relevant relationships in the data. To account for differences between cell lines, separate context-specific models were trained for each cell line. The training was performed on small batches of data instead of the entire dataset at once (mini-batch stochastic gradient descent), making it more efficient. Signatures representing the same compound treatment were grouped together, regardless of dosage or time point. As a similarity metric balanced AUC was used, which adjusts for some classes having more examples than others. AUC is a metric for evaluating machine learning models in binary classification tasks. It measures the area under the Receiver Operating Characteristic curve, where TPR (correctly classified positive samples) is plotted against FPR (incorrectly classified negative samples), providing insight into the model’s ability to distinguish between classes.

Subsequently, the model’s generalizability and performance across different compounds was evaluated using 5-fold compound-wise cross-validations. This means that the dataset was split into five parts based on the compounds, and each part was used as a validation set once while the other four parts were used for training. Thus, the authors demonstrated that PeML outperformed the baseline cosine similarity, yielding higher replicate rank and improving recall for replicate pairs in various cell lines, as well as achieving better results for previously unseen compounds.

PeML improves prediction of compound mechanism of action from perturbational signatures

Next, the authors benchmarked PeML’s ability to identify drugs’ mechanisms of action. Across each cell line in the L1000 and Cell Painting datasets, they found that PeML recovers a greater proportion of biologically-relevant mechanisms of action. Furthermore, a signal-to-noise ratio analysis revealed that PeML better discriminates similar pairs from the background than standard similarity metrics.

Generalizability of PeML

While the previous analyses demonstrated promising results for large high-quality datasets, the performance on smaller training datasets remained unknown. Hence, the authors assessed the minimal training data required for a well-generalised model by downsampling the original datasets. Their results indicate that a few hundred conditions with replicates are sufficient to identify and retrieve biologically relevant associations from a given dataset.

Finally, the authors tested their initial hypothesis that context-specific models tailored to a specific cell line perform better than pan-models trained on all cell lines. The results demonstrated that learning context-specific models for different cancer cell lines improved similarity retrieval tasks compared to models trained in all contexts and cosine models.

Fig. 2 Cell line-specific metric learning functions outperform a pan-dataset function and a baseline cosine function in predicting Mechanism of Action. Figure taken from Smith et al. (2023), BioRxiv published under the CC-BY-NC-ND 4.0 International license.

Further Material

GitHub Repository

R package (not released yet)

Conclusion and Perspective

As the volume of large-scale biological datasets continues to grow, the increasing relevance of weakly supervised learning algorithms becomes evident, offering data-driven and scalable analysis while minimising the dependency on costly and time-consuming expert annotations and training data. In this preprint, Smith and colleagues present Perturbational Metric Learning (PeML), a powerful tool for the analysis of large biological datasets. PeML learns a data-driven similarity function by transforming biological measurements into an intrinsic, dataset-specific basis to extract meaningful biological associations such as compound mechanisms of action from noisy datasets. In addition to capturing a more meaningful notion of similarity, data in the transformed basis can be used for other analysis tasks, such as classification and clustering.

The idea of integrating large-scale imaging data into our pipelines has emerged as a pressing challenge. This led us to consider featuring a preprint that offers valuable insights into bridging multi-omics data analysis with imaging and machine learning. Given Benjamin’s expertise in integrating and analysing large-scale multi-omics data, along with Anna’s background in computer vision, bioimage analysis and machine learning, this preLight post presents an exciting opportunity for interdisciplinary collaboration.

References

Chandrasekaran, S. N., Ackerman, J., Alix, E., Ando, D. M., Arevalo, J., Bennion, M., Boisseau, N., Borowa, A., Boyd, J. D., Brino, L., Byrne, P. J., Ceulemans, H., Ch’ng, C., Cimini, B. A., Clevert, D.-A., Deflaux, N., Doench, J. G., Dorval, T., Doyonnas, R., … & Carpenter, A. E. (2023). JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations. bioRxiv. https://doi.org/10.1101/2023.03.23.534023

Eling, N., Morgan, M. D., & Marioni, J. C. (2019). Challenges in measuring and understanding biological noise. Nature reviews. Genetics, 20(9), 536–548. https://doi.org/10.1038/s41576-019-0130-6

Hernández-González, J., Inza, I., & Lozano, J. A. (2016). Weak supervision and other non-standard classification problems: A taxonomy. Pattern Recognition Letters, 69, 49–55. https://doi:10.1016/j.patrec.2015.10.008

Stefan Duffner, Christophe Garcia, Khalid Idrissi, Atilla Baskurt. Similarity Metric Learning. Multi-faceted Deep Learning – Models and Data, 2021. ⟨hal-03465119⟩ https://hal.science/hal-03465119

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., … & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545-15550. https://doi.org/10.1073/pnas.0506580102

Subramanian, A., Narayan, R., Corsello, S. M., Peck, D. D., Natoli, T. E., Lu, X., Gould, J., Davis, J. F., Tubelli, A. A., Asiedu, J. K., Lahr, D. L., Hirschman, J. E., Liu, Z., Donahue, M., Julian, B., Khan, M., Wadden, D., Smith, I. C., Lam, D., Liberzon, A., … & Golub, T. R. (2017). A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell, 171(6), 1437–1452.e17. https://doi.org/10.1016/j.cell.2017.10.049

Urbanczyk-Wochniak, E., Luedemann, A., Kopka, J., Selbig, J., Roessner-Tunali, U., Willmitzer, L. and Fernie, A.R. (2003), Parallel analysis of transcript and metabolic profiles: a new approach in systems biology. EMBO reports, 4: 989-993. https://doi.org/10.1038/sj.embor.embor944

Tags: cell painting, classification, clustering, correlations, data-driven, gsea, machine learning, next generation connectivity map (l1000), perturbations, similarity metric learning

doi: https://doi.org/10.1242/prelights.35232

Read preprint

(No Ratings Yet)

Author's response

The author team shared

Thanks very much for your interest in my PeML manuscript.

Q1: We are curious to see how the performance of PeML compares with other state-of-the-art similarity metric learning methods providing a more comprehensive evaluation of the proposed method. Have you run any benchmarks to compare PeML to other algorithms?

Despite advances in representation learning – the umbrella under which self-supervised learning (SSL) and weakly supervised learning (WSL) fall – the most commonly used metrics remain off-the-shelf methods like correlation and gene set approaches. The recent revolution in self-supervised learning has most affected computer vision and NLP, as with SimCLR and LLMs. The properties of these spaces are quite different from biological data – in particular that identity preserving transformations are easier to define. Differential perturbational signatures, measuring changes in some feature space, must be analyzed with a scale invariant metric, like cosine: doubling the dose of a drug should produce a similar signature. For both of these reasons, there are relatively few metric learning methods that can be intelligently applied to the perturbational domain. There have been some interesting recent developments in biological SSL. A preprint by Moshkov 2022 on a weakly supervised CNN-based method for cell painting perturbational similarity is promising; CLEAR from Han 2022 develops an scRNA-specific method; scGPT from the Bo Wang lab uses an attention mask on scRNA data to do SSL. However, these are not perturbational datasets. In short, I am not aware of perturbational WSL metric learning methods that are platform-agnostic like PeML. The key advantage of PeML is its simplicity: it is domain agnostic and does not make assumptions about identity-preserving transformations beyond experimental reproducibility. I have not compared PeML’s encoding to a more complex approach, like Moshkov, but it is a worthwhile question.

Q2: PeML requires biological replicates of experiments as ground truth of similar signatures, which may not always be available or feasible to obtain. Is there an alternative way to obtain or infer ground truth signatures?

This is an excellent question, as in domains like scRNA, obtaining some form of replicate experiments isn’t feasible. At present, SSL methods require a ground truth either from identity-preserving transformations or replicates. I speculate that for spaces with local convexity, it might be possible to learn the properties of the space from a particular set of replicated experiments and extrapolate in general. For instance, for scRNA, it might be possible to calibrate the space with repeated measurements of a spike-in control, learn a metric, then apply it to data where replicated experimentation is impossible. An approach like this seems more satisfying than SSL with imputation via an attention mask. Another possible method for defining prior similarity would be to use annotation from another source, such as a phenotypic readout or label of data points, but this coarse approach risks oversimplifying important differences. Ultimately, I don’t know of a better way to learn the properties of a space other than repeated (synthetic or otherwise) measurements.

Q3: Considering that PeML may not be well-suited for datasets with a small number of replicates or features, and its applicability varying based on the specific characteristics of the biological data; when would you recommend using PeML to identify relationships in a dataset and when should one use alternative methods?

Basically any form of representation learning should be evaluated with some benchmark. One of the simplest ways of doing representation learning is to do PCA on a dataset and discard some number of components. But to validate that this is useful, it’s necessary to have some ground truth task on which it can be shown the representation helps extract meaningful information before applying it to new analysis. As we have shown with PeML, you don’t necessarily need a colossal dataset to extract a useful WSL representation. My recommendation would be to first identify a representative benchmark task, then compare any number of representation learning methods to determine which method performs best on that benchmark. Much like a cross-validated R2 value for regression, it’s necessary to have some quantifiable evidence of performance before blithely applying representation methods, especially when all these methods have the risk of failing to generalize to a new domain or dataset.

Q4: In your discussion, you mention that “The space of transformations on transcriptomic data, for example, that leave the identity of the biological state unchanged is unknown.” What approaches/ideas are currently discussed by the community to address and overcome this challenge?

Self-supervised learning has unlocked an entire world of label-free learning, exemplified by revolutions in computer vision classification and Large Language Models in NLP. Biological space is sufficiently complex that Weakly Supervised Learning, the poor relation of SSL, has been needed to learn biological relationships. The presence of these relationships and the relevance of a lower dimensional manifold has been known for decades; pathways are a great example of structure on gene expression data. It may be that attention masks as from NLP and in scGPT are sufficient to learn this manifold, but I believe that we must identify the neighbourhoods of a particular class or data point. Apart from the challenge of understanding the properties of the biological manifold, each domain is different. Cellular morphology imaging has proved useful due to its high-throughput, and much of the knowledge from computer vision can be translated to that space. Proteomics, transcriptomics, chromatin measurements, and DNA are all significantly trickier, and the space of cancer has a vast mutational landscape and relatively few samples from which to learn. We can leverage existing compendia, like TCGA, GTEX, ICGC, and even L1000. Ultimately, I suspect each data modality will require its own insights and tricks, analogous to the image transformations from computer vision, to learn the properties of the data manifold.

Have your say Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

A potential anti-amyloidogenic therapy for type 2 diabetes based on the QBP1 peptide

María M. Tejero-Ojeda, Ada Bernaus Vives, Michal Wojciechowski, et al.

Selected by 01 April 2026

Joao Gabriel, Marcus Oliveira

Discussion

The lipidomic architecture of the mouse brain

Luca Fusar Bassini, Halima Hannah Schede, Laura Capolupo, et al.

Selected by 09 February 2026

CRM UoE Journal Club et al.

Discussion

Kosmos: An AI Scientist for Autonomous Discovery

Ludovico Mitchener, Angela Yiu, Benjamin Chang, et al.

Selected by 04 February 2026

Roberto Amadio et al.

Discussion

Also in the molecular biology category:

Detergent-Triggered Membrane Remodelling Monitored via Intramembrane Fluorescence De-Quenching

Claudia M. F. Andrews, Christopher M. Hofmair, Lauryn Roberts, et al.

Selected by 25 March 2026

Cyntia Alves Conceição, Marcus Oliveira

Discussion

Classical enhancers couple cis-regulatory logic with transcriptional condensates and 3D genome architecture

Ville Tiusanen, Divyesh Patel, Jihan Xia, et al.

Selected by 22 March 2026

Siddharth Singh

Discussion

Small Molecule Agonists of TREM2 Reprogram Microglia and Protect Synapses in Human Alzheimer’s Models

Hossam Nada, Shaoren Yuan, Farida El Gaamouch, et al.

Selected by 17 March 2026

Dina Kabbara

Discussion

Also in the systems biology category:

Human single-cell atlas analysis reveals heterogeneous endothelial signaling

Zimo Zhu, Rongbin Zheng, Yang Yu, et al.

Selected by 11 November 2025

Charis Qi

Discussion

Longitudinal single cell RNA-sequencing reveals evolution of micro- and macro-states in chronic myeloid leukemia

David E. Frankhouser, Dandan Zhao, Yu-Hsuan Fu, et al.

Selected by 03 November 2025

Charis Qi

Environmental and Maternal Imprints on Infant Gut Metabolic Programming

Kine Eide Kvitne, Celeste Allaband, Jennifer C. Onuora, et al.

Selected by 26 October 2025

Siddharth Singh

Discussion

preLists in the bioinformatics category:

Keystone Symposium – Metabolic and Nutritional Control of Development and Cell Fate

This preList contains preprints discussed during the Metabolic and Nutritional Control of Development and Cell Fate Keystone Symposia. This conference was organized by Lydia Finley and Ralph J. DeBerardinis and held in the Wylie Center and Tupper Manor at Endicott College, Beverly, MA, United States from May 7th to 9th 2025. This meeting marked the first in-person gathering of leading researchers exploring how metabolism influences development, including processes like cell fate, tissue patterning, and organ function, through nutrient availability and metabolic regulation. By integrating modern metabolic tools with genetic and epidemiological insights across model organisms, this event highlighted key mechanisms and identified open questions to advance the emerging field of developmental metabolism.

Similarity metric learning on perturbational datasets improves functional identification of perturbations

Background

Similarity metric learning

High-throughput perturbational datasets

Key Findings

Overview

Perturbational Metric Learning (PeML)

PeML improves replicate recall in biological data

PeML improves prediction of compound mechanism of action from perturbational signatures

Generalizability of PeML

Further Material

Conclusion and Perspective

References

Share this:

Have your say Cancel reply

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

A potential anti-amyloidogenic therapy for type 2 diabetes based on the QBP1 peptide

The lipidomic architecture of the mouse brain

Kosmos: An AI Scientist for Autonomous Discovery

Also in the molecular biology category:

Detergent-Triggered Membrane Remodelling Monitored via Intramembrane Fluorescence De-Quenching

Classical enhancers couple cis-regulatory logic with transcriptional condensates and 3D genome architecture

Small Molecule Agonists of TREM2 Reprogram Microglia and Protect Synapses in Human Alzheimer’s Models

Also in the systems biology category:

Human single-cell atlas analysis reveals heterogeneous endothelial signaling

Longitudinal single cell RNA-sequencing reveals evolution of micro- and macro-states in chronic myeloid leukemia

Environmental and Maternal Imprints on Infant Gut Metabolic Programming

preLists in the bioinformatics category:

Keystone Symposium – Metabolic and Nutritional Control of Development and Cell Fate

‘In preprints’ from Development 2022-2023

9th International Symposium on the Biology of Vertebrate Sex Determination

Alumni picks – preLights 5th Birthday

Fibroblasts

Single Cell Biology 2020

Antimicrobials: Discovery, clinical use, and development of resistance

Also in the molecular biology category:

Keystone Symposium on Stem Cell Models in Embryology 2026

SciELO preprints – From 2025 onwards

October in preprints – DevBio & Stem cell biology

October in preprints – Cell biology edition

September in preprints – Cell biology edition

June in preprints – the CellBio edition

May in preprints – the CellBio edition

Keystone Symposium – Metabolic and Nutritional Control of Development and Cell Fate

April in preprints – the CellBio edition

Biologists @ 100 conference preList

February in preprints – the CellBio edition

Community-driven preList – Immunology

January in preprints – the CellBio edition

2024 Hypothalamus GRC

BSCB-Biochemical Society 2024 Cell Migration meeting

‘In preprints’ from Development 2022-2023

CSHL 87th Symposium: Stem Cells

9th International Symposium on the Biology of Vertebrate Sex Determination

Alumni picks – preLights 5th Birthday

CellBio 2022 – An ASCB/EMBO Meeting

EMBL Synthetic Morphogenesis: From Gene Circuits to Tissue Architecture (2021)

FENS 2020

ECFG15 – Fungal biology

ASCB EMBO Annual Meeting 2019

Lung Disease and Regeneration

MitoList

Also in the systems biology category:

2024 Hypothalamus GRC

‘In preprints’ from Development 2022-2023

EMBL Synthetic Morphogenesis: From Gene Circuits to Tissue Architecture (2021)

Single Cell Biology 2020

ASCB EMBO Annual Meeting 2019

EMBL Seeing is Believing – Imaging the Molecular Processes of Life

Pattern formation during development