A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data

Shamus M. Cooley, Timothy Hamilton, Eric J. Deeds, J. Christian J. Ray

Posted on: 16 August 2019

Preprint posted on 2 July 2019

Your scRNA-seq analysis pipeline may be warping your data, due to dimensionality reduction.

Selected by Suraj Kannan

What I like about this study:

I would like to highlight two aspects of this study. Firstly, the topic is simply critical to anyone who has ever analyzed scRNA-seq data (which is becoming increasingly ubiquitous in cellular biology). Dimensionality reduction is the first step in almost every algorithm and analysis pipeline, and a fundamental assumption is that this step preserves important (biologically-relevant) information from the original high-dimensional data. If in fact this step distorts the data, as the authors convincingly argue, biological conclusions from scRNA-seq data would need to be scrutinized. Secondly, this paper is exceptionally well-written. I appreciate that the authors use analogies and toy cases that are both illustrative and clear, even to those without a mathematics or statistical background.

Background

scRNA-seq data is inherently high dimensional, with increasingly sensitive methods capable of detecting thousands of genes per cell. Higher dimensional data, while providing potentially more information, is more difficult to analyze – many algorithms fail to scale up to higher dimensions, for example [1, 2]. A large number of methods exist to transform high dimensional data to low dimensional data while preserving key aspects of the structure (see below for a toy example on several methods) [3]. These methods, including the frequently used t-SNE and UMAP algorithms, underlie nearly all scRNA-seq analysis pipelines, including commonly used algorithms for clustering and trajectory analysis [1]. A fundamental assumption is that dimensionality reduction preserves important structure in the high dimensional data or, at the very worst, does not skew the structure significantly.

**Figure 1**: Example of different dimensionality reduction techniques transforming a 3-dimensional swiss roll to 2 dimensions. Taken from [3].

Key Findings

The authors challenge this assumption using several illustrative cases. As a toy geometrical example, the authors first generated hyperspheres (generalizations of spheres to higher dimensions). In a clever approach, the authors generated lower dimensional hyperspheres in higher dimensions. For example, they could construct a 3-dimensional hypersphere in 5 dimensions by taking a vector of 3 numbers (the 3-dimensional hypersphere) and adding on 2 zeros at the end to make it 5-dimensional. It is trivial to reduce the dimensions of this sphere to 3 or 4 dimensions, e.g. transform the point [1 1 1 0 0] → [1 1 1 0] (4-dimension) or [1 1 1 0 0] → [1 1 1] (3 dimensions). Thus, the authors expected that standard dimensionality reduction techniques should readily succeed in transforming these hyperspheres to lower dimensions, or at the very least preserve local neighbors between different points. Instead, the authors found that even in this simple toy case, all of the methods introduced huge distortions, such that most points had hugely different neighboring points in low dimensions as compared to high dimensions. Increasing the number of points sampled did not improve the mapping but in fact made it worse.

**Figure 2:** Example of transforming a sphere to 2 dimensions via t-SNE. Local neighborhoods are distorted by this transformation. Taken from Figure 1C of manuscript.

The authors then used their approach to analyze dimensionality reduction of real scRNA-seq data. Consistently, they found huge distortions in data even when reducing to relatively high dimensions. This is particularly problematic as most pipelines reduce data to 2 or 3 dimensions (as these are easily visualized). Indeed, the authors tested several commonly used pipelines for clustering and trajectory generation and found that they are affected by dimensionality reduction.

My thoughts

This manuscript affects any research group using scRNA-seq techniques. As an example, in developmental biology a common approach is to reconstruct developmental trajectories of various lineages to study how cells differentiate and specify. If dimensionality reduction inherently skews the data, then the results of these analyses are questionable.

There is no question that these results are disturbing, and motivate the need to develop improved scRNA-seq pipelines (either by developing better dimensionality reduction methods or eliminating their need). In the meantime, however, I do wonder if current techniques are good enough for now, particularly for clustering. While local neighborhoods may be distorted, tSNE and UMAP plots do places cells with similar gene expression close to one another – this can be readily seen by plotting marker genes, for example. Particularly for smaller studies or studies where differences between cell types is clearly defined, tSNE and UMAP may suffice despite the distortions they may introduce. Likewise, while clearly dimensionality reduction does affect cell-to-cell distances, current trajectory reconstruction methods do at least partially correlate with other biological parameters (for example, developmental age). While we should be cautious about interpretation, I suspect that computational methods combined with biological intuition can at least be passable until better methods are developed.

Citations

[1] Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet (2015), 16(3):133-45.

[2] Friedman JH. On Bias, Variance, 0/1 – Loss, and the Curse-of-Dimensionality. Data Mining and Knowledge Discovery (1997), 1:55-77.

[3] Manifold Learning, scikit documentation. Link: https://scikit-learn.org/stable/modules/manifold.html

Tags: dimensionality reduction, rna-seq, single cell, tsne, umap

doi: https://doi.org/10.1242/prelights.13389

Read preprint

(No Ratings Yet)

Author's response

Shamus Cooley, J. Christian J. Ray shared

I had the chance to talk with the authors of the manuscript to get some quotes as well as feedback on some of my comments. Summarized below:

From Shamus Cooley about the future of scRNA-seq: “The good news is that the experimental data can always be analyzed again, once we have better tools for dimensionality reduction, without having do do another experiment.”

From Dr. Ray: “Our results arise from the conception that there must be a lower dimensional representation of high-dimensional single cell data. For example, immunologists have long been successful identifying important cohorts of cells with a small set of markers. We were surprised to find that patterns of mRNA expression do not allow such easy classification with current methods.”

Additionally, Dr. Ray added the following regarding my comment of “Is dimensionality reduction good enough for now?”

“We found that using our own parameters in t-SNE or UMAP can make the cell types either appear to mostly cluster together or be much more mixed up. We suspect that some researchers tweak the dimensionality reduction parameters to make the cell clusters consistent with their preconceptions based on older studies or their biological intuition. Practitioners I have talked to essentially say as much: they often confirm the clustering through orthogonal methods. This type of argument seems to be circular: you can modify the 2-D (or 3-D) representation until it matches your expectations, but then what have you learned from the new experiment? Overall, the shared opinions of myself and my co-authors is that current low-dimensional analysis methods are not good enough.”

Have your say Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

The lipidomic architecture of the mouse brain

Luca Fusar Bassini, Halima Hannah Schede, Laura Capolupo, et al.

Selected by 09 February 2026

CRM UoE Journal Club et al.

Discussion

Kosmos: An AI Scientist for Autonomous Discovery

Ludovico Mitchener, Angela Yiu, Benjamin Chang, et al.

Selected by 04 February 2026

Roberto Amadio et al.

Discussion

Human single-cell atlas analysis reveals heterogeneous endothelial signaling

Zimo Zhu, Rongbin Zheng, Yang Yu, et al.

Selected by 11 November 2025

Charis Qi

Discussion

Also in the genomics category:

Microbial Feast or Famine: dietary carbohydrate composition and gut microbiota metabolic function

Blake Dirks, Alex E. Mohr, Karen D. Corbin, et al.

Selected by 11 December 2025

Jasmine Talevi

Discussion

A high-coverage genome from a 200,000-year-old Denisovan

Stéphane Peyrégne, Diyendo Massilani, Yaniv Swiel, et al.

AND

A global map for introgressed structural variation and selection in humans

PingHsun Hsieh, Natthapon Soisangwan, David S. Gordon, et al.

Selected by 02 December 2025

Siddharth Singh

Discussion

Evolution of taste processing shifts dietary preference

Enrico Bertolini, Daniel Münch, Justine Pascual, et al.

Selected by 31 March 2025

T. W. Schwanitz

preLists in the bioinformatics category:

Keystone Symposium – Metabolic and Nutritional Control of Development and Cell Fate

This preList contains preprints discussed during the Metabolic and Nutritional Control of Development and Cell Fate Keystone Symposia. This conference was organized by Lydia Finley and Ralph J. DeBerardinis and held in the Wylie Center and Tupper Manor at Endicott College, Beverly, MA, United States from May 7th to 9th 2025. This meeting marked the first in-person gathering of leading researchers exploring how metabolism influences development, including processes like cell fate, tissue patterning, and organ function, through nutrient availability and metabolic regulation. By integrating modern metabolic tools with genetic and epidemiological insights across model organisms, this event highlighted key mechanisms and identified open questions to advance the emerging field of developmental metabolism.

A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data

Share this:

Have your say Cancel reply

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

The lipidomic architecture of the mouse brain

Kosmos: An AI Scientist for Autonomous Discovery

Human single-cell atlas analysis reveals heterogeneous endothelial signaling

Also in the genomics category:

Microbial Feast or Famine: dietary carbohydrate composition and gut microbiota metabolic function

A high-coverage genome from a 200,000-year-old Denisovan

A global map for introgressed structural variation and selection in humans

Evolution of taste processing shifts dietary preference

preLists in the bioinformatics category:

Keystone Symposium – Metabolic and Nutritional Control of Development and Cell Fate

‘In preprints’ from Development 2022-2023

9th International Symposium on the Biology of Vertebrate Sex Determination

Alumni picks – preLights 5th Birthday

Fibroblasts

Single Cell Biology 2020

Antimicrobials: Discovery, clinical use, and development of resistance

Also in the genomics category:

November in preprints – DevBio & Stem cell biology

May in preprints – the CellBio edition

March in preprints – the CellBio edition

Biologists @ 100 conference preList

Early 2025 preprints – the genetics & genomics edition

End-of-year preprints – the genetics & genomics edition

BSCB-Biochemical Society 2024 Cell Migration meeting

9th International Symposium on the Biology of Vertebrate Sex Determination

Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University

20th “Genetics Workshops in Hungary”, Szeged (25th, September)

EMBL Conference: From functional genomics to systems biology

TAGC 2020

Zebrafish immunology