FAIR enough? A perspective on the status of nucleotide sequence data and metadata on public archives

Christiane Hassenrück, Tobias Poprick, Véronique Helfer, Massimiliano Molari, Raissa Meyer, Ivaylo Kostadinov

Posted on: 3 November 2021

Preprint posted on 24 September 2021

Digging for data: How FAIR is nucleotide sequencing data storage and how can it be improved to facilitate data mining?

Selected by Kristina Kuhbandner

Categories: bioinformatics, ecology, genomics

Background

Nucleotide sequencing data are frequently used in all areas of life science, and due to technical progress, the amount of data is growing exponentially (1). Recently, we can observe an emerging trend of “recycling” and reanalyzing existing data archived in open access data stores. This “data mining” approach can help to answer scientific questions beyond the ones tackled in the study originally producing the data. However, to ensure proper reuse of existing data, it is not enough to provide the primary data, such as a nucleotide sequence; rather it is essential to specify the associated metadata – features which describe the primary data – such as experimental conditions and methods used to generate them. Therefore, the FAIR guidelines which offer directions to make data Findable, Accessible, Interoperable and Reusable were initiated (2).

To facilitate data standardization, the Genomic Standards Consortium (GCS) established the so called MIxS (Minimal Information about any (x) Sequence) checklists which determine mandatory parameters and suggest using a uniform vocabulary to describe the sampled environment and experimental settings, for example through “Environmental Ontology” (ENVO) (3). ENA (European Nucleotide Archive), one of the three main data archives, strongly encourages submitters to follow these MIxS guidelines. Additionally, professional support for simple and sustainable data deposition is offered by brokerage services such as the German Federation for Biological Data (GFBio) or the China Nucleotide Sequence Archive (CNSA). Despite all measures implemented to promote FAIRness and to further standardize data storage, interoperability and reusability are still hampered by incorrect and insufficient description of metadata (4).

Research question and approach

In their study, Hassenrück and colleagues examined the metadata status of raw read Illumina amplicon and whole genome shotgun sequencing data from ecological material. Specifically, they aimed to assess if the primary sequence data comply with data submission standards. Therefore, the authors searched for raw read data from ecological metagenomes (NCBI taxid 410657) available at ENA. Then they reviewed all “cases” for metadata information about i) geographic coordinates, ii) target gene, subfragment or primers, iii) length of the amplified fragment (nominal length) and iv) use of standard vocabulary to describe the sampled environment according to ENVO.

Furthermore, the format of the submitted raw data is of major importance for automated reusability of nucleotide sequencing data. Therefore, a data mining case study using amplicon studies of the V3-V4 hypervariable region of the bacterial 16S gene was performed to investigate among others correct filing according to ENA guidelines and proper declaration as environmental sample.

Main Findings

Collectively, the number of cases steadily increased in the last years and peaked in 2020 with more than 120 000 submitted sequences; but in total only 6.5% of the analyzed sequences showed compliance with the MIxS checklist, and since 2018 this proportion has clearly decreased.

General metadata, such as geographic coordinates, were provided in nearly all cases. In contrast, mandatory information about the targeted DNA region – critical for data interpretation and reuse – was inadequate in most cases. Only 7% of all examined cases contained correct target gene details, whereas about 1/3 of sequences submitted according to the MIxS checklist had this information readily available. Nominal length, another mandatory parameter, was only specified in 14% of all cases; notably, the use of a MIxS checklist ensured that almost all cases provided information about this value.

Evaluation of the description of environmental characteristics using ENVO revealed that around 70% of all cases did not include any information about these parameters. In contrast, although the use of ENVO terms was sometimes inconsistent, nearly all cases using the MIxS checklist provided those values. Of note, the use of a brokerage service substantially improved metadata quality, especially accessibility and interoperability. Compared to amplicon sequencing data, the quality of whole genome shotgun sequencing (WGS) data was slightly higher.

Lastly, in the scope of a data mining study, the authors analyzed raw reads from 39 studies stored on ENA regarding their compliance with ENA submission requirements. They report that only eight studies were submitted as required and thus pointed out that especially interoperability and reusability of nucleotide sequencing data is still limited.

Overall, these results reveal an alarming trend towards a decline of use of proper standards in data submission and storage which negatively impacts metadata quality. To overcome these problems, the authors give recommendations for the different parties involved, including researchers, research institutions and funding agencies.

Why I chose this preprint

In my opinion, data sharing and communication is the basis for successful, reliable and sustainable research. The recent technological advances allow us to generate an overwhelming amount of data, but in most cases only a small fraction is used in the original study. To fully exploit the potential of these buried “data corpses”, which could massively facilitate scientific progress without additional benchwork effort, FAIRness and proper data management are key. The present study by Dr. Hassenrück and colleagues calls attention to the existing deficits in nucleotide sequencing data storage and gives helpful and easy-to-implement suggestions for the different parties involved in the data sharing process. Moreover, it was very exciting for me to get deeper insights into the global data management processes, regulations, and institutions.

Questions to the authors

Why do you think WGS data are more frequently submitted in compliance with MIxS?
Which of your recommendations do you consider as the most important?
Your study focused on data derived from ENA. Are the same problems present in other data archives?

References

1) Harrison, Peter W., et al. “The European Nucleotide Archive in 2020.” Nucleic acids research 49.D1 (2021): D82-D85.

2) Wilkinson, Mark D., et al. “The FAIR Guiding Principles for scientific data management and stewardship.” Scientific data 3.1 (2016): 1-9.

3) Yilmaz, Pelin, et al. “Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.” Nature biotechnology 29.5 (2011): 415-420.

4) Eckert, Ester M., et al. “Every fifth published metagenome is not available to science.” PLoS biology 18.4 (2020): e3000698.

Tags: availability, data management, fairness, interoperability, nucleotide sequencing, reusability

doi: https://doi.org/10.1242/prelights.30958

Read preprint

(No Ratings Yet)

Author's response

Christiane Hassenrueck shared

1) Why do you think WGS data are more frequently submitted in compliance
with MIxS?

Keeping in mind that this is only my personal opinion: I think WGS
studies are more frequently submitted according to MIxS because they
require more resources (i.e. are more expensive) and may therefore be
only feasible to conduct for a smaller community, presumably more
experienced in their field and also better trained in data management.
I would also like to point out that our study evaluated metadata on the
run level. As sequencing data are usually submitted by one submitter per
study (and metadata quality strongly depends on the submitter), the
percentage of runs of a particular metadata quality may also depend on
the number of runs per study (assuming that the submitter took equal
care to enter the metadata for all runs in a study).

2) Which of your recommendations do you consider as the most important?

Theoretically, the list of suggestions is already boiled down to the
most important, but if I had to choose again, I would pick:
@reviewers: review (meta)data as thoroughly as the manuscript text
@research institutions: capacity development and training
@researchers: diligent use of checklists beyond mandatory parameters
@databases: if feasible, implement further automated checkpoints for
data consistency

3) Your study focused on data derived from ENA. Are the same problems
present in other data archives?

We looked specifically at the data on the Short Read Archive (SRA) of
the INSDC databases (ENA, NCBI, and DDBJ). As these databases are
mirrored, we only accessed them through ENA, since the data itself are
expected to be the same regardless which portal was used for access.
Evaluating data access through any of the other database portal was
beyond the scope of our study.
For sequence data the INSDC databases are globally the biggest
resources, which often constitute the basis for further derived
databases, which may then suffer from the same problems. While we did
not investigate other (derived) sequence data repositories, I find it
likely that the discussed issues are widespread.

Have your say Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

The lipidomic architecture of the mouse brain

Luca Fusar Bassini, Halima Hannah Schede, Laura Capolupo, et al.

Selected by 09 February 2026

CRM UoE Journal Club et al.

Discussion

Kosmos: An AI Scientist for Autonomous Discovery

Ludovico Mitchener, Angela Yiu, Benjamin Chang, et al.

Selected by 04 February 2026

Roberto Amadio et al.

Discussion

Human single-cell atlas analysis reveals heterogeneous endothelial signaling

Zimo Zhu, Rongbin Zheng, Yang Yu, et al.

Selected by 11 November 2025

Charis Qi

Discussion

Also in the ecology category:

Resilience to cardiac aging in Greenland shark Somniosus microcephalus

Elena Chiavacci, Kirstine Fleng Steffensen, Pierre Delaroche, et al.

Selected by 17 February 2026

Theodora Stougiannou

Cannibalism as a mechanism to offset reproductive costs in three-spined sticklebacks

V.I. Abuwa, A. de Flamingh, E. Arredondo, et al.

Selected by 11 February 2026

Tina Nguyen

Discussion

Trade-offs between surviving and thriving: A careful balance of physiological limitations and reproductive effort under thermal stress

David Hubert, Ehren Bentz, Robert T Mason

Selected by 16 January 2026

Tshepiso Majelantle

Also in the genomics category:

Microbial Feast or Famine: dietary carbohydrate composition and gut microbiota metabolic function

Blake Dirks, Alex E. Mohr, Karen D. Corbin, et al.

Selected by 11 December 2025

Jasmine Talevi

Discussion

A high-coverage genome from a 200,000-year-old Denisovan

Stéphane Peyrégne, Diyendo Massilani, Yaniv Swiel, et al.

AND

A global map for introgressed structural variation and selection in humans

PingHsun Hsieh, Natthapon Soisangwan, David S. Gordon, et al.

Selected by 02 December 2025

Siddharth Singh

Discussion

Human single-cell atlas analysis reveals heterogeneous endothelial signaling

Zimo Zhu, Rongbin Zheng, Yang Yu, et al.

Selected by 11 November 2025

Charis Qi

Discussion

preLists in the bioinformatics category:

Keystone Symposium – Metabolic and Nutritional Control of Development and Cell Fate

This preList contains preprints discussed during the Metabolic and Nutritional Control of Development and Cell Fate Keystone Symposia. This conference was organized by Lydia Finley and Ralph J. DeBerardinis and held in the Wylie Center and Tupper Manor at Endicott College, Beverly, MA, United States from May 7th to 9th 2025. This meeting marked the first in-person gathering of leading researchers exploring how metabolism influences development, including processes like cell fate, tissue patterning, and organ function, through nutrient availability and metabolic regulation. By integrating modern metabolic tools with genetic and epidemiological insights across model organisms, this event highlighted key mechanisms and identified open questions to advance the emerging field of developmental metabolism.

FAIR enough? A perspective on the status of nucleotide sequence data and metadata on public archives

Share this:

Have your say Cancel reply

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

The lipidomic architecture of the mouse brain

Kosmos: An AI Scientist for Autonomous Discovery

Human single-cell atlas analysis reveals heterogeneous endothelial signaling

Also in the ecology category:

Resilience to cardiac aging in Greenland shark Somniosus microcephalus

Cannibalism as a mechanism to offset reproductive costs in three-spined sticklebacks

Trade-offs between surviving and thriving: A careful balance of physiological limitations and reproductive effort under thermal stress

Also in the genomics category:

Microbial Feast or Famine: dietary carbohydrate composition and gut microbiota metabolic function

A high-coverage genome from a 200,000-year-old Denisovan

A global map for introgressed structural variation and selection in humans

Human single-cell atlas analysis reveals heterogeneous endothelial signaling

preLists in the bioinformatics category:

Keystone Symposium – Metabolic and Nutritional Control of Development and Cell Fate

‘In preprints’ from Development 2022-2023

9th International Symposium on the Biology of Vertebrate Sex Determination

Alumni picks – preLights 5th Birthday

Fibroblasts

Single Cell Biology 2020

Antimicrobials: Discovery, clinical use, and development of resistance

Also in the ecology category:

SciELO preprints – From 2025 onwards

November in preprints – DevBio & Stem cell biology

Biologists @ 100 conference preList

preLights peer support – preprints of interest

EMBO | EMBL Symposium: The organism and its environment

Bats

Also in the genomics category:

November in preprints – DevBio & Stem cell biology

May in preprints – the CellBio edition

March in preprints – the CellBio edition

Biologists @ 100 conference preList

Early 2025 preprints – the genetics & genomics edition

End-of-year preprints – the genetics & genomics edition

BSCB-Biochemical Society 2024 Cell Migration meeting

9th International Symposium on the Biology of Vertebrate Sex Determination

Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University

20th “Genetics Workshops in Hungary”, Szeged (25th, September)

EMBL Conference: From functional genomics to systems biology

TAGC 2020

Zebrafish immunology