FAIR enough? A perspective on the status of nucleotide sequence data and metadata on public archives
Posted on: 3 November 2021
Preprint posted on 24 September 2021
Digging for data: How FAIR is nucleotide sequencing data storage and how can it be improved to facilitate data mining?
Selected by Kristina KuhbandnerCategories: bioinformatics, ecology, genomics
Background
Nucleotide sequencing data are frequently used in all areas of life science, and due to technical progress, the amount of data is growing exponentially (1). Recently, we can observe an emerging trend of “recycling” and reanalyzing existing data archived in open access data stores. This “data mining” approach can help to answer scientific questions beyond the ones tackled in the study originally producing the data. However, to ensure proper reuse of existing data, it is not enough to provide the primary data, such as a nucleotide sequence; rather it is essential to specify the associated metadata – features which describe the primary data – such as experimental conditions and methods used to generate them. Therefore, the FAIR guidelines which offer directions to make data Findable, Accessible, Interoperable and Reusable were initiated (2).
To facilitate data standardization, the Genomic Standards Consortium (GCS) established the so called MIxS (Minimal Information about any (x) Sequence) checklists which determine mandatory parameters and suggest using a uniform vocabulary to describe the sampled environment and experimental settings, for example through “Environmental Ontology” (ENVO) (3). ENA (European Nucleotide Archive), one of the three main data archives, strongly encourages submitters to follow these MIxS guidelines. Additionally, professional support for simple and sustainable data deposition is offered by brokerage services such as the German Federation for Biological Data (GFBio) or the China Nucleotide Sequence Archive (CNSA). Despite all measures implemented to promote FAIRness and to further standardize data storage, interoperability and reusability are still hampered by incorrect and insufficient description of metadata (4).
Research question and approach
In their study, Hassenrück and colleagues examined the metadata status of raw read Illumina amplicon and whole genome shotgun sequencing data from ecological material. Specifically, they aimed to assess if the primary sequence data comply with data submission standards. Therefore, the authors searched for raw read data from ecological metagenomes (NCBI taxid 410657) available at ENA. Then they reviewed all “cases” for metadata information about i) geographic coordinates, ii) target gene, subfragment or primers, iii) length of the amplified fragment (nominal length) and iv) use of standard vocabulary to describe the sampled environment according to ENVO.
Furthermore, the format of the submitted raw data is of major importance for automated reusability of nucleotide sequencing data. Therefore, a data mining case study using amplicon studies of the V3-V4 hypervariable region of the bacterial 16S gene was performed to investigate among others correct filing according to ENA guidelines and proper declaration as environmental sample.
Main Findings
Collectively, the number of cases steadily increased in the last years and peaked in 2020 with more than 120 000 submitted sequences; but in total only 6.5% of the analyzed sequences showed compliance with the MIxS checklist, and since 2018 this proportion has clearly decreased.
General metadata, such as geographic coordinates, were provided in nearly all cases. In contrast, mandatory information about the targeted DNA region – critical for data interpretation and reuse – was inadequate in most cases. Only 7% of all examined cases contained correct target gene details, whereas about 1/3 of sequences submitted according to the MIxS checklist had this information readily available. Nominal length, another mandatory parameter, was only specified in 14% of all cases; notably, the use of a MIxS checklist ensured that almost all cases provided information about this value.
Evaluation of the description of environmental characteristics using ENVO revealed that around 70% of all cases did not include any information about these parameters. In contrast, although the use of ENVO terms was sometimes inconsistent, nearly all cases using the MIxS checklist provided those values. Of note, the use of a brokerage service substantially improved metadata quality, especially accessibility and interoperability. Compared to amplicon sequencing data, the quality of whole genome shotgun sequencing (WGS) data was slightly higher.
Lastly, in the scope of a data mining study, the authors analyzed raw reads from 39 studies stored on ENA regarding their compliance with ENA submission requirements. They report that only eight studies were submitted as required and thus pointed out that especially interoperability and reusability of nucleotide sequencing data is still limited.
Overall, these results reveal an alarming trend towards a decline of use of proper standards in data submission and storage which negatively impacts metadata quality. To overcome these problems, the authors give recommendations for the different parties involved, including researchers, research institutions and funding agencies.
Why I chose this preprint
In my opinion, data sharing and communication is the basis for successful, reliable and sustainable research. The recent technological advances allow us to generate an overwhelming amount of data, but in most cases only a small fraction is used in the original study. To fully exploit the potential of these buried “data corpses”, which could massively facilitate scientific progress without additional benchwork effort, FAIRness and proper data management are key. The present study by Dr. Hassenrück and colleagues calls attention to the existing deficits in nucleotide sequencing data storage and gives helpful and easy-to-implement suggestions for the different parties involved in the data sharing process. Moreover, it was very exciting for me to get deeper insights into the global data management processes, regulations, and institutions.
Questions to the authors
- Why do you think WGS data are more frequently submitted in compliance with MIxS?
- Which of your recommendations do you consider as the most important?
- Your study focused on data derived from ENA. Are the same problems present in other data archives?
References
1) Harrison, Peter W., et al. “The European Nucleotide Archive in 2020.” Nucleic acids research 49.D1 (2021): D82-D85.
2) Wilkinson, Mark D., et al. “The FAIR Guiding Principles for scientific data management and stewardship.” Scientific data 3.1 (2016): 1-9.
3) Yilmaz, Pelin, et al. “Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.” Nature biotechnology 29.5 (2011): 415-420.
4) Eckert, Ester M., et al. “Every fifth published metagenome is not available to science.” PLoS biology 18.4 (2020): e3000698.
doi: https://doi.org/10.1242/prelights.30958
Read preprintSign up to customise the site to your preferences and to receive alerts
Register hereAlso in the bioinformatics category:
Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods
Benjamin Dominik Maier
Functional Diversity of Memory CD8 T Cells is Spatiotemporally Imprinted
Marina Schernthanner
Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium
Rodrigo Senovilla-Ganzo
Also in the ecology category:
Green synthesized silver nanoparticles from Moringa: Potential for preventative treatment of SARS-CoV-2 contaminated water
Safieh Shah, Benjamin Dominik Maier
Precision Farming in Aquaculture: Use of a non-invasive, AI-powered real-time automated behavioural monitoring approach to predict gill health and improve welfare in Atlantic salmon (Salmo salar) aquaculture farms
Jasmine Talevi
Gestational exposure to high heat-humidity conditions impairs mouse embryonic development
Girish Kale, preLights peer support
Also in the genomics category:
A fine kinetic balance of interactions directs transcription factor hubs to genes
Deevitha Balasubramanian
Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium
Rodrigo Senovilla-Ganzo
Modular control of time and space during vertebrate axis segmentation
AND
Natural genetic variation quantitatively regulates heart rate and dimension
Girish Kale, Jennifer Ann Black
preListsbioinformatics category:
in the‘In preprints’ from Development 2022-2023
A list of the preprints featured in Development's 'In preprints' articles between 2022-2023
List by | Alex Eve, Katherine Brown |
9th International Symposium on the Biology of Vertebrate Sex Determination
This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.
List by | Martin Estermann |
Alumni picks – preLights 5th Birthday
This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.
List by | Sergio Menchero et al. |
Fibroblasts
The advances in fibroblast biology preList explores the recent discoveries and preprints of the fibroblast world. Get ready to immerse yourself with this list created for fibroblasts aficionados and lovers, and beyond. Here, my goal is to include preprints of fibroblast biology, heterogeneity, fate, extracellular matrix, behavior, topography, single-cell atlases, spatial transcriptomics, and their matrix!
List by | Osvaldo Contreras |
Single Cell Biology 2020
A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.
List by | Alex Eve |
Antimicrobials: Discovery, clinical use, and development of resistance
Preprints that describe the discovery of new antimicrobials and any improvements made regarding their clinical use. Includes preprints that detail the factors affecting antimicrobial selection and the development of antimicrobial resistance.
List by | Zhang-He Goh |
Also in the ecology category:
preLights peer support – preprints of interest
This is a preprint repository to organise the preprints and preLights covered through the 'preLights peer support' initiative.
List by | preLights peer support |
EMBO | EMBL Symposium: The organism and its environment
This preList contains preprints discussed during the 'EMBO | EMBL Symposium: The organism and its environment', organised at EMBL Heidelberg, Germany (May 2023).
List by | Girish Kale |
Bats
A list of preprints dealing with the ecology, evolution and behavior of bats
List by | Baheerathan Murugavel |
Also in the genomics category:
BSCB-Biochemical Society 2024 Cell Migration meeting
This preList features preprints that were discussed and presented during the BSCB-Biochemical Society 2024 Cell Migration meeting in Birmingham, UK in April 2024. Kindly put together by Sara Morais da Silva, Reviews Editor at Journal of Cell Science.
List by | Reinier Prosee |
9th International Symposium on the Biology of Vertebrate Sex Determination
This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.
List by | Martin Estermann |
Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University
This preList contains preprints discussed during the 'Semmelweis Symposium 2022' (7-9 November), organised around the 40th anniversary of international medical education at Semmelweis University covering a wide range of topics.
List by | Nándor Lipták |
20th “Genetics Workshops in Hungary”, Szeged (25th, September)
In this annual conference, Hungarian geneticists, biochemists and biotechnologists presented their works. Link: http://group.szbk.u-szeged.hu/minikonf/archive/prg2021.pdf
List by | Nándor Lipták |
EMBL Conference: From functional genomics to systems biology
Preprints presented at the virtual EMBL conference "from functional genomics and systems biology", 16-19 November 2020
List by | Jesus Victorino |
TAGC 2020
Preprints recently presented at the virtual Allied Genetics Conference, April 22-26, 2020. #TAGC20
List by | Maiko Kitaoka et al. |
Zebrafish immunology
A compilation of cutting-edge research that uses the zebrafish as a model system to elucidate novel immunological mechanisms in health and disease.
List by | Shikha Nayar |