Finding the right words to evaluate research: An empirical appraisal of eLife’s assessment vocabulary

Tom E. Hardwicke, Sarah R. Schiavone, Beth Clarke, Simine Vazire

Posted on: 13 May 2024 , updated on: 31 March 2025

Preprint posted on 30 April 2024

Article now published in PLOS Biology at https://doi.org/10.1371/journal.pbio.3002645

How to clearly and consistently convey the editor and reviewer assessment to readers?

Selected by Benjamin Dominik Maier

Categories: scientific communication and education

Updated 31 March 2025 with a postLight by Benjamin Maier

Congratulations to Tom Hardwicke and his team on the publication of their manuscript in PLOS Biology! The journal version is almost identical to the preprint, with only minor formatting adjustments and the addition of a sentence outlining potential future research.

Background

In 2022, the not-for-profit, open access, science publisher eLife announced a new publishing process without the traditional editorial accept/reject decisions after peer review. Instead, all papers that are sent out for peer review are published as “reviewed preprints”. These include expert reviews summarising the findings and highlighting strengths and weaknesses of the study as well as a short consensus evaluation summary from the editor (see eLife announcement). In the eLife assessment, the editor summarises the a) significance of the findings and b) the strength of the evidence reported in the preprint on an ordinal scale (Table 1; check out this link for more information) based on their and the reviewers subjective appraisal of the study. Following peer review, authors can revise their manuscript before declaring a final version. Readers can browse through the different versions of the manuscript, read the reviewer’s comments and the summary assessment.

Table 1. eLife significance of findings and strength of support vocabulary. Table taken from Hardwicke et al. (2024), BioRxiv published under the CC-BY 4.0 International licence.

Study Overview

In this featured preprint, Tom Hardwicke and his colleagues from Melbourne School of Psychological Sciences designed an empirical online questionnaire aimed at evaluating the clarity and consistency of the vocabulary used in the new eLife consensus assessment (as detailed in Table 1). They focussed on determining a) whether diverse readers rank the significance and support strength labels in a consistent order, b) if these rankings correspond with the intended scale, and c) the degree of clarity in distinguishing between the various labels. Additionally, the authors proposed an alternative five-level scale covering the full range of measurement (very strong – strong – moderate – weak – very weak) and compared its perception to eLife’s scale.

The study’s research question, methods, and analysis plan underwent peer review and were pre-registered as a Stage One Registered Report (link). Study participants were sourced globally via an online platform with 301 individuals meeting the inclusion criteria (English proproficiency, aged 18-70 and holding a doctorate degree) and fulfilling attention checks. About one third of the participants indicated that their academic backgrounds aligned closest with disciplines in Life Sciences and Biomedicine. Study participants were presented with 21 brief statements describing the significance and strength of support of hypothetical scientific studies using either the eLife or the alternative proposed vocabulary. They were then tasked to assess the significance and strength of support using 0-100% sliders.

Key Findings

Only 20% of participants consistently assessed the eLife significance statements in the intended order, while over 60% aligned with the intended rankings using the alternative vocabulary.
Regarding the strength of support statements, agreements between implied and intended scale were found to be 15% (eLife) and 67% (alternative vocabulary), respectively.
eLife phrases in the middle of the scale (e.g., “fundamental,” “important,” and “valuable” on the significance dimension) were most frequently misranked by participants (see Figure 1).
For the alternative scales and the eLife strength of support dimension, participants often misranked the lower and upper ends of the scale by one rank (e.g., very strong vs. strong), attributed by the authors to the difficulty of judging phrases in isolation without knowledge of the underlying scale.

Fig. 1 Comparison between implied and intended rankings of significance and strength of support statements drawn from the eLife and alternative vocabulary. Figure taken from Hardwicke et al. (2024), BioRxiv published under the CC-BY 4.0 International licence.

Conclusion and Perspective

The way scientific knowledge is distributed and discussed is currently experiencing many changes. With the advent of preprint servers, open-access models, transparent review processes, mandates for data and code sharing, creative commons licensing, ORCID recognition, and initiatives like the TARA project under DORA, the aim is to enhance accessibility, inclusivity, transparency, reproducibility, and fairness in scientific publishing. One notable transformation is the introduction of summaries at the top of research articles (e.g. AI-generated summaries on bioRxiv, author summary for PLOS articles and eLife’s assessment summary). Personally, I am quite critical of relying on manuscript summaries and summary evaluations to assess the importance and quality of a research article and decide whether it is worth my time. Instead, I usually skim the abstracts, read the section titles and glance at key figures completely to make a decision.

In this featured preprint, Hardwicke and colleagues evaluate whether the standardised vocabulary used for the eLife assessment statements is clearly and consistently perceived by potential readers. Moreover, they propose an alternative vocabulary which they found to better convey the assessment of the reviewers and editors. Their discussion section features potential approaches to improve the interpretation of these summary statements and outlines different concepts and proposals from other researchers. Overall, the preprint stood out to me as an inspiring example for transparent and reproducible open-science. Tom Hardwicke and colleagues a) pre-registered a peer reviewed report with the research question, study design, and analysis plan (link); b) made all their data, materials and analysis scripts publicly available (link), and c) created a containerised computation environment (link) for easy reproducibility.

Tags: open science, peer review, publishing, research assessment

doi: Pending

Read preprint

(No Ratings Yet)

Author's response

The author team shared

Q1: Would you expect to see differences between native, bilingual and non-native speakers, or early career and senior academics when it comes to classifying these statements?

Our data don’t help to answer those questions, but we speculate that having less familiarity with English would probably exacerbate the mismatch between the intended meaning of the phrases and people’s intuitions. We’re not sure if there would be a difference between academics at different career stages.

Q2: Did you consider excluding the 14 participants who reported conflicting information regarding their highest completed education levels and what was the reasoning behind including them in the analysis despite this discrepancy? Do you observe any statistically significant differences between these individuals and those who stated that they are holding a doctoral degree?

The online platform that we recruited participants from (Prolific) asks users a series of prescreening questions, one of which is about their highest completed education level. We recruited only participants who had responded to this question with “doctoral degree”. When participants did our survey, we asked them the question again to confirm and 14 responded with “graduate degree” rather than “doctoral degree”. We don’t know why, presumably either their prescreening response or response to our question was a mistake. But we’re not concerned about this issue — participants with a graduate degree could still be potential readers of eLife papers, and its only 14 of 301 participants — if there is some relevant difference with these participants compared to the others, then it would only have a trivial impact on the results.

Q3: Considering that eLife’s ordinal significance scale includes solely positive phrases while your vocabulary encompasses both positive and negative ones, is it even possible to make a fair assessment on whether participants’ implied rankings match the intended rankings and quantifying their deviations?

We included negative phrases because that seems to be a limitation of the eLife vocabulary on the significance dimension (note that the strength of support dimension does include negative phrases). Its not clear to us why including negative phrases would make comparison of the vocabularies’ rankings “unfair”.

Q4: I was surprised by the high number of participants misranking the bottom two ranks for the alternative vocabulary. While I understand that it is really hard to assign a score without knowing the underlying scale, I would have expected participants to “update” their understanding of the scale with every new observation and adjust their assessment (e.g. if I have seen “very weak” before, I would give a higher value to “weak” and vice versa). Is there an explanation of why this is seemingly not the case?

An individual participant saw each phrase only one time, so there was little opportunity to learn and adjust. We expect this approximates how most eLife readers would actually encounter the vocabulary i.e., they would read an eLife paper and see one phrase from the vocabulary. Of course if someone was reading a lot of eLife papers and being repeatedly exposed to different phrases from the vocabulary it’s possible they might start to learn and adjust their interpretations. But it seems inefficient and error-prone to rely on that potential learning process, rather than, e.g., making the scale explicit for everybody.

Q5: Developing “practical and robust approaches to research assessment globally and across all scholarly disciplines” is a key mission of the San Francisco Declaration on Research Assessment (DORA) and its associated organisation. Do you think a unified language/framework would help to assess the quality of research articles and should such summary assessments by reviewers and editors be taken into account when assessing research?

We’re not sure. Our primary motivation for this study was that our intuitions about the eLife vocabulary did not seem to map onto how eLife intended the vocabulary to be perceived, and we were concerned that other people might have the same intuitions and therefore be misinformed by the eLife summaries. The broader question is whether summaries of the reviewers/editors opinions are actually useful — we don’t know the answer to that, but it may be worth exploring. A unified language/framework would probably be difficult to achieve because research quality is a multidimensional construct and highly context-dependent.

Have your say Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Also in the scientific communication and education category:

A ChatGPT Assisted Reading Protocol for Undergraduate Research Students

Marcus Sambar, Gonzalo R. Vázquez, Anne V. Vázquez, et al.

Selected by 02 January 2025

Reinier Prosee

Discussion

From impact metrics and open science to communicating research: Journalists’ awareness of academic controversies

Alice Fleerackers, Laura L. Moorhead, Juan Pablo Alperin, et al.

Selected by 18 December 2024

Isabella Cisneros et al.

Discussion

An updated and expanded characterization of the biological sciences academic job market

Brooklynn Flynn, Ariangela J. Kozik, You Cheng, et al.

Selected by 01 October 2024

Jennifer Ann Black et al.

preLists in the scientific communication and education category:

ASCB EMBO Annual Meeting 2019

A collection of preprints presented at the 2019 ASCB EMBO Meeting in Washington, DC (December 7-11)

List by

Madhuja Samaddar et al.