Decoding the Molecular Language of Proteins with Evolla

Xibin Zhou, Chenchen Han, Yingqi Zhang, Jin Su, Kai Zhuang, Shiyu Jiang, Zichen Yuan, Wei Zheng, Fengyuan Dai, Yuyang Zhou, Yuyang Tao, Dan Wu, Fajie Yuan

Posted on: 14 March 2025

Preprint posted on 9 January 2025

Say hi to the new ChatGPT.. for proteins !

Selected by Jawdat Sandakly

Categories: bioinformatics, genomics

Schematic overview of Evolla model (adapted from the preprint). Image made available under a CC-BY 4.0 International license.

Background

Unravelling a cellular mechanism, a signalling pathway, or a new drug (and the list goes on) all rely on understanding the function of a key player: the protein. Deciphering protein function is challenging but a lot of effort has been dedicated to this task over the years. A recent breakthrough came with the 2024 Nobel prize winner, AlphaFold, which uses artificial intelligence to predict the 3D protein structure based on its sequence.

But where do we stand now ? Several high-performance computational methods based on sequence similarity and deep learning have advanced function prediction. However, they struggle to make full use of the vast amount of unannotated protein data as there is still a gap between sequencing and annotation.

The introduction of protein language models (PLMs) has revolutionized structural biology by automatically extracting features from massive datasets and fine-tuning them for downstream tasks like protein function prediction, sequence generation, and structure modelling. Yet they don’t fully capture the complexity and diversity of protein biology, since most datasets are limited to just 500,000 to 3 million protein-text pairs with fewer than 100-million-word tokens.

In this preprint, the authors introduce Evolla, a large-scale model designed to answer functional protein questions. What distincts Evolla is its remarkable training dataset size which represents a significant advancement in understanding protein biology.

How does Evolla work?

The demo webserver of the model is available at www.chat-protein.com

When using Evolla, you will:

Be asked to input a protein sequence, structure or uniport ID.

The uniprot ID of the BRCA1 human protein was used as input example

Be able to ask specific questions regarding the inserted protein’s properties or functions.

Be presented with relevant answers generated by Evolla.

Key features

An encoder that extracts high quality representations from diverse protein data.
A decoder that transforms these representations into accurate and contextually relevant responses.
An intermediate compression and alignment modules that enhance its ability to provide biologically meaningful insights for protein functional question-answering.

Implementation of a direct preference optimization (DPO) training

To overcome the limitations of traditional evaluation metrics and to further finetune Evolla’s responses, the authors examined the Direct Preference Optimization (DPO), a preference learning technique designed to improve training stability and the reliability of Evolla outputs.The implementation of DPO into the dataset improved the GPT score significantly and achieved significant performance gains.

Enhancement of response quality with Retrieval-Augmented Generation (RAG)

Despite their remarkable capabilities across a wide range of tasks, Language Models (LLMs) are still prone to generating misleading responses. To overcome this, the authors applied two Retrieval-Augmented Generation (RAG) strategies to Evolla: direct query selection (DQS) and Question-Answer guided selection (QGS). These strategies enhance performance by integrating external knowledge retrieval mechanisms, sourcing the most relevant information from trusted databases. In comparison to the base model, Evolla enhanced with DQS and QGS achieved a significantly higher GPT score which highlights their efficacy in achieving more accurate responses and better performance.

Improvement in performance with increased model size training data

The authors investigated the scaling effects associated with increasing both model size and training data volume. They evaluated training on four progressively larger datasets and found a scaling trend in the mean GPT score which improved with data volume, achieving the highest score on the largest dataset.

Evolla vs general purpose models

To assess Evolla’s overall capability, the authors compared its performance against two state-of-the-art general purpose language models: Deepseek-v3 and gpt-40-2024-11-20 using carefully designed prompts. Evolla demonstrated nearly double the effectiveness in protein function generation.

(Left) Example of the prompts generated (adapted from Supplementary Table S15). (Right) Performance comparison of Evolla and advanced general purposeLLMs (adapted from Fig3 G). Image made available under a CC-BY 4.0 International license.

Effectiveness in protein annotation and prediction tasks

The authors designed a novel framework “Instructional response space” (IRS) in order to evaluate Evolla’s advanced understanding of specific protein properties through its generated responses. The authors could show that Evolla was able to generate responses that effectively captured the catalytic characteristics of proteins based on task-specific instructions in addition to Evolla being able to produce meaningful and biologically relevant outputs.

What I like about the preprint

I think that Evolla holds great promise for advancing protein biology. I very much appreciate the presence of an easily-accessible demo version which allows non-python-expert researchers like myself to easily apply it for the biological questions we’re interested in. What I also liked about this preprint is that the authors used different datasets and evaluation metrics, along with a detailed presentation of the training data, allowing the reader to better understand how Evolla works providing a boost in confidence when trying to implement AI tools alongside benchwork.

Questions for the authors

Over a third of our genes are categorized as Tdark, consisting of protein-coding genes with limited or unknown function in the literature. How effective is Evolla in ‘decoding’ these proteins?
What ensures that Evolla’s responses don’t become generic or overly broad? How can users refine their queries to avoid this?
It’s really interesting to read how Evolla outperforms both Deepseek-v3 and gpt-40-2024-11-20 when it comes to protein biology. Will Evolla’s performance continue to be tested with the release of newer versions of ChatGPT and Deepseek?

References:

Chen, J., Wang, J., Hu, Y., Li, X., Qian, Y., & Song, C. (2025). Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review. Frontiers in Bioengineering and Biotechnology, 13. https://doi.org/10.3389/fbioe.2025.1506508
W, A., Senior, Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A. W. R., Bridgland, A., Penedones, H., Petersen, S., Simonyan, K., Crossan, S., Kohli, P., Jones, D. T., Silver, D., Kavukcuoglu, K., & Hassabis, D. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706–710. https://doi.org/10.1038/s41586-019-1923-7

Tags: artificial intelligence, proteomics

Read preprint

(No Ratings Yet)

Author's response

Prof. Fajie Yuan shared

Over a third of our genes are categorized as Tdark, consisting of protein-coding genes with limited or unknown function in the literature. How effective is Evolla in ‘decoding’ these proteins?

Evolla has some difficulties in understanding these proteins with very low sequence identity (e.g., less than 20-30). In our paper, we showed that for some hard proteins, Evolla’s prediction accuracy is low. I believe this is true for all models or methods. We are still working on some more advanced optimizations for Evolla.

Have your say Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Also in the bioinformatics category:

Tidyplots empowers life scientists with easy code-based data visualization

Jan Broder Engler

Selected by 21 February 2025

Felipe Del Valle Batalla

IMMClock reveals immune aging and T cell function at single-cell resolution

Yael Gurevich Schmidt, Di Wu, Sanna Madan, et al.

Selected by 19 January 2025

Jessica Chevallier

Adenine DNA methylation associated to transcription is widespread across eukaryotes

Pedro Romero Charria, Cristina Navarrete, Vladimir Ovchinnikov, et al.

Selected by 13 January 2025

Francisco Falcon

Also in the genomics category: