Close

Decoding the Molecular Language of Proteins with Evolla

Xibin Zhou, Chenchen Han, Yingqi Zhang, Jin Su, Kai Zhuang, Shiyu Jiang, Zichen Yuan, Wei Zheng, Fengyuan Dai, Yuyang Zhou, Yuyang Tao, Dan Wu, Fajie Yuan

Posted on: 14 March 2025

Preprint posted on 9 January 2025

Say hi to the new ChatGPT.. for proteins !

Selected by Jawdat Sandakly

Categories: bioinformatics, genomics

Schematic overview of Evolla model (adapted from the preprint). Image made available under a CC-BY 4.0 International license.

 

Background

Unravelling a cellular mechanism, a signalling pathway, or a new drug (and the list goes on) all rely on understanding the function of a key player: the protein. Deciphering protein function is challenging but a lot of effort has been dedicated to this task over the years. A recent breakthrough came with the 2024 Nobel prize winner, AlphaFold, which uses artificial intelligence to predict the 3D protein structure based on its sequence.

But where do we stand now ?  Several high-performance computational methods based on sequence similarity and deep learning have advanced function prediction. However, they struggle to make full use of the vast amount of unannotated protein data as there is still a gap between sequencing and annotation.

The introduction of protein language models (PLMs) has revolutionized structural biology by automatically extracting features from massive datasets and fine-tuning them for downstream tasks like protein function prediction, sequence generation, and structure modelling. Yet they don’t fully capture the complexity and diversity of protein biology, since most datasets are limited to just 500,000 to 3 million protein-text pairs with fewer than 100-million-word tokens.

In this preprint, the authors introduce Evolla, a large-scale model designed to answer functional protein questions.  What distincts Evolla is its remarkable training dataset size  which represents a significant advancement in understanding protein biology.

 

How does Evolla work?

The demo webserver of the model is available at www.chat-protein.com

When using Evolla, you will:

  • Be asked to input a protein sequence, structure or uniport ID.
The uniprot ID of the BRCA1 human protein was used as input example

 

  • Be able to ask specific questions regarding the inserted protein’s properties or functions.

  • Be presented with relevant answers generated by Evolla.

 

Key features

  • An encoder that extracts high quality representations from diverse protein data.
  • A decoder that transforms these representations into accurate and contextually relevant responses.
  • An intermediate compression and alignment modules that enhance its ability to provide biologically meaningful insights for protein functional question-answering.

 

Implementation of a direct preference optimization (DPO) training

To overcome the limitations of traditional evaluation metrics and to further finetune Evolla’s responses, the authors examined the Direct Preference Optimization (DPO), a preference learning technique designed to improve training stability and the reliability of Evolla outputs.The implementation of DPO into the dataset improved the GPT score significantly and achieved significant performance gains.

 

Enhancement of response quality with Retrieval-Augmented Generation (RAG)

Despite their remarkable capabilities across a wide range of tasks, Language Models (LLMs) are still prone to generating misleading responses. To overcome this, the authors applied two Retrieval-Augmented Generation (RAG) strategies to Evolla: direct query selection (DQS) and Question-Answer guided selection (QGS). These strategies enhance performance by integrating external knowledge retrieval mechanisms, sourcing the most relevant information from trusted databases. In comparison to the base model, Evolla enhanced with DQS and QGS achieved a significantly higher GPT score which highlights their efficacy in achieving more accurate responses and better performance.

 

Improvement in performance with increased model size training data

The authors investigated the scaling effects associated with increasing both model size and training data volume. They evaluated training on four progressively larger datasets and found a scaling trend in the mean GPT score which improved with data volume, achieving the highest score on the largest dataset.

 

Evolla vs general purpose models

To assess Evolla’s overall capability, the authors compared its performance against two state-of-the-art general purpose language models: Deepseek-v3 and gpt-40-2024-11-20 using carefully designed prompts. Evolla demonstrated nearly double the effectiveness in protein function generation.

(Left) Example of the prompts generated (adapted from Supplementary Table S15). (Right) Performance comparison of Evolla and advanced general purposeLLMs (adapted from Fig3 G). Image made available under a CC-BY 4.0 International license.

 

Effectiveness in protein annotation and prediction tasks

The authors designed a novel framework “Instructional response space” (IRS) in order to evaluate Evolla’s advanced understanding of specific protein properties through its generated responses. The authors could show that Evolla was able to generate responses that effectively captured the catalytic characteristics of proteins based on task-specific instructions in addition to Evolla being able to produce meaningful and biologically relevant outputs.

 

What I like about the preprint

I think that Evolla holds great promise for advancing protein biology. I very much appreciate the presence of an easily-accessible demo version which allows non-python-expert researchers like myself to easily apply it for the biological questions we’re interested in. What I also liked about this preprint is that the authors used different datasets and evaluation metrics, along with a detailed presentation of the training data, allowing the reader to better understand how Evolla works providing a boost in confidence when trying to implement AI tools alongside benchwork.

 

Questions for the authors

  • Over a third of our genes are categorized as Tdark, consisting of protein-coding genes with limited or unknown function in the literature. How effective is Evolla in ‘decoding’ these proteins?
  • What ensures that Evolla’s responses don’t become generic or overly broad? How can users refine their queries to avoid this?
  • It’s really interesting to read how Evolla outperforms both Deepseek-v3 and gpt-40-2024-11-20 when it comes to protein biology. Will Evolla’s performance continue to be tested with the release of newer versions of ChatGPT and Deepseek?

 

References:

  • Chen, J., Wang, J., Hu, Y., Li, X., Qian, Y., & Song, C. (2025). Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review. Frontiers in Bioengineering and Biotechnology, 13. https://doi.org/10.3389/fbioe.2025.1506508
  • W, A., Senior, Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A. W. R., Bridgland, A., Penedones, H., Petersen, S., Simonyan, K., Crossan, S., Kohli, P., Jones, D. T., Silver, D., Kavukcuoglu, K., & Hassabis, D. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706–710. https://doi.org/10.1038/s41586-019-1923-7

Tags: artificial intelligence, proteomics

Read preprint (No Ratings Yet)

Author's response

Prof. Fajie Yuan shared

Over a third of our genes are categorized as Tdark, consisting of protein-coding genes with limited or unknown function in the literature. How effective is Evolla in ‘decoding’ these proteins?

Evolla has some difficulties in understanding these proteins with very low sequence identity (e.g., less than 20-30). In our paper, we showed that for some hard proteins, Evolla’s prediction accuracy is low. I believe this is true for all models or methods. We are still working on some more advanced optimizations for Evolla.

Have your say

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Register here

Also in the bioinformatics category:

Tidyplots empowers life scientists with easy code-based data visualization

Jan Broder Engler

Selected by 21 February 2025

Felipe Del Valle Batalla

Bioinformatics
This post has an associated spotLight.

IMMClock reveals immune aging and T cell function at single-cell resolution

Yael Gurevich Schmidt, Di Wu, Sanna Madan, et al.

Selected by 19 January 2025

Jessica Chevallier

Bioinformatics

Adenine DNA methylation associated to transcription is widespread across eukaryotes

Pedro Romero Charria, Cristina Navarrete, Vladimir Ovchinnikov, et al.

Selected by 13 January 2025

Francisco Falcon

Evolutionary Biology

Also in the genomics category:

Decoding the Molecular Language of Proteins with Evolla

Xibin Zhou, Chenchen Han, Yingqi Zhang, et al.

Selected by 14 March 2025

Jawdat Sandakly

Bioinformatics

IMMClock reveals immune aging and T cell function at single-cell resolution

Yael Gurevich Schmidt, Di Wu, Sanna Madan, et al.

Selected by 19 January 2025

Jessica Chevallier

Bioinformatics

A fine kinetic balance of interactions directs transcription factor hubs to genes

Apratim Mukherjee, Samantha Fallacaro, Puttachai Ratchasanmuang, et al.

Selected by 23 July 2024

Deevitha Balasubramanian

Genomics

preLists in the bioinformatics category:

‘In preprints’ from Development 2022-2023

A list of the preprints featured in Development's 'In preprints' articles between 2022-2023

 



List by Alex Eve, Katherine Brown

9th International Symposium on the Biology of Vertebrate Sex Determination

This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.

 



List by Martin Estermann

Alumni picks – preLights 5th Birthday

This preList contains preprints that were picked and highlighted by preLights Alumni - an initiative that was set up to mark preLights 5th birthday. More entries will follow throughout February and March 2023.

 



List by Sergio Menchero et al.

Fibroblasts

The advances in fibroblast biology preList explores the recent discoveries and preprints of the fibroblast world. Get ready to immerse yourself with this list created for fibroblasts aficionados and lovers, and beyond. Here, my goal is to include preprints of fibroblast biology, heterogeneity, fate, extracellular matrix, behavior, topography, single-cell atlases, spatial transcriptomics, and their matrix!

 



List by Osvaldo Contreras

Single Cell Biology 2020

A list of preprints mentioned at the Wellcome Genome Campus Single Cell Biology 2020 meeting.

 



List by Alex Eve

Antimicrobials: Discovery, clinical use, and development of resistance

Preprints that describe the discovery of new antimicrobials and any improvements made regarding their clinical use. Includes preprints that detail the factors affecting antimicrobial selection and the development of antimicrobial resistance.

 



List by Zhang-He Goh

Also in the genomics category:

Early 2025 preprints – the genetics & genomics edition

In this community-driven preList, a group of preLighters, with expertise in different areas of genetics and genomics have worked together to create this preprint reading list. Categories include: 1) bioinformatics 2) epigenetics 3) gene regulation 4) genomics 5) transcriptomics

 



List by Chee Kiang Ewe et al.

End-of-year preprints – the genetics & genomics edition

In this community-driven preList, a group of preLighters, with expertise in different areas of genetics and genomics have worked together to create this preprint reading list. Categories include: 1) genomics 2) bioinformatics 3) gene regulation 4) epigenetics

 



List by Chee Kiang Ewe et al.

BSCB-Biochemical Society 2024 Cell Migration meeting

This preList features preprints that were discussed and presented during the BSCB-Biochemical Society 2024 Cell Migration meeting in Birmingham, UK in April 2024. Kindly put together by Sara Morais da Silva, Reviews Editor at Journal of Cell Science.

 



List by Reinier Prosee

9th International Symposium on the Biology of Vertebrate Sex Determination

This preList contains preprints discussed during the 9th International Symposium on the Biology of Vertebrate Sex Determination. This conference was held in Kona, Hawaii from April 17th to 21st 2023.

 



List by Martin Estermann

Semmelweis Symposium 2022: 40th anniversary of international medical education at Semmelweis University

This preList contains preprints discussed during the 'Semmelweis Symposium 2022' (7-9 November), organised around the 40th anniversary of international medical education at Semmelweis University covering a wide range of topics.

 



List by Nándor Lipták

20th “Genetics Workshops in Hungary”, Szeged (25th, September)

In this annual conference, Hungarian geneticists, biochemists and biotechnologists presented their works. Link: http://group.szbk.u-szeged.hu/minikonf/archive/prg2021.pdf

 



List by Nándor Lipták

EMBL Conference: From functional genomics to systems biology

Preprints presented at the virtual EMBL conference "from functional genomics and systems biology", 16-19 November 2020

 



List by Jesus Victorino

TAGC 2020

Preprints recently presented at the virtual Allied Genetics Conference, April 22-26, 2020. #TAGC20

 



List by Maiko Kitaoka et al.

Zebrafish immunology

A compilation of cutting-edge research that uses the zebrafish as a model system to elucidate novel immunological mechanisms in health and disease.

 



List by Shikha Nayar