The Language of Proteins

Researchers from the Technion and Tel Aviv University have developed a pioneering AI system that translates protein sequences into natural language

In a paper published in PNAS, researchers from the Technion and Tel Aviv University present BetaDescribe, an AI system that translates protein sequences into natural-language descriptions, opening a new path toward understanding protein functions and accelerating drug development and material design.

Protein analysis is essential in medicine and biotechnology, as demonstrated by breakthroughs such as Ozempic, a drug whose development was inspired by a peptide found in the saliva of a rare desert lizard and is used to treat obesity, diabetes, and other conditions. However, experimental protein characterization remains a lengthy and expensive process, and even large language models (LLMs) have had limited success in performing this task.

Image: Diagram illustrating the system’s operation (generated using ChatGPT)
Image: Diagram illustrating the system’s operation (generated using ChatGPT)

This challenge inspired the development of BetaDescribe, an AI system that converts protein sequences into detailed textual descriptions of their functions and other characteristics. In doing so, the system helps bridge the vast gap between the hundreds of thousands of proteins characterized in the lab and the billions or even trillions that actually exist in nature.

Unlike traditional approaches that rely on similarity to known protein sequences, BetaDescribe combines a generative model with verification mechanisms and evaluation processes. This approach enables the system to infer protein function even when they are not closely related to previously characterized proteins. The technology also provides detailed information on a protein’s functional properties, catalytic activity, involvement in metabolic processes, and potential binding sites relevant to medical and other applications. The researchers demonstrated the effectiveness of the new system by successfully describing six previously uncharacterized proteins.

Image: Example proteins (created using ChatGPT)
Image: Example proteins (created using ChatGPT)

This technology is expected to accelerate medical research, drug discovery, and developments in biotechnology and agriculture. The ability to rapidly generate evidence-based hypotheses regarding the functions of unknown proteins could significantly shorten the path from basic discovery to medical and industrial applications, making biological analysis more focused and efficient.

The paper was led by doctoral student Edo Dotan under the joint supervision of Prof. Yonatan Belinkov from the Technion’s Henry and Marilyn Taub Faculty of Computer Science and Prof. Tal Pupko from the School of Life Sciences at Tel Aviv University. Additional co-authors include Prof. Eran Bacharach, Prof. Marcelo Ehrlich, and doctoral student Iris Lyubman from the School of Life Sciences at Tel Aviv University.

The research was supported by the Israel Science Foundation (ISF).

To read the full article, click here