The unsung heroes of computational biology

Combining AI with established techniques for better enzyme engineering

By Malin Lüking and Camilla Gustafsson

Artificial Intelligence has been playing an increasingly important role in almost every field over the last few years, and biology is no exception. The 2024 Nobel Prize in Chemistry was awarded to the team behind the AI-based protein structure prediction tool AlphaFold, the first computational tool to reliably predict protein structures.

At the same time, David Baker of the University of Washington was awarded for his development of Rosetta, a computational protein design package. Rosetta, first developed in the 90s — before the AI boom — relies mostly on the unsung heroes of computational biology: physics- and knowledge-based models.

Defining the differences between methods

To understand the importance of physics- and knowledge-based models in protein structure prediction and engineering, one must look closely at the main differences between classical methods and machine learning/AI methods — as well as the data types that inform them. Classical methods predict how matter behaves based on physical laws. Knowledge-based methods do the same, but the predictive laws are derived from statistics on experimental measurements. AI and machine learning can derive much more complicated laws by learning from measured data through neural networks or deep learning. These models require large amounts of high-quality data to work well, and this is not easy to generate or obtain.

Both knowledge-based and AI laws for structure prediction and protein engineering get their information from experimentally determined structures. These are laborious to obtain and therefore limited. So far, there are about 200,000 protein structures available in the Protein Data Bank (PDB). This data is somewhat biased toward structures that are easier to solve or have been studied more extensively because of scientific or medical interest. This is why structural biology and protein engineering surged ahead when scientists turned to a different, far more abundant biological data type: sequence data. Databases contain about 3 Billion DNA and 200 million protein sequences.

AlphaFold and similarly successful structure prediction models rely not only on AI training on the PDB, but also on sequence alignments, physics-based models, and statistical terms.

The long history of computational protein engineering, beginning in the 1970s and 80s, shows that the classical models alone can go a long way and have, from the start, aimed at answering questions beyond structure prediction. The 2013 Nobel Prize in Chemistry went to Martin Karplus, Michael Levitt, and Arieh Warshel for "the development of multiscale models for complex chemical systems." The multiscale models referred to here are physics, or knowledge-based terms that are often called force fields or scoring functions.

What happens when proteins fold

Enzymes were among the “complex biological systems” these scientists studied most. The basis of enzyme engineering lies in the application of classical methods that have greatly improved our understanding not only of protein structure, but protein dynamics — the small breathing motions and shape shifts that proteins undergo to carry out their specific functions. Classical methods of structural biology allowed scientists to approach the origins of complex processes such as enzymes' catalytic power, molecular recognition or protein folding.

Protein folding happens when a 1-dimensional protein sequence assembles into its functional, 3D shape. This can be illustrated by a marble rolling down a bumpy slope. The marble at the top represents the extended, 1D protein sequence. Because interactions between protein residues are favoured over interactions with water, the protein quickly folds into a more globular, but rather wobbly and undefined protein structure. We can imagine that with this initial “collapse,” the marble starts rolling down the slope under the effect of gravity, a.k.a. the force field. The marble is slowed down on the way by bumps and wells in the slope. The resulting zig-zag line down the slope represents the folding pathway.

This analogy of the folding pathway resembling a twisted meander from the top to the bottom of a bumpy slope is important when it comes to understanding one of the problems we are currently facing in enzyme engineering.

Combining techniques to better understand enzyme functionality

At enginzyme, we combined the power of sequence data, structural data, and physics in our engineering approach. We regularly improve the stability, activity and precision of enzymes, but we sometimes face challenges of reduced solubility in some of our variants. When this happens, enzymes either cannot be extracted from cells to be used as catalysts, or they do not carry out their function properly.

We have started developing a method to address this problem. Our hypothesis is that by passing the native, folded states of the proteins to the scoring function, we neglect an important part of the protein lifecycle: protein folding. This is a common problem faced by protein engineers: the native state is tangible, but the folding pathways remain obscure.

This is especially a problem for AI-based structure prediction methods, which currently train on the native structures in the PDB. Integrating sequence data into protein engineering pipelines is therefore crucial. During evolution, sequences that are not viable will be selected against. By remaining in the sequence space of present day proteins, we avoid non-viable sequences to a large extent. But biology is complex, and not all information inferred from sequence alignments will save us from introducing mutations that are detrimental to protein folding. Additionally, we would sometimes like to explore activities outside of the known sequence space.

So, how can we understand if a given mutation is detrimental to protein folding? Go back to the marble on the bumpy slope, and imagine that the marble can temporarily get stuck in one of the wells. Such a case would describe a folding intermediate, a rather stable (metastable) protein structure. In contrast to the metastable state, the native state can be seen as one particular point at the bottom of the slope. If the marble arrives anywhere else, or gets permanently stuck in one of the intermediate wells, we would have a misfolded or aggregated protein.

Rather than only checking to see if the marble is sitting in its final position, which we traditionally do when testing enzyme designs computationally, we should also pay attention to the bumps and wells on the way down the slope.

Enginzyme’s modeling approach

To do exactly this, we have been working on a modelling approach using a coarse-grained protein representation. This method models the interaction between protein residues based on their chemical properties. Under the effect of a virtual “heating” process, the simplified representations of the protein variants unfold. The unfolding patterns that we see in our simulations, equivalent to the bumps and wells in the slope, are a result of the protein sequence. By analysing them, we can identify an increased misfolding risk for a given new protein sequence, which we can then filter out.

Overall, our current portfolio relies heavily on physics- and knowledge based methods. We use structure and sequence data to understand how our proteins evolve, and we use physics-based models to study both their native states and folding pathways.

Applied during our workflows, AI will significantly enhance the power of these more traditional methods, just as it was the case for AlphaFold.

Innovation is not just about using the latest high-tech tools or flinging buzz-words around. At enginzyme, we stay on the cutting edge of our field by applying the lessons learned in about 40 years of traditional enzyme modelling, and combining these with new, powerful AI methods.