Q&A with Martin Engqvist, EnginZyme’s Group Leader for Computational Enzyme Design
EnginZyme is on a mission to change chemistry for good by providing reliable, cost-effective and scalable biomanufacturing solutions. Traditional chemical manufacturing demands a lot of energy and creates too much waste. Using biobased starting materials and enzyme catalysts is a big step in the right direction for the industry.
As the head of EnginZyme’s computational enzyme design group, Martin Engqvist leads efforts to figure out how AI, machine learning and other cutting-edge computational techniques can improve the performance of EnginZyme’s biocatalysis solutions.
A paper he co-authored was published recently (May 2023) in Nature Communications. Martin and his co-authors set out to develop a single machine learning model capable of predicting enzyme-substrate relationships across all proteins — a tool that would help narrow chemists’ efforts to find enzyme-substrate pairs that would be most likely to work. The model, known as ESP, demonstrated accuracy of more than 91% and can be applied across a wide range of different enzymes.
Martin sat for an interview with James Connell, a former journalist with The New York Times, to explain the Enzyme Substrate Prediction model and where it fits into the fast-paced world of AI-driven scientific discovery.
J.C.: Can you describe the enzyme-substrate prediction model and tell me the story behind how it came about?
M.E. I started the collaboration with my co-authors about two and a half years ago, when I was running my own research group at Chalmers University of Technology. After joining EnginZyme I kept the collaboration running since the resulting AI model is not only of academic interest, but has clear use cases in industry as well.
Here’s a little background. What we have in the enzyme engineering world is a search problem, essentially. Enzymes are proteins, and they have 20 building blocks. Those blocks are amino acids, and they are put together in a certain order — the amino acid sequence. And the thing is, there are more ways of putting these blocks together than there are atoms in the visible universe.
Now most of the combinations just don’t have any function at all. So the quest in protein engineering is to find the sequences that have interesting properties that we can use for chemistry. To sum it up, there are lots of useless sequences, and some very promising ones, but we don’t know which ones they are.
That’s of course where the lab comes in. We test sequences. Fortunately, nature has provided a wealth of sequences, through evolution. Nature has engineered lots of very useful enzymes for us that we can test. That’s where machine learning and AI come in: we use these tools to ask questions, to help us predict which sequences will be useful for us.
And that’s what this model does. Enzymes need to be paired with substrates — a specific type of molecule that binds to an enzyme and undergoes a chemical reaction catalyzed by the enzyme. There are different properties that you could be interested in. And with this model we ask which enzyme uses which substrate. A substrate in this context is a small molecule; you could call it a chemical.
What we want to ask is, which enzyme might use this molecule and do something interesting with it? And that’s what the AI model tries to answer.
J.C.: So what was the breakthrough that led to you and your co-authors creating this model and writing the paper?
M.E.: I would say there were two main ones. One is kind of technical, but let’s start with that anyway. Facebook, or Meta, created these big AI models, trained on hundreds of millions of protein structures. These models are really good, and anyone can use them.
So we took one of these models off the shelf and added a new component to it that lets us predict these substrate-enzyme pairs.
The approach was novel; we were the first ones to show that this is a good way to use the model.
The other breakthrough was more conceptual.
There are these databases where people store their experimental results — scientists have been working with enzymes for many decades, and the heyday of biochemistry was in the 1960s. They did really beautiful work back then. All these results are stored in big databases, but the problem is that you only store positive results.
So you find that enzyme A uses some substrates, benzene, for instance, and you store that in the database, but you don’t store all the things you tried that didn’t work — you don’t have negative results. And when you train an AI model, you usually really want the negative examples.
You want the model to learn to discriminate between what works and what doesn’t. But that data doesn’t exist. It’s sort of lost forever. Maybe if you read all the papers again, you might find some, but usually even in the papers, they might not include all the substrates they tested.
Then the conceptual thing we did was to make an assumption that a substrate that is not similar to one that works is probably not going to work. And that’s based on what we know about enzymes. You can think of a substrate fitting in an enzyme like a puzzle piece.
If you have a puzzle piece that looks very different, then it’s probably not going to fit, so it’s not going to do anything interesting. We sampled examples that should not work and assumed that these guesses were reasonable. Then we had a large quantity of these negative examples, and that allowed us to train the model.
J.C.: So it’s kind of like how you need a bunch of images that are “not a cat” if you want to train a computer-vision AI to recognize cats?
M.E.: It’s funny. I was just going to talk about cats. Yeah, exactly. Say you want to make a model that tells the difference between cats and dogs, or cats and non-cats. If you only have cat pictures, you can’t train it.
J.C.: How is EnginZyme implementing the ESP model?
M.E.: If you go back to the search problem, you could confine your search to the proteins or enzymes made by nature. There are lots of them, of course, and that’s where we will apply the model first. Again, there are databases with lots of sequenced genomes, and from these genomes we can extract which proteins are made.
Let’s say you have a question about which enzyme might use benzene. We can run the model and get predictions for which ones are likely to use this substrate. And then we test them.
The model will help with one of the big challenges we have: simply to find a good starting point.
We do enzyme engineering, and a starting point for us is an enzyme that takes our substrate of choice and does something with it. But it might not be good enough. Then we engineer it to be more stable or more active. But finding that starting point where something happens — that’s quite challenging. The model can predict which one works, so you can narrow your search.
J.C.: And then once you find a pair that does something interesting, then you begin the process of optimizing it?
M.E.: Yes, exactly. Okay. And for us, that’s mainly stability, but also activity.
J.C.: What’s next?
M.E.: Putting it to good use out in the wild is really the next step. Using it to explore the great quantity of enzymes that nature provided us with. And find good starting points for us to make new chemicals. The model is ready for this.
J.C.: So I understand you’re putting it out there for other people to use, or perhaps even improve on it?
M.E.: Yes, definitely. We live in an ecosystem of ideas, right? We depend a lot on other people’s advances, and we also recognize that other people could make improvements that we haven’t thought of yet. I believe the best way to make sure you reap all the benefits is to just put the model out there, wait a few years, then hopefully someone publishes something that’s even better than what you did.
J.C.: Let’s zoom out and talk more generally about how AI, machine learning and related computational tools are making things better for chemists and other scientists.
M.E.: Yes, if you will allow me a slightly extravagant extrapolation here, the way I see it is that all these technologies are dematerializing things, right? Digital cameras dematerialized photos; the cellphone dematerialized a great number of things — cameras, compasses, calculators, whatever. You have these movie services now, so nobody buys or rents DVDs anymore, these kinds of things.
I think of these models as dematerializing research, if that makes sense. Of course we’ll always need the lab, and we heavily depend on the lab, so I don’t want to downplay the heroic effort put in by the labs to provide solutions today, but I think in 10 years, maybe 15 at most, we’ll be much, much better at just hitting on the solution very quickly. So for any given product or any given enzyme that you sell, you’re going to need much less research 10 years from now. By research I mean hands in the lab, consumables, and the capital needed for all that. Tools like this are going to provide many more solutions to chemists much more quickly and with less work.
M.E.: Yes. It was only 10 or 11 years ago that deep learning had its big breakthrough, when it began outperforming humans at certain tasks, and every year there is progress. Ten years in the future, things are going to be quite vastly different.
J.C.: When I think about technology, I always think in terms of aviation, with the Wright brothers’ first flight in 1903 and the De Havilland Comet jet airliner entering service just 50 years later.
M.E.: That’s what it will be like with AI. We’re really just past the stage where we have shown that it works.
J.C.: Do you think that generative AI will live up to all the buzz that is currently surrounding it? Or do you think it’s just going to be one little part of a much larger whole?
M.E.: It’s revolutionary in so many ways. I completely buy into the hype. People work with mental models, and most people think of large language models like Chat GPT as sort of an evolution of Google, more like a search engine.
But that’s not the right mental model. There’s one project, ChemCrow, where they use a Large Language Model like Chat GPT and augment it with chemistry tools. They use the LLM as a sort of brain to decide which actions to take for a given task and it describes what to do in natural language.
So you can tell this thing, “synthesize jet fuel” and it will come up with the steps to do that using the chemistry tools that you hooked up to it. Okay? I think that’s the right mental model to imagine what LLMs could do for science.
J.C.: You’re basically giving an LLM a different language to work with, right? Like the language of chemistry.
M.E.: Exactly. And there are many versions of this; there are people experimenting with this, to have it sort of be the decision center that can take fuzzy information, fuzzy instructions, and turn them into something usable.
Generative AI is also being used for proteins. David Baker’s Institute for Protein Design at Washington State University is really driving this development — a start-up spun out from the Institute, A-Alpha Bio, just raised more than $20 million.
There is a lot of activity around using generative AI for making completely novel proteins, for changing the sequences.
So far, we have spoken about natural proteins, products of natural evolution, and there are lots of those. And that is where we research to find good starting points, as I said earlier. But with generative AI tools, you can search the space outside of nature. An enormous space of possible sequences that probably don’t exist in nature. You can now have generative AI come up with good sequences that are likely to do something for you, that don’t exist in nature.
The space of possibilities is so huge. Describing the sequences we know as a drop in the ocean would be too much.
J.C.: Do you think that EnginZyme, as a small, nimble company, has an advantage compared to other companies when it comes to using these tools?
M.E.: Definitely. I think so. I see EnginZyme as an analog of Tesla in a way, because they’re vertically integrated. We are too. Today you and I have been speaking about enzyme engineering. But after we engineer the enzyme, we can immobilize it with our patented material, we can put it in columns, produce the desired chemicals, and purify and sell those. We have kind of a full stack vertical integration, which is a huge benefit because we can really tailor the enzymes to fit the application.
In some ways our processes are very simple. We create a tailored, stable, and active enzyme, and then put it to work in the same way catalysts are used in traditional chemistry.
AI could be a bigger part of the solution by giving us more starting points to create these stable and active enzymes, because once you get there, a lot of the uncertainty is removed already.
Biobased chemistry using microorganisms can be tricky. We bypass the organisms and go cell-free by getting straight to the enzyme. One of our advantages is that as AI improves, we’re going to be even more competitive since the enzyme is a bigger part of the solution.
J.C.: When the AI improves, you’ll be able to start from a better place and finish optimizing the enzyme faster. Is that right?
M.E.: That’s right. Finish faster, and then attack more chemicals.
Going back to the search analogy, there are many sequences that can provide a solution to a problem: many, many millions, if not billions. But finding them is still hard. Then depending on the chemical reaction you want to make, there will be either more or fewer of these sequences.
Inherently, some things are harder to do, so you need a more refined protein. So as AI gets better, we will be able to do more things faster, but we’ll also be able to access these more difficult chemistries where you need a really finely tuned enzyme to make it happen.
J.C.: Processes where today you would probably look at the problem and say, “well, let’s not attack that because it’s too complicated”?
M.E.: Yeah, there are some of those for sure. Biobased chemistry cannot yet compete on every chemical process, but if we can make more products using enzymes and biocatalysis, we will make the world a better place.
Illustration by Klingit