The Prose of Proteins - A Lesson in Taste and Vision through the Work of Brian Hie
Cade Gordon, ML@Berkeley Researcher Highlight
In research, there is something captivating about witnessing a scholar's purposeful stride. Instead of their papers stumbling into ArXiv, their work carries a story. Over the past decades, we've been fortunate enough to see individuals pen their own cohesive message into a body of work. In modern machine learning, Lucas Beyer and crew unified the modeling language spoken by different modalities. Transformers aren't just for machine translation: they have a place in vision too. It got scientists speaking the same language. Frances Arnold's work on directed evolution gave more than novel catalysts to chemists. Her work told us that we didn't have to wait millions of years for nature to manifest the proteins of our dreams. We could do that ourselves over the course of weeks or months.
Richard Hamming encapsulates the lesson of research vision through an allegory on random walks. Haphazard life choices can get an individual a distance only sqrt(n)
(was life a 1-D random walk). In contrast, deliberate choices over many small life decisions can take an individual O(n)
. In Hamming's words, "the main difference between those who go far and those who do not is some people have a vision and the others do not.”
To offer an antidote to Hamming's sqrt(n)
lifestyle, we'll examine how taste can inspire vision. Using Brian Hie as our lens to study this problem, we'll see how Brian's unique background in both poetry and biology led to a story that nobody else could have told.
Act I: Prose
During Brian's undergraduate at Stanford, he opted for an uncommon duo. He studied both English literature and computer science. As we'll see, his time spent understanding the mechanics of language gave way to a unique flair for solving biological problems.
Starting within the walls of Google, Devlin et al. 2018 made a strong case for a general unsupervised pre-training scheme. Their work, BERT, constructed a framework for pretraining representation models on text by covering up chunks of text in the input sequence and predicting the missing text as output.
Though a paradigm for natural language, Brian and the protein team at Facebook saw BERT as more than a method enabling text classification or retrieval. They embraced the generality of the framework and noticed the mirror between the English prose and the biological language.
Protein sequences' canonical 20 amino acids form a string in the same way that English does. This insight enabled a BERT-like method for pretraining called ESM (Rives et al. 2021). The models they trained using this scheme rediscovered both properties of the amino acids and protein structure without direct guidance.

ESM wasn't Brian's only time understanding protein through the vocabulary of human language. In his paper "Learning the language of viral evolution and escape," Brian and his team reasoned about viral escape using the framing of grammaticality and semantic change in Hie et al. 2021.
Protein language models allow us to define a sense of grammar and semantic shift. The grammaticality of a mutation can be thought of as how likely it is in a given context. In math terms:
for some mutation \hat{x}_i
to the ith residue given the rest of the starter sequence x. Similarly, the work formalized semantic shift as the L_1
distance from a mutant to the wildtype.

All of the above show the early signs of Brian having a research taste that lets him study biological problems in a unique light. Over the next few acts, we'll see how this taste enables a research vision that accomplishes not only a new lens for evolution, but also tangible prescriptions for how this can inform protein design.
Act 2: Observation
Grammar helps explain viral escape in proteins, but what else can it explain? It turns out that protein fitness in general mirrors the notion of grammar in language.
People use sentences that obey the laws of grammar, and organisms produce proteins that exhibit a natural fitness. A sentence's grammaticality increases the odds of it being used in the future, just like a protein's fitness increases the likelihood of its expression at a later time. In turn, grammar and fitness can be thought of as similar notions that power evolution. For language, it's how sentences change as they age, and for proteins it's how they mutate as time trots along.
Evolocity (Hie et al. 2022) formalizes the interplay between grammar, fitness, and evolution into a mathematical model to estimate the trajectory of protein mutations. By looking at a set of proteins, like influenza A's nucleoprotein, we can use protein language models to predict a rough ordering of which proteins came earlier or later. Looking at two proteins that differ by a single amino acid, we can consult a protein language model to predict which mutation is more likely. This more probable mutation can be seen as more grammatical, meaning it's fitter. As evolution would encourage the existence of fitter proteins, we can argue that this improved variant would come after the less fit variant. After probing the variants of a protein, an estimate of evolutionary time can be constructed by creating a directed graph with the arrows pointing from less grammatical mutants to more grammatical mutants. Using this graph the roots or starting evolutionary sequences can be identified, then analyzed to construct a pseudotime for each protein variant in the graph. This pseudotime is our evolutionary time step prediction.

Predicting evolution on the basis of grammar seems like a farfetched claim. Beyond the beautiful formulas and reminiscence of our algorithms courses, what predictive power does Evolocity have? In practice, quite a lot. Examining the test case mentioned before, the nucleoprotein of influenza A, offers a good playground. Our human immune systems place evolutionary pressure on influenza. Furthermore, influenza's evolution is well cataloged, enabling estimation of mutations over time. Comparing Evolocity's pseudotime against the underlying sampling year of a nucleoprotein results in a Spearman correlation of 0.49. The natural question is could this be due to noise? A two-sided t-test gives P = 4 * 10^-197, making a spurious correlation of the magnitude highly unlikely.

Using the framing of proteins as language created great work thus far. The continued stylization of Brian's work using language has led to a non-trivial way to think about evolution. But the work doesn't end there, in the final act we'll discuss how his sustained linguistic flair gave way to knowledge that will soon impact the clinic.
Act III: Application
Evolocity alone generated curiosity surrounding what problems could be solved by charting evolution. Naturally given the influenza A nucleoprotein studies, investigating other patterns of viral escape became plausible. A team out of BioNTech published a work building on the ideas of Evolocity to study a method for predicting high-risk variants of SARS-CoV-2 in Beguir et al. 2023.
Evolocity could have been the end of the line, but grammaticality had one more fruit to bear. As the final act in the trilogy, we'll discuss another insight enabled by language.
Predicting evolution doesn't stop at sorting the trajectory of known proteins, Evolocity could be used as a tool to further evolve proteins. Antibody therapeutics have proven themselves as potent tools in the clinic. Most importantly, they form a perfect testing ground for protein evolution.
Building upon Evolocity, Hie et al. 2023 used a voting ensemble of different PLMs to determine whether point mutations had higher likelihood than wild-type residues. If sufficiently many of the PLMs voted a mutation as more probable or grammatical than a starting sequence, the mutation would be adopted. Those sequences are sent to the wet lab and any more fit sequences are combined into a second round.

It works! Yet there's something more surprising. Along the way, we've never baked into the prior what the antigen is. The model is never informed what it should bind to, but it's able to consistently mature the antibody despite being unaware of this. Further statistical testing against a hypergeometric null argues that this method is significantly better than random mutagenesis.
Taking a step back from the technical aspect, we've witnessed something special. Brian took a simple protein language model and showed that it could both predict evolution and design therapeutics. The continued vision shown by this lineage of work took it far beyond a simple paper or two. His taste in language propelled us to rethink what these models can do and might power the therapeutics we take.
Today, as both sequence and structure-guided models become more and more performant, our insilico design capabilities will continue to get better. In work led by Varun Shanker and Brian Hie (Shanker et al. 2023), they recently highlighted how using these structure-conditioned models (known as inverse-folding models) can take a protein backbone structure and propose novel mutants that retain functions even more efficiently than PLMs. Perhaps, as structure/sequence models improve and larger models come out (such as Brian's recent work on Evo in Nguyen et al. 2024), our ability to engineer evolution may become even more prescient. This culmination of skill potentially enabling genome-scale in-silico evolution.
Conclusion
Science rarely celebrates a young PI, so we took a look at one that's blossoming. Brian made a dent because he was unabashedly himself in creating a research vision. One of the greatest things he's done as a scientist was read poetry. It enabled him to tell a story that only he alone could. His intertwining of semantics and grammar in biology may very well have potential ramifications for how we design biologics for the clinic. As we turn to our own research, maybe the most important thing in the lab is neither the pipette nor the reagent, but the copy of Murakami witnessing each of our own experiments shaping us as individuals.