Ten Years (give or take) in the Evolution of a Protein

Written by: Christoph Adami

Primary Source:  Spherical Harmonics

How do proteins evolve? Generally the answer is “Very slowly!”. But sometimes, protein evolution can be blazingly fast. How fast, you ask? Ask instead the lizards of the South Adriatic Sea!
OK, where is the South Adriatic Sea? you ask. You should really be asking “What about those lizards?”, but here we go. The Adriatic Sea separates Italy from the Balkan peninsula, as in the picture below (upper left corner). So in 1971, researchers decided to take a species of lizards (known as Podarcis sicula, the Italian wall lizard) found on the small island Kopiste, and transplant them to the neighboring small island Mrcaru.
Adriatic Sea (top left). Pod Kopiste is the tiny island on the left, and Pod Mrcaru is to its right. The larger island is the inhabited Lastovo (credit: Google World)
I don’t know why they did it. They transplanted five adult breeding pairs, so they were intent on creating havoc, no doubt. Or an experiment, perhaps? But the Croatian War of Independence intervened, and the lizards were all but forgotten until a team returned in 2004 to Mrcaru to look at the local lizards there. And they found that the offspring of the ten had essentially overrun the island, and changed in profound ways. On Kopiste, the lizards ate mostly insects. On Mrcaru, instead, there was an abundance of plants for food, and comparatively fewer insects. The insect-eating lizards, however, were not adapted to digest plants, something that requires a different stomach structure that ensures that the plants stay in the intestine long enough to digest the plant cellulose. If it does not stay in the stomach, you can’t get the energy from it. It turns out that the lineage on Mrcaru evolved so-called cecal valves, something that does not usually occur in lizards. The cecal valves close off parts of the stomach, so that some types of bacteria could ferment the cellulose in there. This is stunning only because this adaptation took just over thirty years. It turns out that other body characteristics had changed too: longer, wider, and taller heads that translate in larger forces to bite down on the tough fibrous plants. The lizards needed to survive: this is how they did it.
Can proteins really evolve that fast? It seems that the answer is: “If you really really have to, then yes”. What a pity that we haven’t been able to sample the sequences of the proteins involved over the thirty some years. Wouldn’t that give us a fantastic window on protein evolution? But how can you know that a protein is about to undergo fundamental changes?
It turns out that you can, if you modify the environment in such a way that it becomes unlivable for the organism involved, and you then look for those types that survive the slaughter. Sounds immoral? But we do it all the time, when we give drugs to fight viral infections! The example I will use is the evolution of drug resistance in a protein of the Human Immunodeficiency Virus (HIV), the virus that causes AIDS.
AIDS broke out into the Western population in 1981, but it took fourteen years to develop the first effective anti-viral treatment: a drug that inhibits a crucial piece of the HIV machinery: the protease. To understand the drug and what the protease does, we have to spend some time with the somewhat unusual lifecycle of HIV. It is a retrovirus, which means that its genetic material is RNA, not DNA. The virus infects cells that are crucial in people’s ability to fight infections, which explains to a large extent why it is so deadly: it attacks precisely the system that is supposed to save you. The figure below gives you an idea of the virus’s life cycle.
HIV life cycle. Source: Wikimedia
After the virus capsid (the shell that encapsulates the virus RNA along with a few necessary molecules) binds to the cell (here, a T-cell, which is a type of white blood cell that plays a central role in the immune system), the virus injects the capsid’s material into the cell. Along with the RNA in the capsid comes an enzyme called the “reverse transcriptase“, which is able to make a DNA copy from the RNA material, and this DNA copy is subsequently inserted (“integrated“) into the host cell’s DNA. Now, the DNA of every cell is constantly transcribed and then translated into proteins, and the same is going to happen to the foreign DNA that was inserted into the host cell. Willy-nilly, the cell makes proteins from the virus’s information: it is making virus parts. But it turns out that unlike your own proteins that have stop codons to indicate where a protein ends, the foreign DNA (made from virus RNA) does not have those. As a consequence, the cellular machinery produces one long long protein, called a “polyprotein“. It is, of course, totally unusable in this form. It must be cleaved (meaning “cut”) into the functional pieces with a knife. Where can the virus find such a knife? Well, it makes it itself, and it carries a copy with it in the capsid. Armed with this knife, the polyprotein is cut into all the pieces that are needed to assemble another functional capsid (including the protease and the reverse transcriptase) and packaged with copies of the RNA genetic code (which the cell helpfully made for free) into new capsids. The action of the knife (called a “protease“) is shown in the lower left corner of the life cycle diagram above.
“If I could just blunt this knife”, is what HIV researchers were asking themselves, and they found just the way to do it. Take a look at the molecular structure of the protease in the figure below.
The HIV molecule is a dimer (meaning it is made out of two copies of the same protein that bind to each other, here in cyan and green). Two particular amino acids that are important in the activity of the molecule are colored red and purple
See the hole in the middle, surrounded by the red and purple amino acids? That’s where the polyprotein fits in, and the protease cuts it like a cigar cutter at specific points that are recognized by the red and purple residues. How do you inactivate the cigar cutter? You stick something in there to block the hole! Indeed, this is how all protease inhibitors–that is, drugs that inhibit the activity of the protease, work.
When these drugs hit the market, they were replacing older drugs that had nasty side effects. And these new drugs worked like magic! The only trouble was that the virus was not going to capitulate that easily. Indeed, researchers had created just the scenario that we were calling for above: change the environment in such a manner that makes it unlivable for an organism, and see how it can cope.
HIV protease inhibitors work really well (in particular if associated with another drug, the reverse transcriptase inhibitor), which means that the virus population all but goes extinct. The important modifier here is “all but”. Instead of going extinct, it goes into hiding, and researcher don’t really know where. As you can imagine, finding this hiding spot (and how to coerce the virus to leave it) is a major effort of HIV research today. A problem arises if a patient forgets to take their antiviral drugs. The virus comes out, starts replicating (slowly), and the high mutation rate of the virus creates the opportunity to evolve quickly. HIV can evolve resistance to a protease inhibitor within two weeks. This is not altogether surprising, as when unchecked the virus creates an enormous number of copies (correct and flawed) of the virus every day, so that every single mutation of the nearly 10,000 nucleotide genome is tried multiple times every day, and every pair of mutations a few times. This is enough to cause rapid evolution, and if a single virus finds a way to survive the massacre the drug unleashes, that virus will grow in numbers and create the seeds of a new destructive force that the inhibitor is unprepared for. When resistance emerges, researchers go back to the lab to develop a new type of protease inhibitor, a new way to dull the knife. While it is effective for a while, evolution ultimately keeps up, and finds a way to evade it. How do we stop this maddening race?


The history of this fight between the virus and the drugs that attempt to keep it at bay is documented, as it occurred after we had figured out how to sequence stuff. Every paper that relied on patient data, and every drug trial, was asked to deposit their sequence data (namely the sequence of the virus they extracted from their patients) and deposit it on publicly accessible databases. This sequence data became the “fossils” of this evolutionary history, and it is made from the viral RNA of patients that fought this fight, on the frontline. Many of those did not survive the fight, but they bequeathed  their virus’s sequence data to us for posterity so that we can, perhaps, save the next generation.
Patients that were enrolled in a multitude of drug trials would have the virus’s information sequenced, and these records ultimately found their way into Stanford University’s HIV resistance database (HIVdb).  All sequence data is usually deposited in central repositories such as Genbank, but Stanford’s HIVdb creates an enormous service by curating the HIV data on a single site, and developing tools and algorithms to investigate that data. In my lab, I decided that we should mine this “fossil record” to understand how HIV is adapting to, and attempting to evade, the drugs thrown at it. The evolution of drug resistance in HIV can thus be seen as a long-term evolution experiment (LTEE), only compared to the LTEE is it short, and we do not have frozen isolates.  The Stanford database is a compendium that allows users to query all sorts of information about sequence, type, and resistance profile. For our purposes, namely to study how the sequence evolves, we need only two things: sequences, and whether the patients who donated the sequence were receiving anti-viral drugs.
To understand how evolution is affecting a protein, we have to discuss the concept of the “fitness landscape“. Entire series of blog posts can be written about this concept, but we don’t have that kind of space here. Broadly speaking, a fitness landscape is an idealized picture of how the fitness of an organism depends on either the traits or the genome that determine the organism. Here, we will focus on the mapping between sequence and fitness, not traits and fitness. In such a picture, the fitness is the “elevation”, and the sequence is the coordinate. If you search for “fitness landscape” you will almost invariably end up with a picture that originates from my lab. Give it a try! You might for example find this:


A rugged fitness landscape with different evolutionary paths. Credit: Randal S. Olson
This is a rendering of a rugged fitness landscape that my student (at the time) Randy Olson created for a manuscript that we ended up not finishing.  The general idea depicted there is that mutation-by-mutation you could move peak-to-peak, or if this is not possible, you might choose a path that tries to maximize fitness, even though you may have to walk in the valleys between peaks for a (short) while.
If you consider a protein landscape (the z-axis values in the landscape represent how well a protein is doing its job) then most proteins occupy a peak, because if they did not, then mutations would move them closer to the peak until there are no more ways to improve the protein. Drugs that attack the function of a protein (such as the protease inhibitor blunting the protease as described above) change the landscape profoundly: you can imagine that they simply erase the peak. You might think that this would kill the organism (if the protein is essential). Due to the high mutation rate of the HIV virus, there are actually a lot of variants that exist in the population. Many of them are completely defective, but some of them “live” at the edges of the fitness peak that the un-mutated protein occupies. Because they are barely functional they usually do not play a role. But when the main peak is eliminated, the sequences at the fringes may be the only ones to survive. They make a virus that replicates very slowly, but replicate it does. And thus evolution can continue: if there is any way to improve the function of the protein, that path will be taken. The protein will find a distant peak to climb, and the virus is resurrected: it has evolved resistance to the drug.
Even though research has discovered more and more potent anti-viral drugs, which attack different proteins and are thus more effective than any single drug can be, the virus ultimately will evade them, in particular if the patient forgets to take the drug so that the virus can replicate faster and thus accumulate mutations faster. Is there no way to stop this?
In research that has just appeared in the journal PLoS Genetics, my colleague Aditi Gupta (now a postdoctoral researcher at the New Jersey Medical School of Rutgers University)  and I studied how the virus adapts to more and more complex drug environments over a span of almost 10 years. We studied the evolution of the HIV protease (the molecule you encountered above) using sequences deposited in the Stanford database. We found two things: First: in patients that did not receive drugs, the protease molecule was not evolving. Second, in patients that did receive drugs, the protease molecule was evolving quickly, but it evolved in a peculiar way: by storing information in epistatic interactions, rather than in residue changes.
Ok Ok, I realize that this was a mouthful. First, what was that bit about information? You see, for a protein (as well as all life, in the end) everything is about information. A protein that “does its job” has information about the environment within which it is active. Its sequence encodes that information, but it is information about that environment. You change the environment, and what used to be information may not be information anymore. Information is contextual (as I argue in a series of blog posts that starts here). The evolution of drug resistance, in the light of information theory, is then just the quest to “learn” (that is acquire information) about that new world, the new context.
And it so happens that you can store information in different ways in a sequence. You can certainly store it in the individual symbols that make up the sequence. That is how we usually think of storing information. It is less well-known that you can also store information in the correlations between symbols. I don’t know of a good way to make this intuitive. Information is something that allows you to make predictions (as I argue in the above-mentioned series). A single site being an ‘A’ (instead of a ‘C’, ‘G’, or ‘T’) might be predictive of a particular environmental state. But you can imagine that a site being an ‘A’ as long as a a very particular other sited is a ‘G’ can also be predictive, as long as the only pairs that are allowed are ‘AG’ and ‘GA’. This kind of “dependence” between sites is known as “epistasis” in genetics. There is an enormous amount of literature about epistasis in genetics (as there should be, as I believe it to be the central concept in evolutionary biology) but this post is already too long, so I must refer you to the wiki pages to learn more.
What I argue thus, in a nutshell, is that you can store information in substitutions (of residues) or you can store it in epistatic interactions between residues. What Aditi Gupta and I found by analyzing the “fossil record” of almost ten years of protein evolution is that the protease mostly stored information in the linkages between residues.
I know what you are asking: “Why would a protein do that, and what are the consequences?” These are good questions. Let’s investigate them one by one.
Storing information in “correlated changes” (epistatic interactions) is a necessity if you are rushed. The reason is technical, and you are forgiven if you don’t grasp the entirety of the argument. Single substitutions (the “simple” way to store information) has serious repercussions for a protein, as substitutions (on average) destabilize the protein. Yes, you do remember that a protein has to fold into its structural conformation, and it doesn’t just do that willy-nilly (that’s the second time I used that construction, isn’t it?). This fold has to be energetically favored, and changes in the residue usually make things worse for those energetics. This isn’t a problem if a substitution makes it just a little harder to fold, and if at the same time you have enough time to correct for that problem, by making a compensating substitution somewhere else, later. But if time is of the essence (as when the protein just found its peak utterly annihilated) you can’t just substitute a residue, because you probably have to substitute another too, and that would make the protein not fold. A non-folded protein is a dead protein. It cannot wait for a substitution that will save it.
But as I pointed out, there is another way to “learn” (that is, acquire information) by changing the way residues interact. Such changes affect the folding free energy of the protein very little, and as a consequence this is the favored mode of information acquisition if time is of the essence. What we find in the fossil record is that, indeed, this is how evolution proceeds.
What are the consequences? Well, they are likely to be profound. If a protein evolves to store information in linkages between residues, that implies that the protein becomes more and more constrained. After doing this for a a while, there aren’t that many residues anymore that are free to vary, as there are so many relative states that need to be satisfied. In theory, this means that the protein is evolving itself into a corner from which there may be no escape. What it means is that the protein inhabits a fitness landscape that becomes more and more rugged the more interactions are being locked in between residues.
Let me show you some of the technical evidence that appears in the paper. In the figure below, you see something we call “sum of pairwise MI”, where MI stands for “mutual information”. You can think of that measure as representing the amount of information stored in the linkages between residues in the protein. As a matter of fact, you shouldn’t just think of it in those terms, it is precisely that. This measure is increasing in patients that respond to drug treatment (blue triangles), but does not change in patients that are not receiving those drugs (but really are wishing they would).
Pairwise epistasis, measured in terms of mutual information, as a function of time in the HIV-1 protease. Triangles: patients taking anti-viral drugs. Circles: patients not taking any anti-viral drugs.
What this plot shows is that the proteins that are adapting to drugs do so by creating functional links between residues, and this evolution persists as more and more sophisticated drugs are introduced. But the trend seems to be stalling within the last three years. Could it be that the virus is becoming so constrained that further adaptation is impossible?
I wish I knew the answer to this question, but I don’t. At least from the time course we investigated in this paper, there is no evidence that the protein has slowed its evolution. But I must caution that we only investigated the evolution of the HIV protease for the years 1998-2006. There is sequence data for the years after 2006, of course, but our study was explicitly comparing the response of patients that took anti-viral drugs to those that did not. And after 2006, you could not find enough sequences from patients not taking anti-viral drugs in the database to make statements that were statistically sound. We understand the reason for this, of course, as the anti-viral drugs had become so potent that it would be morally reprehensible to withhold them from a control group.
It is possible that a slow-down of evolution can be discerned in the sequences of patients that were exposed to anti-viral drugs post 2006. That would be a stunning development, which would have profound implications for the evolution of drug resistance in HIV. The data is there. Who wants to analyze it?
The study I discuss was published as:
A. Gupta and C. Adami, “Strong selection significantly increases epistatic interactions in the long-term evolution of a protein”. PLoS Genetics 12 (2016) 1005960.
The following two tabs change content below.
Dr. Adami is Professor for Microbiology and Molecular Genetics & Physics and Astronomy at Michigan State University in East Lansing, Michigan. As a computational biologist, Dr. Adami’s main focus is Darwinian evolution, which he studies theoretically, experimentally, and computationally, at different levels of organization (from simple molecules to brains).