Protein Folding

CS 8803, Simulation of Biological Systems

Kent Czechowski

Scott McManus

Presentation
Source code

Overview

Protein folding presents an enormous problem with respect to searching for optimal protein folds. The Levinthal paradox comes in to play, where there may be 10300 possible protein folds to evaluate, yet a protein chooses the optimal configuration every time. Even for shorter protein chains, the search space is simply too large. In Ken Dill's model, where hydrophobic (H) and hydrophilic (P) amino acids are laid out on a grid, there are over 268 million grid layouts (not necessarily self-avoiding) that are possible. Moreover, there are very few self-avoiding walks to be found. Approximately .0164% of a 24 amino-acid chain's walks are self-avoiding, and it is the self-avoiding walks that are the most interesting. Since there is no algorithm known for efficiently generating these self-avoiding walks, the goal of searching for optimally-folded proteins is even more difficult.

We have chosen simplified models for expressing proteins and have explored search algorithms for finding optimal protein folds. In particular, we have chosen Ken Dill's HP model for representing amino acids on a rectangular lattice. The hydrophobic amino acids tend to cluster together, but there is no actual force that is necessarily measured between them. Instead, this gives a function for evaluating the energy between hydrophobic amino acids: each pair neighboring amino acids gives an energy of +1 between each amino acid, for a total of +2 energy per pair.


Fig 1: Example of the HP Model We have chosen three techniques for searching the space of possible protein folds:

Simulated Annealing

In simulated annealing, the goal is to perform many runs - possibly on the order of millions or even more - of starting from a base or randomized configuration and then making a randomized walk through the configuration space. Each successive configuration is associated with an energy which, in our case, we wished to maximized. The basis for simulated annealing comes from physics, where a high-temperature material will cool or anneal into a crystalline or ordered structure. As the material cools, the likelihood of breaking away from a crystalline structure goes down.

With this in mind, simulated annealing makes a random walk through a configuration space (e.g., folding a protein on a lattice). If making a step in the walk results in a higher energy, then the step is always taken, which is the same as climbing a local maximum. However, suppose that making a step results in a lower energy. In this case, the step is taken cautiously and with a probability that depends on the temperature of the experiment. Early in the experiment, when the temperature is high, the probability of taking a random step (even if that step results in a lower-energy configuration) is high. However, later in the experiment, when the temperature is low, the probability is also low. This is modeled by a cooling schedule, such as e(Energy difference)/f(Temperature), where the function f is typically a constant times the temperature. So at a high temperature, the exponent is 0 and e is close to 1, and at a low temperature the exponent is large and, for a negative energy difference, is close to 0.

We have followed this technique in our simulated annealer. Our energy is, of course, based on the energy of a protein fold, which is based on the number of adjacent hydrophobic amino acids. We targeted a total of 1 million total steps, with results for running with 100 different proteins over 10,000 timesteps up to 100,000 different proteins over 10 timesteps. Since we did not model a temperature explicitly, we tried two cooling schedules - f(T) = 1, and f(T) = square root of the number of iterations made so far.

One of the problems mentioned before is that there is no good way to choose a random, nonintersecting walk. Therefore, we always started with a straight chain. Our perturbations consisted of randomly choosing an amino acid and changing the direction in which an amino acid connected to its neighbor. For example, suppose that we started with a straight chain. Each amino acid in the chain is directed "forward" towards the end. If we change the direction for one amino acid to be "left" or "right" instead of forward, we get a bent shape. If we choose another amino acid later in the chain and bend it in the same direction, then the protein will fold back on itself.

This is repeated between 10 and 10,000 times for a protein. After a number of iterations, the protein typically stops yielding new, self-avoiding walks and becomes unviable. Therefore, there is a question as to how many runs should be made.

Simulated Annealing Results

In our implementation, we varied this number and were not surprised by the results, which are listed below.
Total number of trials Number of iterations per protein Maximum energy Percentage of folds that are self-colliding
100,000 10 34 - 48 19.2%
10,000 100 34 68.5 to 68.7%
1000 1000 34 76.7 to 76.7%
100 10,000 34 68.4%

Table 1: Collisions and Energy for Tested 20-AA Proteins

The same results were found for the square-root-cooling schedule.

One observation should be clear: after a few iterations on a single protein, most of the chosen perturbations are useless. Another that is also striking is that, when running with only 10 timesteps in a single trial over 100,000 trials, the best possible result was found for at least one protein. There are a few explanations for this, and the most likely is that a small set of maximal energy states are determined early in the folding. By running over many more iterations, we have the possibility of finding initial folds which will yield the optimal energy.

Genetic Algorithms

In genetic algorithms, we start with a population of proteins and then mate them with one another. The mating consists of first computing the fitness of each population member. We use a weighted probability for choosing those who will mate together. Once the population members are chosen for mating, mating proceeds by combining the first half of one parent and combining it with the second half of the other parent. Note that there is a fixed orientation to the protein chains. There is also a possibility of mutations - the direction along three monomers has a 1/100 chance of mutating during mating. We then repeat the previous steps many times for a given population and output the result.


Fig. 2: Example of Mating Between Two Proteins



Fig. 3: Example of Mating in Genetic Algorithm Based on Fitness

Also note that the mating technique is similar to the dimerization approach, where two protein chains of half length are combined together to create what is hopefully a chain with good energy characteristics. However, since the energy of a folded protein is dependent on the actual combination of the two (and not just the fitness of the two halves), this does not necessarily yield good results.

Genetic Algorithms - Results

This approach yielded the worst energy of the three methods. One possible explanation is for the aforementioned reason - although the quality of each half of the protein fold is good, the number of contacts between the two proteins, when they are combined, is still relatively low. What is surprising is that, even with populations of 100 and 1000 generations, the maximum energy is still only half that of the proteins found with simulated annealing.

Constrained Hydrophobic Core Construction (CHCC)

The previous two methods focused on searching through the entire search space while focusing on configurations with high energy. However, one question is whether the search space could be further pruned. For example, it is known that a straight protein chain will never yield the highest energy configuration. For that matter, when there are a lot of hydrophobic amino acids in the protein chain, folds of long chains which are only 2-deep (in other words, there are two rows side-by-side) won't yield an optimal protein fold, either.

Constrained Hydrophobic Core Construction (CHCC), developed by Kaizhi Yue and Ken Dill, solves this for a 3D lattice by first noting that a vast majority of the hydrophobic amino acids will reside within a parallelpiped "frame" that is nearly cubic in shape. There may be "barnacle" hydrophobes on the outside of this frame, but for the most part all of the hydrophilic amino acids will reside on the outside of this frame and all of the hydrophobic amino acids will reside inside the frame.


Fig. 4: Example of a Hydrophobic Core

We examined a simplified version of this model in 2D. We used a frame size similar to the one developed by Yue and Dill, where we first calculated the square root of the number of hydrophobic amino acids and then rounded up to find the size of a square. So if a 20-monomer chain has 17 hydrophobic amino acids, then sqrt(17) ~= 4.123 and the frame is 5x5.

Next, we setup the search space. Note that we will get all possible results by only scanning 1/8 of the space, from the top-left corner to the middle and up to the center of the top edge. This eliminates reflected folds that would be evaluated to the same energy.

This algorithm is different in that we do not start with protein chains and then modify them to find the best energy. Instead, the protein folds are "grown" starting from one monomer and increased in size. We prune the possible search tree with the following rules adapted from CHCC:
Since we are growing the protein, we are able to prune out candidate folds that would be self-crossing. So although self-crossings are not a problem, dead ends are.

Another item to note is that the size of the bounding box actually limits obviously-bad configurations. For example, a straight chain would never be considered. This should also be taken into account for the results.

CHCC - Results

In our tests of the same proteins used for genetic algorithms and simulated annealing, we found that between 5 and 50% of all walks considered were dead ends. We found a weak correlation between the number of hydrophilic AAs in the chain and the number of dead ends explored. However, significantly fewer folds were ever considered in full - 50,000 for the proteins explored.

It should also be noted that we found energies for folds that were at least as good as those for simulated annealing.

However, this technique did not work well for proteins with large numbers of hydrophilic monomers. Initially, we started with proteins that had a configuration with an energy that was known to be good. Then we tried random 20-monomer proteins with 7 or more hydrophilic monomers. The result was that all of the hydrophilic monomers were kept to the inside or were stuck on the inside when there were no other search choices. This meant that the resulting folds were not very good.

This led us to explore different frame construction algorithms. The one that we tried in earnest was to change the core/frame size by deducting attached hydrophobic monomer counts from hydrophilic "runs". For example, in the chain PHHPPPHHHH, there is an initial "P" that must go on the outside, and it may be followed by H within the core. However, for the "run" of PPP monomers, the H following it must be on the outside as well, so it does not need to be included in the count of interior "hydrophobic" monomers. The idea was that this would lead to a smaller core and therefore limit the affinity for the side of the frame. Instead, this method did not offer any improvements.

Future Work

It would be interesting to take the simulated annealing approach and allow self-crossing walks, just as in the genetic algorithms approach. In the genetic algorithms approach, we have considered rotating and reorienting the two smaller protein chains that are brought together, as in the dimerization approach [7].

It should be noted that, as in [4], we started with proteins that folded to a known high-energy state. However, there is work, as in [9], that focuses on generating the maximum number of contacts between hydrophobic and polar (hydrophilic) monomers.

CHCC also focuses on globular proteins in which the hydrophobic cores form almost the entirety of the interior. We would be interested in following up other models, such as those where monomers are placed on a helix [5], or other constraint-based models, such as the hydrophobic zippers approach [2].

The CHCC implementation should definitely be extended to the original 3D case and include more of the inlier and "slicing" techniques explored there [1].

The original upshot of this work is that we would combine it with our implementation of Salsa, a replica exchange framework [10]. The synthesis of these two projects, especially with respect to embarrasingly parallel algorithms like simulated annealing, would be interesting to explore.

References

1. Forces of tertiary structures in globular proteins K. Yue and K.A. Dill, Proc. Natl. Acad. Sci. USA, Vol 92, pp 146-150, January 1995, Biophysics

2.Cooperativity in Protein-Folding Kinetics K.A. Dill, K.M. Fiebig, H.S. Chan, Proc. Natl. Acad. Sci. USA, Vol. 90, pp. 1942-1946, March 1993 Proc. Natl. Acad. Sci. USA, Vol 92, pp 146-150, January 1995, Biophysics

3. How to Avoid Yourself B. Hayes, American Scientist Vol 86, Number 4, pp 314-319, July-August 1998

4. A test of lattic protein folding algorithms K. Yue et al, Proc. Natl. Acad. Sci. USA Vol 92, pp. 325-329, January 1995, Chemistry

5. Monte Carlo Simulations of Protein Folding. I. Lattice Model and Interaction Scheme A. Kolinski, J. Skolnick, Proteins: Structure, Function, and Genetics, Vol. 18, Issue 4, p. 338-352, 1994

6. Kinetics of Protein Folding: A Lattice Model Study of the Requirements for Folding to the Native State, A. Sali, E. Shakhnovich, M. Karplus, Journal of Molecular Biology, Vol. 235, Issue 5, pp. 1614-1638, Feb 3, 1994

7. Prototeins, B. Hayes, American Scientist, Vol. 86, Number 3, p. 216-221, May-June 1998 8. How to Fold Graciously, C. Levinthal, Mossbaun Spectroscopy in Biological Systems Proceedings, University of Illinois Bulletin 67, No. 41, pp. 22-24, 1969

9. Engineering of stable and fast-folding sequences of model proteins, E. I. Shakhnovich and A. M. Gutin, Proc. Natl. Acad. Sci. USA, Vol.90, pp. 7195-7199, August 1993, Biophysics

10. Salsa: Scalable Asynchronous Replica Exchange for Parallel Molecular Dynamics Simulations, L. Zhang, M. Parashar, E. Gallicchio, R. Levy, Proceedings of the 35th International Conference on Parallel Processing (ICPP 2006), Columbus, OH, USA, IEEE Computer Society Press, pp. 127 - 134, August 2006.

11. Principles of protein folding - A perspective from simple exact models, K. A. Dill et al, Protein Science Vol 4, pp. 561-602, April 1995