Protein Folding
CS 8803, Simulation of Biological Systems
Kent Czechowski
Scott McManus
Presentation
Source code
Overview
Protein folding presents an enormous problem with respect to searching
for optimal protein folds. The Levinthal paradox comes in to play, where
there may be 10300 possible protein folds to evaluate, yet a
protein chooses the optimal configuration every time. Even for shorter
protein chains, the search space is simply too large. In Ken Dill's
model, where hydrophobic (H) and hydrophilic (P) amino acids are laid
out on a grid, there are over 268 million grid layouts (not necessarily
self-avoiding) that are possible.
Moreover, there are very few self-avoiding walks to be found.
Approximately .0164% of a 24 amino-acid chain's walks are self-avoiding,
and it is the self-avoiding walks that are the most interesting.
Since there is no algorithm known for efficiently generating these
self-avoiding walks, the goal of searching for optimally-folded proteins
is even more difficult.
We have chosen simplified models for expressing proteins and have explored
search algorithms for finding optimal protein folds. In particular, we
have chosen Ken Dill's HP model for representing amino acids on a
rectangular lattice.
The hydrophobic amino acids tend to cluster together, but there is no
actual force that is necessarily measured
between them. Instead, this gives a function for evaluating the energy
between hydrophobic amino acids: each pair neighboring amino acids gives
an energy of +1 between each amino acid, for a total of +2 energy per pair.
Fig 1: Example of the HP Model
We have chosen three techniques for searching the space of possible
protein folds:
- Simulated Annealing
- Genetic Algorithms
- Constrained Hydrophobic Core Construction
Simulated Annealing
In simulated annealing, the goal is to perform many runs - possibly on the order
of millions or even more - of starting from a base or randomized configuration
and then making a randomized walk through the configuration space.
Each successive configuration is associated with an energy which, in
our case, we wished to maximized. The basis for simulated annealing comes
from physics, where a high-temperature material will cool or anneal
into a crystalline or ordered structure. As the material cools, the
likelihood of breaking away from a crystalline structure goes down.
With this in mind, simulated annealing makes a random walk through
a configuration space (e.g., folding a protein on a lattice).
If making a step in the walk results in a higher energy,
then the step is always taken, which is the same as climbing a local
maximum. However, suppose that making a step results in a lower energy.
In this case, the step is taken cautiously and with a probability that
depends on the temperature of the experiment. Early in the experiment,
when the temperature is high, the probability of taking a random step
(even if that step results in a lower-energy configuration) is high.
However, later in the experiment, when the temperature is low, the
probability is also low. This is modeled by a cooling schedule,
such as e(Energy difference)/f(Temperature), where the
function f is typically a constant times the temperature. So at a high
temperature, the exponent is 0 and e is close to 1, and at a low temperature
the exponent is large and, for a negative energy difference, is close
to 0.
We have followed this technique in our simulated annealer. Our energy
is, of course, based on the energy of a protein fold, which is based
on the number of adjacent hydrophobic amino acids. We targeted a total
of 1 million total steps, with results for running with 100 different
proteins over 10,000 timesteps up to 100,000 different proteins over
10 timesteps. Since we did not model a temperature explicitly,
we tried two cooling schedules - f(T) = 1, and f(T) = square root of
the number of iterations made so far.
One of the problems mentioned before is that there is no good way
to choose a random, nonintersecting walk. Therefore, we always
started with a straight chain. Our perturbations consisted of
randomly choosing an amino acid and changing
the direction in which an amino acid connected to its neighbor.
For example, suppose that we started with a straight chain.
Each amino acid in the chain is directed "forward" towards the
end. If we change the direction for one amino acid to be "left"
or "right" instead of forward, we get a bent shape. If we choose
another amino acid later in the chain and bend it in the same
direction, then the protein will fold back on itself.
This is repeated between 10 and 10,000 times for a protein.
After a number of iterations, the protein typically stops
yielding new, self-avoiding walks and becomes unviable. Therefore,
there is a question as to how many runs should be made.
Simulated Annealing Results
In our implementation, we varied this number and were not surprised
by the results, which are listed below.
| Total number of trials |
Number of iterations per protein |
Maximum energy |
Percentage of folds that are self-colliding |
| 100,000 |
10 |
34 - 48 |
19.2% |
| 10,000 |
100 |
34 |
68.5 to 68.7% |
| 1000 |
1000 |
34 |
76.7 to 76.7% |
| 100 |
10,000 |
34 |
68.4% |
Table 1: Collisions and Energy for Tested 20-AA Proteins
The same results were found for the square-root-cooling schedule.
One observation should be clear: after a few iterations on a single
protein, most of the chosen perturbations are useless. Another
that is also striking is that, when running with only 10 timesteps
in a single trial over 100,000 trials, the best possible result was
found for at least one protein. There are a few explanations for this,
and the most likely is that a small set of maximal energy states are
determined early in the folding. By running over many more iterations,
we have the possibility of finding initial folds which will yield the
optimal energy.
Genetic Algorithms
In genetic algorithms, we start with a population of proteins and
then mate them with one another. The mating consists of first computing
the fitness of each population member. We use a weighted probability
for choosing those who will mate together. Once the population members
are chosen for mating, mating proceeds by combining the first half of
one parent and combining it with the second half of the other parent.
Note that there is a fixed orientation to the protein chains.
There is also a possibility of mutations - the direction along
three monomers has a 1/100 chance of mutating during mating.
We then repeat the previous steps many times for a given population
and output the result.

Fig. 2: Example of Mating Between Two Proteins

Fig. 3: Example of Mating in Genetic Algorithm Based on Fitness
Also note that the mating technique is similar to the dimerization
approach, where two protein chains of half length are combined together
to create what is hopefully a chain with good energy characteristics.
However, since the energy of a folded protein is dependent on
the actual combination of the two (and not just the fitness of the
two halves), this does not necessarily yield good results.
Genetic Algorithms - Results
This approach yielded the worst energy of the three methods. One possible
explanation is for the aforementioned reason - although the quality
of each half of the protein fold is good, the number of contacts
between the two proteins, when they are combined, is still relatively
low. What is surprising is that, even with populations of 100 and
1000 generations, the maximum energy is still only half that of the
proteins found with simulated annealing.
Constrained Hydrophobic Core Construction (CHCC)
The previous two methods focused on searching through the entire search
space while focusing on configurations with high energy. However,
one question is whether the search space could be further pruned.
For example, it is known that a straight protein chain will never
yield the highest energy configuration. For that matter, when
there are a lot of hydrophobic amino acids in the protein chain,
folds of long chains which are only 2-deep (in other words, there
are two rows side-by-side) won't yield an optimal protein fold, either.
Constrained Hydrophobic Core Construction (CHCC), developed by Kaizhi Yue
and Ken Dill, solves this for a 3D lattice by first noting that a vast
majority of the hydrophobic amino acids will reside within a
parallelpiped "frame" that is nearly cubic in shape. There may be "barnacle"
hydrophobes on the outside of this frame, but for the most part all
of the hydrophilic amino acids will reside on the outside of this frame
and all of the hydrophobic amino acids will reside inside the frame.

Fig. 4: Example of a Hydrophobic Core
We examined a simplified version of this model in 2D. We used a frame
size similar to the one developed by Yue and Dill, where we first
calculated the square root of the number of hydrophobic amino acids
and then rounded up to find the size of a square. So if a 20-monomer
chain has 17 hydrophobic amino acids, then sqrt(17) ~= 4.123 and
the frame is 5x5.
Next, we setup the search space. Note that we will get all possible
results by only scanning 1/8 of the space, from the top-left corner
to the middle and up to the center of the top edge. This eliminates
reflected folds that would be evaluated to the same energy.
This algorithm is different in that we do not start with protein
chains and then modify them to find the best energy. Instead, the
protein folds are "grown" starting from one monomer and increased
in size. We prune the possible search tree with the following
rules adapted from CHCC:
- If the monomer is hydrophilic, then it must either be on
the outside of the frame or there must be only one branch
available.
- If the monomer is hydrophobic, it must go in the core unless
it was preceded by hydrophilic monomer (in which case the
hydrophilic monomer must be followed by a hydrophobic monomer
in order to get back into the core).
Since we are growing the protein, we are able to prune out
candidate folds that would be self-crossing. So although self-crossings
are not a problem, dead ends are.
Another item to note is that the size of the bounding box actually limits
obviously-bad configurations. For example, a straight chain would never
be considered. This should also be taken into account for the results.
CHCC - Results
In our tests of the same proteins
used for genetic algorithms and simulated annealing, we found that between
5 and 50% of all walks considered were dead ends. We found a weak
correlation between the number of hydrophilic AAs in the chain and the
number of dead ends explored. However, significantly fewer folds were
ever considered in full - 50,000 for the proteins explored.
It should also be noted that we found energies for folds that were
at least as good as those for simulated annealing.
However, this technique did not work well for proteins with large numbers
of hydrophilic monomers. Initially, we started with proteins that
had a configuration with an energy that was known to be good.
Then we tried random 20-monomer proteins with 7 or more hydrophilic
monomers. The result was that all of the hydrophilic monomers were
kept to the inside or were stuck on the inside when there were no
other search choices. This meant that the resulting folds were not
very good.
This led us to explore different frame construction algorithms.
The one that we tried in earnest was to change the core/frame size
by deducting attached hydrophobic monomer counts from hydrophilic
"runs". For example, in the chain PHHPPPHHHH, there is an initial "P"
that must go on the outside, and it may be followed by H within the
core. However, for the "run" of PPP monomers, the H following it must
be on the outside as well, so it does not need to be included in the
count of interior "hydrophobic" monomers. The idea was that this
would lead to a smaller core and therefore limit the affinity
for the side of the frame.
Instead, this method did not offer any improvements.
Future Work
It would be interesting to take the simulated annealing approach
and allow self-crossing walks, just as in the genetic algorithms
approach. In the genetic algorithms approach, we have considered
rotating and reorienting the two smaller protein chains that
are brought together, as in the dimerization approach [7].
It should be noted that, as in [4], we started with proteins
that folded to a known high-energy state. However, there is
work, as in [9], that focuses on generating the maximum number
of contacts between hydrophobic and polar (hydrophilic) monomers.
CHCC also focuses on globular proteins in which the hydrophobic cores
form almost the entirety of the interior. We would be interested in
following up other models, such as those where monomers are placed
on a helix [5], or other constraint-based models, such as the
hydrophobic zippers approach [2].
The CHCC implementation should definitely be extended to the
original 3D case and include more of the inlier and "slicing"
techniques explored there [1].
The original upshot of this work is that we would combine it
with our implementation of Salsa, a replica exchange framework [10].
The synthesis of these two projects, especially with respect
to embarrasingly parallel algorithms like simulated annealing,
would be interesting to explore.
References
1. Forces of tertiary structures in globular proteins K. Yue and K.A. Dill,
Proc. Natl. Acad. Sci. USA, Vol 92, pp 146-150, January 1995, Biophysics
2.Cooperativity in Protein-Folding Kinetics K.A. Dill, K.M. Fiebig, H.S. Chan, Proc. Natl. Acad. Sci. USA, Vol. 90, pp. 1942-1946, March 1993
Proc. Natl. Acad. Sci. USA, Vol 92, pp 146-150, January 1995, Biophysics
3. How to Avoid Yourself B. Hayes, American Scientist
Vol 86, Number 4, pp 314-319, July-August 1998
4. A test of lattic protein folding algorithms K. Yue et al, Proc. Natl. Acad. Sci. USA Vol 92, pp. 325-329, January 1995, Chemistry
5. Monte Carlo Simulations of Protein Folding. I. Lattice Model
and Interaction Scheme A. Kolinski, J. Skolnick, Proteins:
Structure, Function, and Genetics, Vol. 18, Issue 4, p. 338-352,
1994
6. Kinetics of Protein Folding: A Lattice Model Study of the Requirements
for Folding to the Native State, A. Sali, E. Shakhnovich, M. Karplus,
Journal of Molecular Biology, Vol. 235, Issue 5, pp. 1614-1638,
Feb 3, 1994
7. Prototeins, B. Hayes, American Scientist, Vol. 86, Number 3,
p. 216-221, May-June 1998
8. How to Fold Graciously, C. Levinthal, Mossbaun Spectroscopy
in Biological Systems Proceedings, University of Illinois Bulletin
67, No. 41, pp. 22-24, 1969
9. Engineering of stable and fast-folding sequences of model proteins,
E. I. Shakhnovich and A. M. Gutin, Proc. Natl. Acad. Sci. USA,
Vol.90, pp. 7195-7199, August 1993, Biophysics
10. Salsa: Scalable Asynchronous Replica Exchange for Parallel Molecular Dynamics Simulations, L. Zhang, M. Parashar, E. Gallicchio, R. Levy,
Proceedings of the 35th International Conference on Parallel Processing
(ICPP 2006), Columbus, OH, USA, IEEE Computer Society Press,
pp. 127 - 134, August 2006.
11. Principles of protein folding - A perspective from simple exact
models, K. A. Dill et al, Protein Science Vol 4, pp. 561-602,
April 1995