Skip to content

A hybrid Monte Carlo algorithm to determine the structures of proteins

Notifications You must be signed in to change notification settings

lonelyneutrin0/Protein-Structure-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 

Repository files navigation

Protein Structure Prediction

This project aims to visualize protein conformations by optimizing Irbäck's off lattice energy equation. An example of such an implementation can be found here. A hybrid optimization technique known as genetic annealing is employed to reduce computation times. All data from 10/6/2024 onwards was collected using High Performance Computing.
The conformation of proteins is generally expressed using bond and torsion angles. These will serve as the independent variables of our annealing problem. The output will be a tuple of amino acid residue coordinates. The conformation of a given protein is most likely to be the one where energy is minimum. Hence, finding the bond and torsion angles for which the energy function is minimum gives the optimal conformation.

Simulated Annealing

Simulated annealing is an optimization technique that searches solution space for the global minimum through random sampling. In this implementation, the annealing will be carried out using the metropolis condition and Boltzmann values.

Implementation

The objective function:

$$f(\alpha, \beta, \chi) = -k_1 \sum_{i=1}^{N-2} \cos \alpha_i - k_2\sum_{i=1}^{N-3} \cos \beta_i + \sum_{i=1}^{N-2}\sum_{j=i+2}^N 4C(\xi_i, \xi_j)(\frac{1}{r^6} - \frac{1}{r^{12}})$$
In my code, this is implemented as

$$f(\alpha, \beta, \chi) = \langle -k_1, \cos \mathbf{\alpha} \rangle + \langle -k_2, \cos \mathbf{\beta} \rangle + \sum 4\mathbf{C_{ij}}\mathbf{D_{ij}}$$
where $\alpha$ is the bond angle vector, $\beta$ is the torsion angle vector for a given conformation $\chi$ and the third term is the Lennard Jones potential between any two residues. C gives the coefficient depending on the hydrophobicity of any two residues. The matrix- vector representation of this equation allows for usage of NumPy's vectorization improving computation times.

Neighbor method

A random component of either the bond or torsion angle vector is chosen and altered by a small value. The variation is controlled by a heterogeneous degree parameter $\lambda$ as well as the progress of the algorithm.

Parameters

The condition to determine the favorability of a neighbor is the Metropolis condition. The starting and ending temperature are 1.0 and 1e-12 respectively. The cooling coefficient $\gamma = 0.99$ and the heterogeneous degree used for choosing new neighbors $\lambda = 3$ The AnnealingOutput class serves as a container for the algorithm output. run.py utilizes the algorithm results and diagnostic data provided by objects of this class.

v1.1

This version is primitive and produces inaccurate results. One issue to fix in the next version is the energy difference sometimes causes OverFlowError. The annealing schedule must be optimized.

v1.2

This version of the algorithm solves the abovementioned OverFlowError. The algorithm is now O(ml*n), where ml is the markov chain length and n is the number of iterations. For artificial proteins, the markov chain length is set to 50000. For real proteins, it's set to 10000. The number of iterations depends on the initial and final temperature, as well as the cooling coefficient.

v1.3

The file src_np.py utilizing NumPy delivers results within ~0.5 energy units of the values in the research paper for fibonacci artificial proteins of size 13, 21, and 55, although not consistently. Further versions will aim to improve the frequency of accurate modeling.

v1.4

A PyTorch version was written for the future to take advantage of GPU acceleration. The NumPy version was tested on the protein 4RXN and yielded an energy value -172.059076, close to the optimal value of -174.612. The annealing took just over 15 hours.

v1.5

Access to high performance computing allows for parallel annealing runs. The annealer was run 10 times independently for artificial proteins [ml=50000].

LAMMPS Dump Files

From now on, conformation data will be stored in LAMMPS (Large Scale Atomic/Molecular Massively Parallel Simulator) DUMP files. This allows conformation visualization in softwares like OVITO.

Genetic Annealing

The implemented hybrid optimization algorithm aims to maximize the efficiency of the simulated annealing algorithm by patching some of its limitations, namely the high time cost. The algorithm runs a population of solutions in parallel. The relative fitness is determined after each Markov chain step. Fit individuals are chosen to be parents of the next generation. The annealer continues until a fixed number of iterations are achieved.

v1.1

The first working version was implemented in PyTorch. It had faster convergence than serial simulated annealing but always converged to suboptimal local minima. Further versions will be invested in to adjust the cooling schedule and fitness determination.

v1.2

A NumPy version was implemented to take advantage of np.savetxt() making LAMMP DUMP file creation easier.

LAMMPS Dump Files

From now on, conformation data will be stored in LAMMPS (Large Scale Atomic/Molecular Massively Parallel Simulator) DUMP files. This allows conformation visualization in softwares like OVITO.

v1.3

The final version of GA demonstrated faster convergence and higher precision. It outperformed serial simulated annealing, even for higher length proteins.

About

A hybrid Monte Carlo algorithm to determine the structures of proteins

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages