In molecular simulations---especially simulations of complex systems like biomolecules---it's incredibly difficult to start the simulation close enough to equilibrium to avoid initial transients in properties of interest. As a result, it is almost universally recommended that some initial portion of the simulation be discarded to "equilibration". Unfortunately, there hasn't been a simple, automated, and generally applicable way to do this that is standard practice in the field.

In a new manuscript draft posted to bioRxiv this morning, I show how an amazingly simple approach---simply maximizing the number of statistically uncorrelated samples in the latter part of the simulation---can lead to a surprisingly robust and useful algorithm for equilibration detection. This is very much a work in progress, so comments and feedback is very much appreciated!

DOI: http://dx.doi.org/10.1101/021659

All code needed to grab the exact versions of the tools I used (using the conda package installer and the omnia molecular simulation suite), generate the simulation data, analyze it, and generate the figures for the paper is available on GitHub: You simply need to run

./reproduce.sh

to regenerate everything---which is exactly what I did to generate the figures in the posted version of the manuscript. There are still a few improvements I hope to make the scripts easier to read and the data easier to deal with, but hopefully we can try to attain this level of ultra-simple reproducibility in future work as well.

Update [5 July 2015]: The manuscript has been updated based on valuable feedback I've already received! Thanks to everyone who has made comments!

Molecular mechanics forcefields are an integral part of molecular simulation. The quality of any properties computed from molecular simulations is wholly dependent on the quality of the underlying forcefield. Quantifying how well the forcefields we use can reproduce various physical properties provides insight into expected accuracy in other properties of interest, deficiencies in the forcefield parameters or functional form, and strategies for making systematic improvements.

In a new manuscript posted to arXiv ahead of submission, postdoc Kyle Beauchamp tackles one of the most critical issues in forcefield validation: Most of the physical property information one would like to benchmark against is tied up, inaccessible, in paper databases (also known as "books" or "journal articles"). Using the ThermoML Archive from NIST TRC headed by Kenneth Kroenlein (a coauthor on the paper), Kyle is able to show that this data the computer-readable data stored in this archive in the IUPAC-standard XML-based ThermoML format contains a wealth of information useful for automated validation (and eventually parameterization) of molecular mechanics forcefields.

As usual, all code used in the production of this manuscript is made available through GitHub. The code make use of the excellent OpenEye Toolkit, which is available free for academic use that will generate data for the public domain; the GPU-accelerated OpenMM toolkit, and the free AmberTools distribution.

Kyle A. Beauchamp, Julie M. Behr, Ariën S. Rustenburg, Christopher I. Bayly, Kenneth Kroenlein, and John D. Chodera.
Preprint ahead of submission: [arXiv] [PDF] [GitHub]

Chodera lab // MSKCC

Chodera lab // MSKCC

News

Chodera lab // MSKCC

NEWS

Automatic equilibration detection manuscript on bioRxiv

Automated forcefield benchmarking manuscript on arXiv

Chodera lab // MSKCC