A divide-and-conquer distance geometry method to model protein structures from NMR spectroscopy
Abstract
Nuclear Magnetic Resonance (NMR) spectroscopy provides insights into the dynamic behavior of proteins in solution, eliminating the need for crystallization. NMR experiments can spatially probe nuclei within 6 Angstroms of each other. The NMR signals are used to obtain geometric constraints consisting of distances and dihedral angles between atoms. Combined with the covalent-bond geometry, these can be used to determine the three-dimensional structure of a protein, which is essential for understanding its chemical and physiological functions. The technical challenge comes from the imprecise and sparse nature of the data and the potential number of configurations that scale exponentially with the protein size. This complexity renders the structure determination problem computationally intractable. In the absence of a direct approach, various heuristics are used. Protein structures are typically determined iteratively to account for errors from approximations; the errors are eliminated by cycling between structure computation and other stages in the NMR pipeline. This necessitates fast and robust structure computing algorithms. State-of-the-art techniques use molecular dynamics along with simulated annealing for structure calculation. However, they have a high computational overhead as multiple configurations are employed to avoid getting trapped in local minima of the potential energy landscape. We propose an alternative approach that emphasizes complying with the available experimental information, independent of external factors such as initial conformations and force-field parameters. Our protocol, dubbed Distance Restraints and Energy Assisted Modeling (DREAM), works primarily with the available distance and angle bounds. Although distance-geometry techniques were originally introduced for structure determination from NMR, they were not widely adopted due to factors such as computational overhead, lack of scalability, and intolerance to missing or ambiguous data. On the other hand, molecular dynamics-based methods can introduce force-field artifacts into the final results. We use the following innovations to address these drawbacks: Instead of depending on random starting structures (as in molecular dynamics), DREAM leverages the natural distribution of experimental constraints into regions of larger (cores) and sparse data coverage (gaps). A divide-and-conquer framework is designed to model the cores and gaps separately, facilitating the parallel computation of the substructures. We use nonlinear optimization to compute structures for the cores in parallel, align core substructures in a single step, avoid errors from pairwise alignments, and model structure for the gaps. This is particularly effective for proteins with sparse coverage of experimental data. The distance-geometry approach removes the reliance on external factors such as force-field parameters to arrive at native structures before the post-processing stages. DREAM was tested to be robust to erroneous and missing data and can scale to large proteins with 52–271 amino acid residues. The bottom-up strategy of DREAM is closer to the protein folding paradigm, where smaller substructures are modeled first before being assimilated into the global structure. Such an approach was successfully shown to model folds and tertiary structures, whereas other contemporary distance geometry-based methods failed to yield protein conformation. DREAM was successfully tested across more than 100 protein folds. Notably, DREAM is accessible as an offline package or through a web portal, accepting input files in a widely used format. This compatibility enables potential future integration of DREAM into the NMR framework for automated or semi-supervised protein structure derivation. Comparing the protein structures from DREAM with publicly available conformations reveals notable differences, particularly in fluctuations within mobile regions. These variations offer valuable insights into protein dynamics and facilitate the investigation of protein functionality.