Algorithmic Approaches For Protein-Protein Docking And quarternary Structure Inference
Molecular interaction among proteins drives the cellular processes through the formation of complexes that perform the requisite biochemical function. While some of the complexes are obligate (i.e., they fold together while complexation) others are non-obligate, and are formed through macromolecular recognition. Macromolecular recognition in proteins is highly specific, yet it can be both permanent and non permanent in nature. Hallmarks of permanent recognition complexes include large surface of interaction, stabilization by hydrophobic interaction and other noncovalent forces. Several amino acids which contribute critically to the free energy of binding at these interfaces are called as “hot spot” residues. The non permanent recognition complexes, on the other hand, usually show small interface of interaction, with limited stabilization from non covalent forces. For both the permanent and non permanent complexes, the specificity of molecular interaction is governed by the geometric compatibility of the interaction surface, and the noncovalent forces that anchor them. A great deal of studies has already been performed in understanding the basis of protein macromolecular recognition.1; 2 Based on these studies efforts have been made to develop protein-protein docking algorithms that can predict the geometric orientation of the interacting molecules from their individual unbound states. Despite advances in docking methodologies, several significant difficulties remain.1 Therefore, in this thesis, we start with literature review to understand the individual merits and demerits of the existing approaches (Chapter 1),3 and then, we attempt to address some of the problems by developing methods to infer protein quaternary structure from the crystalline state, and improve structural and chemical understanding of protein-protein interactions through biological complex prediction. The understanding of the interaction geometry is the first step in a protein-protein interaction study. Yet, no consistent method exists to assess the geometric compatibility of the interacting interface because of its highly rugged nature. This suggested that new sensitive measures and methods are needed to tackle the problem. We, therefore, developed two new and conceptually different measures using the Delaunay tessellation and interface slice selection to compute the surface complementarity and atom packing at the protein-protein interface (Chapter 2).4 We called these Normalized Surface Complementarity (NSc) and Normalized Interface Packing (NIP). We rigorously benchmarked the measures on the non redundant protein complexes available in the Protein Data Bank (PDB) and found that they efficiently segregate the biological protein-protein contacts from the non biological ones, especially those derived from X-ray crystallography. Sensitive surface packing/complementarity recognition algorithms are usually computationally expensive and thus limited in application to high-throughput screening. Therefore, special emphasis was given to make our measure compute-efficient as well. Our final evaluation showed that NSc, and NIP have very strong correlation among themselves, and with the interface area normalized values available from the Surface Complementarity program (CCP4 Suite: <http://smb.slac.stanford.edu/facilities/software/ccp4/html/sc.html>); but at a fraction of the computing cost. After building the geometry based surface complementarity and packing assessment methods to assess the rugged protein surface, we advanced our goal to determine the stabilities of the geometrically compatible interfaces formed. For doing so, we needed to survey the quaternary structure of proteins with various affinities. The emphasis on affinity arose due to its strong relationship with the permanent and non permanent life-time of the complex. We, therefore, set up data mining studies on two databases named PQS (Protein Quaternary structure database: http://pqs.ebi.ac.uk) and PISA (Protein Interfaces, Surfaces and Assemblies: www.ebi.ac.uk/pdbe/prot_int/pistart.html) that offered downloads on quaternary structure data on protein complexes derived from X-ray crystallographic methods. To our surprise, we found that above mentioned databases provided the valid quaternary structure mostly for moderate to strong affinity complexes. The limitation could be ascertained by browsing annotations from another curated database of protein quaternary structure (PiQSi:5 supfam.mrc-lmb.cam.ac.uk/elevy/piqsi/piqsi_home.cgi) and literature surveys. This necessitated that we at first develop a more robust method to infer quaternary structures of all affinity available from the PDB. We, therefore, developed a new scheme focused on covering all affinity category complexes, especially the weak/very weak ones, and heteromeric quaternary structures (Chapter 3).6 Our scheme combined the naïve Bayes classifier and point-group symmetry under a Boolean framework to detect all categories of protein quaternary structures in crystal lattice. We tested it on a standard benchmark consisting of 112 recognition heteromeric complexes, and obtained a correct recall in 95% cases, which are significantly better than 53% achieved by the PISA,7 a state-of-art quaternary structure detection method hosted at the European Bioinformatics Institute, Hinxton, UK. A few cases that failed correct detection through our scheme, offered interesting insights into the intriguing nature of protein contacts in the lattice. The findings have implications for accurate inference of quaternary states of proteins, especially weak affinity complexes, where biological protein contacts tend to be sacrificed for the energetically optimal ones that favor the formation/stabilization of the crystal lattice. We expect our method to be used widely by all researchers interested in protein quaternary structure and interaction. Having developed a method that allows us to sample all categories of quaternary structures in PDB, we set our goal in addressing the next problem that of accurately determining stabilities of the geometrically compatible protein surfaces involved in interaction. Reformulating the question in terms of protein-protein docking, we sought to ask how we could reliably infer the stabilities of any arbitrary interface that is formed when two protein molecules are brought sterically closer. In a real protein docking exercise this question is asked innumerable times during energy-based screening of thousands of decoys geometrically sampled (through rotation+translation) from the unbound subunits. The current docking methods face problems in two counts: (i), the number of interfaces from decoys to evaluate energies is rather large (64320 for a 9º rotation and translation for a dimeric complex), and (ii) the energy based screening is not quite efficient such that the decoys with native-like quaternary structure are rarely selected at high ranks. We addressed both the problems with interesting results. Intricate decoy filtering approaches have been developed, which are either applied during the search stage or the sampling stage, or both. For filtering, usually statistical information, such as 3D conservation information of the interfacial residues, or similar facts is used; more expensive approaches screen for orientation, shape complementarity and electrostatics. We developed an interface area based decoy filter for the sampling stage, exploiting an assumption that native-like decoys must have the largest, or close to the largest, interface (Chapter 4).8 Implementation of this assumption and standard benchmarking showed that in 91% of the cases, we could recover native-like decoys of bound and unbound binary docking-targets of both strong and weak affinity. This allowed us to propose that “native-like decoys must have the largest, or close to the largest, interface” can be used as a rule to exclude non native decoys efficiently during docking sampling. This rule can dramatically clip the needle-in-a-haystack problem faced in a docking study by reducing >95% of the decoy set available from sampling search. We incorporated the rule as a central part of our protein docking strategy. While addressing the question of energy based screening to rank the native-like decoys at high rank during docking, we came across a large volume of work already published. The mainstay of most of the energy based screenings that avoid statistical potential, involve some form of the Coulomb’s potential, Lennard Jones potential and solvation energy. Different flavors of the energy functions are used with diverse preferences and weights for individual terms. Interestingly, in all cases the energy functions were of the unnormalized form. Individual energy terms were simply added to arrive at a final score that was to be used for ranking. Proteins being large molecules, offer limited scope of applying semi-empirical or quantum mechanical methods for large scale evaluation of energy. We, therefore, developed a de novo empirical scoring function in the normalized form. As already stated, we found NSc and NIP to be highly discriminatory for segregating biological and non biological interface. We, therefore, incorporated them as parameters for our scoring function. Our data mining study revealed that there is a reasonable correlation of -0.73 between normalized solvation energy and normalized nonbonding energy (Coulombs + van der Waals) at the interface. Using the information, we extended our scoring function by combining the geometric measures and the normalized interaction energies. Tests on 30 unbound binary protein-protein complexes showed that in 16 cases we could identify at least one decoy in top three ranks with ≤10 Å backbone root-mean-square-deviation (RMSD) from true binding geometry. The scoring results were compared with other state-of-art methods, which returned inferior results. The salient feature of our scoring function was exclusion of any experiment guided restraints, evolutionary information, statistical propensities or modified interaction energy equations, commonly used by others. Tests on 118 less difficult bound binary protein-protein complexes with ≤35% sequence redundancy at the interface gave first rank in 77% cases, where the native like decoy was chosen among 1 in 10,000 and had ≤5 Å backbone RMSD from true geometry. The details about the scoring function, results and comparison with the other methods are extensively discussed in Chapter 5.9 The method has been implemented and made available for public use as a web server - PROBE (http://pallab.serc.iisc.ernet.in/probe). The development and use of PROBE has been elaborated in Chapter 7.10 On course of this work, we generated huge amounts of data, which is useful information that could be used by others, especially “protein dockers”. We, therefore, developed dockYard (http://pallab.serc.iisc.ernet.in/dockYard) - a repository for protein-protein docking decoys (Chapter 6).11 dockYard offers four categories of docking decoys derived from: Bound (native dimer co-crystallized), Unbound (individual subunits as well as the target are crystallized), Variants (match the previous two categories in at least one subunit with 100% sequence identity), and Interlogs (match the previous categories in at least one subunit with ≥90% or ≥50% sequence identity). There is facility for full or selective download based on search parameters. The portal also serves as a repository to modelers who may want to share their decoy sets with the community. In conclusion, although we made several contributions in development of algorithms for improved protein-protein docking and quaternary structure inference, a lot of challenges remain (Chapter 8). The principal challenge arises by considering proteins as flexible bodies, whose conformational states may change on quaternary structure formation. In addition, solvent plays a major role in the free energy of binding, but its exact contribution is not straightforward to estimate. Undoubtedly, the cost of computation is one of the limiting factors apart from good energy functions to evaluate the docking decoys. Therefore, the next generation of algorithms must focus on improved docking studies that realistically incorporate flexibility and solvent environment in all their evaluations.
Showing items related by title, author, creator and subject.
Swapna, L S (2014-05-27)The last few decades have witnessed an upsurge in the availability of large-scale data on genomes and genome-scale information. The development of methods to understand the trends and patterns from large scale data promised ...
Rakesh, Ramachandran (2017-12-07)Macromolecular assemblies such as the ribosome, spliceosome, polymerases are imperative for cellular functions. The current understanding of these important machineries and many other assemblies at the molecular level is ...
Brinda, K V (2011-10-24)