Coarse-grained dynamics derived structural ensemble for prediction of metal binding sites of protein and phenotypic effects of variants
Structures of proteins play a key role in determining their functions. Knowledge of structure, especially the details of specific sites of a protein can help us understand their contribution to the overall activity. However, proteins are flexible molecules and their structures are present in various conformations in physiological conditions. Consequently, to be able to better understand or predict the function of a protein we must incorporate a spectrum of structural conformations in our analyses. Availability of true conformation of a protein in such an analysis is a key requirement for reliable results. In this thesis, we explore two aspects of protein that directly affect its conformation and thereby the function as well. Firstly, we establish a comprehensive approach for predicting metal ion binding site in apo protein structure. The apo and the metal bound structures of the protein can significantly differ at key sites misleading the analysis of function of the proteins. Secondly, we try to understand the effect of point mutation on protein conformation that can lead to loss-of-function and thereby alter the organisms phenotypic properties. For the metal binding site prediction task, we employ geometric hashing technique to create a template library of the functionally important binding sites, and to match new protein structures with available templates. The matching is done on more than one structure of a protein obtained from coarse-grained molecular dynamics simulation. The matched residues are then checked for ligand specific amino acids to arrive at the final prediction of the binding site(s). We have developed this method for five most common metal ions - Zn2+, Mg2+, Cu2+, Fe3+ and Ca2+. We perform 5-fold cross validation of our method with 1,347 protein structures with 2,343 binding sites. Noteworthy advantage is demonstrated for Ca2+ and Zn2+ metal ions in comparison with IonCom and MIB method, while for other metal ions Mg2+, Cu2+, and Fe3+, the performance is comparable. IonCom is a machine learning iii iv based method which has been shown to work better than state of the art metal ion binding methods including COACH and TargetS. MIB is a template based method (TBM) which also performs better than many machine learning based methods and other TBMs. Notably, almost all previous studies leverage only structural or sequential similarity information for predicting metal binding sites. Our use of dynamics information through ensemble of structures provide a generalized approach to improve metal binding site prediction. The utility of this method is wide, given the integral contribution of the metal binding sites to the protein structure. For understanding the effect of mutation on changes in conformation leading to alteration in function and unfolding free energy, we estimate the change in terms of flexibility of the wild type and mutated protein. Different scoring methods are used for understanding change in function and unfolding free energy. Such as for change in function, we measure the similarity score between flexible regions with correlation coefficient greater than a given threshold, while we calculate free energy of wild type and mutated protein to calculate change in unfolding free energy. The alteration of the protein physicochemical property is used to predict phenotypic alteration through models using experimental data. We obtain significant advantage in overall accuracy of predicting functional changes for 1,719 mutants of CALM1 protein as compared to PolyPhen, and predicting unfolding free energy change for 8 Frataxin mutants as compared to I-Mutant and FoldX methods. We get comparable results to PolyPhen for classifying 34 variants of CHEK2 into deleterious, benign and neutral classes. Our simulation based approach deeply contrasts the common phenotypic-genotypic linkage based trait assessments using statistical data. Our method can work with completely new proteins as well. The phenotype assessment work was done as a part of the CAGI5 competition. Critical Assessment of Genome Interpretation (CAGI) offers an exceptional platform to blind-test algorithms and methods for assessing genotype-phenotype linkage. In summary, both of our metal binding site prediction and mutation effect prediction methods give comparable or better results than state of the art methods. The competitive performance of our approach significantly contrasts the deeply trained machine learning methods. We expect our work to open new avenues to rationally improve conformation analysis and function interpretation.