Coarse-grained dynamics derived structural ensemble for prediction of metal binding sites of protein and phenotypic effects of variants
Abstract
Structures of proteins play a key role in determining their functions. Knowledge of structure,
especially the details of specific sites of a protein can help us understand their contribution to
the overall activity. However, proteins are flexible molecules and their structures are present
in various conformations in physiological conditions. Consequently, to be able to better understand
or predict the function of a protein we must incorporate a spectrum of structural
conformations in our analyses. Availability of true conformation of a protein in such an analysis
is a key requirement for reliable results. In this thesis, we explore two aspects of protein
that directly affect its conformation and thereby the function as well. Firstly, we establish a
comprehensive approach for predicting metal ion binding site in apo protein structure. The apo
and the metal bound structures of the protein can significantly differ at key sites misleading the
analysis of function of the proteins. Secondly, we try to understand the effect of point mutation
on protein conformation that can lead to loss-of-function and thereby alter the organisms phenotypic
properties. For the metal binding site prediction task, we employ geometric hashing
technique to create a template library of the functionally important binding sites, and to match
new protein structures with available templates. The matching is done on more than one structure
of a protein obtained from coarse-grained molecular dynamics simulation. The matched
residues are then checked for ligand specific amino acids to arrive at the final prediction of
the binding site(s). We have developed this method for five most common metal ions - Zn2+,
Mg2+, Cu2+, Fe3+ and Ca2+. We perform 5-fold cross validation of our method with 1,347
protein structures with 2,343 binding sites. Noteworthy advantage is demonstrated for Ca2+
and Zn2+ metal ions in comparison with IonCom and MIB method, while for other metal
ions Mg2+, Cu2+, and Fe3+, the performance is comparable. IonCom is a machine learning
iii
iv
based method which has been shown to work better than state of the art metal ion binding
methods including COACH and TargetS. MIB is a template based method (TBM) which also
performs better than many machine learning based methods and other TBMs. Notably, almost
all previous studies leverage only structural or sequential similarity information for predicting
metal binding sites. Our use of dynamics information through ensemble of structures provide
a generalized approach to improve metal binding site prediction. The utility of this method is
wide, given the integral contribution of the metal binding sites to the protein structure.
For understanding the effect of mutation on changes in conformation leading to alteration
in function and unfolding free energy, we estimate the change in terms of flexibility of the
wild type and mutated protein. Different scoring methods are used for understanding change
in function and unfolding free energy. Such as for change in function, we measure the similarity
score between flexible regions with correlation coefficient greater than a given threshold,
while we calculate free energy of wild type and mutated protein to calculate change in unfolding
free energy. The alteration of the protein physicochemical property is used to predict
phenotypic alteration through models using experimental data. We obtain significant advantage
in overall accuracy of predicting functional changes for 1,719 mutants of CALM1 protein
as compared to PolyPhen, and predicting unfolding free energy change for 8 Frataxin mutants
as compared to I-Mutant and FoldX methods. We get comparable results to PolyPhen
for classifying 34 variants of CHEK2 into deleterious, benign and neutral classes. Our simulation
based approach deeply contrasts the common phenotypic-genotypic linkage based trait
assessments using statistical data. Our method can work with completely new proteins as well.
The phenotype assessment work was done as a part of the CAGI5 competition. Critical
Assessment of Genome Interpretation (CAGI) offers an exceptional platform to blind-test algorithms
and methods for assessing genotype-phenotype linkage.
In summary, both of our metal binding site prediction and mutation effect prediction methods
give comparable or better results than state of the art methods. The competitive performance
of our approach significantly contrasts the deeply trained machine learning methods.
We expect our work to open new avenues to rationally improve conformation analysis and
function interpretation.