Algorithms for Investigating, Decoding and Designing ligand recognition sites in proteins - A Structural Bioinformatics Approach to Studying Protein Function

Sankar, Santhosh

View/Open

Thesis full text (60.44Mb)

Author

Sankar, Santhosh

Metadata

Show full item record

Abstract

All physical processes in living organisms are driven by specific biomolecular interactions. Elucidating the characteristics of biomolecular interactions between proteins and their respective small molecule ligands can provide functional insights as to why cells behave the way they do or which ligand a protein can bind to and also which proteins a particular ligand can bind to. The demand to address such questions has gained significant attention as it drives us to demystify novel protein targets and cellular pathways that are currently unknown. This is equally important if the ligand being investigated is a drug molecule or a probable lead candidate. Often, precise identification of small molecule binding sites provide direct answers into understanding the mechanism by which proteins recognize ligands. PDB currently hosts coordinates for 2,02,467 structures whose binding sites are readily accessible and can be efficiently mined and inferred. This thesis employs a structural bioinformatics approach to investigate characteristics of protein-ligand interactions on a large scale. Specifically, to enable asking a range of questions on protein-ligand interactions, this thesis describes the development of three novel algorithms to systematically investigate key features of protein-ligand recognition in different protein families, enable large-scale searches in 3D for similar binding sites across large structural databases and moreover to identify possible receptors for a given ligand using a design approach. The first of them, called SiteMotif, is a new algorithm that compares binding sites from multiple proteins and derives sequence-order independent structural site motifs. Studying similarities in protein molecules has become a fundamental activity in much of biology and biomedical research, for which methods such as multiple sequence alignments are widely used. Most methods available for such comparisons cater to studying proteins which have clearly recognizable evolutionary relationships but not to proteins that recognize the same or similar ligands but do not share similarities in their sequence or structural folds. In many cases, proteins in the latter class share structural similarities only in their binding sites. While several algorithms are available for comparing binding sites, there are none for deriving structural motifs of the binding sites, independent of the whole proteins. Hence we developed this method. SiteMotif decomposes each binding site into a 3D distance matrix encoding cα-cα, cβ-cβ and cn-cn centroid distances of all residue pairs. Following that, it uses depth-first search traversals and finds paths common between two distance matrices. The similarity between sites are established using three metrics; M-Distmax , M-Distmin , BLOSUM score each individually captures local similarity, global similarity and the residue-residue alignment between sites. An All-vs-All alignment for all pairs of input sites yields an all-pair similarity score from which a representative is identified by mapping the similarity score into a projection network. Finally alignment of all other binding sites onto a representative is established and this constitutes multiple alignment. The algorithm is validated at multiple levels of complexity and demonstrates its performance in different scenarios and shown that SiteMotif identifies new structural motifs of spatially conserved residues in proteins, even when there is no sequence or fold-level similarity, a case-study from glutathione binding proteins. The second algorithm is FLAPP, which is a new state-of-the-art tool for aligning two binding sites and obtaining atomic level alignments in 1 millisecond. Protein function is a direct consequence of its sequence, structure and the arrangement at the binding site. Bioinformatics using sequence analysis is typically used to gain a first insight into protein function. Protein structures, on the other hand, provide a higher resolution platform into understanding functions. As protein structural information is increasingly becoming available through experimental structure determination and through advances in computational methods for structure prediction, the opportunity to utilize this data is also increasing. Structural analysis of small molecule ligand binding sites in particular provide a direct more accurate window to infer protein function. However it remains a poorly utilized resource due to the huge computational cost of existing methods that make large scale structural comparisons of binding sites prohibitive. Hence we developed an algorithm called FLAPP (Fast Local Alignment of Protein Pockets) that produces very rapid atomic level alignments. By combining clique matching in graphs and the power of modern CPU architectures, FLAPP aligns a typical pair of binding site binding sites at ~12.5 milliseconds using a single CPU core, ~ 1 millisecond using 12 cores on a standard desktop machine, and performs a PDB-wide scan in 1-2 minutes. Algorithm is rigorously evaluated at multiple levels of complexity and showed its capability to detect faint alignments. We also present a case study involving vitamin B12 binding sites to showcase the usefulness of FLAPP for performing an exhaustive alignment based PDB-wide scan. The application of these two algorithms is described in a case study of SAM binding proteins. This work has led to deriving binding site structural motifs of SAM recognising proteins and identifying previously unknown resemblance to the ATP binding Walker motifs. S-adenosylmethionine (SAM) is a ubiquitous co-factor that serves as a donor for methylation reactions and additionally serves as a donor of other functional groups such as amino and ribosyl moieties in a variety of other biochemical reactions. Such versatility in function is enabled by the ability of SAM to be recognized by a wide variety of protein molecules that vary in their sequences and structural folds. To understand what gives rise to specific SAM binding in diverse proteins, we set out to study if there are any structural patterns at their binding sites. A comprehensive analysis of structures of the binding sites of SAM by all-pair comparison and clustering, indicated the presence of 4 different site-types, only one among them being well studied. For each site-type we decipher the common minimum principle involved in SAM recognition by diverse proteins and derive structural motifs that are characteristic of SAM binding. The presence of the structural motifs with precise three-dimensional arrangement of amino acids in SAM sites that appear to have evolved independently, indicates that these are winning arrangements of residues to bring about SAM recognition. Further, we find high similarity between one of the SAM site types and a well known ATP binding site type. We demonstrate using in vitro experiments that a known SAM binding protein, HpyAII.M1, a type 2 methyltransferase can bind and hydrolyze ATP. We find common structural motifs that explain this, further supported through site-directed mutagenesis. Observation of similar motifs for binding two of the most ubiquitous ligands in multiple protein families with diverse sequences and structural folds presents compelling evidence at the molecular level in favour of convergent evolution. The final part of the thesis describes a novel de-novo design algorithm ‘CRD’ (Cognate Receptor Discovery) to predict cognate receptors for small molecule ligands, via a design approach. While predicting a new ligand to bind to a protein is possible with current methods, the converse of predicting a protein for a ligand is highly challenging, except for very closely-related known protein-ligand complexes. Predicting a receptor for any given ligand will be path-breaking in understanding protein function, mapping sequence-structure-function relationships and for several aspects of drug discovery including studying the mechanism of action of phenotypically discovered drugs, off-target effects and drug repurposing. CRD constitutes multiple modules each addressing complex problems such as ligand fragmentation, Residue library generation, designing starting seed site etc. The source code was written in python-3.9 using the anaconda distribution and comprises 34 python classes, 12 functions and totalling 14k lines of code. The generated sites are pruned using a triplet fitness function, an improved site ranking scheme introduced in this study, which are then optimized using Genetic Algorithm. CRD partially recovers the receptor for known ligands such as ATP, SAM, FAD and Glucose, which is remarkable given that no prior methods exist to tackle the same. In conclusion, this work describes an algorithmic viewpoint on the protein's ligand binding location and its importance for annotating protein function. The first two algorithms developed in this work, together generate new capabilities of large-scale fast and accurate binding site comparisons and identification of 3D site motifs that describe key residues at the binding sites. The third is a first of its kind, design binding sites for a given ligand and together with the first two, finds matching proteins to predict receptors for a given ligand. The application of these algorithms as in the case study of SAM binding proteins, helps in uncovering the deeper aspects of protein biology systematically. Specifically, the derivation of structural motifs in SAM binding proteins and their similarity with ATP binding motifs led to the identification of an ATP binding and hydrolysis ability in a methyltransferase. The algorithms are made publicly available for use by the community. It is envisaged that many more such insights can be garnered by the use of these algorithms useful for fundamental understanding of protein function, their evolutionary origins and in drug discovery and biotechnology applications.

URI

https://etd.iisc.ac.in/handle/2005/6375

Collections

Biochemistry (BC) [260]