Algorithms for Investigating, Decoding and Designing ligand recognition sites in proteins - A Structural Bioinformatics Approach to Studying Protein Function
Abstract
All physical processes in living organisms are driven by specific biomolecular interactions.
Elucidating the characteristics of biomolecular interactions between proteins and their respective
small molecule ligands can provide functional insights as to why cells behave the way they do or
which ligand a protein can bind to and also which proteins a particular ligand can bind to. The
demand to address such questions has gained significant attention as it drives us to demystify
novel protein targets and cellular pathways that are currently unknown. This is equally important
if the ligand being investigated is a drug molecule or a probable lead candidate. Often, precise
identification of small molecule binding sites provide direct answers into understanding the
mechanism by which proteins recognize ligands. PDB currently hosts coordinates for 2,02,467
structures whose binding sites are readily accessible and can be efficiently mined and inferred.
This thesis employs a structural bioinformatics approach to investigate characteristics of
protein-ligand interactions on a large scale. Specifically, to enable asking a range of questions on
protein-ligand interactions, this thesis describes the development of three novel algorithms to
systematically investigate key features of protein-ligand recognition in different protein families,
enable large-scale searches in 3D for similar binding sites across large structural databases and
moreover to identify possible receptors for a given ligand using a design approach.
The first of them, called SiteMotif, is a new algorithm that compares binding sites from
multiple proteins and derives sequence-order independent structural site motifs. Studying
similarities in protein molecules has become a fundamental activity in much of biology and
biomedical research, for which methods such as multiple sequence alignments are widely used.
Most methods available for such comparisons cater to studying proteins which have clearly
recognizable evolutionary relationships but not to proteins that recognize the same or similar
ligands but do not share similarities in their sequence or structural folds. In many cases, proteins
in the latter class share structural similarities only in their binding sites. While several algorithms
are available for comparing binding sites, there are none for deriving structural motifs of the
binding sites, independent of the whole proteins. Hence we developed this method. SiteMotif
decomposes each binding site into a 3D distance matrix encoding cα-cα, cβ-cβ and cn-cn
centroid distances of all residue pairs. Following that, it uses depth-first search traversals and
finds paths common between two distance matrices. The similarity between sites are established
using three metrics; M-Distmax
, M-Distmin
, BLOSUM score each individually captures local
similarity, global similarity and the residue-residue alignment between sites. An All-vs-All
alignment for all pairs of input sites yields an all-pair similarity score from which a
representative is identified by mapping the similarity score into a projection network. Finally
alignment of all other binding sites onto a representative is established and this constitutes
multiple alignment. The algorithm is validated at multiple levels of complexity and demonstrates
its performance in different scenarios and shown that SiteMotif identifies new structural motifs of
spatially conserved residues in proteins, even when there is no sequence or fold-level similarity, a
case-study from glutathione binding proteins.
The second algorithm is FLAPP, which is a new state-of-the-art tool for aligning two
binding sites and obtaining atomic level alignments in 1 millisecond. Protein function is a direct
consequence of its sequence, structure and the arrangement at the binding site. Bioinformatics
using sequence analysis is typically used to gain a first insight into protein function. Protein
structures, on the other hand, provide a higher resolution platform into understanding functions.
As protein structural information is increasingly becoming available through experimental
structure determination and through advances in computational methods for structure prediction,
the opportunity to utilize this data is also increasing. Structural analysis of small molecule ligand
binding sites in particular provide a direct more accurate window to infer protein function.
However it remains a poorly utilized resource due to the huge computational cost of existing
methods that make large scale structural comparisons of binding sites prohibitive. Hence we
developed an algorithm called FLAPP (Fast Local Alignment of Protein Pockets) that produces
very rapid atomic level alignments. By combining clique matching in graphs and the power of
modern CPU architectures, FLAPP aligns a typical pair of binding site binding sites at ~12.5
milliseconds using a single CPU core, ~ 1 millisecond using 12 cores on a standard desktop
machine, and performs a PDB-wide scan in 1-2 minutes. Algorithm is rigorously evaluated at
multiple levels of complexity and showed its capability to detect faint alignments. We also
present a case study involving vitamin B12 binding sites to showcase the usefulness of FLAPP
for performing an exhaustive alignment based PDB-wide scan.
The application of these two algorithms is described in a case study of SAM binding
proteins. This work has led to deriving binding site structural motifs of SAM recognising
proteins and identifying previously unknown resemblance to the ATP binding Walker motifs.
S-adenosylmethionine (SAM) is a ubiquitous co-factor that serves as a donor for methylation
reactions and additionally serves as a donor of other functional groups such as amino and ribosyl
moieties in a variety of other biochemical reactions. Such versatility in function is enabled by the
ability of SAM to be recognized by a wide variety of protein molecules that vary in their
sequences and structural folds. To understand what gives rise to specific SAM binding in diverse
proteins, we set out to study if there are any structural patterns at their binding sites. A
comprehensive analysis of structures of the binding sites of SAM by all-pair comparison and
clustering, indicated the presence of 4 different site-types, only one among them being well
studied. For each site-type we decipher the common minimum principle involved in SAM
recognition by diverse proteins and derive structural motifs that are characteristic of SAM
binding. The presence of the structural motifs with precise three-dimensional arrangement of
amino acids in SAM sites that appear to have evolved independently, indicates that these are
winning arrangements of residues to bring about SAM recognition. Further, we find high
similarity between one of the SAM site types and a well known ATP binding site type. We
demonstrate using in vitro experiments that a known SAM binding protein, HpyAII.M1, a type 2
methyltransferase can bind and hydrolyze ATP. We find common structural motifs that explain
this, further supported through site-directed mutagenesis. Observation of similar motifs for
binding two of the most ubiquitous ligands in multiple protein families with diverse sequences
and structural folds presents compelling evidence at the molecular level in favour of convergent
evolution.
The final part of the thesis describes a novel de-novo design algorithm ‘CRD’ (Cognate
Receptor Discovery) to predict cognate receptors for small molecule ligands, via a design
approach. While predicting a new ligand to bind to a protein is possible with current methods, the
converse of predicting a protein for a ligand is highly challenging, except for very closely-related
known protein-ligand complexes. Predicting a receptor for any given ligand will be path-breaking
in understanding protein function, mapping sequence-structure-function relationships and for
several aspects of drug discovery including studying the mechanism of action of phenotypically
discovered drugs, off-target effects and drug repurposing. CRD constitutes multiple modules
each addressing complex problems such as ligand fragmentation, Residue library generation,
designing starting seed site etc. The source code was written in python-3.9 using the anaconda
distribution and comprises 34 python classes, 12 functions and totalling 14k lines of code. The
generated sites are pruned using a triplet fitness function, an improved site ranking scheme
introduced in this study, which are then optimized using Genetic Algorithm. CRD partially
recovers the receptor for known ligands such as ATP, SAM, FAD and Glucose, which is
remarkable given that no prior methods exist to tackle the same.
In conclusion, this work describes an algorithmic viewpoint on the protein's ligand binding
location and its importance for annotating protein function. The first two algorithms developed in
this work, together generate new capabilities of large-scale fast and accurate binding site
comparisons and identification of 3D site motifs that describe key residues at the binding sites.
The third is a first of its kind, design binding sites for a given ligand and together with the first
two, finds matching proteins to predict receptors for a given ligand. The application of these
algorithms as in the case study of SAM binding proteins, helps in uncovering the deeper aspects
of protein biology systematically. Specifically, the derivation of structural motifs in SAM binding
proteins and their similarity with ATP binding motifs led to the identification of an ATP binding
and hydrolysis ability in a methyltransferase. The algorithms are made publicly available for use
by the community. It is envisaged that many more such insights can be garnered by the use of
these algorithms useful for fundamental understanding of protein function, their evolutionary
origins and in drug discovery and biotechnology applications.
Collections
- Biochemistry (BC) [257]
Related items
Showing items related by title, author, creator and subject.
-
Structural And Evolutionary Studies On Protein-Protein Interactions
Swapna, L S (2014-05-27)The last few decades have witnessed an upsurge in the availability of large-scale data on genomes and genome-scale information. The development of methods to understand the trends and patterns from large scale data promised ... -
Algorithmic Approaches For Protein-Protein Docking And quarternary Structure Inference
Mitra, Pralay (2011-02-14)Molecular interaction among proteins drives the cellular processes through the formation of complexes that perform the requisite biochemical function. While some of the complexes are obligate (i.e., they fold together while ... -
Probing Ligand Induced Perturbations In Protien Structure Networks : Physico-Chemical Insights From MD Simulations And Graph Theory
Bhattacharyya, Moitrayee (2014-07-16)The fidelity of biological processes and reactions, inspite of the widespread diversity, is programmed by highly specific physico-chemical principles. This underlines our basic understanding of different interesting phenomena ...