|dc.description.abstract||BIOLOGICAL processes are governed through specific interactions of macromolecules. The three-dimensional structural information of the macromolecules is necessary to understand the basis of molecular recognition. A large number of protein structures have been determined at a high resolution using various experimental techniques such as X-ray crystallography, NMR, electron microscopy and made publicly available through the Protein Data Bank. In the recent years, comprehending function by studying a large number of related proteins is proving to be very fruitful for understanding their biological role and gaining mechanistic insights into molecular recognition. Availability of large-scale structural data has indeed made this task of predicting the protein function from three-dimensional structure, feasible. Structural bioinformatics, a branch of bioinformatics, has evolved into a separate discipline to rationalize and classify the information present in three-dimensional structures and derive meaningful biological insights. This has provided a better understanding of biological processes at a higher resolution in several cases. Most of the structural bioinformatics approaches so far, have focused on fold-level analysis of proteins and their relationship to sequences. It has long been recognized that sequence-fold or fold-function relationships are highly complex. Information on one aspect cannot be readily extrapolated to the other. To a significant extent, this can be overcome by understanding similarities in proteins by comparing their binding site structures. In this thesis, the primary focus is on analyzing the small-molecule ligand binding sites in protein structures, as most of the biological processes ranging from enzyme catalysis to complex signaling cascades are mediated through protein-ligand interactions. Moreover, given that the precise geometry and the chemical properties of the residues at the ligand binding sites dictate the molecular recognition capabilities, focusing on these sites at the structural level, is likely to yield more direct insights on protein function.
The study of binding sites at the structural level poses several problems mainly because the residues at the site may be sequentially discontinuous but spatially proximal. Further, the order of the binding site residues in primary sequence, in most of cases has no significance for ligand binding. Compounding these difficulties are additional factors such as, non-uniform contribution to binding from different residues, and size-variations in binding sites even across closely related proteins. As a result, methods available to study ligand-binding sites in proteins, especially on a large-scale are limited, warranting exploration of new approaches. In the present work, new methods and tools have been developed to address some of these challenges in binding site analysis. First, a novel tool for site-based function annotation of protein structures, called PocketAnnotate was developed ( http://proline.biochem.iisc.ernet. in/pocketannotate/). PocketAnnotate, detects the putative binding sites from a given protein structure and compares them to known binding sites in PDB to derive functional annotation in terms of ligand association. Since the tool derives functional annotation at the level of binding sites, it has an advantage over other methods that solely utilize fold or sequence information. This becomes even more important for cases where there is no detectable homology with entries in existing databases, as Pocket Annotate does not depend on evolutionary based information for annotation.
Second, a web-accessible tool for in silico almandine scanning mutations of binding site residues called ABS-Scan has been developed ( http://proline.biochem.iisc.ernet.in/abscan/). This tool helps in assessing the contribution of the individual residues of binding sites in the protein towards ligand recognition. All residues, one at a time, in a binding site are mutated systematically to an alanine and the ability of the corresponding mutant to bind a given ligand is analyzed. The contribution of each residue towards ligand binding is calculated through a G value derived by comparing the binding affinity to the wild-type protein-ligand complex.
Third, a database called Protein-Ligand Interaction Clusters (PLIC) has been developed to identify and analyze the information of similarity across binding sites in PDB, which has been provided in the form of a web-accessible database ( http://proline.biochem.iisc.ernet/ PLIC). Protein-ligand interactions are primarily explored using three different computational approaches - (i) binding site characteristics including pocket shape, nature of residues and interaction profiles with different kinds of chemical probes, (ii) atomic contacts between protein and ligands (iii) binding energetics involved in interactions derived from scoring functions developed for docking. The information on variations in these features derived from different computational tools is also included in the database for enabling the characterization of the binding sites. As a case study to demonstrate the usefulness of these tools, they have been applied to decipher the complexity of S-adenosyl methionine interactions with the protein. Around 1,213 binding sites of SAM or SAM-like compounds could be extracted from the PLIC database. The SAM or SAM-like compounds were observed to interact with ∼18 different protein-fold types. The variations in different protein-ligand contacts across fold types were analyzed. The fold-specific interaction properties and contribution of individual residues towards SAM binding are identified. The tools developed and example analyses using them are described in Chapter 2.
Chapter 3 describes a large-scale pocketome analysis from structural complexes in PDB, in an effort to characterize the known pocket space of protein-ligand interactions. Tools devel-opted as described in Chapter 2 are used for this. A set of 84,846 binding sites compiled from PDB, have been comprehensively analyzed with an objective of obtaining (a) classification of binding sites, (b) sequence-fold-site relationships among proteins, (c) a minimal set of physicochemical attributes sufficient to explain ligand recognition specificity and (d) site-type specific signatures in terms of physicochemical features. A new method to describe binding sites was developed in the form of BScIds such that the structural fold information is well captured. Binding sites and similarities among them were abstracted in the form of networks where each node represents a binding site and an edge between two nodes represents significant similarity between the sites at the structural level. Pocketome networks were constructed from the large-scale information on protein-ligand interactions in the PLIC database. The large pocketome network was then studied to derive relationships between protein folds and chemical entities they interact with. A classification of the binding pockets was achieved by analyzing the pocketome network using graph theoretical approaches combined with clustering methods. 10,858 clusters were identified from the network, each indicating a site-type. Thus, it can be said that there are about 10,858 site-types. Classification of ligand associations into specific site-types helps greatly in resolving the complex relationships by yielding specific site-type ligand associations. The observed classification was further probed to understand the basis of ligand recognition by representing the pockets through feature vectors. These features capture a wide range of physicochemical properties that can be used to derive site-type specific signatures and explore the pocket-space of protein-ligand interactions. A principal component analysis of these features reveals that binding site feature space is continuous in the entire PDB and minor changes in specific features can give rise to significant differences in ligand specificity, consequently defining their distinct functional roles. The weights were also derived for these features through the use of different information theoretic approaches to explain the multiple-specificity of protein-ligand interactions. Analysis of binding sites arising from contribution of residues from different protein fold-types revealed increasing diversity of physicochemical properties at the site, supporting the hypothesis that combination of folds could give rise to new binding sites.
Given that a finer appreciation of the molecular mechanisms within the cell is possible only with the structural information, the next objective was to explore if a structural view of an entire proteome can be obtained and if a pocketome could be constructed and analyzed. With this in mind, the causative agent of tuberculosis - Mycobacterium tuberculosis (Mtb) was chosen. Mtb is also being studied in the laboratory from a systems biology perspective, which enabled exploration of how systems and the structural perspectives could be combined and applied for drug discovery. Chapters 4 to 6 describe this effort.
The genome sequence of Mycobacterium tuberculosis (Mtb) H37Rv, indicates the presence of ∼4,000 protein coding genes, of which experimentally determined structures are available for ∼300 proteins. Further, advances in homology modeling methods have made it feasible to obtain structural models for many more proteins in the proteome. Chapter 4 describes the efforts for obtaining the Mtb structural proteome, through which the three-dimensional struc-tures were derived for ∼70% of the proteins in the genome. Functional annotation of each protein was derived based on fold-based functional assignments, binding-site comparisons and consequent ligand associations. PocketAnnotate, a site-based function annotation pipeline was utilized for this purpose and is described in Chapter 2. Besides these, the annotation covers detection of various sequence and sub-structural motifs and quaternary structure predictions based on the corresponding templates. The study provides a unique opportunity to obtain a global perspective of the fold distribution in the genome. The annotation indicates that cellular metabolism can be achieved with only 219 unique folds. New insights about the folds that predominate in the genome, as well as the fold-combinations that make up multi-domain proteins are also obtained. 1,728 binding pockets have been associated with ligands through binding site identification and sub-structure similarity analyses, yielding a list of ligands that can participate in various biochemical events in the mycobacterial cell. A web-accessible database MtbStructuralproteome has been developed to make the data and the analyses available to the community, ( http://proline.physics.iisc.ernet.in/Tbstructuralannotation). The resource, being one of the first to be based on structure-based functional annotations at a genome scale, is expected to be useful for better understanding of tuberculosis and for application in drug discovery. The reported annotation pipeline is fairly generic and can be applied to other genomes as well.
Chapter 5 describes the characterization of the Mtb pocketome. For the structural models of the Mtb proteome described in chapter 4, a genome-scale binding site prediction exercise was carried out using three different computational methods and subsequently obtaining consensus predictions. The three methods were independent and were based on considering geometry, inter-molecular energies with probes and sequence conservations in evolutionarily related proteins respectively. In all, 13,858 consensus binding pockets were predicted in 2,877 proteins. The pocket space within Mtb was then explored through systematic all-pair comparisons of binding sites. The number of site-types within Mtb was found to be 6,584, as compared to the ∼400 structural folds and 1,831 unique sequence families. This reveals that the pocket space is larger than the sequence or fold-space, suggesting that variations at the site-level contribute significantly to functional repertoire of the organism. By comparing the pockets with the PDB sites enclosing known ligands, around 6906 binding sites were observed to exhibit significant similarity in the entire pockets to some or the other known binding site in PDB. 1,213 metabolites could be mapped onto 665 enzymes covering most of the metabolic pathways. The identified ligands serve as a predicted metabolome for unit abundances of the proteins. A list of proteins containing unique pockets is also identified. The binding pockets, similarities they share within Mtb and the ligands mapped onto them are all made available in a web-accessible database at http://proline.biochem.iisc.ernet.in/mtbpocketome/.
The availability of structural information of the pocketome at a genome-scale opens up several opportunities in drug discovery. They can be directly applied for understanding mechanism of drug action, predicting adverse effects and pharmacodynamics of a drug. Moreover, it enables exploration of new ideas in drug discovery. Polypharmacology is a new concept that aims at modulating multiple drug targets through a single chemical entity. Currently, there are no established approaches to either select appropriate target sets or design polypharmacological drugs. In this study, a structural-proteomics approach is explored to first characterize the pocketome and then utilize it to identify similar binding sites. The knowledge of similarity relationships between the binding sites within the genome can be used in identifying possible polypharmacological drug targets. A pocket similarity based clustering of binding site residues resulted in identification of binding site sets, each having a theoretical potential to interact with a common ligand. A polypharmacological index was formulated to rank targets by incorporating a measure of drug ability and similarity to other pockets within the proteome. By comparing with known drug binding sites from databases such as the Drug Bank, the study has yielded a ready shortlist that includes sets of promising drug targets with polypharmacological possibilities and at the same time has identified possible drug candidates either directly for repurposing or at the least as significant lead clues that can be used to design new drug molecules against the entire group of proteins in each set. This analysis presents a rational approach to identify targets with polypharmacological potential, clues about lead compounds and a list of candidates for drug repurposing.
This thesis demonstrates the feasibility of utilizing the structural bioinformatics approaches at a genome-scale. The tools developed for analyzing large-scale data on protein-ligand inter-actions could be applied to characterize the pocket-space of protein-ligand interactions. The network theory approaches applied in this work, make large-scale data tractable and enable binding-site typing. The binding site analysis at a genome-scale for Mtb is first of its kind and has provided novel insights into the pocket space. The binding site analysis performed on a genome-scale for Mtb provided an opportunity to rationalize the polypharmacological target selection and explore drugs for repurposing in TB. In the larger context, structural modelling of a proteome, mapping the small-molecule binding space in it and understanding the determinants of small-molecule recognition forms a major step in defining a proteome at higher resolution. This in turn will serve as a valuable input towards the emerging field of structural-systems biology, which seeks to understand the biological models at a systems level without compromising on the resolution of the study.||en_US