Comparative analyses of homologous protein sequences and structures : Inference on protein evolution

Balaji, S

dc.contributor.advisor	Srinivasan, N
dc.contributor.author	Balaji, S
dc.date.accessioned	2026-03-10T10:27:28Z
dc.date.available	2026-03-10T10:27:28Z
dc.date.submitted	2003
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/9016
dc.description.abstract	Proteins are linear polymers composed of 20 different amino acid residues and can fold into complex three-dimensional (3D) structures. For a given protein, the information required to fold into a precise 3D structure is encoded in its amino acid sequence. Furthermore, the intricacies of the 3D structure are related to the biochemical activity (function) of the protein. The exact relationship between the one-dimensional (1D) sequence of a protein and its 3D structure still remains an unsolved problem in structural biology. The description of 3D protein structures at the atomic level comes essentially from two experimental methods: X-ray diffraction and nuclear magnetic resonance (NMR) spectroscopy. The coordinate datasets of a large number of proteins are available in the Protein Data Bank. The vast amount of information available at atomic detail on protein structures has enabled researchers to classify them based on various kinds of similarities. Such classification, in turn, has been utilized in comparative studies to understand their properties and evolution. The results of such studies have profound implications for structure prediction and modeling studies, which may be viewed as useful steps prior to the determination of 3D structure using X-ray diffraction or NMR spectroscopy. The aim of the research projects reported in this thesis is to understand the relationship between sequence and structural variation among homologous proteins and its implications for understanding the evolution of protein structures and modeling. Chapter 1 of the thesis provides an introduction to protein structures and evolution. The remaining contents of the thesis can be classified into three major groups: (a) Investigation of structural divergence and sequence variability among homologous proteins (Chapters 2, 3, and 4). (b) Investigations of the evolutionary relationships within families of protein structures and among distantly related protein structures (Chapters 5 and 6). (c) Application of information available in protein sequence and structural families to add value to human proteome data (Chapter 7). Chapter 8 reports an attempt to model the structure of the cGMP-binding GAF domain of human phosphodiesterase 5A (PDE5A) based on known related structures. The use of this model in identifying critical residues of PDE5A involved in binding to cGMP is also reported in Chapter 8. Chapter 2 describes the steps involved in setting up a database of phylogeny and alignment of homologous protein structures, as well as investigations into the relationship between sequence and structural variability among homologous proteins. In order to facilitate the analysis of homologous protein structures, a database of Phylogeny and ALIgnment (PALI) of protein structures has been constructed. The PALI database contains three-dimensional (3D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of protein domains in various families derived from SCOP, a protein structural classification database. Several updates to PALI have been made to keep pace with the growing number of known protein structures. The current version (Release 2.1) comprises 844 families of homologous proteins involving 3,863 protein domain structures, with each family having at least two members. Each member within a family has been structurally aligned through rigid-body superposition with every other member in the same family using pairwise comparisons. In addition, multiple structural alignments have been performed using all members within a family. Every family with at least three members is associated with two dendrograms: one based on a structural dissimilarity metric and the other based on similarity of topologically equivalent residues for every pairwise alignment. Apart from these multi-member families, there are 817 single-member families in the current version of PALI. A new feature in the current release of PALI is the integration of sequences of homologues from sequence databases with the 3D structural families. Alignments between homologous proteins of known 3D structure and those without experimentally determined structures are also provided for every family in the enhanced version of PALI. The database, along with several web-interfaced utilities, can be accessed at: http://pauling.mbu.iisc.ernet.in/~pali The quality of the multiple rigid-body structural alignments in PALI was compared with those obtained from the COMPARER software (developed by Sali and Blundell), which encodes a procedure based on properties and relationships at every residue position. The alignments from the two procedures agreed very well, and variations were observed only in low sequence similarity cases, often in loop regions. Validation of Direct Pairwise Alignment (DPA) between two proteins was carried out by comparing it with pairwise alignments extracted from multiple structural alignments of all members in the family (PMA). In general, DPA and PMA were found to differ only rarely. The structural distance metric used in the analysis combines root mean square deviation (RMSD) and the number of equivalences. It has been shown that the structural distance metric varies similarly to RMSD as a function of sequence identity. The correlation between sequence similarity and structural similarity has been observed to be poor in protein pairs with low sequence similarity. Chapter 3 reports a comparative analysis of structure-dependent distribution of chemical groups in homologous protein structures. The chemical groups that constitute a protein can be classified into the following five categories: 1. Apolar 2. Side-chain polar 3. Sulfur 4. Main-chain polar 5. Spacers (atoms not involved in hydrogen bonding, hydrophobic interactions, or disulfide bond formation) The extent of representation of atoms belonging to each of these five groups within a sphere of 10 Å radius centered at equivalent C? atoms has been compared in pairs of structurally aligned homologous proteins. Subsequently, their relationship to structural divergence has been investigated. The dataset comprised 1,441 pairs of structurally aligned homologous proteins extracted from the PALI database, such that both proteins in each pair have a crystallographic resolution of 2 Å or better. Good average correlation coefficients of greater than 0.6 in the number of representations in hydrophobic groups, main-chain polar groups, and spacers among the homologues have been observed for most sequence identities. This might imply that the hydrophobic environment, main-chain polarity, and spacers are well retained within homologous proteins and hence are critical for the integrity of the structure. It is interesting to note that spacer atoms are also important, in addition to main-chain polar atoms and hydrophobic groups, in the 3D structures. Poor average correlation in side-chain polar atomic groups and sulfur groups at low sequence similarities suggests that polar side-chain conformations could differ significantly, especially for low sequence similarity pairs, while the hydrophobic environment is maintained in conserved core regions. The quantitative relationship, modeled as a cubic polynomial between the RMSD and the corresponding deviation in the number of representations in the chemical groups in protein pairs, could be used in the gross validation of comparative models. It is also noteworthy that the correlation in the extent of representations in each of the five groups among homologues, when considering only residues in the linear amino acid sequence, is lower than the correlation obtained when contributions from spatially proximal (within 10 Å) atomic groups are included. Chapter 4 deals with investigations into tolerance to the substitution of buried apolar residues by charged residues in homologous protein structures. The substitution of buried apolar groups by charged groups is generally expected to destabilize protein structures. In this chapter, the occurrence and accommodation of charged amino acid residues in proteins that are structurally equivalent to buried non-polar residues in homologues are investigated. A dataset of 1,852 homologous pairs of crystal structures of proteins, available at 2 Å or better resolution and identified from the PALI database, has been used to identify 14,024 examples of apolar residues in structurally conserved regions replaced by charged residues in homologues. Out of 2,530 cases of buried apolar residues, 1,677 of the equivalent charged residues in homologues are exposed, and the remaining charged residues are buried. These drastic substitutions are most often observed in homologous protein pairs with low sequence identity (<30%) and in large protein domains (>300 residues). Such buried charged residues in large proteins are often located at the interface of sub-domains or structural repeats. Beyond 7 Å of residue depth of buried apolar residues, or less than 4% solvent accessibility, almost all substituting charged residues are buried. It is also observed that acidic side-chains have a higher preference to be buried than positively charged residues. There is a preference for buried charged residues to be accommodated in the interior by forming hydrogen bonds with another side-chain rather than with the main-chain. The side-chains interacting with a buried charged residue are most often located in structurally conserved regions of the alignment. About 50% of observations involving hydrogen bonds between buried charged side-chains and another side-chain correspond to salt bridges. Among buried charged residues interacting with the main-chain, positively charged side-chains commonly form hydrogen bonds with main-chain carbonyls, while negatively charged residues are accommodated by hydrogen bonding with main-chain amides. These carbonyls and amides are usually located in loops that are structurally variable among homologous proteins. In a comparative modeling exercise, given the alignment of a target sequence to the template structure(s), if there is a buried non-polar residue in the template aligned with a charged residue in the target sequence, it is not straightforward to decide if the charged residue should be exposed or buried. Even if the decision is made to bury the charged residue, the question of how to accommodate the charge in the interior remains. Hence, this analysis provides valuable clues for decision-making in such circumstances. Chapter 5 describes a comparison of sequence-based and structure-based phylogenetic trees of homologous proteins and the inferences deduced regarding protein evolution. Based on the available known three-dimensional (3D) structures of proteins, it is clear that two potentially homologous proteins with insignificant sequence similarity could adopt a common fold and may perform the same or similar biochemical functions. As the extent of sequence similarity among such proteins could be comparable to that between two unrelated (non-homologous) proteins, it is appropriate to use similarities in 3D structures of proteins rather than base or amino acid sequence similarities when modeling protein evolution. This argument is supported by the observation that the extent of conservation improves progressively when moving from base sequences of genes to amino acid sequences of proteins to the 3D structures of proteins. This analysis presents an assessment of the usefulness of 3D structures in modeling the evolution of homologous proteins. A dataset of 108 protein domain families of known structure, with at least 10 members per family, extracted from the PALI database, has been used for this analysis. A comparison of the extent of structural dissimilarity and the extent of sequence dissimilarity among pairs of proteins, which serve as inputs for the construction of phylogenetic trees, has been performed using correlation coefficients. One observation is that the correlation between structure-based dissimilarity measures and sequence-based dissimilarity measures is usually good if the sequence similarity among homologues is about 30% or higher. For protein families with low sequence similarity among the members, the correlation coefficient between sequence-based dissimilarity and structure-based dissimilarity is poor (less than 0.2). In such cases, it has been found that the structure-based dendrogram clusters proteins with the most similar biochemical functional properties better than the sequence-similarity-based dendrogram. In the cases of multi-structural domain protein families, disulfide-rich protein families, and proteins with non-protein moieties (such as metal atoms and ATP-bound molecules), the correlation coefficient between sequence-based and structure-based dissimilarity measures is poor, even though sequence identity in some cases is higher than 30%. Chapter 6 deals with an analysis to derive inferences about the common evolutionary origin of proteins with a common fold but insignificant sequence similarity, using structure-based phylogenetic relationships. The evolutionary classifications of protein structures are best described in databases such as SCOP. In this analysis, an attempt has been made to make an objective evolutionary classification of proteins within a fold. A dataset of 10 folds from the SCOP database, each containing three or more superfamilies, has been used in this analysis. All domains within a fold are aligned on the basis of their 3D structures using COMPARER. Structural distance measures have been calculated based on various structural parameters, such as gross structural divergence in length, insertion of secondary structures, and the relative orientation and displacement of equivalent secondary structures. The extent of correlation between the structural distance metric matrix (SDM matrix) between protein pairs within a fold and the corresponding SCOP matrix, which represents the SCOP classification, has been calculated. It has been found that 8 out of 10 folds show good correlation between the SDM matrix and the SCOP matrix, reflected in correlation coefficients greater than 0.6. In dendrograms obtained using structural features, families within a superfamily are generally clustered more closely than families from different superfamilies within the fold. This is notable because the extent of sequence similarity is very low both within and across superfamilies. Members of families within a SCOP superfamily share common functional features and are suggested to have an evolutionary relationship. Hence, classification based solely on structural similarities in secondary structural regions is generally consistent with SCOP classification. However, there are situations in which objective classification tends to deviate significantly from manual SCOP classifications. Average SDM values in some superfamilies could be as low as 45.3 units, while in others they could be as high as 175.7 units. This suggests that different superfamilies have diverged to varying extents, indicating limitations in employing a universal scoring scheme to deduce evolutionary relationships. Deviations from SCOP classification suggest that objective evolutionary classification could be influenced by factors such as the nature of fold architecture, how biochemical function is determined by such architecture, and the extent of divergence among superfamilies. Chapter 7 reports structural and functional domain assignments to gene products encoded in the human genome. A dataset of 21,318 gene products in the human genome, with greater than 90% match between NCBI (www.ncbi.nlm.nih.gov) and ENSEMBL (www.ensembl.org) databases, has been used to assign functional and structural domains. This has been accomplished using information about protein sequence families available in Pfam (developed by Bateman and coworkers) and structural families available in the PALI database. Approximately 50% of gene products in the proteome could be assigned at least one structural or functional domain using the sequence-to-profile matching procedure encoded in IMPALA (developed at NCBI, USA). Protein kinases are the most frequently occurring domains, with 543 domains assigned to various gene products, while the P-loop-containing nucleotide triphosphate hydrolases superfamily is the most represented, with 871 domains. More than 200 domains belonging to functional families with apparently no structural information have been assigned to 35 structural families using family profile matching techniques. New assignments of functional domains have been made to 33 human proteins potentially linked to genetically inherited diseases (http://www.ncbi.nlm.nih.gov/Omim). The analysis also revealed that 10 families, previously known to occur only in bacteria, archaea, or viruses, are found in the human genome. These families include Ferrous iron transport protein B and UPF0118 (a domain of unknown function), which are currently known to occur only in prokaryotes, and TT viral ORF1, which is a viral protein. Chapter 8 reports sequence analysis and structural modeling of the cGMP-binding GAF domain of cGMP-specific phosphodiesterase PDE5A. GAF domains are ubiquitous signaling domains found in all three kingdoms of life. A well-established function of GAF domains is cGMP binding. The crystal structure of GAFb of mouse PDE2A has been used to construct a model of GAFa of human PDE5A complexed with cGMP, as both share a high sequence identity of 48%. The other available crystal structure of a GAF domain is from yeast, which has less than 30% sequence identity with GAFa of human PDE5A. The yeast variant is shown not to bind cGMP; hence, the yeast GAF domain structure has not been considered as a template to construct the model of GAFa of human PDE5A. A model for GAFa of human PDE5A complexed with cGMP has been generated. The model suggests that the structure of the GAFa domain of PDE5A is likely similar to that of PDE2A, with critical interactions with cGMP being retained in GAFa. Based on the model, it can be deduced that a particular phenylalanine residue (Phe205) is involved in interaction with cGMP and may play an important role in binding to cGMP by GAFa. Mutation of Phe205 to either Ala or Gln leads to complete loss of binding to cGMP by the GAFa mutants, consistent with predictions based on modeling. A survey for GAF-like sequences has been conducted across the complete genomes of 43 prokaryotes using a profile-matching procedure encoded in PSI-BLAST software. There are 98 gene products that could be assigned GAF-like domains. Further analysis suggests that GAF-like sequences in prokaryotes do not conserve residues equivalent to cGMP-binding residues in eukaryotes. Hence, prokaryotic GAF-like domains may show variation in the nature of small molecules or proteins that bind to prokaryotic GAF domain homologues.
dc.language.iso	en_US
dc.relation.ispartofseries	T05460
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation
dc.subject	Structure factor
dc.subject	Stereochemical validation
dc.subject	NMR spectroscopy
dc.title	Comparative analyses of homologous protein sequences and structures : Inference on protein evolution
dc.type	Thesis
dc.degree.name	PhD
dc.degree.level	Doctoral
dc.degree.grantor	Indian Institute of Science
dc.degree.discipline	Science

Files in this item

Name:: T05460.pdf
Size:: 26.40Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Molecular Biophysics Unit (MBU) [381]

Show simple item record