Insights into Substrate Specificity in Sortase Enzymes from Structural Studies on a Novel Class of Housekeeping Sortase (SrtE) Identifying Functionally Important Cis-Peptide Containing Segments in Proteins and their utility in Molecular Function Annotation
Understanding protein function is fundamental to the fields of protein engineering and drug design. While most of the previous efforts in this direction have focused on the sequence-structure-function paradigm, recent studies have pointed to protein dynamics as being integral to its activity. The work in the current thesis follows this overall theme of obtaining insights into protein function from its structure and dynamics. It can be broadly divided into two sections. In the first section, the thesis candidate has tried to elucidate the residues modulating the substrate specificity of a particular family of enzymes, known as sortases, through structural and computational studies (including dynamics simulations) on a novel member in the family. This work has been carried out in collaboration with Dr. R.P. Roy, National Institute of Immunology, New Delhi (biochemical characterization was performed by Mr. Vijay Pawale at Dr. Roy‟s laboratory). In the second half of this thesis, the candidate has described a structure-based method involving the use of cis-peptide containing segments for the function annotation of proteins. The incorporation of dynamics information leads to an improvement of our annotation approach, which is also demonstrated. This part of the work has been carried out in collaboration with Dr. Debnath Pal, Department of Computational and Data Sciences, Indian Institute of Science. Following is a chapter-wise description of the overall layout of the thesis. Section I: Insight into substrate specificity in sortase enzymes from structural studies on a novel housekeeping sortase of class E (SrtE) Chapter 1| A brief account of sortases: This chapter provides a brief survey of the literature on sortases and the scope of the work presented in the thesis. Many surface proteins in Gram-positive bacteria are incorporated into the cell wall through covalent ligation by a class of cysteine transpeptidases known as Sortase. These surface proteins contain a cell wall sorting signal (CWSS) which is recognized by sortase, enzymatically cleaved and subsequently joined covalently to the pentaglycine branch of lipid II (a peptidoglycan precursor) in general, which is finally incorporated into the peptidoglycan cell wall. Six classes of sortases have been identified on the basis of their sequence. These sortases differ in the substrate motif that they recognize and the function performed. The class A sortase (SrtA) is expressed ubiquitously in Gram-positive bacteria. It is involved in the cell surface anchoring of a large number of functionally distinct proteins which contain an LPXTG recognition motif in their CWSS, and is referred to as the „house-keeping‟ sortase. Sortases of other types are not ubiquitous and are meant to perform specialized functions. Sortase B is involved in iron acquisition, sortase C in pilus formation and sortase D in sporulation. The substrate motifs recognized by these sortases are, in general, different from the recognition motif in SrtA substrates. Several Gram-positive bacteria with a high GC content in their genome have been suggested to use a sortase E (SrtE) instead of SrtA to perform the housekeeping activity. These sortase sequences share low identity with sortases of classes A-D. The substrates of SrtE have been proposed to contain an LAXTG recognition motif instead of LPXTG based on genomic analyses. Class F consists of sortases from several Actinobacteria. However, the biological function of these sortases is not well understood. To date, structures of sortases from classes A-D have been determined, all of which display an eight-stranded beta barrel fold (termed the sortase fold), a conserved catalytic triad of His-Cys-Arg and a TLXTC motif at the active site (C: catalytic Cysteine; X varies across the different classes of sortases). Sortase B and C are augmented by additional secondary structure features which are absent in sortase A. SrtA from Staphylococcus aureus is the most well studied among sortases of known structure. Several of the surface proteins attached by sortases are responsible for bacterial virulence. SrtA deletion mutants have been found to exhibit reduced virulence without affecting cell viability. Moreover, the localization of sortase in the cell membrane and the absence of eukaryotic homologs have made sortase an attractive target for the development of novel therapeutics. In addition, the transpeptidase activity of sortase has found extensive applications in biotechnology. The prototype SrtA from Staphylococcus aureus is commonly used for these applications; however, its use is limited by its obligate Ca2+ ion-dependent activity and the stringent preference for an LPXTG motif. Hence, characterization of new sortases with altered substrate recognition profiles and rational modification of known sortases has tremendous potential for biotechnological applications and advancements. While sortases of classes A-D have been studied extensively to date and their structures determined, no structural data is available for a class E sortase. The thesis candidate has solved the first high resolution crystal structure of a putative housekeeping Sortase E in Streptomyces avermitilis (SavSrtE), a bacterium with a GC rich genome. Biochemical experiments performed by our collaborator on this protein have demonstrated Ca2+ independent transpeptidase activity and a preference for LAXTG-containing peptides as its cognate substrate over the LPXTG motif that is recognized by sortase A. Moreover, the protein exhibits a preference for small uncharged residues in the position succeeding the penta-peptide motif. This thesis documents the results of crystal structure analyses, molecular docking studies and dynamics simulations to understand the structural basis for these experimental findings. Finally, sequence analyses were performed to detect possible residues which modulate substrate specificity. Based on these analyses, mutations were performed. The thesis also documents the crystal structure solution and analysis of an active site mutant (residue T196 at the position X in the TLXTC motif). Chapter 2| Methods for the analyses of Sortase E from S. avermitilis (SavSrtE): This chapter provides a description of the procedures used to carry out the thesis work. An N-terminus truncated construct (∆N50) of wild type SavSrtE and its mutant T196V were cloned, expressed and purified in the laboratory of our collaborator, Dr. R.P. Roy (NII, New Delhi), and provided to us for structure and sequence analyses. Initially, crystallization trials were carried out on the wild type protein using commercially available screening kits and the sitting drop vapor diffusion method. The condition which gave crystals was optimized further. Finally, diffraction quality crystals were obtained in a drop containing 1μL of protein (4 mg/mL in 10 mM Tris-HCl buffer pH 7.2, 100 mM NaCl and 2 mM beta-mercaptoethanol) mixed with 1μL solution of the crystallization condition containing 1.6 M ammonium sulfate, 0.1 M citric acid at pH 3.75 using the hanging drop vapor diffusion method. The crystals were cryo-protected in a 10% sucrose solution and diffraction data collected at the European Synchrotron Radiation Facility (BM-14, ESRF). The crystals diffracted to 1.65Å. The protein crystallized in the P3221 space group with unit cell parameters a = b = 85.84Å, c = 48.20Å, α = β = 90°, γ = 120°. Calculation of Matthews coefficient indicated the presence of one molecule in the asymmetric unit. T196V mutant protein yielded diffraction quality crystals in the same condition as the wild type protein. The crystals were cryo-protected using sucrose and diffraction data were collected at the BM-14 beamline. The mutant crystals diffracted to 1.70Å. The protein crystallized in the P3221 space group with unit cell parameters a = b = 84.98Å, c = 48.00Å, α = β = 90°, γ = 120° and one molecule in the asymmetric unit. The quality of the datasets was assessed by SFCHECK and data were found to be of appropriate quality for structure solution. SavSrtE has low sequence identity (25 – 34%) to other class A sortases of known structure. Hence the scaled data, sequence information and model coordinates (sortase A from Streptococcus agalactiae, PDB ID: 3rcc) were submitted to the MR (molecular replacement) phasing option in the EMBL-Hamburg AutoRickshaw pipeline. The model generated from the server was used as input to PHASER for MR. The MR solution was subjected to one cycle of rigid body refinement followed by several cycles of restrained refinement using REFMAC from the CCP4 suite, with alternate rounds of inspection and manual model building in COOT for model improvement. The convergence of the refinement procedure was checked from the reduction in R-factors. The most essential refinement statistics for the final models of the wild type protein and T196V mutant are tabulated below. Table 1 Wild type (5GO5) T196V (5GO6) Resolution 1.65 Å 1.70 Å Rwork / Rfree (%) 16.11 / 19.05 17.31 / 20.82 R.M.S. bond lengths (Å) 0.012 0.019 R.M.S. bond angles (°) 1.53 1.89 Average B-factors (Å2) Protein 19.1 32.5 Water 32.6 42.4 SO42- 58.7 60.8 Gly 36.0 - Ramachandran map statistics Most favoured region (%) 86.8 89.8 Additional allowed region (%) 13.2 10.2 Generously allowed region (%) 0.0 0.0 Outliers (%) 0.0 0.0 The genome of S. avermitilis was searched using the ScanProsite tool to identify putative substrates, details of which are also documented in this chapter. Additionally, the thesis candidate performed Mutual Information analysis on an alignment of 1569 sortase sequences from different classes to identify the residues possibly regulating substrate specificity. Based on this analysis, mutations were performed of which the T196V mutant has been studied in this thesis. Finally, this chapter describes the protocol used to perform protein peptide docking and subsequent molecular dynamics simulations to understand how dynamics may influence substrate specificity. Chapter 3| Analyses of SavSrtE sequence and structure: This chapter provides a description of the analyses on the wild type SavSrtE and the T196V mutant. The overall fold of SavSrtE is very similar to that observed in the structures of other sortases, although the sequence similarity to other classes is low. Variations are observed in the loop regions (longer β1/β2 and β6/β7 loops). The active site is comprised by residues from the β2/H1 loop, β3/β4 loop, β4 strand, β6/β7 loop, β7 strand, β7/β8 loop and β8 strand. It also does not carry any cluster of electronegative residues close to the active site and therefore, is expected to have Ca2+ ion independent activity, which is observed in biochemical experiments (Dr. R.P. Roy‟s lab). Comparison with other housekeeping sortases showed that the β6/β7 loop in SavSrtE is in a closed conformation, indicating the presence of a preformed binding pocket for the LAXTG substrate binding, contrary to the prototype SrtA from Staphylococcus aureus which requires a Ca2+ ion to stabilize the closed conformation. Moreover, a small pocket is observed adjacent to the catalytic triad which contained electron density fitting a Gly molecule. This pocket is proposed to be the binding site for the second substrate that resolves the protein-peptide intermediate through a nucleophilic attack. Our docking simulations showed that a Gly of a triglycine moiety can be positioned in this pocket. Biochemical experiments established that SavSrtE recognizes the substrate motif LAXTG instead of LPXTG which is preferred by class A sortases. It also prefers Gly based nucleophiles as the second substrate. Additionally, the protein is found to prefer neutral residues over charged residues in the position succeeding the Gly of the LAXTG motif. Structure analyses showed the presence of a bulky Tyr residue (Y112) at the active site pocket which, according to molecular docking studies, hinders the productive binding of Pro-containing peptides (LPXTG) over Ala-containing ones (LAXTG). The OH group of Y112 is involved in a hydrogen bond with the backbone nitrogen of the second Ala in the ALANT peptide but not in the Pro-containing peptide. Y112 is held rigidly in place via interactions with neighbouring residues and a network of hydrogen-bonded water molecules in the crystal structure. A Tyr residue is found to be present in an equivalent position in several sortase sequences of Class E, and may be a general feature responsible for the specificity of sortase Es to putative LAXTG-containing substrates in their genomes. It may be mentioned that class D sortases, which contain a Phe residue at the equivalent position, recognize the LPXTA substrate motif. The side chain of this Phe displays different rotamers in the NMR structure of Bacillus anthracis SrtD, pointing to its flexibility, whereas Y112 in S. avermitilis SrtE is rigid. In addition, molecular dynamics simulations on the models of protein-peptide complex (obtained from docking) showed that the two peptides have similar backbone dynamics, unlike the case of S. aureus SrtA where the Ala-containing peptide does not maintain a kinked conformation similar to the Pro-containing cognate peptide. Hence the Tyr at the active site appears to be the main factor behind the discrimination of the two peptides. Substrate sequences in the S. avermitilis genome contain small neutral residues in the position succeeding the Thr-Gly peptide bond in the substrate. This preference is also observed in biochemical assays. Docking calculations showed that the protein cannot accommodate large side chains in the site where this residue is positioned. To detect the residues involved in altering the substrate specificity of SavSrtE, we performed a multiple sequence alignment using 1569 sortase sequences and carried out mutual information (MI) analysis on this data. Our analysis implicated several residue pairs lining the active site pocket in modulating substrate specificity. These included the aforementioned Tyr residue as well as the position X (T196 in SavSrtE) in the TLXTC motif at the active site. Mutations were performed at these positions and crystallization trials performed. We could successfully crystallize and solve the structure of the T196V mutant, which has been documented in this thesis. The mutant protein has the same overall structure as the wild type. Moreover, the catalytic Cys residue was observed to be unmodified in this structure, compared to the wild type which was presumably altered by β-mercaptoethanol added during protein purification. The mutated residue (Val) was found to have a different side chain rotamer than T196. Moreover, the absence of any polar atom in the side chain of V196 disrupted the hydrogen-bonded network of water molecules observed at the active site in the wild type structure. Experiments on the mutant showed a reduction in activity, implying that T196 is important for substrate recognition. The altered side chain orientation of V196 is expected to be responsible for the reduction in activity, though a peptide-bound crystal structure would be necessary to clearly understand the mechanism. In this respect, future crystallization trials may be performed with modified peptides that bind covalently to the active site Cys residue, similar to the strategy employed for S. aureus SrtA and Bacillus anthracis SrtA. Our structure and sequence analyses have pointed to some residue positions responsible for the modified substrate specificity. While only one mutant has been characterized, the other mutants also need to be studied (through biochemical asssays and structure analysis) to understand how they contribute to substrate recognition. In this context, double mutants may also be generated to understand the combined effect. For example, single mutations of E105 and E108 were found to reduce the activity of Staphylococcus aureus SrtA, while the double mutant resulted in Ca2+ ion independent activity. Additional structure and sequence analysis coupled with experiments are necessary to detect residues which may be mutated to enhance the activity of SavSrtE, similar to what has been performed for S. aureus SrtA. To summarize, our studies show that the substrate specificity of SavSrtE is different from that of class A sortases, and provide an explanation for it using structure analyses and computation. This altered specificity profile, orthogonal to that of S. aureus SrtA, and Ca2+ ion independent activity make it a potential candidate for use in simultaneous conjugation of multiple peptide substrates to their target. Moreover, this structure may be used firstly as a model to design inhibitors for housekeeping srtEs from pathogenic organisms like Corynebacterium diphtheriae and Tropheryma whipplei. Secondly, most of the previous studies on inhibitor design for sortases documented small molecules or peptidomimetics binding to the pocket of the first substrate. Since distinct binding pockets have been observed for the two substrates in SavSrtE, this information may be used to build inhibitors targeting the second pocket or spanning both the pockets. Section II: Identifying functionally important cis-peptide containing segments in proteins and their utility in molecular function annotation Chapter 4| Functionally important cis-peptide fragments in proteins: detection and relevance: This chapter describes the relevance of cis-peptides to protein function and a method to detect such functionally important cis-peptides in proteins. Cis-peptide bonds are comparatively rare in proteins due to the steric strain associated with the 1,4-atomic clash in the peptide chain. Consequently, only about 0.03% of Xaa-Xnp (Xaa: any amino acid; Xnp: any amino acid other than Pro) peptide bonds occur in the cis conformation; the occurrence is somewhat higher (5%) for imino peptide-containing Xaa-Pro cases. Despite their low occurrence, cis-peptides have been found to be evolutionarily conserved, pointing to their important role in structure and function. Cis-Xnp peptide bonds exhibit a significant disposition towards ligand-binding sites and dimerization interfaces, whereas cis-Pro bonds have been found to occur in a rare „touch-turn‟ motif at functional sites. Cis-trans isomerization is expected to play a regulatory role in many cellular processes. Non-conservation of these peptides is implicated in the evolution of different function among similar protein folds. Hence, there has been a renewed interest in detecting cis-peptides from residue patterns and linking them to molecular function. The importance of proteins as molecular 'workhorses' makes it imperative to understand how they function. However, a vast majority of the proteins catalogued in public sequence and structure databases do not have experimentally verified functional annotation. Experimental approaches are inadequate to manually curate these large numbers of un-annotated proteins. This necessitates the use of computational function prediction tools. The simplest prediction methods involve the assessment of similarity in sequence and three-dimensional structure with homologous proteins of known function. The presence of high overall similarity, however, does not predict function unambiguously since certain protein folds are associated with multiple functions while proteins with different folds may share functional traits. Often proteins with different global structure are found to have structural similarity at the local level of segments of residues that are responsible for the similarity in function. This has given rise to fragment-based (FB) function annotation methods. FB methods may involve locating functionally relevant surface patches or cavities formed by sequentially distant residues, or the presence of structurally conserved, contiguous residue fragments with proven relevance to function. The direct relevance of the cis-peptide bond to protein function suggests its use for the purpose of function annotation in a FB approach, yet no method exists to exploit it. This chapter describes a method using geometric clustering and level-specific Gene Ontology (GO) molecular-function (MF) terms to identify, in a statistically significant manner, cis-peptide embedded fragments (henceforth referred to as cis-fragments) in a protein linked to its molecular function. Such fragments were associated with GO MF based propensity value ≥ 20 at p-value ≤ 0.05, indicating the statistical significance of our results. The relevance of the identified cis-fragments to protein function was further verified through a literature survey. The features of these fragments are discussed in this chapter. Some of these fragments do not overlap with known PROSITE patterns, depicting the utility of these fragments as sequence patterns. Moreover, the thesis candidate identified contiguous stretches of functionally important trans-peptide fragments and cis-fragments forming extended structure-based functional signatures. Chapter 5| Use of functionally important cis-fragments in annotation: In this chapter, the candidate describes how a library of cis-peptide embedded fragments with proven association to molecular function can be useful for annotating proteins with known structure (and having cis-peptide) but unknown function. The functionally important fragments detected in the previous chapter were searched for exact matches in sequence and cis-peptide in a test set of PDB entries of known function at different thresholds of sequence redundancy and p-value. Additionally, the match or mis-match in GO MF term between the functionally important fragment and the test protein was also evaluated. To assess the efficiency of our method in annotation, true positive rate (TPR) and false positive rate (FPR) were calculated at each threshold as follows: TPR TP and FPR FP TP FN FP TN The following table explains how the numbers of cases with TP, FP, etc. were assigned. Cases with match in Match in cis-peptide No match in cis- sequence peptide Match in GO MF TP FN No match in GO MF FP TN The cis-fragments alone were sufficient to identify other proteins with similar function. Over different thresholds, TPR >0.91 and FPR <0.23 were observed. Annotation recall benchmarks interpreted using receiver-operator-characteristic-plot returned >0.9 area-under-curve, corroborating the utility of the annotation method. Further, the applicability of our method in fragment-based function annotation is illustrated for cases where homology-based annotation transfer is not possible. The work presented here adds to the repertoire of function annotation approaches and also facilitates engineering, design and allied studies around the cis-peptide neighbourhood of proteins. The results presented in chapters 4 and 5 have already been published (reprint enclosed) with the thesis candidate as the first author. Chapter 6| Molecular dynamics information improves cis-peptide based function annotation of proteins: The preceding chapters have demonstrated the use of functionally relevant cis-peptide segments in a homology-independent, fragment match-based protein function annotation method. However, proteins are not static molecules; their dynamics is integral to their activity. Hence we have incorporated the dynamics (obtained using an in-house coarse-grained forcefield) of functionally important cis-peptide segments in our annotation method. This is the first study to include both static and dynamics information to improve the prediction of protein molecular function. To ascertain the improvement upon incorporating dynamics, the ACV-based dynamics profiles (details in chapter) were compared in a dataset consisting of 102 pairs each of positive data (PDB entries with match in fragment sequence and cis-peptide) and negative data (PDB entries with match in fragment sequence but no match in cis-peptide). Our analyses depicted that using only cis-peptide information gave less false positives and a low FPR (0.11), which is desirable, but also a relatively low TPR (0.72). This is due to large FN (trans-peptide with matching GO MF), which can arise when the cis-fragment undergoes cis-trans isomerization to accomplish its function and coordinates have been obtained for the segment in the test data in the trans-state, or if there is an error in assignment of the omega angle during structure solution. On the other hand, using only dynamics information increases the numbers of both true and false positives and hence the TPR (0.95) and FPR (0.51). This is due to false-positive matches for cases where fragments with similar secondary structure show similar dynamics, but the proteins do not share a common function. Combining the predictions from the two methods reduces errors while detecting the true matches, thereby enhancing the utility of our method in function annotation (TPR: 0.95 and FPR: 0.07). Subsequently, we have combined static and dynamics information to annotate proteins of unknown function. A combined approach, therefore, opens up new avenues of improving existing automated function annotation methodologies. The work described in this chapter has been submitted to a peer reviewed journal. Future prospects include the development of a web server to facilitate the application of our method by a wide research community. A possible improvement includes identification and comparison of the dynamics of additional sites close to the identified cis-fragment, in an automated manner, to improve the accuracy of our annotation. Appendix 1 gives a description of the results of biochemical experiments performed in the laboratory of our collaborator Dr. R.P. Roy, NII, New Delhi. Appendix 2 contains additional data supplementary to chapter 4. Appendix 3 provides additional data supplementary to chapter 5. Appendix 4 provides additional data supplementary to chapter 6. Appendix 5 contains reprints of publications.
- Physics (PHY)