Use of strategically designed protein-like sequences in structure and function recognition
Abstract
The advent of high fidelity protein sequencing techniques has led to a considerable wealth
of sequence data. However, the number of proteins with information on 3-D structure and
functional features available is considerably lower. In spite of improvements in structural
and functional genomics initiatives, most experimental procedures in use are time
consuming. This has led to a formidable gap between the sequence and structure space
which continues to increase. The structural coverage of the proteome of most organisms is
not complete and limits the information available on function and the implied biological
roles. Computational approaches could provide preliminary ideas on the structure and
function of proteins. Protein structures are far more conserved than sequences as a
consequence of the evolutionary pressure to maintain the structure and thereby its function.
Therefore, recognition of evolutionary relationships among proteins could serve as an
important step towards inferences on shared structural and functional features between
related proteins. Detailed comparative analysis of evolutionarily related proteins could
provide clues to protein structure and consequently its function. However, a notorious
problem is detection of relationship between proteins characterized by low sequence
similarity (less than about 20%) as unrelated proteins too share poor sequence similarity.
The detection of relatedness between sequentially distant proteins serves as a nodal point
in structure and function recognition. Hence, most sequence search algorithms rely on
deriving these non-trivial relationships between distant homologues to further functional
annotation.
It has been observed that the limitation in identifying distant relatives is due to the
sparseness of the protein sequence space. i.e., if sequences intermediately related to the two
proteins (or two protein families) are unavailable, then the recognition of such relationships
purely using sequence data becomes challenging. The paucity of natural intermediate
sequences to direct profile or sequence search methods undermines even rigorous and
powerful search algorithms. In a protocol developed earlier in the group, protein-like
sequences, referred as offsprings, were computationally designed using the sequence
profiles of domain family pairs, referred as parents, which are known to be distantly related.
It has been shown that these sequences served as stepping stones for search methods to link
distant relatives. Plugging these intermediately related sequences, into the database of
natural protein sequences addressed the challenges of the void and sparse regions of the
protein sequence space. Use of designed sequences showed a marked improvement in
structural fold coverage and augmented the ability of search protocols. Therefore, use of
designed sequences in homology detection could enable recognition of structure and
function of proteins not known so far.
The questions raised in this thesis starts with exploring the foldability of the designed
sequences into the parent structural fold. Having seen that these designed proteins are likely
to adopt the structural fold of parent families, they were employed in recognizing the
structure of protein families which do not possess any information on structure yet. Further,
an improvement in the approach was put forth to make homology driven searches faster
and more sensitive by representing the sequences, both natural and designed, as hidden
Markov models. The use of intermediately related artificial sequences in probing functional
relationships between protein families was explored. The associations made through
designed sequences were examined for identifying biological relevance by exploring the
conservation of putative functional residues. To strengthen the ability of the designed
intermediates in homology detection, the artificial expansion of the protein space around
protein families was carried out.