|
Current Protein & Peptide
Science
ISSN: 1389-2042

Current Protein and Peptide
Science
Volume 8, Number 2, April 2007
Contents
Workshop on the Definition of Protein Domains and
their Likelihood of Crystallization
Guest Editors: Oliviero Carugo, Kristina Djinovic, Sasha
Gorbalenya and Paul Tucker

Editorial Pp. 119-120
Predicting Experimental Properties of Proteins from
Sequence by Machine Learning Techniques Pp. 121-133
Pawel Smialowski, Antonio J. Martin-Galiano, Jürgen
Cox and Dmitrij Frishman
[Abstract]
Predicting Protein Disorder and Induced Folding: From
Theoretical Principles to Practical Applications
Pp. 135-149
Jean M. Bourhis, Bruno Canard and Sonia Longhi
[Abstract]
Production and Crystallization of Protein Domains:
How Useful are Disorder Predictions ? Pp. 151-160
S. Quevillon-Cheruel, Nicolas Leulliot1, Lucie Gentils,
Herman van Tilbeurgh and Anne Poupon
[Abstract]
Prediction of Protein Disorder at the Domain Level
Pp. 161-171
Zsuzsanna Dosztányi, Márk Sándor,
Peter Tompa and István Simon
[Abstract]
Towards Proteomic Approaches for the Identification
of Structural Disorder Pp. 173-179
Veronika Csizmók, Zsuzsanna Dosztányi, István
Simon and Peter Tompa
[Abstract]
Computer-Assisted Protein Domain Boundary Prediction
Using the DomPred Server Pp. 181-188
Kevin Bryson, Domenico Cozzetto and David T. Jones
[Abstract]
Prediction of Number and Position of Domain Boundaries
in Multi-Domain Proteins by Use of Amino Acid Sequence Alone
Pp. 189-195
Nikita V. Dovidchenko, Michail Yu Lobanov and Oxana V.
Galzitskaya
[Abstract]
Posttranslational Modifications and Subcellular Localization
Signals: Indicators of Sequence Regions without Inherent 3D
Structure? Pp. 197-203
Birgit Eisenhaber and Frank Eisenhaber
[Abstract]
Pipelines, Robots, Crystals and Biology: What Use
High Throughput Solving Structures of Challenging Targets?
Pp. 205-217
Christian Kambach
[Abstract]
Abstracts
[Back to top]
Editorial
This issue of Current Protein and Peptide Science is
devoted to the emerging field of likelihood of protein crystallization
and is related to the seminars and lectures presented recently
at the Workshop on the definition of protein domains and their
likelihood of crystallization, held in Vienna at the end of
June 2006 (http://www.embl-hamburg.de/workshops/2006/domains/),
where a number of scientists addressed these questions by
presenting and debating both experimental and computational
approaches.
Likelihood of crystallization must be predicted computationally
and/or determined experimentally in order to avoid time expensive
experiments on samples, the three-dimensional structure of
which cannot be determined experimentally, because of a series
of possible obstacles. For example, if a protein is natively
disordered, in the sense that it is not characterized by a
unique, well defined conformation, its three-dimensional structure
cannot be determined experimentally, since it does not exist.
Moreover, a sequence construct that does not correspond to
a protein domain might be difficult to express because of
its misfolding or its reduced solubility.
This is particularly important in the structural genomics
era, in which high throughput approaches are applied to the
determination of three-dimensional structures of proteins,
the biochemical, biophysical, and biological features of which
were not previously studied. However, the preliminary analysis
and estimation of the likelihood of crystallization is not
relegated to proteomics studies only, but it is important
also for traditional hypothesis driven projects, in which
the optimization of the protein sample is equally important,
allowing one to generate samples suitable for structural studies
and/or improve diffraction quality of crystals and obtain,
as a consequence, more reliable final results.
The first review, written by Dmitrij Frishman
and co-workers (Technische Universität München,
Germany), deals with the general problem of predicting, with
computational and bioinformatics methods, experimental success
in cloning, expression, soluble expression, purification and
crystallization of proteins. On the basis of publicly available
resources, sophisticated machine learning algorithms allow
one to make reasonable predictions. For example, solubility
predictions are reaching the accuracy of over 70%.
The successive four reviews are devoted to prediction, determination,
and analysis of conformational disorder.
Sonia Longhi and co-workers (CNRS
and Universités Aix-Marseille I et II, France) presents
an overview of several methods currently employed for predicting
protein conformational disorder and present some practical
examples of how they can be combined in order to achieve more
reliable predictions.
Anne Poupon and co-workers (Université
Paris-Sud, France) report the high throughput application
of disorder predictions in a structural genomics project on
soluble yeast proteins and focus their attention on strategies
for tailoring proteins into crystallizable domains.
Predictions of conformational disorder are analyzed also by
Zsuzsanna Dosztányi
and co-workers (Hungarian Academy of Sciences, Hungary),
though from a different perspective. The primary focus of
this review is the systematic interpretation of the scores
of different predictors.
Experimental approaches for the detection of protein disorder
are reviewed by Peter Tompa and
co-workers (Hungarian Academy of Sciences, Hungary), with
special emphasis on proteomic-scale methods, like heat- or
acid treatments with a subsequent two-dimensional electrophoresis/mass
spectrometry characterization.
Furthermore, the problem of defining domain boundaries on
the basis of the amino acidic sequences is analyzed in the
next two reviews.
David Jones and co-workers (University
College London, United Kingdom) compare completely automatic
and computer-assisted methods and discuss the problem of benchmarking
different predictors. Furthermore, the DomPred server, which
includes predictors based on sequence comparisons and on secondary
structure predictors, is critically analyzed in order to allow
its optimal use.
A completely different prediction strategy is illustrated
by Oxana Galzitskaya and co-workers
(Russian Academy of Sciences, Pushchino, Russia), based on
the statistics of appearance of amino acid residues at domain
boundaries. Such an approach, which allows very fast computations,
is carefully compared to other, more demanding computational
methods.
Frank and Birgit Eisenhaber (Institute
of Molecular Pathology, Vienna, Austria) summarize their results
about bioinfromatic predictions of post-translational modifications
and translocation signals. This is particularly important
because these chemical processes are often related to conformationl
transitions, for example from conformational disorder to order,
and are also strictly interconnected with the expression systems
in which proteins can be produced and characterized.
Eventually, Christian Kambach (Paul
Scherrer Institut, Switzerland) surveys the state of the art
in high throughput methods for production, purification, crystallization,
and characterization of complexes, membrane proteins and other
challenging targets. Particular attention is devoted not only
to structural genomics approaches but also to the applications
of such a powerful technology to hypothesis driven projects
focused on specific biological systems.
The Workshop on the definition of protein domains and their
likelihood of crystallization, held in Vienna at the end of
June 2006 (http://www.embl-hamburg.de/workshops/2006/domains/),
was organized by the Department of Biomolecular Structural
Chemistry of the University of Vienna in it's capacity as
a Training and Dissemination Centre for the EU 6th Framework
VIZIER project (http://www.vizier-europe.org/). Support from
the Bioinformatics Integration Network II of GEN-AU (Austria)
is also gratefully acknowledged.
The VIZIER project (http://www.vizier-europe.org/) is aimed
at the comprehensive structural characterization of the replicative
apparatus of RNA viruses. Although being exceptionally small,
RNA viruses encode large multi-domain polyproteins whose dissection
has presented a formidable challenge. Studies of many RNA
viruses, including coronaviruses employing the largest RNA
genomes, revealed that bioinformatics, particularly comparative
sequence analysis, could empower researchers in meeting this
challenge. This experience is broadly utilized by the VIZIER
and it provided impetus for the Workshop which scope along
with that of this issue were however not restricted by the
origin of proteins.
Oliviero Carugo
Department of General Chemistry
University of Pavia, Italy, and
Department of Biomolecular Structural Chemistry
University of Vienna University
Austria
Kristina Djinovic-Carugo
Department of Biomolecular Structural Chemistry
University of Vienna University
Austria
Alexander E. Gorbalenya
Department of Medical Microbiology
Leiden University Medical Center
The Netherlands
Paul Tucker
European Molecular Biology Laboratory, Hamburg
Germany
[Back to top]
Predicting Experimental Properties of Proteins
from Sequence by Machine Learning Techniques
Pawel Smialowski, Antonio J. Martin-Galiano, Jürgen
Cox and Dmitrij Frishman
Efficient target selection methods are an important prerequisite
for increasing the success rate and reducing the cost of high-throughput
structural genomics efforts. There is a high demand for sequence-based
methods capable of predicting experimentally tractable proteins
and filtering out potentially difficult targets at different
stages of the structural genomic pipeline. Simple empirical
rules based on anecdotal evidence are being increasingly superseded
by rigorous machine-learning algorithms. Although the simplicity
of less advanced methods makes them more human understandable,
more sophisticated formalized algorithms possess superior
classification power. The quickly growing corpus of experimental
success and failure data gathered by structural genomics consortia
creates a unique opportunity for retrospective data mining
using machine learning techniques and results in increased
quality of classifiers. For example, the current solubility
prediction methods are reaching the accuracy of over 70%.
Furthermore, automated feature selection leads to better insight
into the nature of the correlation between amino acid sequence
and experimental outcome. In this review we summarize methods
for predicting experimental success in cloning, expression,
soluble expression, purification and crystallization of proteins
with a special focus on publicly available resources. We also
describe experimental data repositories and machine learning
techniques used for classification and feature selection.
[Back to top]
Predicting Protein Disorder and Induced Folding: From
Theoretical Principles to Practical Applications
Jean M. Bourhis, Bruno Canard and Sonia Longhi
In the last years there has been an increasing amount of experimental
evidence pointing out that a large number of proteins are
either fully or partially disordered (unstructured). Intrinsically
disordered proteins are ubiquitary proteins that fulfil essential
biological functions while lacking highly populated and uniform
secondary and tertiary structure under physiological conditions.
Despite the large abundance of disorder, disordered regions
are still poorly detected. Recognition of disordered regions
in a protein is instrumental for reducing spurious sequence
similarity between disordered regions and ordered ones, and
for delineating boundaries of protein domains amenable to
crystallization. As presently none of the available automated
methods for prediction of protein disorder can be taken as
fully reliable on its own, we present a brief overview of
the methods currently employed highlighting their philosophy.
We show a few practical examples of how they can be combined
to avoid pitfalls and to achieve more reliable predictions.
We also describe the currently available methods for the identification
of regions involved in induced folding and provide a few practical
examples in which the accuracy of predictions was experimentally
confirmed.
[Back to top]
Production and Crystallization of Protein Domains:
How Useful are Disorder Predictions ?
S. Quevillon-Cheruel, Nicolas Leulliot1, Lucie Gentils,
Herman van Tilbeurgh and Anne Poupon
The failure to produce and/or crystallize proteins is often
due to their modular structure. There exists therefore considerable
interest to develop strategies for tailoring proteins into
crystallizable domains. In the framework of a Structural Genomics
Project on soluble yeast proteins, we have tested the expression
of numerous genetic constructs of our targets in order to
produce and crystallize proteins and protein domains and solve
their three-dimensional structure. In some cases, the choice
of the domain boundaries was guided by prediction from sequence
using various software packages, including Prelink, a home-made
prediction method for detecting unfolded regions. In other
cases, large numbers of constructs were generated using molecular
biology or biochemical methods.
In this paper, we analyze the results of the over-expression
in E. coli and crystallization of these constructs,
and compare these with the predictions that can be obtained
from our software and from others.
[Back to top]
Prediction of Protein Disorder at the Domain Level
Zsuzsanna Dosztányi, Márk Sándor,
Peter Tompa and István Simon
Intrinsically disordered/unstructured proteins exist in a
highly flexible conformational state largely devoid of secondary
structural elements and tertiary contacts. Despite their lack
of a well defined structure, these proteins often fulfill
essential regulatory functions. The intrinsic lack of structure
confers functional advantages on these proteins, allowing
them to adopt multiple conformations and to bind to different
binding partners. The structural flexibility of disordered
regions hampers efforts solving structures at high resolution
by X-ray crystallography and/or NMR. Removing such proteins/regions
from high-throughput structural genomics pipelines would be
of significant benefit in terms of cost and success rate.
In this paper we outline the theoretical background of structural
disorder, and review bioinformatic predictors that can be
used to delineate regions most likely to be amenable for structure
determination. The primary focus of our review is the interpretation
of prediction results in a way that enables segmentation of
proteins to separate ordered domains from disordered regions.
[Back to top]
Towards Proteomic Approaches for the Identification
of Structural Disorder
Veronika Csizmók, Zsuzsanna Dosztányi, István
Simon and Peter Tompa
Intrinsically unstructured/disordered proteins (IUPs) and
protein domains lack a well-defined three-dimensional structure
under physiological conditions. Structural disorder imparts
advantages in many non-conventional functions, which poses
a significant challenge to our understanding of the structure-function
relationship of proteins. The general appreciation of this
fact, however, is hampered by the large gap in our knowledge
on IUPs, as we have biophysical data on less than 500 of them,
whereas bioinformatic predictions suggest at least several
thousand such proteins in the human proteome alone. Thus,
proteomic-scale identification and characterization of IUPs
will need to be implemented to fill this gap and advance our
knowledge in this important field. In this review we give
an insight into the various rationales of proteomic efforts
of identifying IUPs, and survey the handful of attempts that
combined enrichment of extracts for IUPs by heat- or acid
treatment with a subsequent two-dimensional electrophoresis/mass
spectrometry identification. Advantages and drawbacks of the
various approaches are outlined in anticipation of future
inventions in the field that will hopefully ele-vate IUP research
to the truly proteomic level.
[Back to top]
Computer-Assisted Protein Domain Boundary Prediction
Using the DomPred Server
Kevin Bryson, Domenico Cozzetto and David T. Jones
Domain prediction from sequence is a particularly challenging
task, and currently, a large variety of different methodologies
are employed to tackle the task. Here we try to classify these
diverse approaches into a number of broad categories. Completely
automatic domain prediction from sequence alone is currently
fraught with problems, but this should not be so surprising
since human experts currently have significant disagreement
on domain assignment even when given the structures. It can
be argued that we should only test the domain prediction methods
on benchmark data that human experts agree upon and this is
the approach we take in this paper. Even for the data sets
on which human experts agree, automatic structure-based domain
assignment still cannot always agree, and so again it is still
unlikely that domain prediction methods will reliably obtain
correct results completely automatically. We make the argument
that computer-assisted domain prediction is a more
achievable goal. With this aim in mind, we present the DomPred
server. This server provides the user with the results from
two completely different categories of method (DPS and DomSSEA).
In this paper, each method is individually benchmarked against
one of the latest domain prediction benchmarks to provide
information about their respective reliabilities. A variety
of different benchmark scores are employed since the accuracy
of a domain prediction method depends critically on what types
of results one wishes to obtain (single/multi-domain classification,
domain number, residue linker positions, etc.). Also both
of these methods, implemented within the DomPred server, can
suggest alternative domain predictions, allowing the user
to make the final decision based on these results and applying
their own background knowledge to the problem. The DomPred
server is available from the URL: http://bioinf.cs.ucl.ac.uk/software.html.
[Back to top]
Prediction of Number and Position of Domain Boundaries
in Multi-Domain Proteins by Use of Amino Acid Sequence Alone
Nikita V. Dovidchenko, Michail Yu Lobanov and Oxana V.
Galzitskaya
Prediction of protein domain boundaries is an important step
for the prediction of three-dimensional structure. The simple
method PDP has been elaborated for prediction of the number
and position of domain boundaries in multi-domain proteins
by use of amino acid sequence alone. The method uses an optimized
scale based on the statistics of appearance of amino acid
residues at domain boundaries. Our method demonstrates promising
results in comparison to other methods that do not use homologous
sequences. From the database of proteins that are targets
from CASP6 (Critical Assessment of Techniques for Protein
Structure Prediction) our program correctly assigned the number
of domains for ~80% of one domain proteins and ~50% for two-domain
proteins. Our method offers three main advantages: it is very
simple, it is fast, and it uses a minimal number of parameters
in comparison with other methods.
[Back to top]
Posttranslational Modifications and Subcellular Localization
Signals: Indicators of Sequence Regions without Inherent 3D
Structure?
Birgit Eisenhaber and Frank Eisenhaber
Given the huge number of sequences of otherwise uncharacterized
protein sequences, computer-aided prediction of posttranslational
modifications (PTMs) and translocation signals from amino
acid sequence becomes a necessity. We have contributed to
this multi-faceted, worldwide effort with the development
of predictors for GPI lipid anchor sites, for N-terminal N-myristoylation
sites, for farnesyl and geranylgeranyl anchor attachment as
well as for the PTS1 peroxisomal signal. Although the substrate
protein sequence signals for various PTMs or translocation
systems vary dramatically, we found that their principal architecture
is similar for all the cases studied. Typically, a small stretch
of the amino acid residues is buried in the catalytic cleft
of the protein-modifying enzyme (or the binding site of the
transporter). This piece most intensely interacts with the
enzyme and its sequence variability is most restricted. This
stretch is surrounded by linker segments that connect the
part bound by the enzyme with the rest of the substrate protein.
These residues are, as a trend, small with a flexible backbone
and polar. Due to the mechanistic requirements of binding
to the enzyme, we suggest that most PTM sites are necessarily
embedded into intrinsically disordered regions (except for
cases of autocatalytic PTMs, PTMs executed in the unfolded
state or non-enzymatic PTMs) and this issue requires consideration
in structural studies of proteins with complex architecture.
Surprisingly, some proteins carry sequence signals for posttranslational
modification or translocation that remain hidden in the normal
biological context but can become fully functional in certain
conditions.
[Back to top]
Pipelines, Robots, Crystals and Biology: What Use
High Throughput Solving Structures of Challenging Targets?
Christian Kambach
With recent advances in the technology and software
underlying crystallographic structure solution, demands on
both output and functional significance of X-ray structures
are soaring. To achieve the required speed and quality also
with ever larger and more difficult targets, combining HTP
screening methods (robotics based or not) adopted from structural
genomics initiatives with thorough expertise and dedicated
characterization effort for each individual target is almost
a must. I present concepts, practical considerations, and
experiences on implementing an HTP technology platform for
structural and functional studies on complexes, membrane proteins
and other challenging targets. Emphasis lies on the environment
of small academic groups engaged exclusively in hypothesis
driven projects focused on specific biological systems. Suitability
of given HTP protocols for particular target classes, benchmarking
and quality control for procedures, and project management
issues at the interface between extensive, broad parameter
screening and intensive individual target work required by
non-SG amenable targets are discussed.
|