Download Bioinformatics - Genes, Proteins and Computers - C. Orengo, et al., (BIOS, 2003) WW PDF

TitleBioinformatics - Genes, Proteins and Computers - C. Orengo, et al., (BIOS, 2003) WW
File Size10.1 MB
Total Pages322
Table of Contents
                            Book Cover
1 Molecular evolution
	1.1 Molecular evolution is a fundamental part of bioinformatics
		1.1.1 A brief history of the gene
		1.1.2 What is information?
		1.1.3 Molecular evolution The algorithmic nature of molecular evolution Causes of genetic variation Classification of mutations Substitutional mutation Insertion and deletion
	1.2 Evolution of protein families
		1.2.1 Protein families in eukaryotic genomes
		1.2.2 Gene duplication
		1.2.3 Mechanisms of gene duplication
		1.2.4 The concept of homology
		1.2.5 The modularity of proteins
	1.3 Outlook: Evolution takes place at all levels of biological organization
	References and further reading
2 Gene finding
	2.1 Concepts
	2.2 Finding genes in bacterial genomes
		2.2.1 Mapping and sequencing bacterial genomes
		2.2.2 Detecting open reading frames in bacteria
	2.3 Finding genes in higher eukaryotes
		2.3.1 Approaches to gene finding in eukaryotes Genetic and physical maps Whole genome sequencing Finding individual genes
		2.3.2 Computational gene finding in higher eukaryotes The role of bioinformatics Recognition of eukaryotic genes Detection of exons and introns
	2.4 Detecting non-coding RNA genes
	References and further reading
3 Sequence comparison methods
	3.1 Concepts
	3.2 Data resources
		3.2.1 Databases
		3.2.2 Substitution matrices
	3.3 Algorithms for pairwise sequence comparison
		3.3.1 Challenges faced when aligning sequences Dot plots and similarity matrices for comparing protein sequences Dynamic programming Gap penalties Local implementation of dynamic programming Calculating pair-wise sequence identity
	3.4 Fast database search methods
		3.4.1 FASTA
		3.4.2 BLAST
	3.5 Assessing the statistical significance of sequence similarity
	3.6 Intermediate sequence searching
	3.7 Validation of sequence alignment methods by structural data
	3.8 Multiple sequence alignment
		3.8.1 Sequence weighting
		3.8.2 Deriving the consensus sequence and aligning other sequences against it
		3.8.3 Gap penalties
	References and further reading
4 Amino acid residue conservation
	4.1 Concepts
	4.2 Models of molecular evolution
		4.2.1 Neutralist model
		4.2.2 Selectionist model
	4.3 Substitution matrices
		4.3.1 Conservative substitutions
		4.3.2 Scoring amino acid similarity
		4.3.3 Elements of a substitution matrix Log odds ratio Log odds ratio of substitution Diagonal elements
		4.3.4 Constructing a Dayhoff matrix Construction of the raw PAM matrix Calculation of relative mutabilities The mutation probability matrix Calculating the log-odds matrix
		4.3.5 BLOSUM matrices
	4.4 Scoring residue conservation
		4.4.1 Exercises for a conservation score
		4.4.2 Guidelines for making a conservation score
	4.5 Methods for scoring conservation
		4.5.1 Simple scores
		4.5.2 Stereochemical property scores
		4.5.3 Mutation data scores Sum of pairs scores Performance
		4.5.4 Sequence-weighted scores A simple weighting scheme Incorporating sequence weights into conservation scores
	4.6 Insights and conclusions
	References and further reading
5 Function prediction from protein sequence
	5.1 Overview
	5.2 The similar sequence-similar structure-similar function paradigm
	5.3 Functional annotation of biological sequences
		5.3.1 Identification of sequence homologs
		5.3.2 Identification of conserved domains and functional sites
		5.3.3 Methods for building macromolecular motif representations Single motif methods
				Regular expressions
				Fuzzy regular expressions Multiple motif methods
				BLOCKS Full domain alignment methods
				Profiles Hidden Markov Models Statistical significance of profile and HMM hits
		5.3.4 Secondary database searching: a worked example
	5.4 Outlook: Context-dependence of protein function
	References and further reading
6 Protein structure comparison
	6.1 Concepts
		6.1.1 Single domains and multidomain proteins
		6.1.2 Reasons for comparing protein structures Analysis of conformational changes on ligand binding Detection of distant evolutionary relationships Analysis of structural variation in protein families Identification of common structural motifs
		6.1.3 Expansion of the Protein Structure Databank
	6.2 Data resources
	6.3 Algorithms
		6.3.1 Approaches for comparing 3-D structures
		6.3.2 Intermolecular approaches which compare geometric properties (rigid body superposition methods) Challenges faced in comparing distantly related structures Superposition methods: coping with indels by comparing secondary structures Superposition methods: coping with indels by comparing fragments Superposition methods: coping with indels by using dynamic progamming
		6.3.3 Intramolecular methods which compare geometric relationships Distance plots Comparing intramolecular relationships: coping with indels by comparing secondary structures: Graph theory methods Comparing intramolecular relationships: coping with indels by comparing fragments Comparing intramolecular relationships: coping with indels by applying dynamic programming techniques
		6.3.4 Combining intermolecular superposition and comparison of intramolecular relationships
		6.3.5 Searching for common structural domains and motifs
	6.4 Statistical methods for assessing structural similarity
	6.5 Multiple structure comparison and 3-D templates for structural families
		6.5.1 3-D templates
	6.6 Conclusions
7 Protein structure classifications
	7.1 Concepts
	7.2 Data resources
	7.3 Protocols used in classifying structures
		7.3.1 Removing the redundancy and identifying close relatives by sequencebased methods
		7.3.2 Identifying domain boundaries
		7.3.3 Detecting structural similarity between distant homologs Pairwise structure alignment methods Multiple structure alignment and profile-based methods
	7.4 Descriptions of the structural classification hierarchy
		7.4.1 Neighborhood lists
		7.4.2 Comparisons between the different classification resources
		7.4.3 Distinguishing between homologs and analogs
		7.4.4 Recognizing homologs to structural families in the genomes: Intermediate sequence libraries
	7.5 Overview of the populations in the different structural classifications and insights provided by the classifications
		7.5.1 Populations of different levels in the classification hierarchy
8 Comparative modeling
	8.1 Concepts
	8.2 Why do comparative modeling?
	8.3 Experimental methods
		8.3.1 Building a model: Traditional method Identifying parent structures Align the target sequence with the parent(s) Identify the structurally conserved and structurally variable regions Inherit the SCRs from the parent(s) Build the SVRs Build the sidechains Refining the model Evaluating the model
		8.3.2 Building a model: Using MODELLER
		8.3.3 Building a model: Other methods
	8.4 Evaluation of model quality
	8.5 Factors influencing model quality
	8.6 Insights and conclusions
9 Protein structure prediction
	9.1 Concepts
	9.2 Strategies for protein structure prediction
		9.2.1 Comparative modeling
		9.2.2 Ab initio prediction Knowledge-based methods Simulation methods
		9.2.3 Fold recognition or threading
	9.3 Secondary structure prediction
		9.3.1 Principles of secondary structure prediction
		9.3.2 Intrinsic propensities for secondary structure formation of the amino acids
		9.3.3 Hydropathy methods and transmembrane helix prediction
		9.3.4 Predicting secondary structure from multiple sequence alignments
		9.3.5 Predicting secondary structure with neural networks
		9.3.6 Secondary structure prediction using ANNs
		9.3.7 Training the network
		9.3.8 Cross-validation
		9.3.9 Using sequence profiles to predict secondary structure
		9.3.10 Measures of accuracy in secondary structure prediction
	9.4 Fold recognition methods
		9.4.1 The goal of fold recognition
		9.4.2 Profile methods
		9.4.3 Threading methods
		9.4.4 Potentials of mean force
		9.4.5 Threading algorithms
		9.4.6 How well do threading methods work?
	9.5 Ab initio prediction methods
	9.6 Critically assessing protein structure prediction
	9.7 Conclusions
10 From protein structure to function
	10.1 Introduction
	10.2 What is function?
	10.3 Challenges of inferring function from structure
	10.4 Methods of functional evolution
		10.4.1 Gene duplication
		10.4.2 Gene fusion
		10.4.3 One gene, two or more functions
	10.5 Functional classifications
		10.5.1 Limitations of functional schemes Hierarchical structure Apples and oranges
	10.6 From structure to function
		10.6.1 Basic structure PDBsum
		10.6.2 Structural class
		10.6.3 Global or local structural similarity Homologous relationship Fold similarity and structural analogs Structural motifs and functional analogs
		10.6.4 Ab initio prediction
	10.7 Evolution of protein function from a structural perspective
		10.7.1 Substrate specificity
		10.7.2 Reaction chemistry Conserved chemistry Semi-conserved chemistry Poorly conserved chemistry Variation in chemistry
		10.7.3 Catalytic residues
		10.7.4 Domain enlargement
		10.7.5 Domain organization and subunit assembly
		10.7.6 Summary
	10.8 Structural genomics
		10.8.1 From structure to function: specific examples Mj0577: putative ATP-mediated molecular switch Mj0266: putative pyrophosphate-releasing XTPase M.jannaschii IMPase: bifunctional protein E.coli ycaC: bacterial hydrolase of unknown specificity E.coli yjgF
		10.8.2 Implications for target discovery and drug design
	10.9 Conclusions
	References and further reading
11 From structure-based genome annotation to understanding genes and proteins
	11.1 Concepts
	11.2 Computational structural genomics: structural assignment of genome sequences
	11.3 Methods and data resources for computational structural genomics
		11.3.1 Methods of assignment of structures to protein sequences
		11.3.2 Databases of structural assignments to proteomes
		11.3.3 Computational and experimental structural genomics: Target selection
	11.4 Proteome and protein evolution by computational structural genomics
		11.4.1 Protein families in complete genomes Common and specific domain families in the three kingdoms of life The power law distribution of domain family sizes
		11.4.2 Domain combinations in multidomain proteins Selection for a small proportion of all possible combinations of domains Large families have many types of neighboring domains Conservation of N-to-C terminal orientation of domain pairs Many combinations are specific to one phylogenetic group Multidomain proteins involved in cell adhesion
	11.5 Evolution of enzymes and metabolic pathways by structural annotation of genomes
		11.5.1 Small molecule metabolism in E.coli: an enzyme mosaic
		11.5.2 Types of conservation of domain duplications
		11.5.3 Duplications within and across pathways
		11.5.4 Conclusions from structural assignments to E.coli enzymes
	11.6 Summary and outlook
12 Global approaches for studying protein-protein interactions
	12.1 Concepts
	12.2 Protein-protein interactions
	12.3 Experimental approaches for large-scale determination of protein-protein interactions
		12.3.1 Yeast-two-hybrid screens
		12.3.2 Purification of protein complexes followed by mass spectrometry
	12.4 Structural analyses of domain interactions
		12.4.1 Interaction map of domain families
		12.4.2 The geometry of domain combinations
	12.5 The use of gene order to predict protein-protein interactions
		12.5.1 Conservation of gene order
		12.5.2 Gene fusions
	12.6 The use of phylogeny to predict protein-protein interactions
	12.7 Summary and outlook
	References and further reading
13 Predicting the structure of protein-biomolecular interactions
	13.1 Concepts
	13.2 Why predict molecular interactions?
	13.3 Practical considerations
		13.3.1 Protein-protein docking
		13.3.2 Protein-ligand docking
		13.3.3 What is needed?
	13.4 Molecular complementarity
		13.4.1 Shape complementarity Grid representation
		13.4.2 Property-based measures Hydrophobicity Electrostatic complementarity Amino acid conservation
		13.4.3 Molecular mechanics and knowledge-based force fields
		13.4.4 Experimental and knowledge-based constraints
	13.5 The search problem
		13.5.1 Constraint-based methods DOCK algorithm Matching Applications
		13.5.2 Complete search of space The Fourier transform method Scoring Applications
	13.6 Conformational flexibility
		13.6.1 Protein-ligand docking Multiple conformation rigid-body method Stochastic search methods Combinatorial search methods
		13.6.2 Protein flexibility
	13.7 Evaluation of models
	13.8 Visualization methods
	References and further reading
14 Experimental use of DNA arrays
	14.1 Concepts
		14.1.1 The cellular transcriptome
	14.2 Methods for large-scale analysis of gene expression
	14.3 Using microarrays
		14.3.1 Performing a microarray experiment
		14.3.2 Scanning of microarrays
	14.4 Properties and Processing of array data
	14.5 Data normalization
		14.5.1 Normalizing arrays
		14.5.2 Normalizing genes
	14.6 Microarray standards and databases
15 Mining gene expression data
	15.1 Concepts
		15.1.1 Data analysis
		15.1.2 New challenges and opportunities
	15.2 Data mining methods for gene expression analysis
	15.3 Clustering
		15.3.1 Hierarchical clustering Single linkage clustering Complete linkage clustering Clustering gene expression data
		15.3.2 K-means
		15.3.3 Self-organizing maps
		15.3.4 Discussion
	15.4 Classification
		15.4.1 Support vector machines (SVM)
		15.4.2 Discussion
	15.5 Conclusion and future research
	References and further reading
16 Proteomics
	16.1 The proteome
	16.2 Proteomics
		16.2.1 Why study the proteome?
		16.2.2 How to study the proteome
		16.2.3 The role of bioinformatics in proteomics
	16.3 Technology platforms in proteomics
		16.3.1 Protein separation technology 2-D gel electrophoresis Liquid chromatography methods Affinity chromatography techniques for cell-map proteomics
		16.3.2 Protein annotation technology Protein annotation by mass spectrometry Combined HPLC and MS Protein quantification by mass spectrometry
		16.3.3 Protein chips
	16.4 Case studies
		16.4.1 Case studies in expression proteomics
		16.4.2 Case studies in cell-map proteomics
	16.5 Summary
	References and further reading
17 Data management of biological information
	17.1 Concepts
	17.2 Data management concepts
		17.2.1 Databases and database software
		17.2.2 Why are DBMSs useful?
		17.2.3 Relational and other databases
	17.3 Data management techniques
		17.3.1 Accessing a database
		17.3.2 Designing a database Logical design Physical design
		17.3.3 Overcoming performance problems
		17.3.4 Accessing data from remote sites
	17.4 Challenges arising from biological data
	17.5 Conclusions
	References and further reading
18 Internet technologies for bioinformatics
	18.1 Concepts
	18.2 Methods and standards
		18.2.1 HTML and CSS
		18.2.2 XML
		18.2.3 XSL and XSLT
		18.2.4 Remote procedure invocation
		18.2.5 Supporting standards
	18.3 Insights and conclusions
	References and further reading
Document Text Contents
Page 2


genes, proteins and computers

Page 161

Measures of accuracy in secondary structure prediction

Papers describing methods for secondary structure prediction will always quote estimates of their accuracy, hopefully
based on cross-validated testing. The most common measure of accuracy is known as a Q3 score, and is stated as a
simple percentage. This score is the percentage of a protein that is expected to be correctly predicted based on a three-
state classification, i.e. helix, strand or coil. Typical methods which work on a single protein sequence will have an
accuracy of about 60%. What does this mean? It means than on average 60% of the residues you try to assign to helix,
strand or other will be correctly assigned and 40% will be wrong (a residue predicted to be in a helix when it is really in
a -strand for example). However, this is just an average over many test cases. For an individual case the accuracy might
be as low as 40% or as high as 80%. The Q3 score can be misleading. For example, if you predicted every residue in
myoglobin to be in a helix you would get a Q3 score of 80%—which is a good score. However, predicting every
residue to be in a helix is obviously nonsense and will not give you any useful information. Thus, the main problem with
Q3 as a measure is that it fails to penalize the network for over-predictions (e.g. nonhelix residues predicted to be helix)
or under-predictions (e.g. helix residues predicted to be non-helix). Hence, given the relative frequencies of helix,
strand and coil residues in a typical set of proteins, it is possible to achieve a Q3 of around 50% merely by predicting
everything to be coil.

A more rigorous measure than Q3, introduced by Matthews in 1975, involves calculating the correlation coefficient for
each target class, e.g. the correlation coefficient for helices can be calculated as follows:


a is the number of residues correctly assigned to helix,
b is the number of residues correctly assigned to non-helix,
c is the number of residues incorrectly assigned to helix,
d is the number of residues incorrectly assigned to non-helix.

The correlation coefficients for helix (Ch), strand (Ce) and coil (Cc) are in the range +1 (totally correlated) to• 1
(totally anti-correlated). Ch, Ce and Cc can be combined in a single figure (C3) by calculating the geometric mean of the
individual coefficients.

Although the correlation coefficient is a useful measure of prediction accuracy, it does not assess how protein-like the
prediction is. How realistic a prediction is depends (to some extent) on the lengths of the predicted secondary
structural segments; a prediction of a single residue helix, for example, is clearly not desirable. However, just as
important is that the correct order of secondary structure elements is predicted for a given protein structure. Taking
myoglobin again as an example, in this case we would expect a good prediction method to predict six helices and no
strands. Predicting eight helices or even one long helix might give a good Q3 score, but clearly these predictions would
not be considered good. A measure which does take the location and lengths of predicted secondary structure segments
into account is the so-called segment overlap score (Sov score) proposed by Rost and, like the Q3 score, this is
expressed as a percentage. The definition of the Sov score is rather complicated, however, and so has not been as widely
used at the more intuitive Q3 score. Furthermore, at high accuracy levels (e.g.>70%) the Q3 scores and Sov scores are
highly correlated, and so to compare current prediction methods, Q3 scores are usually sufficient.


Page 162

Fold recognition methods

As we have seen, ab initio prediction of protein 3-D structures is not possible at present, and thus a general solution to
the protein-folding problem is not likely to be found in the near future. However, it has long been recognized that
proteins often adopt similar folds despite no significant sequence or functional similarity and that there appears to be a
limited number of protein folds in nature. It has been estimated that for ~70% of new proteins with no obvious
common ancestry to proteins with a known fold there will be a suitable structure in the database from which to build a
3-D model. Unfortunately, the lack of sequence similarity will mean that many of these will go undetected until after 3-
D structure of the new protein is solved by X-ray crystallography or NMR spectroscopy. Until recently this situation
appeared hopeless, but in the early 1990s, several new methods were proposed for finding these similar folds.
Essentially this meant that results similar to those from comparative modeling techniques could be achieved without
requiring there to be homology. These methods are known as fold recognition methods or sometimes as threading

The goal of fold recognition

Methods for protein fold recognition attempt to detect similarities between protein 3-D structure that are not
necessarily accompanied by any significant sequence similarity. There are many approaches, but the unifying theme is to
try and find folds folds that are compatible with a particular sequence (see Figure 9.3). Unlike methods which compare
proteins based on sequences only, these methods take advantage of the extra information made available by 3-D
structure. This now provides us with three ways of comparing two proteins:

In effect, fold recognition methods turn the protein-folding problem on its head: rather than predicting how a sequence
will fold, they predict how well a fold will fit a given sequence. There are three main components of a fold recognition
method (as illustrated in Figure 9.3): a library of folds (i.e. a set of proteins with known folds), a scoring function which
measures how likely it is that the target sequence adopts a given fold and some method for aligning the target sequence
with the fold so as to best fit as judged by the scoring function. The best scoring alignment on the best scoring fold is
used to generate a 3-D model for the target protein in the same way as simple sequence alignments are used to guide
comparative modeling techniques.

Profile methods

One of the earliest methods for protein fold recognition was proposed by Bowie, Lüthy and Eisenberg in 1991. Here
sequences are matched to folds by describing each fold in terms of the environment of each residue in the structure. The
environment of a residue can be described in many different ways, though Bowie et al. opted to describe it in terms of
local secondary structure (three states: strand, helix and coil), solvent accessibility (three states: buried, partially buried
and exposed), and the degree of burial by polar rather than apolar atoms. Propensities for the 20 amino acids to be


Page 321

secondary structure prediction 138–45
using artificial neural networks 141–3
cross-validation 143
hydropathy methods 140
intrinsic propensities of the amino acids 138–9
measures of accuracy in 144–5
from multiple sequence alignments 140–1
with neural networks 141
principles 138
using sequence profiles 144
transmembrane helix prediction 140
training the network 143

segment overlap score (Sov score) 145
selectionist model of molecular evolution 50
selenocysteine 21
selenoproteins 21
self-organizing maps 239–41
semantic markup 281–2
semantic web 271, 274
semi-conserved chemistry 167
sequence alignment 29, 66
sequence-based methods in classifying protein structures 104–5
sequence comparisons 66
sequence databases 156
sequence homologs 66–7
sequence identity 30

pair-wise 38
sequence similarity, assessing statistical significance of 42–4
sequence tagged sites (STSs) 23
sequence weighting 46, 63–4
Serial Analysis of Gene Expression (SAGE) 219
serial recruitment 189
SGML (Standard Generalized Markup Language) 274
shadow genes 21
Shannon’s entropy 60, 61
shape complementarity 205–6
short single motifs 78
short tandem repeats 7–8
sickle cell anemia 8

building 127
modeling 128

signals 26
silent mutations 8
silent pseudogene 11
similarity matrices 33–5
similar-sequence-similar structure-similar function paradigm 64–

simple scores 59–60

simple sequence length polymorphisms (SSLPs) 22
simulated annealing 98, 213
single domain proteins 81–2
single-linkage clustering algorithm 234–5, 237
single motif methods 69, 70–2
single-nucleotide polymorphisms (SNPs) 22
six-frame translation 20
SLOOP database 126
Smith-Waterman algorithm 38, 39, 41, 43, 44
SOAP 280–1
soft-ionization methods 249, 252
solution arrays 255
Sov score 145
spacefill model 205
Spearman’s rank correlation 231
specific window size 26
SPINE database 180
splice donor and acceptor sites 26
splice variants 218
SQL 264–5
SRS 269–70
SSAP method 96, 97, 98, 110
SSEARCH 38, 43
STAMP method 91, 92, 100, 110
steepest descents 129
stereochemical properties 60–2
stochastic context-free grammars (SCFGs) 27
stochastic search methods 213
strictly conserved (invariant) position 57
structural-based genome annotation 175–91
structural class 158
structural classification hierarchy 111–18
structural conserved regions 123, 124–5
structural embellishments,

protein 90
structural genomics 171–4
structural plasticity 82
structurally variable regions 123, 124, 125–7
substitution matrices 31–2, 47, 50–7
substitutional mutation 8–9
substitutions 8
substrate specificity 166
subunit assembly 170
sum of pairs (SP) score 63
summary score matrix 97
superfamily 10


Page 322

SUPERFAMILY database 178, 180
superfolds 161, 208
supersecondary structures 83
supersites 161, 208
support vector machines (SVM) 242–3
SwissModel 123
SWISS-PROT 30, 68, 73, 104, 117, 250
symbol emission probability distribution 75
synonymous substitution 9

Tabu search 213
tandem mass spectrometry 252
tandem repeats 10
TATA box 26
TEmplate Search and Superposition (TESS) 165
termination codon 20
tertiary structure 81
TeX 273
threading 118, 137–8, 147, 148, 177
TIM barrel 161, 162
time of flight analyzer 252
tissue plasminogen activator (TPA) 15, 77, 78
training set 143
transcriptome 218
transfer RNA (tRNA) genes 27
transition state probability 75
transitions 9
transposition 12, 13
transposons 12, 13
transversions 9
tree-drawing method of Saitou 46
TrEMBL databases 30, 44, 123
tryptic peptides 249
tuples 32–3
twilight zone of sequence similarity 30

ubiquitin 49
unequal crossing-over 12, 13
unequal sister chromatid exchange 12, 13
unitary matrix 31
universal protein arrays 255
UNIX pattern matching 71

variable position 57
VAST 110, 115
virtual cells 246
Virtual Reality Modeling Language (VRML) 215

Viterbi algorithm 76

Watson, James 2
weighting sequences 63–4
whole domains 78
whole gene model 27
whole genome annotation 20

sequencing 23
window 38
within-cluster variation 238
World Wide Web 273, 281–2
World Wide Web Consortium (W3C) 274
WSDL (Web Services Description Language) 81
WU-BLAST implementation 40

xenologs 14
XML 270, 274, 277–9
X-ray cystallography 121–2
XSL 274, 279–80
XSLT 274, 279–80

yeast artificial chromosomes (YACs) 23
yeast-two-hybrid screens 195–6

zoo blotting 25
Z-score 42, 100, 115, 116


Similer Documents