Abstract:
We describe a relatively simple dynamic programming algorithm for the special case of binary trees. We then show that the general case of multifurcating trees can be treated by interleaving solutions to certain auxiliary Maximum Weighted Matching problems with an extension of this dynamic programming approach, resulting in an overall polynomial-time solution of complexity (n4 log n) w.r.t. the number n of leaves. The source code of a C implementation can be obtained under the GNU Public License from http://www.bioinf.uni-leipzig.de/Software/Targeting webcite. For binary trees, we furthermore discuss several constrained variants of the MPP as well as a partition function approach to the probabilistic version of the MPP.The algorithms introduced here make it possible to solve the MPP also for large trees with high-degree vertices. This has practical relevance in the field of comparative phylogenetics and, for example, in the context of phylogenetic targeting, i.e., data collection with resource limitations.Comparisons among species are fundamental to elucidate evolutionary history. In evolutionary biology, for example, they can be used to detect character associations [1-3]. In this context, it is important to use statistically independent comparisons, i.e., any two comparisons must have disjoint evolutionary histories (phylogenetic independence). The Maximal Pairing Problem (MPP) is the prototype of a class of combinatorial optimization problems that models this situation: Given an arbitrary phylogenetic tree T and weights ωxy for the paths between any two pairs of leaves (x, y) (representing a particular comparison), what is the collection of pairs of leaves with maximum total weight so that the connecting paths do not intersect in edges?Algorithms for special cases of the MPP that are restricted to binary trees and equal weights (which thus simply maximizes the number of pairs) have been described, but not implemented [2]. Since different pairs of taxa may contribu

Abstract:
While folding energies of an RNA and its reverse complement are similar, the differences are sufficient at least in conjunction with substitution patterns to discriminate between structured RNAs and their complements. We present here a support vector machine that reliably classifies the reading direction of a structured RNA from a multiple sequence alignment and provides a considerable improvement in classification accuracy over previous approaches.RNAstrand is freely available as a stand-alone tool from http://www.bioinf.uni-leipzig.de/Software/RNAstrand webcite and is also included in the latest release of RNAz, a part of the Vienna RNA Package.Genome wide computational screens for structured ncRNA genes in mammals [1-3], urochordates [4], nematodes [5], and drosophilids [6] resulted in tens of thousands putative structured ncRNAs. Functional and structural annotation of these predictions thus becomes a pressing problem. Evidence for evolutionary conservation of RNA structure alone usually does not distinguish very well between the two possible reading directions. This information, however, is crucial already for the most basic annotation information. Direction information is needed e.g. to determine whether a conserved RNA motif is intronic, located within a coding sequence or an untranslated exon, an independent ncRNA, or one of the many classes of small RNAs associated with other transcripts [7].The RNAstrand tool is designed specifically to predict the reading direction of a multiple sequence alignment under the assumption that the alignment contains an evolutionary conserved RNA secondary structure. Our task at hand is a conceptually simple two class prediction problem for which we employ a support vector machine (SVM) [8]. The basic idea is to devise descriptors that utilize both the small asymmetry in the energy rules [9] and the asymmetric effect of GU base pairs.Small differences in the measured folding energies between an RNA molecule and its reverse com

Abstract:
This editorial is the first article published in our new journal Algorithms for Molecular Biology. By starting this journal, we aim to provide an online and open access resource for the growing research community in the field of algorithmic bioinformatics. Bioinformatics or computational biology is a very broad and heterogeneous field of research ranging from applied data analysis and IT support for life-science projects to probabilistic modelling, algorithm development and complexity analysis. Today, there is a variety of established and newly founded bioinformatics journals covering these diverse areas. Some of these journals are general-purpose journals covering the whole range of research topics in computational biology, for example Bioinformatics or BMC Bioinformatics. Other journals are specialised on applied bioinformatics where software tools are used as a means to obtain biological insights, e.g. In Silico Biology and PLoS Computational Biology. There are also some existing journals that focus on algorithmic topics in bioinformatics, e.g. Journal of Computational Biology, Journal of Bioinformatics and Computational Biology or IEEE/ACM Transactions on Computational Biology and Bioinformatics. These algorithmic journals are run in the traditional way where publishing is free of charge but readers or their libraries have to pay subscription fees to obtain access to published research results.During the last few years, online open access journals have become popular in many areas of research. In contrast to established publishing models, these journals provide free and unlimited access to research articles for everyone connected to the internet. Online publishing offers a rapid way of publishing research results since every article is ready to be published immediately after formal acceptance. Above all, articles in open access journals are highly visible since access is not limited to those whose libraries can afford increasingly expensive subscription fees. Pu

Abstract:
Here we present a modified variant of progressive sequence alignments that addresses both issues. Instead of pairwise alignments we use exact dynamic programming to align sequence or profile triples. This avoids a large fractions of the ambiguities arising in pairwise alignments. In the subsequent aggregation steps we follow the logic of the Neighbor-Net algorithm, which constructs a phylogenetic network by step-wisely replacing triples by pairs instead of combining pairs to singletons. To this end the three-way alignments are subdivided into two partial alignments, at which stage all-gap columns are naturally removed. This alleviates the "once a gap, always a gap" problem of progressive alignment procedures.The three-way Neighbor-Net based alignment program aln3nn is shown to compare favorably on both protein sequences and nucleic acids sequences to other progressive alignment tools. In the latter case one easily can include scoring terms that consider secondary structure features. Overall, the quality of resulting alignments in general exceeds that of clustalw or other multiple alignments tools even though our software does not included heuristics for context dependent (mis)match scores.(The software is freely available for download from reference [1])High quality multiple sequence alignments (MSAs) are a prerequisite for many applications in bioinformatics, from the reconstruction of phylogenies and the assessment of evolutionary rate variations to gene finding and phylogenetic footprinting. A large part of comparative genomics thus hinges on our ability to construct accurate MSAs. Since the multiple sequence alignment problem is NP hard [2] with the computational cost growing exponentially with the number of sequences, it has been a long-standing challenge to devise approximation algorithms that are both efficient and accurate. These approaches can be classified into progressive, iterative, and stochastic alignment algorithms. The most widely used tools such as

Abstract:
We present a Markov Chain Monte Carlo method for sampling cycle length in large graphs. Cycles are treated as microstates of a system with many degrees of freedom. Cycle length corresponds to energy such that the length histogram is obtained as the density of states from Metropolis sampling. In many growing networks, mean cycle length increases algebraically with system size. The cycle exponent $\alpha$ is characteristic of the local growth rules and not determined by the degree exponent $\gamma$. For example, $\alpha=0.76(4)$ for the Internet at the Autonomous Systems level.

Abstract:
Hard combinatorial optimization problems deal with the search for the minimum cost solutions (ground states) of discrete systems under strong constraints. A transformation of state variables may enhance computational tractability. It has been argued that these state encodings are to be chosen invertible to retain the original size of the state space. Here we show how redundant non-invertible encodings enhance optimization by enriching the density of low-energy states. In addition, smooth landscapes may be established on encoded state spaces to guide local search dynamics towards the ground state.

Abstract:
Longer runs of three or more consecutive G along the probe sequence and in particular triple degenerated G at its solution end ((GGG)1-effect) are associated with exceptionally large probe intensities on GeneChip expression arrays. This intensity bias is related to non-specific hybridization and affects both perfect match and mismatch probes. The (GGG)1-effect tends to increase gradually for microarrays of later GeneChip generations. It was found for DNA/RNA as well as for DNA/DNA probe/target-hybridization chemistries. Amplification of sample RNA using T7-primers is associated with strong positive amplitudes of the G-bias whereas alternative amplification protocols using random primers give rise to much smaller and partly even negative amplitudes.We applied positional dependent sensitivity models to analyze the specifics of probe intensities in the context of all possible short sequence motifs of one to four adjacent nucleotides along the 25meric probe sequence. Most of the longer motifs are adequately described using a nearest-neighbor (NN) model. In contrast, runs of degenerated guanines require explicit consideration of next nearest neighbors (GGG terms). Preprocessing methods such as vsn, RMA, dChip, MAS5 and gcRMA only insufficiently remove the G-bias from data.Positional and motif dependent sensitivity models accounts for sequence effects of oligonucleotide probe intensities. We propose a positional dependent NN+GGG hybrid model to correct the intensity bias associated with probes containing poly-G motifs. It is implemented as a single-chip based calibration algorithm for GeneChips which can be applied in a pre-correction step prior to standard preprocessing.Fig. 1a shows the surface image of a hybridized Affymetrix GeneChip expression array. Its area of about 1.6 cm2 divides into a grid of nearly one million probe spots of size (11 × 11) μm2. Each of them is covered by a 'turf' of 25meric oligonucleotides attached to the chip surface. Their sequence is chose

Abstract:
Equivalence relations on the edge set of a graph $G$ that satisfy restrictive conditions on chordless squares play a crucial role in the theory of Cartesian graph products and graph bundles. We show here that such relations in a natural way induce equitable partitions on the vertex set of $G$, which in turn give rise to quotient graphs that can have a rich product structure even if $G$ itself is prime.

Abstract:
Local minima and the saddle points separating them in the energy landscape are known to dominate the dynamics of biopolymer folding. Here we introduce a notion of a "folding funnel" that is concisely defined in terms of energy minima and saddle points, while at the same time conforming to a notion of a "folding funnel" as it is discussed in the protein folding literature.

Abstract:
Background: Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the reference genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments. Results: We introduce a simple data-adaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical log-likelihoods. These likelihoods are then combined to a score that typically gives rise to a mixture distribution. From these we determine a decision threshold to separate potentially variant sites from the noisy background. Conclusions: In simulations we show that our simple proposed model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.