Abstract:
Background When using Illumina high throughput short read data, sometimes the genotype inferred from the positive strand and negative strand are significantly different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias. In this study, we used Illumina short-read sequencing data to evaluate the effect of strand bias on genotyping quality, and to explore the possible causes of strand bias. Result We collected 22 breast cancer samples from 22 patients and sequenced their exome using the Illumina GAIIx machine. By comparing the consistency between the genotypes inferred from this sequencing data with the genotypes inferred from SNP chip data, we found that, when using sequencing data, SNPs with extreme strand bias did not have significantly lower consistency rates compared to SNPs with low or no strand bias. However, this result may be limited by the small subset of SNPs present in both the exome sequencing and the SNP chip data. We further compared the transition and transversion ratio and the number of novel non-synonymous SNPs between the SNPs with low or no strand bias and those with extreme strand bias, and found that SNPs with low or no strand bias have better overall quality. We also discovered that the strand bias occurs randomly at genomic positions across these samples, and observed no consistent pattern of strand bias location across samples. By comparing results from two different aligners, BWA and Bowtie, we found very consistent strand bias patterns. Thus strand bias is unlikely to be caused by alignment artifacts. We successfully replicated our results using two additional independent datasets with different capturing methods and Illumina sequencers. Conclusion Extreme strand bias indicates a potential high false-positive rate for SNPs.

Abstract:
If a sequence of functions diverges almost everywhere, then the set of subsequences which diverge almost everywhere is a residual set of subsequences.

Abstract:
We examined the frequencies of initiation- and termination-codons in the two phases, and found that termination codons do not significantly differ between the two phases, whereas initiation codons are more abundant in phase 1. We found that the primary factors explaining the phase inequality are the frequencies of amino acids whose codons may combine to form start codons in the two phases. We show that the frequencies of start codons in each of the two phases, and, hence, the potential for the creation of overlapping genes, are determined by a universal amino-acid frequency and species-specific codon usage, leading to a correlation between long phase-1 overlaps and genomic GC content.Our model explains the phase bias in same-strand overlapping genes by compositional factors without invoking selection. Therefore, it can be used as a null model of neutral evolution to test selection hypotheses concerning the evolution of overlapping genes.This article was reviewed by Bill Martin, Itai Yanai, and Mikhail Gelfand.Overlapping genes were found in all cellular domains of life, as well as in viruses [1-3]. Overlapping genes are thought to have unique evolutionary constraints [4,5] and regulatory properties [6,7]. Genes can overlap on the same strand (→ →) or on the complementary strand ("tail-to-tail" → ←, or "head-to-head" ← →, Figure 1). Different nomenclatures have been used in the literature to denote "same-strand" ("unidirectional," "codirected," "parallel," and "tandem"), "tail-to-tail" ("convergent," "anti-parallel," and "end-on"), and "head-to-head" ("divergent" and "head-on") overlapping genes [8-11]. Here, we use the self-explanatory terms "same-strand" and "opposite-strand" overlapping genes.In bacteria, overlaps on the same strand are by far the most abundant [10,11], most likely because, on average, 70% of the genes in bacterial genomes, are located on one strand [9]. Same-strand overlaps occur in frameshifts of one nucleotide (phase 1) or two nucleotides (phas

Abstract:
In 1997 S\'ark\"ozy and Mauduit introduced the well-distribution measure($W$) and the correlation measure of order $\ell$ ($C_{\ell}$) of binarysequences as measures of their pseudorandomness.For a truly random binary sequencethese measures are small ($\ll N^{1/2} (\log N)^c$ for a sequenceof length $N$). Several constructions have been given for which these measuresare small, namely they are $\ll N^{1/2} (\log N)^c$, so the sequence$E_N$ has strong pseudorandom properties. But in certain applications, e.g. incryptography, it is not enough to know that the sequence has strongpseudorandomproperties, it is also important that the subsequences $E_M$ (where $E_M$is of the form$\{e_x,e_{x+1},...,e_{x+M-1}\}$) also have strong pseudorandom propertiesfor values $M$ possibly small in terms of $N$. In this paper I will deal withthis problem incase of values $M \gg N^{1/4+ \varepsilon}$.

Abstract:
We determine the average number of distinct subsequences in a random binary string, and derive an estimate for the average number of distinct subsequences of a particular length.

Abstract:
This note provides very simple, efficient algorithms for computing the number of distinct longest common subsequences of two input strings and for computing the number of LCS embeddings.

Abstract:
We present a number of results relating partial Cauchy-Littlewood sums, integrals over the compact classical groups, and increasing subsequences of permutations. These include: integral formulae for the distribution of the longest increasing subsequence of a random involution with constrained number of fixed points; new formulae for partial Cauchy-Littlewood sums, as well as new proofs of old formulae; relations of these expressions to orthogonal polynomials on the unit circle; and explicit bases for invariant spaces of the classical groups, together with appropriate generalizations of the straightening algorithm.

Abstract:
Given two rooted, labeled trees $P$ and $T$ the tree path subsequence problem is to determine which paths in $P$ are subsequences of which paths in $T$. Here a path begins at the root and ends at a leaf. In this paper we propose this problem as a useful query primitive for XML data, and provide new algorithms improving the previously best known time and space bounds.

Abstract:
Dysregulation of miRNAs expression plays a critical role in the pathogenesis of genetic, multifactorial disorders and in human cancers. We exploited sequence, genomic and expression information to investigate two main aspects of post-transcriptional regulation in miRNA biogenesis, namely strand selection regulation and expression relationships between intragenic miRNAs and host genes. We considered miRNAs expression profiles, measured in five sizeable microarray datasets, including samples from different normal cell types and tissues, as well as different tumours and disease states. First, the study of expression profiles of “sister” miRNA pairs (miRNA/miRNA*, 5′ and 3′ strands of the same hairpin precursor) showed that the strand selection is highly regulated since it shows tissue-/cell-/condition-specific modulation. We used information about the direction and the strength of the strand selection bias to perform an unsupervised cluster analysis for the sample classification evidencing that is able to distinguish among different tissues, and sometimes between normal and malignant cells. Then, considering a minimum expression threshold, in few miRNA pairs only one mature miRNA is always present in all considered cell types, whereas the majority of pairs were concurrently expressed in some cell types and alternatively in others. In a significant fraction of concurrently expressed pairs, the major and the minor forms found at comparable levels may contribute to post-transcriptional gene silencing, possibly in a coordinate way. In the second part of the study, the behaved tendency to co-expression of intragenic miRNAs and their “host” mRNA genes was confuted by expression profiles examination, suggesting that the expression profile of a given host gene can hardly be a good estimator of co-transcribed miRNA(s) for post-transcriptional regulatory networks inference. Our results point out the regulatory importance of post-transcriptional phases of miRNAs biogenesis, reinforcing the role of such layer of miRNA biogenesis in miRNA-based regulation of cell activities.