OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

BMC Bioinformatics 2006

Querying the public databases for sequences using complex keywords contained in the feature lines

DOI: 10.1186/1471-2105-7-45

Olivier Croce, Micha？l Lamarre, Richard Christen

Full-Text Cite this paper Add to My Lib

Abstract:

We show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use.Although not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.The quantity of biological information available in the public databases is now very large and doubling nearly every year [1]. Projects involving high throughput data from approaches such as the new transcriptomic or genomic technologies require large quantities of data to be dealt with. For example, in order to design a DNA chip for bacterial identification, one may want to retrieve every available sequence for a universal gene such as the 16S rRNA gene (nearly 200,000 sequences expected by the end of year 2005).Retrieval of such sequences is not trivial. Retrieval by sequence similarity is not feasible, since some of these genes are very variable (ITS regions for example). Also it is difficult, if not impossible, to determine a cutoff level in order to exclude non homologous gene sequences. Blast [2] for example allows a cutoff according to the "E value", while Exonerate [3] also allows a cutoff on percentage of similarity. The "E value" cutoff depends on the database size that changes every day. The percentage of similarity is be

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133