|
Genome Biology 2000
Arabidopsis chromosome 4 sequenceDOI: 10.1186/gb-2000-1-1-reports030 Abstract: The paper summarizes years of work by hundreds (if not thousands) of people in dozens of labs spread over three continents. The key features of chromosome 4 are as follows. The long arm of chromosome 4 is 14.5 Mb, the short arm is 3.0 Mb (plus nearly 3.5 Mb of ribosomal DNA repeats). Nearly 50% of the sequence encodes for protein, for a total of 3,744 predicted proteins. Each gene is about 4.6 kb in length, containing an average of 5.2 exons. The actual or potential cellular function for approximately 60% of the genes can be predicted on the basis of similarity to other characterized proteins. Only 33% of the predicted genes are represented among the available 45,000 Arabidopsis expressed sequence tags (ESTs). Of these, 6% of the genes match 75% of the ESTs. Note that it is not clear if the authors are referring at this point only to the chromosome 4 sequence or to all Arabidopsis sequence available; it is clearly important to sequence normalized EST libraries in order to maximize the amount of non-redundant sequence gathered. Almost 8% of the predicted genes have no ESTs and no similarity to other proteins; these may represent spurious gene predictions or plant-specific genes expressed at low levels.The authors give some statistics on various motifs and structural topologies found in the predicted proteins. They also attempt to classify the proteins into major functional categories (such as metabolism and transcription). The only major surprise is the large number of genes involved in disease and defense responses. This is largely due to several large clusters of leucine-rich repeat genes, including one family of 15 contiguous genes. A surprisingly large number of genes are arranged in tandem copies. Of genes with products that have significant similarity to other proteins in Arabidopsis, 12% are arrayed in tandem clusters, ranging from pairs of genes to the 15 leucine-rich repeat genes. This hints at the underlying mechanism of how plants generate sequence diversi
|