BMC Genomics  2010 

Most transcription factor binding sites are in a few mosaic classes of the human genome

DOI: 10.1186/1471-2164-11-286

We find that the human genome may be described by 19 pairs of mosaic classes, each defined by its base frequencies, (or more precisely by the frequencies of doublets), so that typically a run of 10 to 100 bases belongs to the same class. Most experimentally verified binding sites are in the same four pairs of classes. In our sample of seventeen transcription factors — taken from different families of transcription factors — the average proportion of sites in this subset of classes was 75%, with values for individual factors ranging from 48% to 98%. By contrast these same classes contain only 26% of the bases of the genome and only 31% of occurrences of the motifs of these factors — that is places where one might expect the factors to bind. These results are not a consequence of the class composition in promoter regions.This method of analysis will help to find transcription factor binding sites and assist with the problem of false positives. These results also imply a profound difference between the mosaic classes.The DNA sequence has no landmarks to guide the search for transcription factor binding sites: these binding sites may be near the transcription start site but may also be far from it [1,2]. Many papers have examined how these sites might be found computationally [3]. Some methods use a comparison between orthologous regions of different species [4], often treating the problem as one of multiple alignment [5,6]. Other algorithms use a collection of subsequences containing a binding site (for example the promoter regions of coregulated genes or subsequences derived from ChIp-chip experiments) to deduce the form or motif of the binding site which is then used to identify sites in other sequences — reviews of these methods are given in [7,8]. These methods include Weeder [9], MEME [10], ANN-SPEC [11], MORPH [12] and GLAM [13]. Some authors have proposed a statistical test to decide whether a region of DNA is a regulatory region: two methods [14,15] tested on f


