SNPFile – A software library and file format for large scale association mapping and population genetics studies

DOI: 10.1186/1471-2105-9-526

We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. This is somewhat alleviated by a scripting interfaces that makes it easy to write converters to and from the format.High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies [1], in particular genome disease association studies [2-7].There are currently no standard file format for storing such genotype data, and most major analysis tools define their own textual input and output formats. Only a few tools supports several input formats, and often several conversion scripts needs to be implemented in a study. These file formats of analysis tools usually only represent a restricted set of the data collected for the study – only the data necessary for the computations provided by the program – so a study either needs a secondary format for storing all data, with converter programs for import/export to analysis tools, or need several files for storing various types of data.While spreadsheets and plain text files were adequate, if not optimal, solutions earlier, the increased data size mandates more efficient solutions. While plain text files formats have the advantage that they are human readable and can be edited in any text editor to correct mistakes, they have two maj


