ADMIXMAP

a program to model admixture using marker genotype data  

back to index

back to ADMIXMAP main page

 

Input File Formats

genotypesfile
The first row of the file is a header row listing locus names, enclosed in quotes and separated by spaces. Locus names should be exactly the same as in the file locusfile. Loci must be ordered by their map positions on the genome. Each subsequent row contains genotype data for a single individual. Each line contains the individual ID, the individual's sex, coded as 1 for male, 2 for female, 0 for missing, followed by observed genotypes at each locus, optionally enclosed in quotes. The sex column may be omitted if none of the loci are on the X chromosome. Haploid genotypes (including X chromosome genotypes for males) are coded as single integers. Diploid genotypes are coded as pairs of integers separated by a comma.  Where there are a alleles at a locus, the alleles should be coded as numbers from 1 to a.  Missing genotypes are coded as "0,0" (or "0" for haploid genotypes). 

For compatibility with existing datasets, we plan to change this file format to one more similar to the PEDFILE format used with LINKAGE. 

testgenotypesfile

This file contains genotypes for each individual in the genotypesfile at diallelic loci not included in the model due to large haplotypes not being modelled. The format is the as for the genotypesfile above except that genotypes should be coded as 0 for "1,1", 1 for "1,2" and 2 for "2,2". Missing genotypes should be coded as NA. The file is not used by the program itself but will indicate that, provided there is a regression model,  "offline" score tests are to be carried out in the R script.

locusfile
File contains information about each simple locus: that is, each locus that is typed.  The first row of the file is ignored by the program, and can be used as a header. Each subsequent row contains values of four  variables: locus name; number of alleles at this locus; genetic map distance in Morgans, centimorgans or megabases between this locus; and the previous locus and the name of the chromosome where the locus is located.  The last column is optional if none of the loci lie on the X chromosome. If distances are supplied in centimorgans, the header of the distance column must contain "cm" or cM". If the distances are supplied in megabases, the header should contain "mb" or "Mb". Loci must be ordered by their map positions on the genome. Locus names should  contain only alphanumeric characters (no spaces, dots or hyphens).  If the previous locus is unlinked, the genetic map distance should be coded as "NA", "#" or "." . Loci considered too far apart to be linked may also be treated as unlinked. For two or more loci that are so close together that they should be analysed as a single compound locus (as with DRD2Bcl and DRD2Taqd in the tutorial), map distance should be coded as 0.
The website http://actin.ucd.ie/cgi-bin/rs2cm.cgi can be used to obtain the genetic map positions (in cM)
of a list of SNPs, which, once converted to distances, may be specified in the locusfile.

allelefreqfile
This file contains the ancestry-specific allele frequencies at each compound locus in each ancestral subpopulation. The first row contains headers in quotes, separated by spaces. The first string in this row is ignored. Subsequent strings in the first row specify the names of the ancestral subpopulations contributing to the admixed population under study.  Subsequent rows specify the ancestry-specific allele frequencies (usually estimated by from sampling modern descendants of the subpopulations that underwent admixture. The first column in each row gives the name of the compound locus, in quotes.
For diallelic loci, only the frequency of allele 1 in each population is specified. For each locus with a alleles, there are (a - 1) rows specifying frequencies of alleles 1 to (a - 1).
Where two or more loci are to be analysed as a single haplotype, the ancestry-specific frequency of each haplotype must be specified. Thus in the example files below, there are two SNPs in the DRD2 gene, giving four possible haplotypes) and four lines specifying the ancestry-specific frequencies of haplotypes 11, 12, 21, 22. The loci in the haplotype are ordered by their map position on the genome, and the haplotypes are ordered by incrementing a counter from right to left. For instance if there were three loci A, B, C, with 4, 2 and 3 alleles respectively, the haplotypes would be listed in the following order: 111, 112, 113, 121, 122, 123, 211, ..., 422, 423.
Note: Use of this file is not recommended and is supported only for backward compatibility with previous versions.  Instead you can specify the allele frequencies in priorallelefreqfile or historicallelefreqfile as fixed, with option fixedallelefreqs = 1

priorallelefreqfile
This file contains parameter values for the Dirichlet prior distribution of the allele or haplotype frequencies at each compound locus in each subpopulation.  At each compound locus with k alleles or k possible haplotypes, a Dirichlet prior distribution is specified by a vector of k positive numbers.   Where these alleles or haplotypes have been counted directly in samples from an unadmixed subpopulation, the parameter values should be specified as 0.5 plus the observed counts of each allele.   Where no information is available about allele or haplotype frequencies at a compound locus in a subpopulation, or no copies of the allele have been observed in the sample from that subpopulation, specify 0.5 in the corresponding cells.   Specifying 0.5 in all cells, with columns for b subpopulations, is equivalent to specifying the option populations = b .
Where haplotype frequencies at a compound locus have been estimated from unordered genotypes, the user should supply the parameters of the Dirichlet distribution that most closely approximates the posterior distribution of haplotype frequencies given the observed genotypes and a reference prior, as described above.  The first row is a header row, consisting of strings in quotes, separated by spaces. The first string in this row is ignored, and the subsequent strings specify the names of the ancestral subpopulations contributing to the admixed population).
After the header row, there is one row for each allele (or haplotype) at each compound locus.  The first column in each row gives the name of the compound locus in quotes.  Subsequent columns give the prior parameters for the frequency of the allele (or haplotype) in each subpopulation, separated by a single space. 
If the compound locus consists of two or more simple loci, (see notes above), the rows list prior parameters for the haplotypes in the order defined by incrementing a counter from right to left. For instance if there were three loci A, B, C, with 4, 2 and 3 alleles respectively, the haplotypes would be listed in the following order: 1-1-1, 1-1-2, 1-1-3, 1-2-1, 1-2-2, 1-2-3, 2-1-1, ..., 4-2-2, 4-2-3.  Estimated counts should be given for all possible haplotypes, however rare.  The program will include all possible haplotypes in the model, but will omit rare haplotypes when constructing test statistics. 

historicallelefreqfile
This file contains observed counts of alleles or haplotypes at each compound locus in samples from unadmixed subpopulations.  The format of this file is exactly the same as the format of priorallelefreqfile described above.  The only difference between the two files is that in historicallelefreqfile, 0.5 is not added to the observed counts.  

outcomevarfile
This file contains values of one or more outcome variables.  After the header row, the file has one row per individual.  Binary variables should be coded as 1 = affected, 0 = unaffected. Missing values are coded as #. The header row contains the variable labels in quotes separated by spaces. If the file contains more than one outcome variable, the column containing the first variable of interest should be specified by the command-line option targetindicator. The number of columns to be used (1 or 2) can be specified with the outcomes option. 

covariatesfile
This file contains values of covariates to be included in the regression model. It is used only if an outcomevarfile has been specified, and is optional even then.  The header row contains covariate names in quotes, separated by spaces. Subsequent rows contain the observed values of these variables.  For computational reasons, the values of the covariates should be centred about their sample means. Missing values are coded as #.

coxoutcomevarfile
This file contains surivival data for a Cox regression model. After the header row, the file has one row per individual and there are three columns. The first contains the times when each individual began to be observed; the second contains the times the individuals ceased being observed and the last column contains the number of events that occurred during the obseerved period (usually 0 or 1). The start and finish times must be numeric and relative to the same point in time, (usually the first start time).

Back to top