ADMIXMAP

a program to model admixture using marker genotype data

General Options
Allele / Haplotype Frequency Model
Data Files
Model Specification
Prior Specification
Output Files
Tests and Diagnostics

User options

The program requires a list of options to be specified by the user either as command-line arguments, or in a text file the name of which is given as a single argument to the program . As explained above, the most convenient way to specify these arguments is to use a Perl script (see "admixmap.pl" ). A list of these options is given in the following table. Required arguments are in bold.

General Options

*samples*	Integer specifying total number of iterations of the Markov chain, including burn-in. With strong priors and informative markers, a run of about 500 should suffice for inference. Otherwise, a run of at least 20 000 iterations may be necessary. See here for how to determine if the run has been long enough.
*burnin*	Integer specifying number of iterations for burn-in of the Markov chain, before posterior samples are output. A burn-in of at least 50 iterations is recommended for inference. For analyses requiring long runs, a burn-in of up to 500 may be required.
*every*	Integer specifying the "thinning" of samples from the posterior distribution that are written to the output files, after the burn-in period. For example, if every=10, sampled values are written to the output files every 10 iterations. We recommend using a value of 5 to keep down the size of the output files. Sampling more frequently than this does not much improve the precision of results, because successive draws are not independent. Thinning the output samples does not affect the calculation of ergodic averages or test statistics, which are based on all sampled values. Note that every must be no greater than (samples - burnin) / 10 or some output files may be empy.
numannealedruns	If thermo=0, this specifies the number of "annealing" runs during burnin. This usually improves mixing. If thermo=1, this specifies the number of "temperatures" at which to run in order to estimate the marginal likelihood by thermodynamic integration. Default is 20.
displaylevel	0 - silent mode; Only start and finish times output to screen. 1 - quiet mode; Model specification, priors, test results and diagnostics written to screen. 2 - normal mode; more verbose information and an iteration counter output to screen. >2 - monitor mode; population-level parameters also written to screen with frequency specified by every.
resultsdir	Path of directory for output files. Default is 'results'.
logfile	Name of log file written by the program. Default is 'logfile.txt',
seed	can be used to specify a seed for the random number generator.

Allele / Haplotype Frequency Model

The program requires one of the following four options, any one of which specifies the number of subpopulations in the model: populations, allelefreqfile, priorallelefreqfile, or historicallelefreqfile. These options are mutually exclusive.

*populations*	Integer specifying number of subpopulations that have contributed to the admixed population under study. If specified as 1, the program fits a model based on a single homogeneous population. This option is not required (and is ignored) if information about allele frequencies is supplied in allelefreqfile, priorallelefreqfile, or historicallelefreqfile, as the number of columns in any of these files defines the number of subpopulations in the model. If none of these files are specified, the parameters of the Dirichlet priors for allele or haplotype frequencies default to 1/n, where n is the number of alleles or haplotypes at each compound locus.
*allelefreqfile*	Pathname of file containing the allele frequencies of the genotyped loci for each subpopulation. When this option is specified, the model treats the allele frequencies as fixed constants. This option is obsolete, and retained only for backward compatibility. Instead, use option priorallelefreqfile to specify the allele frequencies, and specify option fixedallelefreqs=1. This allows you to use the same format for the allele frequency file, whether the allele frequencies are fixed, have a prior distribution with no dispersion, or are specified with a dispersion model.
*priorallelefreqfile*	Pathname of file containing parameters of the Dirichlet prior distributions for allele frequencies (or haplotype frequencies) at each compound locus in each subpopulation. Where allele frequencies have been estimated from a sample of unadmixed individuals, the prior distribution parameters for the corresponding subpopulation should be specified as the observed allele counts plus 0.5. Where no allele frequency data are available, specify the prior parameters as 0.5 for each allele ("reference" prior). When this option is specified, the program fits a model in which the allele frequencies in each subpopulation are estimated simultaneously from the unadmixed samples and the admixed sample under study
*historicallelefreqfile*	Pathname of file containing observed allele counts at the genotyped loci from samples of unadmixed individuals in each subpopulation. When this option is specified, the program fits a model that allows the "historic" allele frequencies in the unadmixed population to vary from the corresponding ancestry-specific allele frequencies in the admixed population under study

Data Files

Details of file formats are under Input files

*locusfile*	path to file containing information about each locus typed
*genotypesfile*	path to file containing genotypes for each individual typed
outcomevarfile	path to file containing values of outcome variables
coxoutcomevarfile	path to file containing data for a Cox regression
covariatesfile	path to file containing covariates for a regression model
targetindicator	Integer specifying column in outcomevarfile that contains the first outcome variable to be modelled. This column number should be specified as an offset from column 1: thus to select the variable in column 1, specify targetindicator=0. The default is 0.
outcomes	valid only with outcomevarfile. Integer specifying the number of columns of the outcomevarfile to use, starting with targetindicator.
reportedancestry	not fully tested or documented: allows prior information about each individual’s ancestry to be specified in the model
testgenotypesfile	specifies genotypes for offline score tests at loci that have not been included in the model.

Model Specification

indadmixhiermodel	0 - Model for a collection of individuals in which the admixture proportions of each individual’s parents, and the sum of intensities on each parental gamete, are statistically independent given the priors on these parameters. This option is useful in two situations: (1) when you already have strong prior information about the distribution of admixture in the population from which the individuals have been sampled, and want to specify a Dirichlet prior for each individual’s parental admixture proportions using the option initalpha0; or (2) when you want to calculate the marginal likelihood of the model given the genotype data on each individual. 1- Hierarchical model on individual admixture The default is 1.
randommatingmodel	0 - assortative mating model (admixture proportions the same in both parents) 1 - random mating model The default is 0.
globalrho	0 - the sum of intensities parameter r is allowed to vary between individuals, or between gametes if a random mating model is specified). This specifies a hierarchical model, with a gamma distribution for the variation of r between individuals specified as below. 1 - the sum of intensities r is modelled as a global parameter, set to be the same on all parental gametes The default is 1
fixedallelefreqs	1 specifies that priorallelefreqfile contains fixed allele frequencies 0 otherwise default is 0
correlatedallelefreqs	valid only with 'populations' or 'priorallelefreqfile' options 1 specifies a correlated allele frequency model 0 otherwise default is 0

Prior Specification

sumintensitiesprior globalsumintensitiesprior	In a model with global sumintensities or without a hierarchical model of individual admixture, the sum of intensities parameter has a Gamma(a, b) prior specified as " globalsumintensitiesprior="a,b" ". Default values for a and b are 3 and 0.5, giving a prior mean of 6 and prior variance of 12. Otherwise (indadmixhiermodel=1 and globalrho=0 ), the sum of intensities parameter r has a Gamma(a,b) prior distribution and the scale parameter b has a beta hyperprior with parameters b₀ and b₁. This specifies a "GammaGamma" prior, which has mean E(r) = ab₁ / (b₀ - 1) and variance E(r)(E(r)+1) / (b₀-2). The three parameters of this prior are specified with sumintensitiesprior. The three values must be enclosed by quotes and separated by commas e.g "sumintensitiesprior="2,3,4" ". Thus, for instance, to model an African-American population, for which we have prior information that the sum of intensities parameter is about 6 per morgan, we could specify sumintensitiesprior = "6,40,39" This specifies the prior for the sum of intensities parameter r as Gamma(6, 1) which has mean 6 and variance 1. "0,1,0" specifies a flat prior on log r "1,1,0" specifies a flat prior on r The default, if this option is not specified, is "4,3,3" Where there is not enough data for reliable inference of the sum of intensities parameter, it is often useful to specify that the prior distribution should be truncated at some upper limit of plausible values, using the option truncationpoint.
etapriormean, etapriorvar	Specify the prior mean and variance of the dispersion parameter(s), h, in a dispersion or correlated allele frequency model.
etapriorfile	File containing parameters of the gamma prior distribution specified for the allele frequency dispersion parameter h in each subpopulation. This option can be used only when a dispersion model has been specified with the option historicallelefreqfile. This is useful when there are not enough data for the dispersion parameter to be inferred from the data, and we want to use prior information from population genetics. This file has one row for each subpopulation (in the same order as the order of subpopulations by columns in historicallelefreqfile, and two columns specifying the shape and location parameters of the gamma distribution. Thus, for a sample from an African-American population, in which historicallelefreqfile contains counts of alleles in samples of modern west Africans (in the first column) and Europeans (in the second column), we might specify an etaprior file containing these two lines:- 50 1 500 1 This specifies a prior with mean 50 for the parameter for dispersion of allele frequencies between modern unadmixed west Africans and the African gene pool in African-Americans, and a prior with mean 500 and variance 500 for the parameter for dispersion of allele frequencies between modern unadmixed Europeans and the European gene pool in African-Americans. The dispersion parameter is related to the fixation index F_ST by x = (1 + F_ST) / F_ST, so values of 50 and 500 for x correspond roughly to values of 0.02 and 0.002 for F_ST.
admixtureprior, admixtureprior1	When indadmixhiermodel = 0, each of these two options can be used to specify a Dirichlet parameter vector for parental admixture proportions. The parameter vector is specified as a string of numbers separated by commas. For instance, with a model based on 3 subpopulations:- admixtureprior = “2, 8, 3.5” would specify the prior for parental admixture proportions (or the maternal gamete if option randommatingmodel=1 has been specified) with parameter vector c(2, 8, 3.5). admixtureprior1 can be used similarly to specify the prior for paternal admixture proportions if optionrandommatingmodel=1 has been specified. For example, "admixtureprior = 1,1,0" and "admixtureprior1 = 1,1,1" would specify that one parent has 2-way admixture (between subpopulations 1 and 2) and the other has 3-way admixture between subpopulations . If indadmixhiermodel =1, admixtureprior can be used to specify initial values for the population admixture Dirichlet parameters.
regressionpriorprecision	Prior precision (1 / variance) of regression parameters
popadmixproportionsequal	Specifies that the population-level admixture proportions are to be kept equal

Output Files

Pathnames of output files, details of file formats in Output files.

paramfile	Population-level admixture and sum-of-intensities
regparamfile	Regression parameters
dispparamfile	Allele/haplotype frequency dispersion in historicallelefreqs model
indadmixturefile	Individual-level admixture proportions and sum-of-intensities
allelefreqoutputfile	Name of output file containing samples from the posterior distribution of ancestry-specific allele frequencies. Valid only when the allele frequencies are specified as random variables, i.e. when one of the two options priorallelefreqfile or historicallelefreqfile is specified and fixedallelefreqs is 0.
ergodicaveragefile	Ergodic averages of population-level parameters and of the mean and variance of the deviance.

Tests and Diagnostics

The options below specify additional tests or output,but do not change the model itself

chib	1 - Calculate marginal likelihood for the first individual using Chib algorithm. 0 - default
thermo	1 - Use thermodynamic integration to compute marginal likelihood. 0 - default
testoneindiv	1 - compute marginal likelihood for the first individual listed in the genotypes file. This individual will not be included as part of the sample and should not be included in an outcomevarfile or covariatesfile. 0 - default
indadmixmodefile	Name of output file containing posterior estimates of the modes of individual admixture proportions and individual-level sumintensities (if globalrho=0).
admixturescorefile	Pathname of file to which results of a score test for the association of the trait with individual admixture will be written. This option is valid only if an outcome variable has been specified. This option is used only to obtain a formal test of the null hypothesis of no association between the trait and individual admixture. If admixturescorefile is specified, the regression model will not include individual admixture proportions as explanatory variables, and tests for allelic association or linkage will not be adjusted for the effect of individual admixture. Provided an outcomevarfile is specified and unless option admixturescorefile is specified the program will fit a regression model with the outcome variable as dependent variable and individual admixture proportions (plus any covariates specified in inputfile) as explanatory variables.
allelicassociationscorefile	Name of output file containing score tests for association of the outcome variable with alleles at each simple locus, adjusting for individual admixture.
residualellelicassocscorefile	Name of output file containing score tests for residual allelic association between pairs of unlinked loci.
haplotypeassociationscorefile	Name of output file containing score tests for association of the outcome variable with haplotypes for all compound loci containing haplotypes, adjusting for individual admixture.
ancestryassociationscorefile	Name of output file containing score tests at each compound locus for linkage with genes underlying ethnic variation in the trait. This is a test for association of the trait with locus ancestry, adjusting for individual admixture and covariates. This test should be used in a cross-sectional or cohort study design. For a case-control study of a rare disease, the affected-only test below has greater statistical power.
affectedonlyscorefile	Name of output file containing score tests at each compound locus for linkage with ancestry, based on comparing the observed and expected proportions of gene copies at this locus that have ancestry from each subpopulation. This test is calculated from affected individuals only: individuals are their own controls. Even when the sample includes both cases and controls, this test is more powerful than the regression model score test in ancestryassociationscorefile if the disease is rare.
likratiofile	Name of output file containing likelihood ratios for the affecteds-only score test at values of 0.5 and 2 for the ancestry risk ratio.
allelefreqscorefile	Name of output file containing score tests of mis-specified ancestry specific allele frequencies. This option is valid only when the allele frequencies are fixed, i.e. when option allelefreqfile is specified or fixedallelefreqs is 1. There is a test for each population at each locus as well as a summary chi-squared test across populations.
hwscoretestfile	Name of outputfile containing score tests for heterozygosity across loci, as a test for departure from Hardy-Weinberg equilibrium. These can be used to detect genotyping errors.
stratificationtestfile	Name of output file containing test for residual population stratification (stratification not accounted for by the fitted model).
dispersiontestfile	Name of output file containing test for dispersion of allele frequencies between the unadmixed populations sampled and the corresponding ancestry-specific allele frequencies in the admixed population under study. This is evaluated for each subpopulation at each locus, and as a global test over all loci. This option is valid only if option priorallelefreqfile is specified. The results are "Bayesian p-values", as above.
fstoutputfile	This option is used only with option historicallelefreqfile (which specifies a dispersion model for allele frequencies). Under a dispersion model, the allele frequencies in unadmixed modern descendants are allowed to vary from the corresponding ancestry-specific allele frequencies in the admixed population. The variance of allele frequencies at a locus can be measured by Wright's "fixation index subpopulation-total" (F_st). In Wright's terminology, the unadmixed modern descendants and the pool of genes of corresponding ancestry in the admixed population are "subpopulations", and the "historic" population from which both these gene pools are derived is the "total" population. This differs from the terminology used in this manual, in which K "subpopulations" are specified in the model as ancestors of the admixed population. For each locus, and each subpopulation, specifying the option fstoutputfile will make the program output the ergodic average of the F_st value. These values can be examined as a diagnostic: a locus with an unusually large F_st value may indicate errors in coding, errors in typing, or possibly that allele frequencies in unadmixed modern descendants have diverged from the corresponding allele frequencies in the admixed population as a result of recent selection pressure.