ADMIXMAPa program to model admixture using marker genotype data |
User options
The program requires a list of options to be specified by the user either as command-line arguments, or in a text file the name of which is given as a single argument to the program . As explained above, the most convenient way to specify these arguments is to use a Perl script (see "admixmap.pl" ). A list of these options is given in the following table. Required arguments are in bold.
samples |
Integer specifying total number of iterations of the Markov chain, including burn-in. With strong priors and informative markers, a run of about 500 should suffice for inference. Otherwise, a run of at least 20 000 iterations may be necessary. See here for how to determine if the run has been long enough. |
burnin |
Integer specifying number of iterations for burn-in of the Markov chain, before posterior samples are output. A burn-in of at least 50 iterations is recommended for inference. For analyses requiring long runs, a burn-in of up to 500 may be required. |
every |
Integer specifying the "thinning" of samples from the posterior distribution that are written to the output files, after the burn-in period. For example, if every=10, sampled values are written to the output files every 10 iterations. We recommend using a value of 5 to keep down the size of the output files. Sampling more frequently than this does not much improve the precision of results, because successive draws are not independent. Thinning the output samples does not affect the calculation of ergodic averages or test statistics, which are based on all sampled values. Note that every must be no greater than (samples - burnin) / 10 or some output files may be empy. |
numannealedruns |
If thermo=0,
this specifies the number of "annealing" runs during burnin. This
usually improves mixing.
If thermo=1, this specifies the number of "temperatures" at which to run in order to estimate the marginal likelihood by thermodynamic integration. Default is 20. |
displaylevel |
0 - silent mode; Only start and finish times output to screen. 1 - quiet mode; Model specification, priors, test results and diagnostics written to screen. 2 - normal mode; more verbose information and an iteration counter output to screen. >2 - monitor mode; population-level
parameters also written to screen with frequency specified by every. |
resultsdir | Path of directory for output files. Default is 'results'. |
logfile |
Name of log file written by the program. Default is 'logfile.txt', |
seed |
can be used to specify a seed for the random number
generator. |
Allele / Haplotype Frequency Model
The program requires one of the following four options, any one of which specifies the number of subpopulations in the model: populations, allelefreqfile, priorallelefreqfile, or historicallelefreqfile. These options are mutually exclusive.
populations |
Integer specifying number of subpopulations that have contributed to the admixed population under study. If specified as 1, the program fits a model based on a single homogeneous population. This option is not required (and is ignored) if information about allele frequencies is supplied in allelefreqfile, priorallelefreqfile, or historicallelefreqfile, as the number of columns in any of these files defines the number of subpopulations in the model. If none of these files are specified, the parameters of the Dirichlet priors for allele or haplotype frequencies default to 1/n, where n is the number of alleles or haplotypes at each compound locus. |
allelefreqfile |
Pathname of file containing the allele
frequencies of the genotyped loci for each subpopulation. When this option
is specified, the model treats the allele frequencies as fixed constants. This option is obsolete, and retained only for backward compatibility. Instead, use option priorallelefreqfile to specify the allele frequencies, and specify option fixedallelefreqs=1. This allows you to use the same format for the allele frequency file, whether the allele frequencies are fixed, have a prior distribution with no dispersion, or are specified with a dispersion model. |
priorallelefreqfile |
Pathname of file containing parameters of the Dirichlet prior distributions for allele frequencies (or haplotype frequencies) at each compound locus in each subpopulation. Where allele frequencies have been estimated from a sample of unadmixed individuals, the prior distribution parameters for the corresponding subpopulation should be specified as the observed allele counts plus 0.5. Where no allele frequency data are available, specify the prior parameters as 0.5 for each allele ("reference" prior). When this option is specified, the program fits a model in which the allele frequencies in each subpopulation are estimated simultaneously from the unadmixed samples and the admixed sample under study |
Pathname of file containing observed allele counts at the genotyped loci from samples of unadmixed individuals in each subpopulation. When this option is specified, the program fits a model that allows the "historic" allele frequencies in the unadmixed population to vary from the corresponding ancestry-specific allele frequencies in the admixed population under study |
Details of file formats are under Input files
locusfile | path to file containing information about each locus typed |
genotypesfile | path to file containing genotypes for each individual typed |
outcomevarfile | path to file containing values of outcome variables |
coxoutcomevarfile | path to file containing data for a Cox regression |
covariatesfile | path to file containing covariates for a regression model |
targetindicator | Integer specifying column in outcomevarfile that contains the first outcome variable to be modelled. This column number should be specified as an offset from column 1: thus to select the variable in column 1, specify targetindicator=0. The default is 0. |
outcomes |
valid only with outcomevarfile.
Integer specifying the number of columns of the outcomevarfile to use, starting with targetindicator. |
reportedancestry |
not fully tested or documented: allows prior information about each individual’s ancestry to be specified in the model |
testgenotypesfile | specifies genotypes for offline score tests at loci that have not been included in the model. |
indadmixhiermodel |
0 - Model for a collection of
individuals in which the admixture proportions of each
individual’s parents, and the sum of intensities on each parental gamete,
are statistically independent given the priors on these parameters.
This option is useful in two situations: (1) when you already have strong prior information about the distribution of admixture in the population from which the individuals have been sampled, and want to specify a Dirichlet prior for each individual’s parental admixture proportions using the option initalpha0; or (2) when you want to calculate the marginal likelihood of the model given the genotype data on each individual. 1- Hierarchical model on individual admixture The default is 1. |
randommatingmodel |
0 - assortative mating model (admixture proportions the same in both parents) 1 - random mating model The default is 0. |
globalrho |
0 - the sum of intensities parameter r is allowed to vary between individuals, or between gametes if a random mating model is specified). This specifies a hierarchical model, with a gamma distribution for the variation of r between individuals specified as below. 1 - the sum of intensities r is modelled as a global parameter, set to be the same on all parental gametes The default is 1 |
fixedallelefreqs |
1 specifies that priorallelefreqfile contains fixed allele frequencies 0 otherwise default is 0 |
correlatedallelefreqs |
valid only with 'populations' or 'priorallelefreqfile' options
1 specifies a correlated allele frequency model 0 otherwise default is 0 |
sumintensitiesprior globalsumintensitiesprior |
In a model with global sumintensities or without a hierarchical model of individual admixture, the sum of intensities parameter has a Gamma(a, b) prior specified as " globalsumintensitiesprior="a,b" ". Default values for a and b are 3 and 0.5, giving a prior mean of 6 and prior variance of 12. Otherwise (indadmixhiermodel=1 and globalrho=0 ), the sum of intensities parameter r has a Gamma(a,b) prior distribution and the scale parameter b has a beta hyperprior with parameters b0 and b1. This specifies a "GammaGamma" prior, which has mean E(r) = ab1 / (b0 - 1) and variance E(r)(E(r)+1) / (b0-2). The three parameters of this prior are specified with sumintensitiesprior. The three values must be enclosed by quotes and separated by commas e.g "sumintensitiesprior="2,3,4" ". Thus, for instance, to model an African-American population, for which we have prior information that the sum of intensities parameter is about 6 per morgan, we could specify sumintensitiesprior = "6,40,39" This specifies the prior for the sum of intensities parameter r as Gamma(6, 1) which has mean 6 and variance 1. "0,1,0" specifies a flat prior on log r "1,1,0" specifies a flat prior on r The default, if this option is not specified, is "4,3,3" Where there is not enough data for reliable inference of the sum of intensities parameter, it is often useful to specify that the prior distribution should be truncated at some upper limit of plausible values, using the option truncationpoint. |
etapriormean, etapriorvar | Specify the prior mean and variance of the dispersion parameter(s), h, in a dispersion or correlated allele frequency model. |
etapriorfile |
File containing parameters of the gamma
prior distribution specified for the allele frequency dispersion parameter h in each subpopulation. This option can be used only when a
dispersion model has been specified with the option historicallelefreqfile. This is useful when there are not enough
data for the dispersion parameter to be inferred from the data, and we want
to use prior information from population genetics. This file has one row for each
subpopulation (in the same order as the order of subpopulations by columns in
historicallelefreqfile, and two columns specifying the shape and location
parameters of the gamma distribution.
Thus, for a sample from an African-American population, in which historicallelefreqfile contains
counts of alleles in samples of modern west Africans (in the first column)
and Europeans (in the second column), we might specify an etaprior file containing
these two lines:- 50 1 500 1 This specifies a prior with mean 50 for the
parameter for dispersion of allele frequencies between modern unadmixed west
Africans and the African gene pool in African-Americans, and a prior with
mean 500 and variance 500 for the parameter for dispersion of allele
frequencies between modern unadmixed Europeans and the European gene pool
in African-Americans. The dispersion parameter is related to the fixation index FST by x = (1 + FST) / FST,
so values of 50 and 500 for x correspond roughly to values of 0.02 and 0.002
for FST. |
admixtureprior,
admixtureprior1 |
When
indadmixhiermodel = 0, each of these two options
can be used to specify a Dirichlet parameter vector for parental admixture proportions. The parameter vector is specified as a
string of numbers separated by commas.
For instance, with a model based on 3 subpopulations:- admixtureprior
= “2, 8, 3.5” would specify the prior for parental admixture
proportions (or the maternal gamete if option randommatingmodel=1 has been
specified) with parameter vector c(2, 8, 3.5). admixtureprior1 can be used similarly to specify the prior for
paternal
admixture proportions if optionrandommatingmodel=1 has been specified. For example, "admixtureprior = 1,1,0" and "admixtureprior1 = 1,1,1" would specify that one parent has 2-way admixture (between subpopulations 1 and 2) and the other has 3-way admixture between subpopulations . If
indadmixhiermodel =1, admixtureprior can be used to specify initial
values for the population admixture Dirichlet parameters. |
regressionpriorprecision | Prior precision (1 / variance) of regression parameters |
popadmixproportionsequal | Specifies that the population-level admixture proportions are to be kept equal |
Pathnames of output files, details of file formats in Output files.
paramfile | Population-level admixture and sum-of-intensities |
regparamfile | Regression parameters |
dispparamfile | Allele/haplotype frequency dispersion in historicallelefreqs model |
indadmixturefile | Individual-level admixture proportions and sum-of-intensities |
allelefreqoutputfile | Name of output file containing samples from the posterior distribution of ancestry-specific allele frequencies. Valid only when the allele frequencies are specified as random variables, i.e. when one of the two options priorallelefreqfile or historicallelefreqfile is specified and fixedallelefreqs is 0. |
ergodicaveragefile | Ergodic averages of population-level parameters and of the mean and variance of the deviance. |
The options below specify additional tests or output,but do not change the model itself
chib |
1 - Calculate marginal likelihood for the first individual using Chib algorithm. 0 - default |
thermo |
1 - Use thermodynamic integration to compute
marginal likelihood.
0 - default |
testoneindiv |
1 - compute marginal likelihood for the
first individual listed in the genotypes file. This individual will not be
included as part of the sample and should not be included in an
outcomevarfile or covariatesfile.
0 - default |
indadmixmodefile | Name of output file containing posterior estimates of the modes of individual admixture proportions and individual-level sumintensities (if globalrho=0). |
admixturescorefile |
Pathname of file to which results of a score test for the association of the trait with individual admixture will be written. This option is valid only if an outcome variable has been specified. This option is used only to obtain a formal test of the null hypothesis of no association between the trait and individual admixture. If admixturescorefile is specified, the regression model will not include individual admixture proportions as explanatory variables, and tests for allelic association or linkage will not be adjusted for the effect of individual admixture. Provided an outcomevarfile is specified and unless option admixturescorefile is specified the program will fit a regression model with the outcome variable as dependent variable and individual admixture proportions (plus any covariates specified in inputfile) as explanatory variables. |
allelicassociationscorefile |
Name of output file containing score tests for association of the outcome variable with alleles at each simple locus, adjusting for individual admixture. |
residualellelicassocscorefile | Name of output file containing score tests for residual allelic association between pairs of unlinked loci. |
haplotypeassociationscorefile |
Name of output file containing score tests for association of the outcome variable with haplotypes for all compound loci containing haplotypes, adjusting for individual admixture. |
ancestryassociationscorefile |
Name of output file containing score
tests at each compound locus for linkage with genes underlying ethnic
variation in the trait. This is a test for association of the trait with
locus ancestry, adjusting for individual admixture and covariates. This
test should be used in a cross-sectional or cohort study design. For a case-control
study of a rare disease, the affected-only test below has greater
statistical power. |
affectedonlyscorefile |
Name of output file containing score tests at each compound locus for linkage with ancestry, based on comparing the observed and expected proportions of gene copies at this locus that have ancestry from each subpopulation. This test is calculated from affected individuals only: individuals are their own controls. Even when the sample includes both cases and controls, this test is more powerful than the regression model score test in ancestryassociationscorefile if the disease is rare. |
likratiofile | Name of output file containing likelihood ratios for the affecteds-only score test at values of 0.5 and 2 for the ancestry risk ratio. |
allelefreqscorefile |
Name of output file containing score tests of mis-specified ancestry specific allele frequencies. This option is valid only when the allele frequencies are fixed, i.e. when option allelefreqfile is specified or fixedallelefreqs is 1. There is a test for each population at each locus as well as a summary chi-squared test across populations. |
hwscoretestfile | Name of outputfile containing score tests for heterozygosity across loci, as a test for departure from Hardy-Weinberg equilibrium. These can be used to detect genotyping errors. |
Name of output file containing test for residual population stratification (stratification not accounted for by the fitted model). |
|
Name of output file containing test for dispersion of allele frequencies between the unadmixed populations sampled and the corresponding ancestry-specific allele frequencies in the admixed population under study. This is evaluated for each subpopulation at each locus, and as a global test over all loci. This option is valid only if option priorallelefreqfile is specified. The results are "Bayesian p-values", as above. |
|
fstoutputfile |
This option is used only with option
historicallelefreqfile |