ADMIXMAP

a program to model admixture using marker genotype data

Output files are formatted as either tab-delimited tables with a header line (for 2-way arrays) or as R objects (for 3- or 4-way arrays). Output files are written to the directory specified by resultsdir.

paramfile - Posterior draws of the following at intervals determined by option every: -

Parameters of the Dirichlet distribution for parental admixture: one for each subpopulation
Sum of intensities for the stochastic process of transitions of ancestry on hybrid chromosomes

regparamfile - Posterior draws of intercept, slope and precision (the inverse of the residual variance) parameters in the regression model, at intervals determined by option every.

dispparamfile - Posterior draws of allele frequency dispersion parameters, one for each subpopulation, at intervals determined by option every. These are written only if option historicallelefreqfile has been specified or correlatedallelefreqs = 1.

Median and 95% credible intervals for these parameters are written to the file PosteriorQuantiles.txt.

indadmixturefile - Posterior draws of individual/gamete level variables, at intervals determined by option every written as an R object. The outputs to this file are, in the following order;

gamete admixture proportions, ordered by subpopulations and then by gamete if a random mating model is specified. If an assortative mating model is specified only individual admixture proportions will be output.
gamete/individual sum-of-intensities if globalrhoindicator is false
predicted value of the outcome variable in the regression model
paternal and maternal haplotypes at this locus.

These values are written out for every individual at every iteration This file is formatted to be read into R as a three-way array (indexed by variables, individuals, draws).

allelefreqoutputfile - Posterior draws of the ancestry-specific allele or haplotype frequencies for each state of ancestry at each compound locus, at intervals determined by option every. These results can be used to compute new parameters for the prior distributions specified in priorallelefreqfile which can be used in subsequent studies with independent samples

ergodicaveragefile - Cumulative posterior means over all iterations ("ergodic averages") for the variables in paramfile, regparamfile and dispparamfile, output at intervals of 10 ´every iterations. Monitoring these ergodic averages allows the user to determine whether the sampler has been run long enough for the posterior means to have been estimated accurately.

The output files admixturescorefile, allelicassociationscorefile, ancestryassociationscorefile, affectedsonlyscorefile contain results of score tests obtained by averaging over the posterior distribution. Each table of score test results, based on cumulative averages for the score and information over all posterior samples obtained after the burn-in period, is output at intervals of 10 × every. Monitoring these repeated updates allows the user to determine when the sampler has been run long enough for the test results to be computed accurately. For inference, only the last table, which is output separately and which is based on the entire posterior sample, is used. All these files are formatted to be read into R as a three-way array (indexed by loci, test statistics, output number).

For univariate null hypotheses (testing the effect of one allele, one haplotype, or one subpopulation against all others) the test statistic is the score divided by the square root of the observed information, which has a standard normal distribution under the null hypothesis. The percent of information extracted (the ratio of observed information to complete information) measures the information obtained about the parameter under test, in comparison the information that would be obtained if individual admixture, haplotypes at each locus, and gamete ancestry at each locus were measured without error.

For the affected-only and ancestry asociation score tests, the missing information can be partitioned into two components: missing information about locus ancestry, and missing information about model parameters(parental admixture) . These components are tabulated separately.

For composite null hypotheses, the score U is a vector, the observed information V is a matrix, and the test statistic (UV^-1U^/) has a chi-squared distribution under the null hypothesis.

admixturescorefile - test for association of trait with individual admixture. The null hypothesis is no effect of individual admixture in a regression model, with covariates as explanatory variables if specified. The test statistic is computed for the effect of each subpopulation separately, with a summary chi-square test over all subpopulations if there are more than two subpopulations.

allelicassociationscorefile - tests for allelic association at each locus. The null hypothesis is no effect of the alleles or haplotypes in a regression analysis with individual admixture (and covariates if specified) as explanatory variables. The test statistic is computed for each allele or haplotype separately, with a summary chi-square statistic over all alleles or haplotypes at each locus if there are more than two alleles or haplotypes. Rare alleles or haplotypes are grouped together.
This test is appropriate when testing for association of the trait with alleles or haplotypes in a candidate gene.

ancestryassociationscorefile - tests for linkage of each locus with genes underlying ethnic variation in disease risk or trait values. This is a test for association of the trait with ancestry at each compound locus, conditional on parental admixture. The null hypothesis is no effect of locus ancestry in a regression analysis with individual admixture (and covariates if specified) as explanatory variables. The test statistic is computed for the effect of each subpopulation separately, with a summary chi-square statistic over all subpopulations at each locus if there are more than two subpopulations. The proportion of information extracted depends upon the information content for ancestry of the marker locus and other nearby loci. This test is appropriate when the objective of the study is to exploit admixture to localize genes underlying ethnic variation in the trait value, using ancestry-informative markers rather than candidate gene polymorphisms.

affectedsonlyscorefile - tests for linkage of each locus with genes underlying the ethnic difference in disease risk, using only the affected individuals. The null hypothesis is that the risk ratio between populations that the locus accounts for is 1. This test statistic is computed for the effect of each subpopulation at each locus. The test compares at each locus the observed and expected proportion of gene copies that have ancestry from the high-risk subpopulation. This is the only test that can be used if the sample consists only of affected individuals. Even if a control group has been typed, for a rare disease the affected-only test is more efficient than the test given in ancestryassociationscorefile based on a regression model. This is because for a rare disease, the observed and expected proportion of gene copies that have ancestry from the high-risk subpopulation will not differ by very much in unaffected individuals.

allelefreqscorefile - tests for mis-specification of ancestry-specific allele frequencies.
This test is computed only if allele frequencies have been specified as fixed with option allelefreqfile. For each compound locus and each subpopulation, a score test is computed for the null hypothesis that the frequencies of all alleles have been specified correctly. A summary test over all k subpopulations is also computed at each locus.

args.txt - a list of the options used by the program. This is used by the R script to identify output files and other information. This is writen to resultsdir.

An R script (AdmixmapOutput.R) is supplied that processes these output files to produce tables of posterior quantiles, frequency plots of the posterior distribution, and plots of the cumulative posterior means for the variables that are output to paramfile. The R script also calculates a summary slope parameter for the effect of admixture from each subpopulation, versus the others. This R script is run automatically from the Perl script (admixmap.pl) that is supplied as a wrapper for the program.

Interpretation of output from the program

These notes are based on the output produced by using the Perl script admixmap.pl to run the main program. Output files produced by the main are processed by the R script AdmixmapOutput.R. This produces several text files, and a file plots.ps containing graphs in postscript format

Evaluating the sampler

The adequacy of the burn-in period can be evaluated by the Geweke diagnostics in the R output. If the burn-in period is adequate, the numbers in this table should have approximately a standard normal distribution.

The mixing of the MCMC sampler can be evaluated by examining the autocorrelation plots. Autocorrelation extending beyond 20 iterations (2 thinned draws if every = 10 ) indicates slow mixing.

Acceptance rates for the Metropolis-Hastings samplers used by the program are printed to screen and logfile.

The adequacy of the total number of iterations can be evaluated by examining a plot of the statistic of interest calculated from all iterations since the end of the burn-in period, against the iteration number. Where inference is based on the mean of a parameter, this statistic is an ergodic (cumulative) average over all iterations to that point. Plots of ergodic averages of the population-level parameters are given in file ErgodicAveragePlots.ps.

Evaluating the fit of the model

The file stratificationtestfile contains results of a diagnostic test for residual population stratification that is not explained by the fitted model. For details of how this test is calculated, and a discussion of how to interpret it, see Hoggart (2003). The test is based on testing for allelic association between unlinked loci that is not explained by the model. The results is a "Bayesian p-value": p < 0.5 indicates lack of fit. The "Bayesian p-value" calculated by this test is more conservative than a classical p-value. Our experience has been that a test p-value of 0.3 or less is fairly strong evidence for residual stratification. Where this statistic yields evidence of lack of fit, the model should be specified with more subpopulations, unless there is some other reason for lack of fit such as mis-specified allele frequencies.

The file dispersiontestfile contains results of a diagnostic test for variation between the allele frequencies in the unadmixed populations that have been sampled to calculate the prior parameter values in priorallelefreqfile and the corresponding ancestry-specific allele frequencies in the admixed population under study. Again the results are "Bayesian p-values", for which the deviation of the test p-value from its expected value of 0.5 does not provide an absolute measure of the strength of evidence for lack of fit. For each subpopulation, the test statistic is calculated as a summary test over all loci and for each locus separately. Examination of the test statistic for each locus may reveal errors in coding, or errors in specifying the prior allele frequencies.

The option dispersiontestfile is valid only where option priorallelefreqfile has been specified. Where allele frequencies have been specified as fixed, option allelefreqscorefile should be specified and the output file should be examined.

No diagnostic test for lack of fit of the distribution of individual admixture proportions to the model is yet implemented. However the plots in file Plots.ps can be examined to compare the estimated distribution of individual admixture proportions (based on the the posterior means for individual admixture) with an estimate for the distribution of individual admixture values in the population (based on the posterior means for the Dirichlet parameters of this distribution).

The deviance and Deviance Information Criterion (DIC) are computed each time.

For an analysis of a single individual, with option chib, the log marginal likelihood, also known as the log evidence, is computed.

With option thermo=1, the marginal likelihood is approximated for any model. The greater the value of numannealedruns, the more accurate will be the approximation, but the longer the program will take to run.