ADMIXMAPa program to model admixture using marker genotype data |
Output files are formatted as either tab-delimited tables with a header line (for 2-way arrays) or as R objects (for 3- or 4-way arrays). Output files are written to the directory specified by resultsdir.
paramfile - Posterior draws of the following at intervals determined by option every: -
regparamfile - Posterior draws of intercept, slope and precision (the inverse of the residual variance) parameters in the regression model, at intervals determined by option every.
dispparamfile - Posterior draws of allele frequency dispersion parameters, one for each subpopulation, at intervals determined by option every. These are written only if option historicallelefreqfile has been specified or correlatedallelefreqs = 1.
Median and 95% credible intervals for these parameters are written to the file PosteriorQuantiles.txt.
indadmixturefile - Posterior draws of individual/gamete level
variables, at intervals determined by option every written as an R object.
The outputs to this file are, in the following order;
These
values are written out for every individual at every
iteration This file is formatted to be read into R as a three-way
array (indexed by variables, individuals, draws).
allelefreqoutputfile - Posterior
draws of the ancestry-specific allele or haplotype frequencies for each state
of ancestry at each compound locus, at intervals determined by option
every. These results can be used to
compute new parameters for the prior distributions specified in
priorallelefreqfile which can be used in subsequent studies with independent
samples
ergodicaveragefile - Cumulative
posterior means over all iterations ("ergodic averages") for the
variables in paramfile, regparamfile and dispparamfile, output at intervals of 10 ´every iterations. Monitoring these ergodic averages allows the user to determine whether the sampler
has been run long enough for the posterior means to have been estimated
accurately.
The output
files admixturescorefile, allelicassociationscorefile,
ancestryassociationscorefile, affectedsonlyscorefile contain results of
score tests obtained by averaging over the posterior distribution. Each table of score test results, based on
cumulative averages for the score and information over all posterior samples
obtained after the burn-in period, is output at intervals of 10 × every. Monitoring these repeated updates allows
the user to determine when the sampler has been run long enough for the test
results to be computed accurately. For
inference, only the last table, which is output separately and which is based on the entire
posterior sample, is used. All these files are formatted to be read into R as a
three-way array (indexed by loci, test statistics, output number).
For univariate
null hypotheses (testing the effect of one allele, one haplotype, or one
subpopulation against all others) the test statistic is the score divided by
the square root of the observed information, which has a standard normal
distribution under the null hypothesis. The percent of
information extracted (the ratio of observed information to complete
information) measures the information obtained about the parameter under
test, in comparison the information that would be obtained if individual
admixture, haplotypes at each locus, and gamete ancestry at each locus were
measured without error.
For the
affected-only and ancestry asociation score tests, the missing information can be partitioned into two
components: missing information about locus ancestry, and missing information
about model parameters(parental admixture) . These components are tabulated
separately.
For composite
null hypotheses, the score U is a vector, the observed information V
is a matrix, and the test statistic (UV-1U/) has
a chi-squared distribution under the null hypothesis.
admixturescorefile - test for association
of trait with individual admixture. The null hypothesis is no effect of individual admixture in a
regression model, with covariates as explanatory variables if specified. The test statistic is computed for the
effect of each subpopulation separately, with a summary chi-square test over
all subpopulations if there are more than two subpopulations.
allelicassociationscorefile - tests for allelic
association at each locus. The null hypothesis is no effect of the
alleles or haplotypes in a regression analysis with individual admixture (and
covariates if specified) as explanatory variables. The test statistic is
computed for each allele or haplotype separately, with a summary chi-square
statistic over all alleles or haplotypes at each locus if there are more than
two alleles or haplotypes. Rare alleles or haplotypes are grouped
together.
This test is appropriate when testing for association of the trait with
alleles or haplotypes in a candidate gene.
ancestryassociationscorefile - tests for linkage of each locus with
genes underlying ethnic variation in disease risk or trait values. This
is a test for association of the trait with ancestry at each compound locus,
conditional on parental admixture. The
null hypothesis is no effect of locus ancestry in a regression analysis with
individual admixture (and covariates if specified) as explanatory
variables. The test statistic is computed for the effect of each
subpopulation separately, with a summary chi-square statistic over all
subpopulations at each locus if there are more than two subpopulations.
The proportion of information extracted depends upon the information content
for ancestry of the marker locus and other nearby loci. This test is appropriate when the objective
of the study is to exploit admixture to localize genes underlying ethnic
variation in the trait value, using ancestry-informative markers rather than
candidate gene polymorphisms.
affectedsonlyscorefile - tests for linkage of
each locus with genes underlying the ethnic difference in disease risk, using
only the affected individuals. The null hypothesis is that
the risk ratio between populations that the locus accounts for is 1.
This test statistic is computed for the effect of each subpopulation at each
locus. The test compares at each locus the observed and expected
proportion of gene copies that have ancestry from the high-risk
subpopulation. This is the only test that can be used if the sample consists
only of affected individuals. Even if a control group has been typed,
for a rare disease the affected-only test is more efficient than the test
given in ancestryassociationscorefile based on a regression model. This is because for a rare disease, the
observed and expected proportion of gene copies that have ancestry from the high-risk
subpopulation will not differ by very much in unaffected individuals.
allelefreqscorefile - tests for
mis-specification of ancestry-specific allele frequencies.
This test is computed only if allele frequencies have been specified as fixed
with option allelefreqfile. For each
compound locus and each subpopulation, a score test is computed for the null
hypothesis that the frequencies of all alleles have been specified correctly. A summary test over all k subpopulations is
also computed at each locus.
args.txt
- a list of the options used by the program. This is used by the R script
to identify output
files and other information. This is writen
to resultsdir.
An R script (AdmixmapOutput.R) is supplied that processes these output files to produce tables of posterior quantiles, frequency plots of the posterior distribution, and plots of the cumulative posterior means for the variables that are output to paramfile. The R script also calculates a summary slope parameter for the effect of admixture from each subpopulation, versus the others. This R script is run automatically from the Perl script (admixmap.pl) that is supplied as a wrapper for the program.
Interpretation of output from the program
These notes are based on
the output produced by using the Perl script admixmap.pl to run the main
program. Output files produced by the
main are processed by the R script AdmixmapOutput.R. This produces several text files, and a
file plots.ps containing graphs in postscript format
The adequacy of the
burn-in period can be evaluated by the Geweke diagnostics in the R output. If the burn-in period is
adequate, the numbers in this table should have approximately a standard
normal distribution.
The mixing of the MCMC sampler can be evaluated by examining the autocorrelation plots. Autocorrelation extending beyond 20 iterations (2 thinned draws if every = 10 ) indicates slow mixing.
Acceptance
rates for the Metropolis-Hastings samplers used by the program are printed to
screen and logfile.
The adequacy of
the total number of iterations can be evaluated by examining a plot of the
statistic of interest calculated from all iterations since the end of the
burn-in period, against the iteration number. Where inference is based on the mean of a
parameter, this statistic is an ergodic (cumulative) average over all iterations to that point.
Plots of ergodic averages of the population-level parameters are given
in file ErgodicAveragePlots.ps.
The file stratificationtestfile
contains results of a diagnostic test for residual population stratification
that is not explained by the fitted model.
For details of how this test is calculated, and a discussion of how to
interpret it, see Hoggart (2003). The
test is based on testing for allelic association between unlinked loci that
is not explained by the model. The results is a "Bayesian p-value": p < 0.5 indicates lack of fit. The "Bayesian p-value" calculated
by this test is more conservative than a classical p-value. Our experience has been that a test
p-value of 0.3 or less is fairly strong evidence for residual
stratification. Where this statistic
yields evidence of lack of fit, the model should be specified with more
subpopulations, unless there is some other reason for lack of fit such as
mis-specified allele frequencies.
The file dispersiontestfile
contains results of a diagnostic test for variation between the allele
frequencies in the unadmixed populations that have been sampled to calculate
the prior parameter values in priorallelefreqfile and the corresponding
ancestry-specific allele frequencies in the admixed population under
study. Again the results are
"Bayesian p-values", for which the deviation of the test p-value
from its expected value of 0.5 does not provide an absolute measure of the
strength of evidence for lack of fit.
For each subpopulation, the test statistic is calculated as a summary
test over all loci and for each locus separately. Examination of the test statistic for each
locus may reveal errors in coding, or errors in specifying the prior allele
frequencies.
The option dispersiontestfile is valid only where option priorallelefreqfile has been specified. Where allele frequencies have been specified as fixed, option allelefreqscorefile should be specified and the output file should be examined.
No diagnostic test for lack of fit of the distribution of individual admixture proportions to the model is yet implemented. However the plots in file Plots.ps can be examined to compare the estimated distribution of individual admixture proportions (based on the the posterior means for individual admixture) with an estimate for the distribution of individual admixture values in the population (based on the posterior means for the Dirichlet parameters of this distribution).
The deviance and Deviance Information Criterion (DIC) are computed each time.
For an analysis of a single individual, with option chib, the log marginal likelihood, also known as the log evidence, is computed.
With
option thermo=1, the marginal likelihood is approximated for any model.
The greater the value of numannealedruns, the more accurate will be the
approximation, but the longer the program will take to run.