ADMIXMAPa program to model admixture using marker genotype data |
Introduction
Introduction
ADMIXMAP is a general-purpose program for modelling admixture, using marker genotypes and trait data on a sample of individuals from an admixed population (such as African-Americans), where the markers have been chosen to have extreme differentials in allele frequencies between two or more of the ancestral populations between which admixture has occurred. The main difference between ADMIXMAP and classical programs for estimation of admixture such as ADMIX is that ADMIXMAP is based on a multilevel model for the distribution of individual admixture in the population and the stochastic variation of ancestry on hybrid chromosomes. This makes it possible to model the associations of ancestry between linked marker loci, and the association of a trait with individual admixture or with ancestry at a linked marker locus.
Possible uses of the ADMIXMAP program
Modelling the distribution of individual admixture values and the history of admixture (inferred by modelling the stochastic variation of ancestry along chromosomes).
Case-control, cross-sectional or cohort studies that test for a relationship between disease risk and individual admixture
Localizing genes underlying ethnic differences in disease risk by admixture mapping
Controlling for population structure (variation in individual admixture) in genetic association studies so as to eliminate associations with unlinked genes
Reconstructing the genetic structure of an ancestral population where unadmixed modern descendants are not available for study
ADMIXMAP can model admixture between more than two populations, and can use data from multi-allelic or biallelic marker polymorphisms. The program has been developed for application to admixed human populations, but can also be used to model admixture in livestock or for fine mapping of quantitative trait loci in outbred stocks of mice.
A manual for the program is available which describes the statistical model in more detail. Downloads of the program compiled for various platforms are also available. We recommend that before trying to run the program, you consult us first about your requirements.
ADMIXMAP is designed to analyse datasets that consist of trait measurements and genotype data on a sample of individuals from an admixed or stratified population. Although the name of the program reflects its origins as a program designed for admixture mapping, it has wider uses, especially in genetic association studies. The study design can be a cross-sectional survey of a quantitative trait or binary outcome, a case-control study or a cohort study. For admixture to be modelled efficiently, at least some of the loci typed should be "ancestry-informative markers": markers chosen to have large allele frequency differentials between the ancestral subpopulations that underwent admixture. The program can deal with any number of ancestral subpopulations and any number of linked marker loci. In its present version, the program handles only data from samples of unrelated individuals.
The program is written in C++, and is freely
available with source code under a GPL. Offers to help with development of the
program are welcome. The current
version runs only on a single processor, and computation time is a serious
limitation on large datasets. We are
developing a parallel version that will be able to run on a computing
cluster, using the MPICH implementation of the MPI message-passing standard.
The program is based on a hybrid of
Bayesian and classical approaches. A Bayesian full probability model is
specified, assigning vague prior distributions to parameters for the
distribution of admixture in the population and the stochastic variation of
ancestry along hybrid chromosomes. The posterior distribution of all
unobserved variables given the observed genotype and trait data,
is generated by Markov chain
McKeigue, P.M., Carpenter, J., Parra, E.J., Shriver, M.D.. Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach: application to African-American populations. Annals of Human Genetics 2000;64: 171-86.
Hoggart, C.J., Parra, E.J., Shriver, M.D., Bonilla, C., Kittles, R.A., Clayton, D.G. and McKeigue, P.M. Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003; 72:1492-1504.
Hoggart, C.J., Shriver, M.D., Kittles,
R.A., Clayton, D.G. and McKeigue, P.M. Design and analysis of admixture
mapping studies.Am J Hum Genet. 2004; 74:965-978.
McKeigue, PM; Prospects for admixture mapping of complex traits. Am J Hum Genet 2005; in press
1. Modelling the dependence of a
disease or quantitative trait upon individual admixture
For a binary trait, such as presence of disease, the program fits a logistic
regression model of the trait upon individual admixture, mean of admixture
proportions of both parents. For a continuous trait, such as skin
pigmentation, the program fits a linear regression model of the trait value
on individual admixture. Covariates such as age, sex and socioeconomic status
can be included in the regression model. The program output includes
posterior means and 95% credible intervals for the regression coefficients.
Alternatively, the program can be used to test a null hypothesis of no
association of disease risk or trait level effect with individual admixture
as described below.
2. Controlling for confounding of
genetic associations in stratified populations
For more details of this application, see Hoggart et
al (2003). The program calculates a score test for association of the disease
or trait with alleles or haplotypes at each locus, adjusting for individual
admixture and other covariates in a regression model. Where there is evidence
for association of a trait with individual admixture, the posterior
distribution of the regression coefficient can be estimated in a further
analysis. For this application, the dataset should include at least 30
markers informative for ancestry.
3. Admixture mapping: localizing genes
that underlie ethnic differences in disease risk.
For more details of this application, see Hoggart et al
(2004). Where
differences in disease risk have a genetic basis, testing for association of
the disease with locus ancestry by conditioning on parental admixture can
localize genes underlying these differences. This approach is an extension of
the principles underlying linkage analysis of an experimental cross. To
exploit the full power of admixture mapping, 1000 or more markers informative
for ancestry across the genome are required.
4. Detecting population stratification
and identifying admixed individuals
Where no information about the demographic background of
the population under study is available, ADMIXMAP can be used to test for
population stratification, to determine how many subpopulations are required
to model this stratification, and to identify admixed individuals. This
is useful when assembling panels of unadmixed individuals to be used for
estimating allele frequencies. We emphasize that when the program is
run without supplying prior information about allele frequencies in each
subpopulation, the subpopulations are not identifiable in the model. Thus inference should be based only on the
posterior distribution of variables that are unaffected by permuting the
labels of the subpopulations.
5. Testing for associations of a trait
with haplotypes and estimating haplotype frequencies from a sample of
unrelated individuals
Where two or more loci in the same gene have been typed,
ADMIXMAP will model the unobserved haplotypes, conditional on the observed
unordered genotypes. Score tests for association of haplotypes
with the trait can be obtained, and samples from the posterior distribution
of haplotype frequencies can be obtained.
This application of the program is not limited to admixed or
stratified populations: for a population that is not stratified, the user can
simply specify the option populations=1
For each of these applications, score tests of the appropriate null hypotheses are built into the program.
Modelling admixture and trait values:
Any run of two or more loci for which the distance between loci is specified as zero is modelled as a single "compound locus". Thus if L "simple loci" (SNPs, insertion/deletion polymorphisms or microsatellites) have been typed, and three of these simple loci are in the same gene, the model will have L - 2 compound loci. The program assumes that on any gamete, the ancestry state is the same at all loci within a compound locus. The program allows for allelic association within any compound locus that contains two or more simple loci, and models the unobserved haplotypes at this compound locus.
For each parent of each individual,
admixture proportions are defined by a vector with K co-ordinates, where K
is the number of ancestral subpopulations that contributed to the admixed
population under study. For instance,
in a
1. The population distribution from which the parental admixture proportions are drawn is modelled by a Dirichlet distribution with parameter vector of length K.
2. The allele or haplotype frequencies at each compound locus are modelled by a Dirichlet distribution, with prior parameters specified by the user.
3. Locus ancestry is modelled by a multinomial distribution, with cell probabilities specified by the admixture of both parents.
4. The probabilities of observing each allele or haplotype at each locus on each gamete, given the ancestry of the gamete at that locus, are modelled by a multinomial distribution, with parameters given by the ancestry-specific allele (or haplotype) frequencies.
5. The stochastic variation of ancestral states along the chromosomes transmitted from each parent is modelled as a mixture of independent Poisson arrival processes with intensities a, b, g per Morgan (for three-way admixture). For given values of parental admixture, it is only necessary to specify a single parameter r for the sum of intensities: r= a + b + g.
If an outcome variable is supplied, ADMIXMAP fits a regression model (logistic regression for a binary trait, linear regression for a quantitative trait) with individual admixture proportions and any covariates supplied by the user as explanatory variables.
Modelling allele/haplotype frequencies:
The program can be run with ancestry-specific allele / haplotype frequencies specified either as fixed or as random variables. If one of the two options allelefreqfile or fixedallelefreqs is specified, the allele frequencies are specified as fixed at the values supplied. If option populations is specified, the allele frequencies are specified as random variables with reference (uninformative) prior distributions. If option priorallelefreqsfile is specified, the allele frequencies are specified as random variables with a prior distribution given by the values in this file. This option is used where allele frequencies have been estimated from samples of unadmixed modern descendants of the ancestral subpopulations that contributed to the admixed population under study. For instance, in a study of a population of mixed European and west African ancestry, allele frequencies at some or all of the loci typed may have been estimated in samples from modern unadmixed west African and European populations. The program will use this information to estimate the ancestry-specific allele frequencies from the unadmixed and admixed population samples simultaneously, allowing for sampling error.
If no information about allele frequencies in the ancestral subpopulations is provided, the ancestry-specific allele frequencies are estimated only from the admixed population under study. If no information about allele frequencies is provided at any locus, the subpopulations are not identifiable in the model. This does not matter when the program is used only to control for confounding by hidden population stratification, as described in Hoggart et al. (2003).
The file priorallelefreqfile specifies the parameters of a Dirichlet prior distribution for the allele frequencies at each locus in each subpopulation. Where the alleles or haplotypes have been counted directly in samples from unadmixed modern descendants, these parameter values should be specified by adding 0.5 to the observed counts of each allele or haplotype in each subpopulation. These parameter values specify the Dirichlet posterior distribution that we would obtain by combining a reference prior with the observed counts. Using this as a prior distribution when analysing data from the admixed population is equivalent to estimating the allele frequencies simultaneously from the admixed and unadmixed population samples, with a reference prior.
For compound loci, where haplotype frequencies have
been estimated from unordered genotypes rather than by counting phase-known
gametes, the user cannot specify the prior distribution simply by adding 0.5
to the observed counts of each haplotype in a sample of unadmixed modern
descendants, because the counts have not been observed. Instead, we can
compute the posterior distribution of haplotype frequencies in the unadmixed
population, given a reference prior and the observed unordered genotype data. In accordance with the principles of
Bayesian inference, we can then use this posterior distribution to specify the
prior distribution when modelling data from the admixed population. To simplify the computation, we generate a
large sample from the posterior distribution of haplotype frequencies in the
unadmixed population and calculate the parameters of the Dirichlet
distribution that most closely approximates this posterior distribution. The parameters of this Dirichlet
distribution are then entered in file priorallelefreqfile.
This can be implemented by running ADMIXMAP with genotype
data from each unadmixed population sample, specifying options populations=1
and allelefreqoutputfile to sample the posterior distribution of the
haplotype frequencies. The parameters of the Dirichlet distribution that
most closely approximates this posterior distribution can then be computed,
and substituted into the file priorallelefreq as input to the program
when modelling data from the admixed population. This is straightforward: for each locus and
each subpopulation, the posterior expectations and the posterior covariance
matrix of the allele frequencies are evaluated using the samples from the
posterior distribution in allelefreqoutputfile. The Dirichlet parameters ai
that
approximate the posterior distribution are
then computed by equating the posterior expectations of the allele
frequencies to the ratios
ai
/ aSi,
and equating the determinant of the posterior covariance matrix to the
determinant of the covariance matrix of the Dirichlet distribution. If ADMIXMAP is run with the option allelefreqoutputfile and the R script AdmixmapOutput.R is run to process the
output files, the Dirichlet parameters will be computed and written to a file
in the correct format for use in subsequent analyses as priorallelefreqfile.
With options populations, allelefreqfile or priorallelefreqfile, the program fits a model in which the allele frequencies in modern unadmixed descendants of the ancestral subpopulations are identical to the corresponding ancestry-specific allele frequencies in the admixed population under study. The option dispersiontestfile will generate a diagnostic test of this assumption.
With option historicallelefreqfile, the program fits a more general model in which there is dispersion of allele frequencies between the unadmixed and admixed populations.
With option correlatedallelefreqs a correlated allele frequency model is fitted, in which the allele frequency prior parameters are the same across subpopulations and specified as vectors of proportions and a sum, common to all loci.
There are various approaches to statistical inference and hypothesis testing using the ADMIXMAP program:-
(1) A model based on the null hypothesis can be fitted, and this null hypothesis can be tested against alternatives with a score test computed by averaging over the posterior distribution of the missing data. For a description of the theory underlying this approach, see Hoggart et al. (2003). Several score tests are built into the program, and are described below. Additional score tests can be constructed by the user.
(2) The effect under study can be included in the regression
model, so that the posterior distribution of the effect parameter is
estimated. In large samples, the
posterior mean and 95% credible interval for the effect parameter are
asymptotically equivalent to the maximum likelihood estimate and 95%
confidence interval that would be obtained by classical methods.
(3) The log-likelihood function for the effect of a
parameter can be computed: not yet implemented
(4) The marginal
likelihood of the model can be evaluated using Chib’s algorithm or
thermodynamic integration. Chib's algoritm is implemented only for a
single individual (the first listed in the genotypes
file) and for a model with no outcome variable but thermodynamic integration
is implemented for all models.
In addition to these methods for formal inference, model
diagnostics, based on the posterior predictive check probability, are
provided to detect population stratification not accounted for in the model, lack of fit of the allele frequencies to the specified
model and for departure from Hardy-Weinberg equilibrium.
Comparison with other programs for modelling admixture:
The program STRUCTURE (available from http://pritch.bsd.uchicago.edu/) fits a similar hierarchical model for population admixture, given genotype data on admixed and unadmixed individuals, if you specify the "popalphas" option (see documentation for this program at http://pritch.bsd.uchicago.edu/software/readme_structure2.pdf).
The main differences between ADMIXMAP and STRUCTURE are:-
The program ANCESTRYMAP (available from http://genepath.med.harvard.edu/~reich/ ) is similar to ADMIXMAP but is restricted to diallelic simple loci and two populations.
Tools for converting data from ANCESTRYMAP format to ADMIXMAP format and vice-versa as well as from STRUCTURE format to ADMIXMAP format are available here.