ADMIXMAP

a program to model admixture using marker genotype data

Tutorial on using ADMIXMAP to model genotype and phenotype data from an admixed or stratified population

If you have problems getting the program to run, email david.odonnell (please supply any error messages and logfiles if available)

To feed back comments on this tutorial, email paul.mckeigue

Append @ucd.ie to the email addresses given above.

Getting started

This tutorial is based on data from a sample of 446 Hispanic-Americans resident in Colorado, typed at 32 loci. ADMIXMAP is designed to analyse larger datasets with more markers, but the analysis of this small dataset illustrates all the methods. The following data files have been prepared for input to ADMIXMAP. Before starting, open these files to view them. They are most easily viewed with a program such as Excel that will interpret tabs as column separators.

Filename	Contents
outcomevars.txt	column 1: diabetes, coded as 0=unaffected, 1=affected. column 2: skin reflectance, scored as a quantitative trait
covariates2std.txt	age and sex, standardized about their sample means.
covariates3std.txt	age, sex and income group, standardized about their sample means.
genotypes.txt	genotypes at 32 SNP loci. These include 2 -4 SNPs in each of three candidate genes for diabetes (CAPN10, PPARG, and SUR1), and one SNPs that is in a candidate gene for skin pigmentation (TYR). The first column is the ID number. The names of the other columns (given in the header row) must match the first column of the locusfile (loci.txt). For each genotype, the two alleles are separated by a comma. The alleles must be numbered 1, 2, ... N, and this numbering must correspond to the sequence of rows in priorallelefreqfile.
loci.txt	locus description file, with locus name, number of alleles, and map distance from last locus. This file was generated from the file LociChr.txt, which gives the chromosome number and estimated genetic map position (in cM) for each locus. Map distances are given in morgans. If the locus is not linked to the last locus, the distance from last locus is coded as 100. If the locus is very close to the last locus (< 100 kb), the distance from last locus is coded as 0. Thus, for instance, the four SNPs (simple loci) in the CAPN10 gene are modelled as a single compound locus, with 16 possible haplotypes. Thus the 32 SNPs ("simple loci") will be grouped into 24 compound loci.
priorallelefreqs.txt	parameters for Dirichlet prior distributions of ancestry-specific allele frequencies (European, Native American, west African) at each compound locus. This file has one row for each possible haplotype at each compound locus. If the compound locus contains only one SNP, the number of possible haplotypes is of course just 2. Haplotypes are ordered by incrementing a counter from the right: for instance the 16 possible haplotypes at a compound locus consisting of four SNPs are ordered 1-1-1-1, 1-1-1-2, 1-1-2-1, ..., 2-2-2-2. Where the compound locus contains only one simple locus, the prior parameters are calculated simply by adding 0.5 to the observed allele counts in samples of unadmixed individuals. Where no data from unadmixed individuals are available, the prior parameters are specified simply as 0.5 (a "reference" prior). Where data from unadmixed individuals are available, the compound locus contains two or more simple loci, the parameters for the prior on haplotype frequencies are obtained by using ADMIXMAP with a single population model to generate the posterior distribution of haplotype frequencies from data on unadmixed individuals. For an example of how to do this, run the script HapFreqs.pl in the folder HapFreqs. This will generate the tables of parameter values that were used to specify the prior parameters for the compound loci DRD2, CAPN10, PPARG, and SUR1. These runs will take only a few seconds
etapriors.txt	Parameters for gamma prior distribution on allele frequency dispersion parameters. See below for explanation

These notes assume that you have installed Perl (ActivePerl is the windows version, the R statistical package, and a viewer for postscript files (such as Ghostview).

First, download and install ADMIXMAP. The Perl script tutorial.pl has been provided for this tutorial. Edit the Perl script where indicated to specify the location of the ADMIXMAP executable, the R executable (Rcmd.exe on a Windows platform) and the R script (AdmixmapOutput.R) that processes the output from the ADMIXMAP executable.

You are now ready to start the Perl script from your working directory. To do this, open the console shortcut and type "perl tutorial.pl". This script will run the program six times with different models, calling the main ADMIXMAP program each time with appropriate command-line options. The command-line options are stored by Perl in an array or "hash". The Perl script provides a convenient means of running several analyses with different options in batch mode. After each run of the ADMIXMAP program, the Perl script will run the R script AdmixmapOutput.R to analyse the output files, and move all output files to a folder (subdirectory) named for the type of analysis that has been run. Five new subdirectories will be created, each containing files output by ADMIXMAP and the R script. On an ordinary PC, these analyses will take about 20 minutes. You can inspect the results of each analysis, as described below, to determine if a longer run (with more samples from the posterior distribution) is needed. The results quoted in the tutorial below are from a long run, so may not correspond exactly to those obtained with a shorter run.

We now examine the results of each ADMIXMAP analysis.

SinglePopResults

This folder contains results of analysis with command-line option populations=1. This analysis does not exploit ADMIXMAP's ability to model population structure, but does exploit its ability to fit regression models and to model association with haplotypes given unphased genotype data. For this purpose, ADMIXMAP is an alternative to HAPLOSCORE, a program which uses a similar score test for association with haplotypes in a regression model. Other uses of the single population model are to test for population stratification, and to estimate haplotype frequencies in samples from unadmixed populations. These haplotype frequency estimates can be used to specify priors for the analysis of data from admixed populations.

To specify that the second column of outcomevarfile contains the outcome variable of interest (skin reflectance), we specify the option targetindicator=1 (i.e. offset by 1 from column 0). The program automatically determines that this is a continuous variable. No information about allele frequencies or haplotype frequencies is supplied, so the program generates the posterior distribution of haplotype frequencies from the data, given a reference prior. The program fits a linear regression model with skin reflectance as outcome variable, and age, sex and body mass index as explanatory variables. As the program does not have to model population structure, it requires only a short run: a total of 1100 iterations, including a burn-in of 100, is ample for all test statistics to be computed.

The file RegressionParamConvergenceDiagnostics.txt contains results of a simple test for the adequacy of the burn-in period, attributed to Geweke (1992). If burn-in is adequate and the sampling run after burn-in is long enough, these test statistics should have a standard normal distribution. Extreme values of the test statistics (indicated by small p-values) imply that a longer burn-in, and a longer sampling run after the burn-in, should be used. For this simple model, there is no evidence of lack of convergence.

Now examine the log file. This contains the result of a test for residual population stratification, based on testing for allelic association between unlinked loci. See the main program documentation for more details of how this test statistic is calculated. Only 14 loci are used in this calculation, as only unlinked loci can be included. The result is reported as a posterior predictive check probability or "Bayesian p-value". This is the frequency with which, over the posterior distribution of model parameters, a simulated dataset gives results more extreme than the observed dataset. We can interpret the "p-value" of 0.04 as evidence for stratification: there are more associations between unlinked loci than expected by chance. In general, posterior predictive check probabilities are more conservative than classical p-values: a p-value of 0.05 is fairly strong evidence of lack of fit.

The file args.txt contains one line for each of the options specified on the command-line. This includes default values for some of the command-line options that were not specified in the Perl script. An alternative way to invoke the program with the options specified in this file is simply to type "<path to admixmap executable>admixmap <path to args.txt>args.txt".

The file HardyWeinbergTest.txt contains results of tests for Hardy-Weinberg equilibrium at each of the 32 simple loci. This test is based on averaging over the posterior distribution of allele frequencies, whereas the classic test for Hardy-Weinberg equilibrium conditions on the observed counts of alleles. For a single population model, we expect the score test to give very similar results to the classical test. Positive scores indicate that the proportion of homozygotes is higher than expected given the allele frequencies. Two of the loci - FY and MID52 - show evidence of departure from Hardy-Weinberg equilibrium. For locus FY there are too few copies of allele 2 for the test to be valid. Possible explanations for deviation from Hardy-Weinberg equilibrium are genotyping error, a higher proportion of missing genotypes (failure to call the genotype) in heterozygotes, or population stratification accompanied by non-random mating. If population stratification accounts for the departure from Hardy-Weinberg equilibrium, we expect that there will be no evidence of departure from Hardy-Weinberg equilbrium when the test is repeated with a model of admixture between two or more subpopulations.

The file PosteriorQuantiles.txt contains summary statistics for the posterior distribution of the parameters of the regression model. Ignore the rows for Dirichlet parameter and sumIntensities, which are irrelevant to a single population model. The posterior means and 95% credible intervals for the regression coefficients for age and sex are given. The "precision" parameter is the inverse of the residual standard deviation. All these estimates will be very similar to those that would be obtained with a standard regression program: classical 95% confidence intervals are equivalent to Bayesian 95% credible intervals where (as in this application) the sample size is large and the priors on the regression coefficients are non-informative.

Each simple locus is tested for allelic association with the outcome variable (skin reflectance), and the haplotypes at each compound locus are tested for allelic association with hypertension. These score tests test the null hypothesis β=0, where β is the regression coefficient for the effect of number of copies of the allele (or haplotype) on the outcome variable, in a model with the covariates (age, sex and income in this case). The test statistic is calculated by dividing the score (gradient of the log-likelihood at β=0) by the square root of the observed information (curvature of the log-likelihood at β=0). This statistic has a standard normal distribution under the null hypothesis. Positive score values indicate that the most likely value of β is greater than zero. The proportion of information extracted is a measure of how much information about β we have, in comparison with a dataset in which all variables were observed directly.

First examine the file TestsAllelicAssociationFinal.txt, which contains the results of score tests for association of the outcome variable with allele 1 at each simple locus. For ease of viewing, open this file with a program such as Excel that will interpret the tabs as column separators. The p-values in this table will be practically identical to those that would be obtained by testing for association with number of copies of each allele in a classical logistic regression analysis, adjusting for age, sex and income. Note that there are three loci for which the p-values are less than 0.01. Of these three, TYR192 is in a candidate gene for skin pigmentation, and CYP19e2 is closely linked to a gene (SLC24a5) that has recently been shown to account for some of the ethnic variation in skin pigmentation.

However we expect these test results to be confounded by population stratification, as this has not been accounted for in the statistical model.

The file TestsHaplotypeAssociationFinal.txt contains tests for association with haplotypes at each compound locus. All haplotypes with frequency < 1% are grouped together as "others". The other haplotypes are tested one at a time for association, and also with a summary chi-square test of the null hypothesis that all haplotype effects are equal. A more appropriate chi-square test would exclude the "others" category: this will be fixed in a later release.

Note that the proportion of information extracted is typically at least 70% for each haplotype - this is what we would expect when inferring haplotypes from unphased genotype data in the presence of strong allelic association between the simple loci within each compound locus.

TwoPopsResults

This folder contains results of analysis with command-line option populations=2. This specifies a model with admixture between two subpopulations. For this analysis we specify the allele frequencies in these two subpopulation as unknown, with (uninformative) reference priors. The program fits a linear regression model with individual admixture proportion, together with age, sex, and income category as explanatory variables. The program attempts to infer the structure of the population from the allelic associations between markers and from the association of marker alleles with the outcome variable. Even if your objective is only to study population structure, including an outcome variable (such as skin reflectance) that is strongly related to individual admixture proportions helps the program to learn about individual admixture proportions and population stratification.

With no information about allele frequencies, and only 21 ancestry-informative marker loci in this dataset, the program requires very long runs to explore the posterior distribution of the population admixture parameters. Examine the tests for convergence in the file PopAdmixParamConvergenceDiags.txt: unless you have specified a very long run (samples = 10000, or more), the p-values will indicate that the sampler has not run long enough. To see why long runs are required, view the postscript file PopAdmixAutocorrelations.ps. This shows that the sampling of the Dirichlet parameters for the distribution of admixture proportions in the population mixes slowly, with autocorrelation of 0.5 or more up to a lag of 50 iterations. Mixing is much faster when prior information about allele frequencies is supplied, and when larger numbers of ancestry-informative marker loci have been typed. Larger datasets with hundreds of ancestry-informative markers usually require only a few hundred iterations for reliable inference.

The log file shows that the posterior predictive check probability in the test for residual stratification is 0.51 - there is no evidence of residual stratification not accounted for by a model with two subpopulations.

The model now contains four extra parameters: two parameters for the (Dirichlet) distribution of admixture proportions in the population, a "sum-intensities" parameter that is equivalent to the effective number of generations since admixture, and a coefficient for the effect of admixture on the outcome variable in the regression model. The file regparams.txt contains samples from the posterior distribution of the regression coefficients. The file PosteriorQuantiles.txt contains summary statistics for the posterior distribution of the model parameters. The program infers the population admixture proportions as about 2 to 1, and that skin reflectance is inversely related to proportionate admixture from the subpopulation that contributes less to the ancestry of the admixed population. As the next analysis shows, these results from a model with no information about allele frequencies are close to those obtained when we supply information about allele frequencies in Europeans, Native Americans and west Africans. As the two subpopulations are not identifiable in the model, their labelling as 1 and 2 is arbitrary, and the labels of the subpopulations that are associated with higher and lower skin reflectance might be reversed when the program is run with a different seed on the random number generator. In principle, the labelling could even reverse during a single run of the sampler. However the inverse association of skin reflectance with proportionate admixture from the subpopulation that makes the smaller contribution to the admixed population should be consistent.

To determine whether the sampler has been run long enough to estimate the p-value accurately for the tests that we are interested in, we can examine the postscript files TestsAllelicAssociation.ps and TestsHaplotypeAssociation.ps. These show the results of successive evaluations of the score test, evaluated over all posterior samples obtained since the end of the burn-in period and written to file every 50 iterations. After a few hundred iterations, most of the p-values are stable.

As the regression model includes individual admixture proportion as a predictor variable, tests of allelic association will be adjusted for whatever population stratification is inferred by the model. The p-values for association of skin reflectance with tyr192 and cyp19e2 are still statistically significant (at p = 0.007 and p = 0.00005 respectively), from which we can infer that the associations seen in the single-population model are unlikely to be accounted for by confounding effects of population stratification. Note that the proportion of information extracted is now only about 80% at most loci, compared with nearly 100% in the single population model. This is because there are only 21 ancestry-informative markers in the study, and the score statistic (which is based on adjusting for individual admixture in the regression model) consequently varies over the posterior distribution. The proportion of information extracted can be interpreted as a measure of the efficiency of the test. With more markers, the proportion of information extracted in the score test would be higher.

Estimates of individual admixture proportions obtained in this analysis are not meaningful because unless at least some information about allele frequencies or individual ancestry is supplied, the subpopulations are not identifiable in the model. This does not matter if we simply want to adjust for population stratification, because the score tests for allelic association are valid even if the labelling of the subpopulations is permuted (thus reversing the sign of the corresponding regression coefficient.). To rank individuals by their "degree of admixture", we can examine the posterior mean of the "ancestry diversity" for each individual, in the file IndAdmixPosteriorMeans.txt. This is calculated as the probability that locus ancestry differs between two gene copies drawn at random from unlinked loci in the individual under study. It does not depend upon the labelling of the subpopulations, and so can be computed even when the subpopulations are not identifiable. If you are using the program to identify admixed individuals, without using prior information about allele frequencies, we recommend that you use this statistic.

PriorFreqResultsSkin

This folder contains result of analysis with command-line option priorallelefreqfile=priorallelelefreqs.txt. With this option, the program fits a model in which allele frequencies in the unadmixed populations sampled are assumed to be identical to the corresponding ancestry-specific allele frequencies in the admixed population. The file priorallelefreqs.txt has three columns specifying priors on the allele frequencies in Europeans, Native Americans and west Africans. Specifying a prior distribution for the allele frequencies, rather than specifying them as fixed constants, allows for the uncertainty in estimates of allele frequencies that are based on samples of finite size. At loci where no prior information about allele frequencies is available for a given subpopulation, a reference prior (all parameters equal to 0.5) is specified. Where haplotype frequencies have been estimated from phase-unknown genotypes, the uncertainty in these estimates can also be accounted for in the parameters of the prior distribution.

The file PosteriorQuantiles.txt contains the posterior mean, posterior median and central 95% credible interval for population-level variables: Dirichlet parameters for the distribution of admixture, sum-intensities, regression coefficients, and the population admixture proportions. If the study sample size is large, the posterior mode and 95% credible interval are asymptotically equivalent to the maximum likelihood estimate and 95% confidence interval. If the sample size is large, the posterior distribution will be approximately normal and thus the posterior mode will be approximately equal to the mean (or median). This approximation is closest if the variable has been transformed to lie on the real line (between minus infinity and plus infinity), Plots of the posterior densities of all population-level parameters are given in the postscript file PosteriorDensities.ps. The table below briefly explains what the variables tabulated in this file mean.

Variable	Explanation
Dirichlet parameters for distribution of admixture (Eur, NAm, Afr) in the population	The ratios between the Dirichlet parameters determine the average admixture proportions in the population, and the sum of the Dirichlet parameters determines the variance of admixture proportions between individuals. A large value of the sum of Dirichlet parameters implies that the variance of admixture proportions between individuals is small.
Sum of intensities	This parameter measures the frequency with which transitions between states of European, African and Native American ancestry occur along the chromosomes in this population. This parameter is assumed to be the same in all individuals unless the option globalrho=0 is specified. The parameter can be interpreted as the average number of generation back to unadmixed ancestors. Note that in this dataset, with only a few linked markers, we don't have much information from which to infer the sum of intensities parameter: the 95% credible interval is from about 5 to about 19.
Regression coefficients for intercept, covariates, and individual admixture	These are linear regression coefficients. For a model with K subpopulations, K - 1 regression coefficients for individual admixture proportions are displayed (the population given in the first column of the allele frequency file is taken as the baseline category).
Population admixture proportions	Population admixture proportions are calculated by dividing the Dirichlet parameters by their sum.
Precision	Inverse of residual variance in regression model

Posterior means of individual admixture are in the file IndAdmixPosteriorMeans.txt. These values can be plugged into other types of genetic analysis, but should not be used to test for a relationship with skin pigmentation because the association between individual admixture and the skin pigmentation has already been used by the program to learn about individual admixture. If you want estimates of individual admixture that you can plug into a regression model to test for association with the outcome variable, run the program without an outcomevarfile.

The file DistributionIndividualAdmixture.ps contains histograms of the distribution of posterior means of individual admixture. For comparison, the distribution of individual admixture specified by the posterior means of the Dirichlet parameters is shown as a curve on the same plot. This file can be examined to test if there is any obvious lack of fit of the distribution of individual admixture proportions to the model: for instance a bimodal distribution of admixture proportions, which is not compatible with a Dirichlet distribution.

The file TestsForDispersion.txt contains cumulativeresults of a test for variation of allele frequencies between the unadmixed populations (in this example European, Native American and west African) that were sampled to obtain prior parameters in priorallelefreqsfile, and the corresponding ancestry-specific allele frequencies in the admixed population (in this case Hispanic-Americans in Colorado). The numbers given in the file are "posterior predictive check probabilities" or "Bayesian p-values". Small posterior predictive check probabilities indicate lack of fit. There are no loci at which the p-values are small, indicating good fit of the data to the allele frequencies given in priorallelefreqfile. In this small dataset, we do not have enough information to detect small departures from the "no dispersion" model. However the unadmixed Native American populations that were sampled (from north and south America) are unlikely to be exactly representative of the ancestral Native American subpopulation that contributed genes to this admixed population in Colorado. If there is evidence of dispersion of allele frequencies, we can deal with this by fitting a dispersion model as described in the next section.

The results of tests of association of skin reflectance with alleles at each locus are similar to those obtained in the TwoPopsResults analysis, in which no prior information about allele frequencies was supplied.

In this analysis, where the subpopulations are identifiable, we can test also for association of the outcome variable with ancestry at each locus. This is the basis of admixture mapping, an approach that exploits admixture to localize genes underlying ethnic variation in disease risk. The results of these tests are in the file TestsAncestryAssociationFinal.txt, with plots of successive evaluations in the postscript file TestsAncestryAssociation.ps. The program runs more slowly when these tests are specified. We can ignore the tests for association with African ancestry, as the proportion of African admixture in this population is too low for such tests to be meaningful. We can also ignore any tests for which the observed information is very small, as where there is not enough information in the data, the asymptotic properties of the score test in large samples (approximation of the log-likelihood to a quadratic function) will not hold. At some loci, such as GNB3, the observed information from the test for association with Native American ancestry is evaluated as negative - probably because the true value is close to zero and the sampler has not been run long enough to estimate it accurately, although it is formally possible for the information to be negative (log-likelihood not concave downwards) in small samples.

Note that the efficiency of this test (proportion of information extracted) is much lower than the efficiency of the test for allelic association. For association with Native American ancestry, the highest proportion of information extracted is at marker locus mid52. This marker locus is highly informative for Native American ancestry. There is evidence of linkage to genes underlying the ethnic differences in disease risk at TYR192 and CYP19e2. The scores for association with Native American ancestry are negative, implying that average skin reflectance is inversely related to the proportion of gene copies that are of Native American ancestry at this locus, as we would expect if the trait locus accounts for some of the ethnic difference in skin reflectance. The tests for linkage with ancestry, unlike the test for allelic association, use the genotype data from all loci on each chromosome to extract information about ancestry at each locus. With these tests, we expect any evidence of linkage to be detectable over a broad region: 20 cM or more. To establish whether linkage with ancestry at TYR192 and CYP19e2 can be confirmed, the next step would be to type more markers informative for Native American versus European ancestry in these regions.

The file TestsAncestryAssociation.ps also contains a plot of the proportion of information extracted at each locus, across all loci. With larger marker sets, this plot can be used to evaluate the coverage of each chromosome by the marker set.

PriorFreqResultsDiabetes

This folder contains results of an analysis with diabetes as outcome variable. As diabetes is a binary variable, the program fits a logistic regression model. The logistic regression coefficients are log odds ratios, and can be transformed to odds ratios by taking exponents. With age and sex as the only other covariates in the regression model, the posterior mean for the log odds ratio for the effect of unit change in Native American admixture is 3.0, with a 95% credible interval from 0.15 to 7.0. As this interval does not overlap 0, we can interpret this as a statistically significant (p<0.05) association of diabetes with Native American admixture, adjusted for age and sex. We cannot, however, exclude the possibility that the association is accounted for by an unmeasured confounder that is associated with the proportion of Native American admixture and is independently associated with diabetes. If we run the analysis again with age, sex and income in the model as covariates (edit the Perl script tutorial.pl or the file args.txt to specify option covariatesfile='covariates3std.txt'), the posterior mean for the adjusted log odds ratio associated with Native American admixture falls to 2.0, with a 95% credible interval that overlaps 0. This is because income is associated both with Native American admixture and diabetes. The results are thus compatible with an environmental explanation for the association of diabetes with Native American admixture in this population. A larger sample size, more ancestry-informative markers, and more extensive measurements of environmental covariates would be required to investigate this.

These concerns about confounding by environmental factors do not apply to the tests for allelic association or to the tests for linkage with locus ancestry. With these tests, it is sufficient to adjust for individual admixture proportions to guarantee that confounding by environmental factors will be eliminated. There is weak evidence of association with two candidate genes: PPARG (one of two SNPs) and SUR1 (one of the sixteen possible haplotypes). These results are consistent with associations described in other studies. As none of the markers in candidate genes for diabetes are informative for ancestry, and no other nearby ancestry-informative markers have been typed, we cannot assess the possible contribution of these candidate genes to ethnic variation in diabetes risk.

As diabetes is a binary trait, we can evaluate both affected-only and case-control tests for linkage with locus ancestry. The fit of these test results to the distribution expected under the null is given by the QQ plots. These show that the affected-only and case-control test statistics are a good fit to a standard normal distribution. The affected-only test is more powerful than the case-control test, but assumes that t ancestry state frequencies do not vary systematically across the genome within the admixed population. The case-control test is robust to violation of this assumption.

HistoricAlleleFreqResults

This folder contains results of analysis with command-line option historicallelefreqfile=priorallelefreqs.txt. With this option, the program fits a "dispersion" model for the allele frequencies, which allows the allele frequencies in the unadmixed populations to vary from the corresponding ancestry-specific allele frequencies in the admixed population under study. A single allele frequency dispersion parameter h (eta) is estimated for each subpopulation. The option outcomes=2 specifies that regression models should be fitted for both outcome variables simultaneously. Thus if we are interested in evaluating associations with diabetes, we can still use the skin reflectance values to provide additional information about individual admixture.

The dispersion parameter can be interpreted as follows. Imagine that the variation between allele frequencies in the modern unadmixed west African populations sampled and the allele frequencies in the pool of genes of African ancestry in the African-American population of Philadelphia had been generated by drawing two independent equal-sized samples from an ancestral total population. The allele frequencies in the two samples would differ as a results of sampling variation, and the variance of the sample frequencies between these two samples would depend upon the size of the sample that was drawn. The parameter h is equivalent to this sample size. Small values of h (<100) indicate dispersion of allele frequencies. h is related to Wright's F_ST (fixation index subpopulation-total) by the relation F_ST=1/(1+ h).

As the program does not have much information from which to estimate these parameters, we have to specify strong priors. These are specified in the file data/etapriors.txt. There is one row for each subpopulation, with rows ordered as in the columns of historicallelefreqfile. Each row specifies a shape parameter and a rate parameter for a gamma prior distribution. The prior mean is shape/rate, and the prior variance is shape/rate^2. We have specified the prior means on the dispersion parameters as 500, 50 and 50 for European, Native American and African allele frequencies respectively. based on estimates of F_ST between subpopulations within these continental groups. The prior variances are specified as 5, 0.5, and 0.5 respectively. These very small prior variances effectively constrain the dispersion parameters to be close to their prior means: there is not enough information in the dataset for us to be able to estimate the dispersion parameters from the data. With this dataset, the point of running a dispersion model is simply to examine whether relaxing the assumption of no dispersion of allele frequencies alters the results.

The posterior distributions of the allele frequency dispersion parameters (labelled as eta.Afr and eta.Eur for African and European subpopulations respectively) are plotted in the file ParameterPosteriorDensities.ps, and the means and medians are given in the file PosteriorQuantiles.txt. Posterior means of the standardized variance of allele frequencies (expressed as F_ST) between unadmixed and admixed subpopulations within each continental group are in the file lociFst.txt

Estimates of the dispersion parameter have implications for the sample size required when estimating allele frequencies in unadmixed populations in order to model admixture. In general, there is no point in using a sample size that is an order of magnitude larger than the value of eta for a given subpopulation, because the allele frequencies in the unadmixed populations will not accurately predict the corresponding ancestry-specific allele frequencies in the admixed population.

The results of tests for allelic association and linkage with ancestry are similar to those obtained with the previous analysis using a model with no dispersion, suggesting that these analyses are fairly robust to assumptions about allele frequencies.

Other possible models

You can edit the Perl script or the args.txt file to specify other options.

randommatingmodel=1 would specify that the admixture proportions on the two parental gametes are independent draws from the Dirichlet distribution in the population. In general this option is only useful if you have at least 100 ancestry-informative markers, allowing the program to infer the admixture proportions of each parental gamete separately. The default is randommatingmodel=0

globalrho=0 would specify a hierarchical model for the sum-intensities parameter. Again this option is only useful if you have at least 100 ancestry-informative markers, allowing the program to infer the sum-intensities parameter for each gamete. The default is globalrho=1.

indadmixhiermodel=0 would eliminate the hierarchical model for individual admixture, allowing you to specify a Dirichlet prior on individual or gamete admixture directly with initalpha0 and initalpha1. This option is occasionally useful when you do not want the "shrinkage" effect of a hierarchical model, in which outlying observations are pulled towards the population mean.