## ADMIXMAPa program to model admixture using marker genotype data |

Tutorial on using ADMIXMAP to model genotype and phenotype data from an admixed or stratified population

If you have problems getting the program to run, email david.odonnell (please supply any error messages and logfiles if available)

To feed back comments on this tutorial, email paul.mckeigue

Append @ucd.ie to the email addresses given above.

Getting started

This tutorial is based on data from a sample of 446 Hispanic-Americans resident in Colorado, typed at 32 loci. ADMIXMAP is designed to analyse larger datasets with more markers, but the analysis of this small dataset illustrates all the methods. The following data files have been prepared for input to ADMIXMAP. Before starting, open these files to view them. They are most easily viewed with a program such as Excel that will interpret tabs as column separators.

Filename |
Contents |

outcomevars.txt |
column 1: diabetes, coded as 0=unaffected,
1=affected. column 2: skin reflectance, scored as a quantitative trait |

covariates2std.txt |
age and sex, standardized about their sample means. |

covariates3std.txt |
age, sex and income group, standardized about their sample means. |

genotypes.txt |
genotypes at 32 SNP loci. These
include 2 -4 SNPs in each of three candidate genes for diabetes (CAPN10,
PPARG, and SUR1), and one SNPs that is in a candidate gene for skin
pigmentation (TYR). The first column is the ID number. The
names of the other columns (given in the header row) must match the
first column of the locusfile (loci.txt). For each genotype, the
two alleles are separated by a comma. The alleles must be numbered
1, 2, ... N, and this numbering must correspond to the sequence of rows
in priorallelefreqfile. |

loci.txt |
locus description file, with locus
name, number of alleles, and map distance from last locus. This file was generated from the file LociChr.txt, which
gives the chromosome number and estimated genetic map position (in cM)
for each locus. Map
distances are given in morgans. If the locus is not linked to the
last locus, the distance from last locus is coded as 100. If the
locus is very close to the last locus (< 100 kb), the distance from
last locus is coded as 0. Thus, for instance, the four SNPs (simple
loci) in the CAPN10 gene are modelled as a single compound
locus, with 16 possible haplotypes. Thus the 32 SNPs ("simple loci")
will be grouped into 24 compound
loci. |

priorallelefreqs.txt |
parameters for Dirichlet prior distributions of
ancestry-specific allele frequencies (European, Native American, west
African) at each compound locus. This file has one row for each
possible haplotype at each compound locus. If the compound locus
contains only one SNP, the number of possible haplotypes is of course
just 2. Haplotypes are ordered by incrementing a counter from
the right: for instance the 16 possible haplotypes at a compound
locus consisting of four SNPs are ordered 1-1-1-1, 1-1-1-2, 1-1-2-1,
..., 2-2-2-2.
Where the compound locus contains only one simple
locus, the prior parameters are calculated simply by adding 0.5 to the
observed allele counts in samples of unadmixed individuals. Where
no data from unadmixed individuals are available, the prior parameters
are specified simply as 0.5 (a "reference" prior). Where
data from unadmixed individuals are available, the compound locus
contains two or more simple loci, the parameters for the prior on
haplotype frequencies are obtained by using ADMIXMAP with a single
population model to generate the posterior distribution of
haplotype frequencies from data on unadmixed individuals.
For an example of how to do this, run the script |

etapriors.txt | Parameters for gamma prior distribution on allele frequency dispersion parameters. See below for explanation |

These notes assume that you have installed Perl (ActivePerl is the windows version, the R statistical package, and a viewer for postscript files (such as Ghostview).

First, download and install ADMIXMAP. The Perl script *tutorial.pl*
has been provided for this tutorial. Edit the Perl
script where indicated to specify the location of the ADMIXMAP executable, the R
executable (*Rcmd.exe* on a Windows platform) and the R script (AdmixmapOutput.R) that processes the
output from the ADMIXMAP executable.

You are now ready to start the Perl script from your
working directory. To do this, open the console shortcut and type
"*perl tutorial.pl*". This script will run the program six times with different
models, calling the main ADMIXMAP program each time with appropriate
command-line options. The command-line options are stored by Perl in an array or
"hash". The Perl script provides a convenient means of running
several analyses with different options in batch mode. After each run of the ADMIXMAP program, the Perl
script will run the R script * AdmixmapOutput.R* to
analyse the output files, and move all output files to a folder (subdirectory) named for
the type of analysis that has been run. Five new subdirectories will be created, each containing
files output by ADMIXMAP and the R script. On
an ordinary PC, these analyses will take about 20 minutes. You can inspect
the results of each analysis, as described below, to determine if a longer run
(with more samples from the posterior distribution) is needed. The
results quoted in the tutorial below are from a long run, so may not correspond
exactly to those obtained with a shorter run.

We now examine the results of each ADMIXMAP analysis.

SinglePopResults

This folder contains results of analysis with
command-line option

. This analysis does not exploit ADMIXMAP's ability to
model population structure, but does exploit its ability to fit regression
models and to model association with haplotypes given unphased genotype
data. For this purpose, ADMIXMAP is an alternative to HAPLOSCORE, a
program which uses a similar score test for association with haplotypes in a
regression model. Other uses of the single population model are to
test for population stratification, and to estimate haplotype frequencies in
samples from unadmixed populations. These haplotype frequency estimates
can be used to specify priors for the analysis of data from admixed
populations. *populations=1*

To specify that the
second column of *outcomevarfile* contains the outcome variable of interest
(skin reflectance), we specify the option *targetindicator=1* (i.e. offset
by 1 from column 0). The program automatically determines that this is a
continuous variable*.*
No information about allele
frequencies or haplotype frequencies is supplied, so the program generates the
posterior distribution of haplotype frequencies from the data, given a reference
prior. The program fits a linear regression model with skin reflectance as outcome variable, and age, sex and body mass index as
explanatory variables. As the program does not have to model population
structure, it requires only a short run: a total of 1100 iterations, including a
burn-in of 100, is ample for all test statistics to be computed.

The file * RegressionParamConvergenceDiagnostics.txt* contains results of a simple
test for the adequacy of the burn-in period, attributed to Geweke
(1992). If
burn-in is adequate and the sampling run after burn-in is long enough, these test statistics should have a standard normal
distribution. Extreme values of the test statistics (indicated by small p-values)
imply that a longer burn-in, and a longer
sampling run after the burn-in, should be used. For this simple model,
there is no evidence of lack of convergence.

Now examine the log file. This contains the result of a test for residual
population stratification, based on testing for allelic association between
unlinked loci. See the main program
documentation for more details of how this test statistic is calculated.
Only 14 loci are used in this calculation, as only unlinked loci can be
included. The result is reported as a ** posterior predictive check probability** or
"Bayesian p-value". This is the frequency with which, over the
posterior distribution of model parameters, a simulated dataset gives
results more extreme than the observed dataset. We can interpret the
"p-value"
of 0.04 as evidence for stratification: there are more associations between
unlinked loci than expected by chance. In general, posterior predictive check
probabilities are more conservative than classical p-values: a p-value of 0.05
is fairly strong evidence of lack of fit.

The file *args.txt *contains one line for each of the
options specified on the command-line. This includes default values for
some of the command-line options that were not specified in the Perl
script. An alternative way to invoke the program with the options
specified in this file is simply to type "*<path to admixmap
executable>admixmap <path to args.txt>args.txt*".

The file HardyWeinbergTest.txt contains results of tests for Hardy-Weinberg equilibrium at each of the 32 simple loci. This test is based on averaging over the posterior distribution of allele frequencies, whereas the classic test for Hardy-Weinberg equilibrium conditions on the observed counts of alleles. For a single population model, we expect the score test to give very similar results to the classical test. Positive scores indicate that the proportion of homozygotes is higher than expected given the allele frequencies. Two of the loci - FY and MID52 - show evidence of departure from Hardy-Weinberg equilibrium. For locus FY there are too few copies of allele 2 for the test to be valid. Possible explanations for deviation from Hardy-Weinberg equilibrium are genotyping error, a higher proportion of missing genotypes (failure to call the genotype) in heterozygotes, or population stratification accompanied by non-random mating. If population stratification accounts for the departure from Hardy-Weinberg equilibrium, we expect that there will be no evidence of departure from Hardy-Weinberg equilbrium when the test is repeated with a model of admixture between two or more subpopulations.

The file * PosteriorQuantiles.txt* contains summary
statistics for the posterior distribution of the parameters of the regression
model. Ignore the rows for Dirichlet parameter and sumIntensities, which
are irrelevant to a single population model. The posterior means and 95%
credible intervals for the regression coefficients for age and sex are
given. The "precision" parameter is the inverse of the residual
standard deviation. All these estimates will be very similar to those that
would be obtained with a standard regression program: classical 95% confidence
intervals are equivalent to Bayesian 95% credible intervals where (as in this
application) the sample size is large and the priors on the regression
coefficients are non-informative.

Each simple locus is tested for allelic association with the outcome variable (skin reflectance), and the haplotypes at each compound locus are tested for allelic association with hypertension. These score tests test the null hypothesis β=0, where β is the regression coefficient for the effect of number of copies of the allele (or haplotype) on the outcome variable, in a model with the covariates (age, sex and income in this case). The test statistic is calculated by dividing the score (gradient of the log-likelihood at β=0) by the square root of the observed information (curvature of the log-likelihood at β=0). This statistic has a standard normal distribution under the null hypothesis. Positive score values indicate that the most likely value of β is greater than zero. The proportion of information extracted is a measure of how much information about β we have, in comparison with a dataset in which all variables were observed directly.

First examine the file * TestsAllelicAssociationFinal.txt,*
which contains the results of score tests for association of the outcome
variable with allele 1 at each simple locus. For ease of viewing, open this file
with a program such as Excel that will interpret the tabs as column
separators. The p-values in this table will be practically identical
to those that would be obtained by testing for association with number of copies
of each allele in a classical logistic regression analysis, adjusting for age,
sex and income. Note that there are three loci for which the p-values are
less than 0.01. Of these three, TYR192 is in a candidate gene for skin pigmentation, and CYP19e2 is
closely linked to a gene (SLC24a5) that has recently been shown to account for some of the ethnic
variation in skin pigmentation.

However we expect these test results to be confounded by population stratification, as this has not been accounted for in the statistical model.

The file * TestsHaplotypeAssociationFinal.txt * contains tests
for association with haplotypes at each compound locus. All haplotypes
with frequency < 1% are grouped together as "others". The other
haplotypes are tested one at a time for association, and also with a summary
chi-square test of the null hypothesis that all haplotype effects are
equal. A more appropriate chi-square test would exclude the
"others" category: this will be fixed in a later release.

Note that the proportion of information extracted is typically at least 70% for each haplotype - this is what we would expect when inferring haplotypes from unphased genotype data in the presence of strong allelic association between the simple loci within each compound locus.

TwoPopsResults

This folder contains results of analysis with
command-line option

. This specifies a model with
admixture between two subpopulations. For this analysis we specify the
allele frequencies in these two subpopulation as unknown, with (uninformative) reference priors.
The program fits a linear regression model with individual admixture proportion, together with
age, sex, and income category as explanatory variables. The program
attempts to infer the structure of the population from the allelic associations
between markers and from the association of marker alleles with the outcome
variable. Even if your objective
is only to study population structure,
including an outcome variable (such as skin reflectance) that is strongly
related to individual admixture proportions helps the program to learn about
individual admixture proportions and population stratification. *populations=2*

With no information about allele frequencies, and only 21
ancestry-informative marker loci in this dataset, the program requires very long runs to
explore the posterior distribution of the population admixture parameters. Examine the
tests for convergence in the file
* PopAdmixParamConvergenceDiags.txt: *unless you have specified a
very long run (*samples = 10000,* or more), the p-values will indicate that
the sampler has not run long enough. To see why long runs are required, view the postscript file
*
PopAdmixAutocorrelations.ps*. This shows that the sampling of the
Dirichlet parameters for the distribution of admixture proportions in the
population mixes slowly, with autocorrelation of 0.5 or more up to a lag of 50
iterations. Mixing is much faster when prior information about allele frequencies
is supplied, and when larger numbers of ancestry-informative marker loci have
been typed. Larger datasets with hundreds of ancestry-informative
markers usually require only a few hundred iterations for reliable
inference.

The log file shows that the posterior predictive check probability in the test for residual stratification is 0.51 - there is no evidence of residual stratification not accounted for by a model with two subpopulations.

The model now contains four extra parameters: two parameters for the (Dirichlet) distribution of admixture proportions in the population, a "sum-intensities" parameter that is equivalent to the effective number of generations since admixture, and a coefficient for the effect of admixture on the outcome variable in the regression model. The file regparams.txt contains samples from the posterior distribution of the regression coefficients. The file PosteriorQuantiles.txt contains summary statistics for the posterior distribution of the model parameters. The program infers the population admixture proportions as about 2 to 1, and that skin reflectance is inversely related to proportionate admixture from the subpopulation that contributes less to the ancestry of the admixed population. As the next analysis shows, these results from a model with no information about allele frequencies are close to those obtained when we supply information about allele frequencies in Europeans, Native Americans and west Africans. As the two subpopulations are not identifiable in the model, their labelling as 1 and 2 is arbitrary, and the labels of the subpopulations that are associated with higher and lower skin reflectance might be reversed when the program is run with a different seed on the random number generator. In principle, the labelling could even reverse during a single run of the sampler. However the inverse association of skin reflectance with proportionate admixture from the subpopulation that makes the smaller contribution to the admixed population should be consistent.

To determine whether the sampler has been run long enough
to estimate the p-value accurately for the tests that we are interested in, we can examine the postscript files
*
TestsAllelicAssociation.ps * and *TestsHaplotypeAssociation.ps*.
These show the results of successive evaluations of
the score test, evaluated over all posterior samples obtained since the end of
the burn-in period and written to file every 50 iterations. After a few hundred
iterations, most of the p-values are stable.

As the regression model includes individual admixture proportion as a predictor variable, tests of allelic association will be adjusted for whatever population stratification is inferred by the model. The p-values for association of skin reflectance with tyr192 and cyp19e2 are still statistically significant (at p = 0.007 and p = 0.00005 respectively), from which we can infer that the associations seen in the single-population model are unlikely to be accounted for by confounding effects of population stratification. Note that the proportion of information extracted is now only about 80% at most loci, compared with nearly 100% in the single population model. This is because there are only 21 ancestry-informative markers in the study, and the score statistic (which is based on adjusting for individual admixture in the regression model) consequently varies over the posterior distribution. The proportion of information extracted can be interpreted as a measure of the efficiency of the test. With more markers, the proportion of information extracted in the score test would be higher.

Estimates of individual admixture proportions obtained in
this analysis are not meaningful because unless at least some information about
allele frequencies or individual ancestry is supplied, the subpopulations are
not identifiable in the model. This does not matter if
we simply want to adjust for population stratification, because the score tests
for allelic association are valid even if the labelling of the subpopulations is
permuted (thus reversing the sign of the corresponding regression coefficient.). To rank individuals by their "degree of
admixture", we can examine the posterior mean of the "ancestry diversity" for each
individual, in the file *IndAdmixPosteriorMeans.txt*. This is
calculated as the probability that locus ancestry differs between two gene
copies drawn at random from unlinked loci in the individual under study. It does not
depend upon the labelling of the subpopulations, and so can be computed even
when the subpopulations are not identifiable. If you are using the program to
identify admixed individuals, without using prior information about allele
frequencies, we recommend that you use this statistic.

PriorFreqResultsSkin

This folder contains result of analysis with
command-line option

. With
this option, the program fits a model in which allele frequencies in the
unadmixed populations sampled are assumed to be identical to the corresponding
ancestry-specific allele frequencies in the admixed population. The file
priorallelefreqs.txt has three columns specifying priors on the allele
frequencies in Europeans, Native Americans and west Africans. Specifying a
prior distribution for the allele frequencies, rather than specifying them as
fixed constants, allows for the uncertainty in estimates of allele frequencies
that are based on samples of finite size. At loci where no prior
information about allele frequencies is available for a given subpopulation, a reference prior (all
parameters equal to 0.5) is specified. Where haplotype frequencies have been
estimated from phase-unknown genotypes, the uncertainty in these estimates can
also be accounted for in the parameters of the prior distribution.*priorallelefreqfile=priorallelelefreqs.txt*

The file * PosteriorQuantiles.txt * contains the posterior
mean, posterior median and central 95% credible interval for population-level
variables: Dirichlet parameters for the distribution of
admixture, sum-intensities, regression coefficients, and the population admixture
proportions. If the study sample size is large, the posterior mode and 95%
credible interval are asymptotically equivalent to the maximum likelihood
estimate and 95% confidence interval. If the sample size is large, the posterior distribution will be approximately normal and
thus the posterior mode will be approximately equal to the mean (or
median). This approximation is closest if the
variable has been transformed to lie on the real line (between minus infinity
and plus infinity), Plots
of the posterior densities of all population-level parameters are given in the
postscript file PosteriorDensities.ps. The table below briefly explains what the
variables tabulated in this file mean.

Variable |
Explanation |

Dirichlet parameters for distribution of admixture (Eur, NAm, Afr) in the population | The ratios between the Dirichlet parameters determine the average admixture proportions in the population, and the sum of the Dirichlet parameters determines the variance of admixture proportions between individuals. A large value of the sum of Dirichlet parameters implies that the variance of admixture proportions between individuals is small. |

Sum of intensities | This parameter measures the frequency with which transitions between states of European, African and Native American ancestry occur along the chromosomes in this population. This parameter is assumed to be the same in all individuals unless the option globalrho=0 is specified. The parameter can be interpreted as the average number of generation back to unadmixed ancestors. Note that in this dataset, with only a few linked markers, we don't have much information from which to infer the sum of intensities parameter: the 95% credible interval is from about 5 to about 19. |

Regression coefficients for intercept, covariates, and individual admixture | These are linear regression
coefficients. For a model with K subpopulations, K - 1
regression coefficients for
individual admixture proportions are displayed (the population given in
the first column of the allele frequency file is
taken as the baseline category). |

Population admixture proportions | Population admixture proportions are calculated by dividing the Dirichlet parameters by their sum. |

Precision | Inverse of residual variance in regression model |

Posterior means of individual admixture are in the file *
IndAdmixPosteriorMeans.txt.* These values can be plugged into other types of
genetic analysis, but should not be used to test for a relationship with skin
pigmentation because the association between individual admixture and the skin
pigmentation has already been used by the program to learn about individual
admixture. If you want estimates of individual admixture that you can plug into
a regression model to test for association with the outcome variable, run the
program without an outcomevarfile.

The file * DistributionIndividualAdmixture.ps* contains
histograms of the distribution of posterior means of individual admixture. For
comparison, the distribution of individual admixture specified by the posterior
means of the Dirichlet parameters is shown as a curve on the same plot. This
file can be examined to test if there is any obvious lack of fit of the
distribution of individual admixture proportions to the model: for instance a
bimodal distribution of admixture proportions, which is not compatible with a
Dirichlet distribution.

The file * TestsForDispersion.txt* contains cumulativeresults
of a test for variation of allele frequencies between the unadmixed
populations (in this example European, Native American and west African) that were sampled to
obtain prior parameters in

, and the
corresponding ancestry-specific allele frequencies in the admixed population (in
this case Hispanic-Americans in Colorado). The numbers given in the file
are "posterior predictive check probabilities" or "Bayesian
p-values". Small posterior predictive check probabilities indicate lack of
fit. There are no loci at which the p-values are small, indicating good fit of the data to the allele
frequencies given in priorallelefreqfile. In this small dataset, we do not
have enough information to detect small departures from the "no
dispersion" model. However the unadmixed Native American populations
that were sampled (from north and south America) are unlikely to be exactly
representative of the ancestral Native American subpopulation that contributed
genes to this admixed population in Colorado. If there is evidence of
dispersion of allele frequencies, we can deal with this by fitting a dispersion
model as described in the next section.*priorallelefreqsfile*

The results of tests of association of skin reflectance with alleles at each locus are similar to those obtained in the TwoPopsResults analysis, in which no prior information about allele frequencies was supplied.

In this analysis, where the subpopulations are
identifiable, we can test also for association of the outcome variable with
ancestry at each locus. This is the basis of **admixture mapping**, an
approach that exploits admixture to localize genes underlying ethnic variation in disease
risk. The results of these tests are in the file *TestsAncestryAssociationFinal.txt*, with plots of successive evaluations in the
postscript file *TestsAncestryAssociation.ps*. The program runs more
slowly when these tests are specified. We can ignore the tests for
association with African ancestry, as the proportion of African admixture in
this population is too low for such tests to be meaningful. We can also ignore any tests for which the observed information is very small,
as where there is not enough information in the data, the asymptotic properties of the score test in large samples (approximation
of the log-likelihood to a quadratic function) will not hold. At some loci, such as
GNB3, the observed information from the test for association with Native
American ancestry is evaluated as negative - probably because the true value is
close to zero and the sampler has not been run long enough to estimate it
accurately, although it is formally possible for the information to be negative
(log-likelihood not concave downwards) in small samples.

Note that the efficiency of this test (proportion of information extracted) is much lower than the efficiency of the test for allelic association. For association with Native American ancestry, the highest proportion of information extracted is at marker locus mid52. This marker locus is highly informative for Native American ancestry. There is evidence of linkage to genes underlying the ethnic differences in disease risk at TYR192 and CYP19e2. The scores for association with Native American ancestry are negative, implying that average skin reflectance is inversely related to the proportion of gene copies that are of Native American ancestry at this locus, as we would expect if the trait locus accounts for some of the ethnic difference in skin reflectance. The tests for linkage with ancestry, unlike the test for allelic association, use the genotype data from all loci on each chromosome to extract information about ancestry at each locus. With these tests, we expect any evidence of linkage to be detectable over a broad region: 20 cM or more. To establish whether linkage with ancestry at TYR192 and CYP19e2 can be confirmed, the next step would be to type more markers informative for Native American versus European ancestry in these regions.

The file * TestsAncestryAssociation.ps* also contains a plot
of the proportion of information extracted at each locus, across all loci. With
larger marker sets, this plot can be used to evaluate the coverage of each chromosome
by the marker set.

PriorFreqResultsDiabetes

This folder contains results of an analysis with
diabetes as outcome variable. As diabetes is a binary variable, the
program fits a logistic regression model. The logistic regression
coefficients are log odds ratios, and can be transformed to odds ratios by
taking exponents. With age and sex as the only other covariates in the
regression model, the posterior mean for the log odds ratio for the effect of
unit change in Native American admixture is 3.0, with a 95% credible interval
from 0.15 to 7.0. As this interval does not overlap 0, we can interpret
this as a statistically significant (p<0.05) association of diabetes with
Native American admixture, adjusted for age and sex. We cannot, however,
exclude the possibility that the association is accounted for by an unmeasured
confounder that is associated with the proportion of Native American admixture
and is independently associated with diabetes. If we run the analysis
again with age, sex and income in the model as covariates (edit the Perl script *tutorial.pl*
or the file *args.txt* to specify option *covariatesfile='covariates3std.txt'*), the posterior mean
for the adjusted log odds ratio associated with Native American admixture falls
to 2.0, with a 95% credible interval that overlaps 0. This is because
income is associated both with Native American admixture and diabetes. The
results are thus compatible with an environmental explanation for the
association of diabetes with Native American admixture in this population.
A larger sample size, more ancestry-informative markers, and more extensive
measurements of environmental covariates would be required to investigate
this.

These concerns about confounding by environmental factors do not apply to the tests for allelic association or to the tests for linkage with locus ancestry. With these tests, it is sufficient to adjust for individual admixture proportions to guarantee that confounding by environmental factors will be eliminated. There is weak evidence of association with two candidate genes: PPARG (one of two SNPs) and SUR1 (one of the sixteen possible haplotypes). These results are consistent with associations described in other studies. As none of the markers in candidate genes for diabetes are informative for ancestry, and no other nearby ancestry-informative markers have been typed, we cannot assess the possible contribution of these candidate genes to ethnic variation in diabetes risk.

As diabetes is a binary trait, we can evaluate both affected-only and case-control tests for linkage with locus ancestry. The fit of these test results to the distribution expected under the null is given by the QQ plots. These show that the affected-only and case-control test statistics are a good fit to a standard normal distribution. The affected-only test is more powerful than the case-control test, but assumes that t ancestry state frequencies do not vary systematically across the genome within the admixed population. The case-control test is robust to violation of this assumption.

HistoricAlleleFreqResults

This folder contains results of analysis with
command-line option

.
With this option, the program fits a "dispersion" model for the allele
frequencies, which allows the allele frequencies in the unadmixed populations to
vary from the corresponding ancestry-specific allele frequencies in the admixed
population under study. A single allele frequency dispersion parameter h
(eta) is estimated for
each subpopulation. The option *historicallelefreqfile=priorallelefreqs.txt**outcomes=2* specifies that
regression models should be fitted for both outcome variables
simultaneously. Thus if we are interested in evaluating associations
with diabetes, we can still use the skin reflectance values to provide
additional information about individual admixture.

The dispersion parameter can be interpreted as follows.
Imagine that the variation between allele frequencies in the modern unadmixed
west African populations sampled and the allele frequencies in the pool of genes
of African ancestry in the African-American population of Philadelphia had been
generated by drawing two independent equal-sized samples from an ancestral total
population. The allele frequencies in the two samples would differ as a results
of sampling variation, and the variance of the sample frequencies between these
two samples would depend upon the size of the sample that was drawn. The
parameter h is equivalent to this sample size. Small values
of h (<100) indicate dispersion of allele frequencies.
h is related to
Wright's * F _{ST} *(fixation index subpopulation-total) by the relation

As the program does not have much information from which to
estimate these parameters, we have to specify strong priors. These are
specified in the file *data/etapriors.txt*. There is one row for each
subpopulation, with rows ordered as in the columns of *historicallelefreqfile*.
Each row specifies a shape
parameter and a rate parameter for a gamma prior distribution. The prior
mean is shape/rate, and the prior variance is shape/rate^2. We have
specified the prior means on the dispersion parameters as 500, 50 and 50 for
European, Native American and African allele frequencies respectively. based on
estimates of *F _{ST} * between subpopulations within these
continental groups. The prior variances are specified as 5, 0.5, and 0.5
respectively. These very small prior variances effectively constrain the
dispersion parameters to be close to their prior means: there is not enough information in the dataset for us to be able to estimate the
dispersion parameters from the data. With this dataset, the point of
running a dispersion model is simply to examine whether relaxing the assumption
of no dispersion of allele frequencies alters the results.

The posterior distributions of the allele frequency
dispersion parameters (labelled as eta.Afr and eta.Eur for African and European
subpopulations respectively) are plotted in the file *ParameterPosteriorDensities.ps*, and the means and medians are given in the file
*
PosteriorQuantiles.txt. *Posterior means of the standardized variance
of allele frequencies (expressed as
*
F*_{ST}) between unadmixed and admixed subpopulations within each
continental group are in the file lociFst.txt

Estimates of the dispersion parameter have implications for the sample size required when estimating allele frequencies in unadmixed populations in order to model admixture. In general, there is no point in using a sample size that is an order of magnitude larger than the value of eta for a given subpopulation, because the allele frequencies in the unadmixed populations will not accurately predict the corresponding ancestry-specific allele frequencies in the admixed population.

The results of tests for allelic association and linkage with ancestry are similar to those obtained with the previous analysis using a model with no dispersion, suggesting that these analyses are fairly robust to assumptions about allele frequencies.

Other possible models

You can edit the Perl script or the * args.txt* file to
specify other options.

* randommatingmodel=1 *would specify that the admixture proportions on the
two parental gametes are independent draws from the Dirichlet distribution in
the population. In general this option is only useful if you have at least
100 ancestry-informative markers, allowing the program to infer the admixture proportions of each parental
gamete separately. The default is *randommatingmodel=0*

*globalrho=0* would specify a hierarchical model for the sum-intensities
parameter. Again this option is only useful if you have at least 100
ancestry-informative markers, allowing the program to infer the sum-intensities
parameter for each gamete. The default is *globalrho=1.*

*indadmixhiermodel=0* would eliminate the
hierarchical model for individual admixture, allowing you to specify a Dirichlet
prior on individual or gamete admixture directly with *initalpha0* and *initalpha1*.
This option is occasionally useful when you do not want the
"shrinkage" effect of a hierarchical model, in which outlying
observations are pulled towards the population mean.