ADMIXMAPa program to model admixture using marker genotype data |
genotypesfile
The first row of the file is a header row
listing locus names, enclosed in quotes and separated by spaces. Locus names
should be exactly the same as in the file locusfile. Loci must be ordered by
their map positions on the genome. Each subsequent row contains genotype data
for a single individual. Each line contains the individual ID, the
individual's sex, coded as 1 for male, 2 for female, 0 for missing, followed by
observed genotypes at each locus, optionally enclosed in quotes. The sex column may be omitted if none of the
loci are on the X chromosome. Haploid genotypes (including X
chromosome genotypes for males) are coded as single integers. Diploid
genotypes are coded as pairs of integers separated by a comma. Where there are a alleles at
a locus, the alleles should be coded as numbers from 1 to a. Missing
genotypes are coded as "0,0" (or "0" for haploid
genotypes).
For compatibility with existing datasets, we plan to change this file format to one more similar to the PEDFILE format used with LINKAGE.
testgenotypesfile
This file contains genotypes for each individual in the genotypesfile at diallelic loci not included in the model due to large haplotypes not being modelled. The format is the as for the genotypesfile above except that genotypes should be coded as 0 for "1,1", 1 for "1,2" and 2 for "2,2". Missing genotypes should be coded as NA. The file is not used by the program itself but will indicate that, provided there is a regression model, "offline" score tests are to be carried out in the R script.
locusfile
File contains information about each simple
locus: that is, each locus that is typed. The first row of the file is
ignored by the program, and can be used as a header. Each subsequent row
contains values of four variables: locus name; number of alleles at this
locus; genetic map distance in Morgans, centimorgans or megabases between this locus; and the previous
locus and the name of the chromosome where the locus is located. The
last column is optional if none of the loci lie on the X chromosome. If
distances are supplied in centimorgans, the header of the distance column must
contain "cm" or cM". If the distances are supplied in
megabases, the header should contain "mb" or "Mb". Loci must be ordered by their map positions on the genome. Locus
names should contain only alphanumeric characters (no
spaces, dots or hyphens). If the previous locus is unlinked, the genetic
map distance should be coded as "NA", "#" or
"." . Loci considered too far apart to be linked may also be
treated as unlinked. For two or more loci that are
so close together that they should be analysed as a single compound locus (as
with DRD2Bcl and DRD2Taqd in the tutorial), map distance should be coded
as 0.
The website http://actin.ucd.ie/cgi-bin/rs2cm.cgi
can be used to obtain the genetic map positions (in cM)
of a list of SNPs, which, once converted to distances, may be specified in the locusfile.
allelefreqfile
This file contains the ancestry-specific
allele frequencies at each compound locus in each ancestral subpopulation. The
first row contains headers in quotes, separated by spaces. The first string in
this row is ignored. Subsequent strings in the first row specify the names of
the ancestral subpopulations contributing to the admixed population under
study. Subsequent rows specify the ancestry-specific allele frequencies
(usually estimated by from sampling modern descendants of the subpopulations
that underwent admixture. The first column in each row gives the name of the
compound locus, in quotes.
For diallelic loci, only the frequency of allele 1 in each population is
specified. For each locus with a alleles, there are (a - 1) rows specifying
frequencies of alleles 1 to (a - 1).
Where two or more loci are to be analysed as a single haplotype, the
ancestry-specific frequency of each haplotype must be specified. Thus in the
example files below, there are two SNPs in the DRD2 gene, giving four possible
haplotypes) and four lines specifying the ancestry-specific frequencies of
haplotypes 11, 12, 21, 22. The loci in the haplotype are ordered by their map
position on the genome, and the haplotypes are ordered by incrementing a
counter from right to left. For instance if there were three loci A, B, C,
with 4, 2 and 3 alleles respectively, the haplotypes would be listed in the
following order: 111, 112, 113, 121, 122, 123, 211, ..., 422, 423.
Note: Use of this file is not recommended and is supported only for backward
compatibility with previous versions. Instead you can specify the allele
frequencies in priorallelefreqfile or historicallelefreqfile as fixed, with
option fixedallelefreqs = 1
priorallelefreqfile
This file contains parameter values for the
Dirichlet prior distribution of the allele or haplotype frequencies at each
compound locus in each subpopulation. At each compound locus with k
alleles or k possible haplotypes, a Dirichlet prior distribution is specified
by a vector of k positive numbers. Where these alleles or
haplotypes have been counted directly in samples from an unadmixed
subpopulation, the parameter values should be specified as 0.5 plus the
observed counts of each allele. Where no information is available
about allele or haplotype frequencies at a compound locus in a subpopulation,
or no copies of the allele have been observed in the sample from that
subpopulation, specify 0.5 in the corresponding cells. Specifying
0.5 in all cells, with columns for b subpopulations, is equivalent to
specifying the option populations = b .
Where haplotype frequencies at a compound locus have been estimated from
unordered genotypes, the user should supply the parameters of the Dirichlet
distribution that most closely approximates the posterior distribution of
haplotype frequencies given the observed genotypes and a reference prior, as
described above. The first row is a header row, consisting of strings in
quotes, separated by spaces. The first string in this row is ignored, and the
subsequent strings specify the names of the ancestral subpopulations
contributing to the admixed population).
After the header row, there is one row for each allele (or haplotype) at each
compound locus. The first column in each row gives the name of the
compound locus in quotes. Subsequent columns give the prior parameters
for the frequency of the allele (or haplotype) in each subpopulation,
separated by a single space.
If the compound locus consists of two or more simple loci, (see notes above),
the rows list prior parameters for the haplotypes in the order defined by
incrementing a counter from right to left. For instance if there were three
loci A, B, C, with 4, 2 and 3 alleles respectively, the haplotypes would be
listed in the following order: 1-1-1, 1-1-2, 1-1-3, 1-2-1, 1-2-2, 1-2-3,
2-1-1, ..., 4-2-2, 4-2-3. Estimated counts should be given for all
possible haplotypes, however rare. The program will include all possible
haplotypes in the model, but will omit rare haplotypes when constructing test
statistics.
historicallelefreqfile
This file contains observed counts of alleles
or haplotypes at each compound locus in samples from unadmixed
subpopulations. The format of this file is exactly the same as the
format of priorallelefreqfile described above. The only difference
between the two files is that in historicallelefreqfile, 0.5 is not added to
the observed counts.
outcomevarfile
This file contains values of one or more outcome variables. After the header row,
the file has one row per individual. Binary variables should be coded as
1 = affected, 0 = unaffected. Missing values are coded as #. The header row contains the variable labels in
quotes separated by spaces. If the file contains more than one outcome variable,
the column containing the first variable of interest should be specified by the
command-line option targetindicator. The number of columns to be used (1 or 2)
can be specified with the outcomes option.
covariatesfile
This file contains values of covariates to be
included in the regression model. It is used only if an outcomevarfile
has been specified, and is optional even then. The header row
contains covariate names in quotes, separated by spaces. Subsequent rows
contain the observed values of these variables. For computational
reasons, the values of the covariates should be centred about their sample
means. Missing values are coded as #.
This file contains surivival data for a Cox
regression model. After the header row,
the file has one row per individual and there are three columns. The first
contains the times when each individual began to be observed; the second
contains the times the individuals ceased being observed and the last column
contains the number of events that occurred during the obseerved period (usually
0 or 1). The start and finish times must be numeric and relative to the same
point in time, (usually the first start time).