UNSUPPORTED SOFTWARE: USE AT OWN RISK
NullHWE - Test for HWE When One Allele is Null
A null allele is one that isn't detected except when
it occurs homozygously. For example, in the blood-antigen ABO system, O
is a null allele because from simple blood reaction, the genotypes
AO and AA are indistinguishable.
In the presence of a null allele, the apparent distribution of genotypes will appear
skewed in favour of homozygotes, because, for example, AO heterozygotes
are counted as AA homozygotes. Therefore, when a standard
test for Hardy-Weinberg Equilibrium (HWE) (e.g. Guo & Thompson's randomization program
HWE, available at http://gause.biology.ualberta.ca/jbrzusto/hwenj.html)
indicates lack of equilibrium, it is useful to see to what extent this apparent disequilibrium might
be explained by the presence of a null allele.
The standard method for estimating the frequency of a null allele is the
Estimation-Maximization (EM) Algorithm, which finds those allele frequencies
that maximize the probability of the observed results under the assumption
of HWE. (This algorithm, for the case of a null allele, is available at
http://gause.biology.ualberta.ca/jbrzusto/nullele.html)
What is left to do is to determine to what extent, even assuming the existence of a null allele,
there is still evidence of disequilibrium.
This web page offers one approach. In what follows, assume there is a null allele.
Two observable genotype distributions are compatible if there
is an allele distribution (including null alleles) with which
they are both compatible.
For example, in the A-B-O blood type system,
A:10 B:15 AB:20 O:10
is compatible with
A:5 B:10 AB:25 O:15
because they are both compatible with the allele counts:
a:35 b:45 o:30
under the arrangements:
AA:5 BB:10 AB:20 OO:10 AO:5 BO:5
and
AA:5 BB:10 AB:25 OO:15 AO:0 BO:0,
respectively.
This is a straightforward extension of the notion of compatible genotype distributions
used by the Guo & Thompson's HWE randomization test. There, allele
counts are fixed, because all alleles are visible. The randomization
test presented below randomizes over the collection of all genotype
distributions compatible with the observed one.
The test proceeds in the usual randomizing way:
- see how far from HWE the observed data are
- repeat n (i.e. many) times:
- randomly generate an element, X, of the universe (i.e. compatible genotype distributions)
- see if X is at least as far from HWE as the observed data is
- if so, increase m by one
- the number m / n is an estimate of the probability of getting a genotype distribution as
far from HWE as the one observed.
At a general level, this algorithm is identical to Guo & Thompson. The differences are:
- the universe of compatible genotype distributions is not just those with the same allele counts, but includes any with the
same allele counts as a genotype distribution compatible with the observed data.
- each compatible genotype distribution is drawn completely at random,
rather than using the kind of partial shuffling used in Guo & Thompson.
This makes it safer to draw conclusions from the output.
- the measure of distance from HWE is not simply the probability of the
observed genotype distribution under HWE and the observed allele count, but rather, the probability
of the observed genotype distribution under HWE and the best possible allele distribution,
obtained by EM. In other words, for both the original and each random, compatible genotype distribution, we
run EM to find the best fit to HWE, then measure just how good that fit is.
Moreover, the probability is calculated using allele frequencies and a multinomial distribution,
rather than using the fixed allele count randomization used by Guo & Thompson. (i.e. the
probability used here is the one maximized by EM.)
So the final output answers the question, "How often would a compatible genotype
distribution yield as bad or worse a fit to HWE, after using EM to improve the
fit with a null allele, as the observed genotype distribution?"
The sample data is for blood types of 2060 Croatians, from Mourant et al. 1976.
The first number is 2, the number of non-null alleles. Then come the numbers
of people with the phenotypes A, AB, and B. Finally, the number
of people with phenotype O, which implies genotype OO.
If you click on "Calculate!", the program uses EM to produce estimates of
all 3 allele frequencies, and the corresponding expected genotype frequencies
under Hardy-Weinberg equilibrium. It then generates the given number of random
compatible distributions, and uses EM on each to obtain a best HWE fit, keeping
track of those that fit no better than the observed data.
Input format:
- the number of non-null allele flavours, n
- the genotype counts: A1,1 A2,1 A2,2 A3,1 A3,2 A3,3 ... An,n
- the number of null homozygotes (assumed to be zero if not supplied)
Numbers must be separated by spaces, tabs, and/or newlines.
Output:
- the EM estimates of allele and genotype frequencies for the observed data
- log probability for the EM run of each randomly drawn compatible genotype distribution
- log probability for the EM run of the observed data
- proportion of random runs with a lesser or equal probability (under HWE)
Output is rounded to four decimal places for allele frequencies, and 2 decimal places for genotype counts.
Comments, complaints, questions to John Brzustowski. This is
free software, with source available in this .zip archive.