nullhwe - HWE test with null alleles

UNSUPPORTED SOFTWARE: USE AT OWN RISK

NullHWE - Test for HWE When One Allele is Null

A null allele is one that isn't detected except when it occurs homozygously. For example, in the blood-antigen ABO system, O is a null allele because from simple blood reaction, the genotypes AO and AA are indistinguishable.

In the presence of a null allele, the apparent distribution of genotypes will appear skewed in favour of homozygotes, because, for example, AO heterozygotes are counted as AA homozygotes. Therefore, when a standard test for Hardy-Weinberg Equilibrium (HWE) (e.g. Guo & Thompson's randomization program HWE, available at http://gause.biology.ualberta.ca/jbrzusto/hwenj.html) indicates lack of equilibrium, it is useful to see to what extent this apparent disequilibrium might be explained by the presence of a null allele.

The standard method for estimating the frequency of a null allele is the Estimation-Maximization (EM) Algorithm, which finds those allele frequencies that maximize the probability of the observed results under the assumption of HWE. (This algorithm, for the case of a null allele, is available at http://gause.biology.ualberta.ca/jbrzusto/nullele.html) What is left to do is to determine to what extent, even assuming the existence of a null allele, there is still evidence of disequilibrium.

This web page offers one approach. In what follows, assume there is a null allele.
Two observable genotype distributions are compatible if there is an allele distribution (including null alleles) with which they are both compatible. For example, in the A-B-O blood type system,
A:10 B:15 AB:20 O:10
is compatible with
A:5 B:10 AB:25 O:15
because they are both compatible with the allele counts:
a:35 b:45 o:30
under the arrangements:
AA:5 BB:10 AB:20 OO:10 AO:5 BO:5
and
AA:5 BB:10 AB:25 OO:15 AO:0 BO:0,
respectively.
This is a straightforward extension of the notion of compatible genotype distributions used by the Guo & Thompson's HWE randomization test. There, allele counts are fixed, because all alleles are visible. The randomization test presented below randomizes over the collection of all genotype distributions compatible with the observed one.

The test proceeds in the usual randomizing way:

see how far from HWE the observed data are
repeat n (i.e. many) times:
- randomly generate an element, X, of the universe (i.e. compatible genotype distributions)
- see if X is at least as far from HWE as the observed data is
- if so, increase m by one
the number m / n is an estimate of the probability of getting a genotype distribution as far from HWE as the one observed.

At a general level, this algorithm is identical to Guo & Thompson. The differences are:

the universe of compatible genotype distributions is not just those with the same allele counts, but includes any with the same allele counts as a genotype distribution compatible with the observed data.
each compatible genotype distribution is drawn completely at random, rather than using the kind of partial shuffling used in Guo & Thompson. This makes it safer to draw conclusions from the output.
the measure of distance from HWE is not simply the probability of the observed genotype distribution under HWE and the observed allele count, but rather, the probability of the observed genotype distribution under HWE and the best possible allele distribution, obtained by EM. In other words, for both the original and each random, compatible genotype distribution, we run EM to find the best fit to HWE, then measure just how good that fit is. Moreover, the probability is calculated using allele frequencies and a multinomial distribution, rather than using the fixed allele count randomization used by Guo & Thompson. (i.e. the probability used here is the one maximized by EM.)

So the final output answers the question, "How often would a compatible genotype distribution yield as bad or worse a fit to HWE, after using EM to improve the fit with a null allele, as the observed genotype distribution?"

The sample data is for blood types of 2060 Croatians, from Mourant et al. 1976. The first number is 2, the number of non-null alleles. Then come the numbers of people with the phenotypes A, AB, and B. Finally, the number of people with phenotype O, which implies genotype OO. If you click on "Calculate!", the program uses EM to produce estimates of all 3 allele frequencies, and the corresponding expected genotype frequencies under Hardy-Weinberg equilibrium. It then generates the given number of random compatible distributions, and uses EM on each to obtain a best HWE fit, keeping track of those that fit no better than the observed data.

Input format:

the number of non-null allele flavours, n
the genotype counts: A_1,1 A_2,1 A_2,2 A_3,1 A_3,2 A_3,3 ... A_n,n
the number of null homozygotes (assumed to be zero if not supplied)

Numbers must be separated by spaces, tabs, and/or newlines.

Output:

the EM estimates of allele and genotype frequencies for the observed data
log probability for the EM run of each randomly drawn compatible genotype distribution
log probability for the EM run of the observed data
proportion of random runs with a lesser or equal probability (under HWE)

Output is rounded to four decimal places for allele frequencies, and 2 decimal places for genotype counts.

Comments, complaints, questions to John Brzustowski. This is free software, with source available in this .zip archive.