Doh

UNSUPPORTED SOFTWARE: USE AT OWN RISK

Assignment Calculator

New: You should now be able to cut/paste large datasets directly into/out of text boxes in this calculator, at least if your web browser is not too old. This means you don't have to use the Virtual Clipboard and should eliminate the need for this calculator to read and write directly to your machine. You may have to use the Paste key, which is often Ctrl-V, to get data into a window.

Note: until I get time to update it, you will need to use a Java 1.1 browser to run this program. Netscape version 4.X should work. Versions 6 and later will probably not work.

New: Finding likely migrants

New: Subsampling to Equalize Population Sizes

Long Overdue: Citing this program

Use the Calculator

About Assignment

Randomization for confidence intervals

About Cumulative Heterozygosity

Using Dominant Marker data

About the Input Format

About the Output

Assigning Individuals from Unkown Populations

How to deal with missing data

Doh vs. WHICHRUN - read if you are new to Doh!

Reading data from files on your machine.

The Virtual Clipboard for large datasets

What are page buttons for?

Running this program without the network

Creating a dataset from a POPGEN file

Running assignment on a batch of files

Bugs, Fixes, and Changes

Credits

About Assignment

This calculator takes genotypes of individuals from several populations and determines from which population each individual is most likely to have come, by using the assignment index, the highest probability of an individual's genotype in any of the populations. Calculations are as described in

Paetkau, D., W. Calvert, I. Sterling, and C. Strobeck. 1995. Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology 4:347-354

and

Paetkau, D., L. P. Waits, P. L. Clarkson, L. Craighead, and C. Strobeck. 1997. An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics 147:1943-1957

About the Input Format

Hit the Create Dataset button to pop up an input datasheet. Don't forget to fill in the number of loci. Duplicate individual names will automatically be suffixed with "-2", "-3" and so on. Names must not include spaces or TAB characters. Make sure the options page corresponds to the format of your data. The Individual Genotypes window should be filled with blank/TAB/newline separated items like this:

GB01  Admiralty 184 198 152 158 105 113 184 184
SIT01 Baranof   184 184 152 154 105 113 178 182

where the first item is an individual name, the second is a population name, and the rest of the items are the number_of_loci * ploidy allele copies for that individual, in order by locus. Alleles can be represented by numbers or words. Space is not significant in the input, as long as at least one space, TAB, or newline separates each item. The Options page of the input datasheet allows you to specify whether individual and/or population names are present, and if so, in what order. If names are not present, individuals and/or populations are numbered sequentially.

Individuals From Unknown Populations

If you have individuals whose source population is unkown and you wish to find the population in which their genotypes are most likely, include them in your dataset with the symbol ? (question mark) as the population name. You'll still need to assign an individual name. Any individuals with unknown population will be listed at the end of the Individual Assignment output after you run the test. The results from this analysis should be the same as you'd get from WhichRun.

Dealing with Zeroes

Double click on the Options for Assignment Test listbox entry to pop up the option datasheet. The main option is what value to use in place of a zero allele frequency. i.e. when the allele has not actually been detected in a population, or when the only individual in which it has been found is removed from that population. You can choose either to always use f = 0.01 or to use f = 1 / C where C = 1 + ploidy * size_of_the_appropriate_population. The latter choice amounts to pretending that the next allele sampled from the population would have been of the given type. In both cases, frequencies of the other alleles for the given locus and population are not changed, and so together, they will sum to slightly more than one.

New: there is now another option for dealing with zero frequencies: you can choose to have all frequencies adjusted in every population using the formula p' = (f + 1/a) / (n + 1), where p' is the adjusted probability estimate for a given allele at a locus in a population, f is the number of allele copies of the given type, a is the number of allele flavours for that locus, and n is the number of gene copies for that locus in the given population, not including missing copies. For comparison, the usual, unadjusted probability estimate is just p = f / n. This zero-avoidance device is described in Titterington et al.(1981) J. R. Statist. Soc. A. 144:145-175. Results from using this method can differ considerably from those using the other two.

Once you have hit OK on both the input and option datasheets, hit the Calculate button to run the assignment test, unless the datasheet complains, in which case there is a problem with the way your data is entered, and if you are lucky, the datasheet complaint will help you figure out what it is.

(back to top)

About the Output

Running the assignment test produces two datasheets:

The Assignment Datasheet contains:

for each individual:
- its nominal population (where it was sampled)
- its assigned population (where its genotype has highest probability)
- the probability of its genotype in the assigned population
- the probability of its genotype in every population
a population-level assignment matrix, giving, for each pair of populations, the number of individuals sampled in the first but assigned to the second.
a matrix A defined by:
A_x,y = 1/n_x Sum over i in x of [ log₁₀ ( Pr_x(g_i) / Pr_y(g_i) ) ]
where

x, y are populations
n_x is the size of population x

g_i is the genotype of individual i

Pr_x is the genotype probability calculated in population x
So A_x,y is a measure of how much more likely genotypes of individuals sampled in population x are in population x than in population y. A is not symmetric.

The Assignment Datasheet contains a (lower triangular) matrix of distances between each pair of populations, calculated as:

d_x,y = ( A_x,y + A_y,x ) / 2

These are not necessarily distances in the metric sense. (Separate datasheets are used in anticipation of better integration between this program and others in the near future.)

Output looks ugly, but is meant to be pasted into a spreadsheet.

The distance matrix is suitable for use in programs like Phylip Neighbor and a web-based tree-building program . You can copy/paste the distance matrix from here using either the system clipboard, or the virtual clipboard.

(back to top)

Assignment Calculator

This program requires a Java-enabled browser. Perhaps your browser has Java disabled?

(back to top)

Missing Data

Use a '-' (minus sign without the quotes) in the place of each missing allele copy (making sure it has space before and after, just like any other allele label); e.g. a missing locus in a diploid organism is entered as ' - - '. The program simply drops the probabilities for missing allele copies from the calculation of the assignment index. Note: the assignment index for an individual with missing data cannot be directly compared to the index for an individual with all data, or with data missing in different locations, because each allele added to the calculation can only decrease (or at least not increase) the index value. However, it may be legitimate to compare the ratio (probability of an individual in its assigned population / probability in its nominal (input) population) between such individuals.

(back to top)

Virtual Clipboard (for large files)

The Virtual Clipboard lets you paste data into or copy data out of multi-page windows. The ---> VC and <--- VC buttons move data to and from the virtual clipboard. If you see a "Browse" button below, then your browser will also let you copy a file on your machine to the Virtual Clipboard. You can then paste it from the V.C. into the calculator's input windows.

Name of file:

The Virtual Clipboard is Brainless

Is your window full of garbage after pasting? Unfortunately, the Virtual Clipboard knows nothing about file formats, so cutting and pasting directly from an Excel file will fill your window with junk (ie. raw Excel file bytes). You can, however, copy from the open Excel worksheet to a text editor, then save as text in a new file which you can then copy to the Virtual Clipboard. You can also just save a copy of your Excel file in tab-delimited text format, and copy that file to the VC. The point of the VC is just to help work around the limitations on window size in Netscape, MS Explorer, and probably other browsers.

(back to top)

(go to Calculator)

Multi-Page TextBoxes (for medium-sized datasets)

Some browsers limit the size of text windows, so this program allows for bigger input and output using multiple pages. Each page is limited to around 10000 characters. To see if you have filled a page, try typing more characters at the end of the window. When you use your system's copy/paste functions to get text in and out of this program, you have to do so one page at a time; the program has no control over this.

(go to top)

Running, Reading, and Writing on Your Machine

You can download this web page and calculator so that you can run it from your own machine, without using the network. The only disadvantage of this is that you won't necessarily be running the latest version. If you are using Netscape Navigator/Communicator V4.0 or higher, you can load data into a text box directly from a file on your machine, without going through the virtual clipboard (and thus avoid sending your data over the network). The Load and Save buttons below a data entry window will pop up a file dialog, allowing you to load (save) the contents of the window from (to) a file. Netscape will ask your permission for these operations. Unfortunately, other browsers do not (yet?) support the same security mechanism. Netscape Communicator is now open source, so this is the only security mechanism I will support for now. To use these feature, make a local copy of Doh:

save this html/php page as Doh.html to your machine using your browser's "File/Save As..."
save the file Doh.zip in the same directory by clicking here
close your browser (this is necessary)
run your browser and open Doh from e.g. "c:\Doh\Doh.html", or wherever you saved it

(go to top)

Cumulative Heterozygosity and Allele Count

Doh can calculate cumulative expected and observed heterozygosity, cumulative allele count, and cumulative probability of identity (P(ID)) as individuals are considered one by one. To use this feature:

create a Genotyped Individuals dataset as for the assignment test
select appropriate options by editing the options for cumulative heterozygosity... dataset
select the Genotyped Individuals dataset in the list
click on the Calculate Cumulative Heterozygosity... button

Options are:

missing alleles:
- treat every missing allele as unique, so that e.g. 'X -' at a diploid locus counts as two alleles and as a heterozygote
- or ignore missing alleles, so that e.g. 'X -' at a diploid locus counts as one allele (and as a homozygote), and '- -' is not counted at all
order in which to add individuals:
- consider individuals in the order they appear in the input
- or add individuals in random order, repeating the process however many times you specify in the Number of Runs option, and report average values of the heterozygosities and allele counts
population: by default, all individuals are used, but you can select to use only those from a population by giving its name

Output for this command is rounded to three digits after the decimal.

Probability of Identity P(ID)

This is the probability that two randomly drawn diploid genotypes will be identical, assuming observed allele frequencies and random assortment. The program uses the unbiased estimator of P(ID) found in Paetkau et. al.(1998) Conservation Biology 12(2):418-29, unless the number of alleles found at a locus is less than 4, in which case the standard biased estimator is used. In the unusual case where no non-missing alleles have been found for a locus, and you have chosen to have Doh ignore missing alleles, P(ID) is set to one.

Output Format for Cumulative Calculations

Output from these commands is a matrix of numbers. Each row represents a locus, except that the last row represents the mean for all loci. The first column has results from considering a single individual, the second column, from two individuals, and so on, up to the number of individuals chosen for analysis.

(go to top)

Randomizing for Confidence Intervals

Doh can repeatedly randomize your data and re-calculate the assignment test. This allows you to test several null hypotheses about the numbers of individuals cross-assigned between populations. The options are:

1: draw existing individuals within each population. Each population is randomly redrawn, with replacement, from its own individuals. Basically useless!!!
2: draw existing individuals from combined populations. Each population is randomly redrawn, with replacement, from the combined pool of all individuals from all populations. (H0: Populations are actually one well-mixed population.)
3: draw new individuals from each population gene pool. For each population, every individual's genotype is re-drawn from the gene pool for that population, assuming Hardy-Weinberg equilibrium, and sampling with replacement. (H0: Each population is in HWE, but populations are distinct.)
4: draw new individuals from the total gene pool for all populations. For each population, every individual's genotype is re-drawn from the combined gene pool of all populations, assuming HWE and sampling with replacement (H0: Populations are actually one well-mixed population in HWE.)

If randomization (4) suggests rejection of H0, then it's likely that one or both of randomizations (2) and (3) will too. If randomization (2) rejects H0, but randomization (3) doesn't, then the cross assignments you observed are partly due to HW disequilibrium, which can make individuals with multiple rare alleles seem unlikely within their own populations.

While randomization proceeds, which might take some time, a histogram of numbers of population to population cross-assignments for each pair of populations is displayed for your amusement. The pink bar is the number of cross-assignments for the observed data, while the black bars accumulate a histogram of cross-assignments for the randomized populations. Again, for now, this graph is ephemeral.

The Randomization Output

The output consists of three new panes in the assignment test output:

Random Assign Mean: a matrix giving the mean numbers of individuals assigned from the row population to the column population in the randomized datasets
Random Assign Var: a matrix giving the variance of numbers of individuals assigned from the row population to the column population in the randomized datasets
Random Assign Num: a matrix telling you how often (i.e. in how many randomized datasets) the cross assignment from row to column was at least as large as in the assignment test run on your data without randomization. This is the size of the tail end of the cross-assignment distribution beyond your observed value. If you divide this number by the number of randomizations, you get an estimate of the probability of observing at least as many cross assignments as you saw if the appropriate null hypothesis is true.

Subsampling to Equalize Population Sizes

As a special case, Doh can run a randomization as follows:

Let M be the size of the smallest population
Repeat numRand times:
- randomly select M individuals from each population without replacement
- compute the assignment index on these new equal-sized populations
- record the number of individuals assigned between each pair of populations i -> j
report the mean and variance of numbers of individuals assigned for each pair of populations i -> j

This (or at least my implementation of it) is a quick and dirty technique. Martin Carlsson (martin.carlsson@ebc.uu.se) suggested it might be useful for getting around possible problems with highly unequal population sizes.

In the Assignment Output panel, you will only see values for these items:

numPop: the number of populations
randMeanAssigned: the mean number of individuals assigned from population i to population j
randVarAssigned: the variance in numbers of individuals assigned from population i to population j

In particular, no summaries are provided for assignments of individuals, and no A or distance matrix is computed.

(go to top)

Finding Likely Migrants

Here's one approach to looking for individuals who might have come from a population other than the one from which they were sampled:

view the Options for Assignment Test panel
- click the "Options" tab and turn on "Collect extremal stats..."
- click the "Rand. Method" tab and select "draw new individuals from each population gene pool; assumes HWE"
- click the "Num. Rand." tab and fill in a reasonable number.
run the program using the "Calculate Assignment" button
the Assignment Test Extremal Statistics dataset which is generated has two tabs:
- Individual Log Likelihoods: for each individual, this lists that individual's nominal population, whether it was cross-assigned (i.e. had a higher L in a different population), L (log of probability in its nominal population), and P, the fraction of individuals randomly drawn from the same population which had equal or smaller L values. A small P value indicates that such an unusual individual would only rarely occur due to random sampling under HWE.
- Population Log Likelihoods: for each population K, consider the values log(probability of I in K) for individuals(I) sampled from there. For a population with N individuals, this will be a set of N numbers. The variance and range (max - min) of those numbers are measures of the spread of likelihoods for that population. You might expect that the spread would be larger if a population consisted of individuals coming from separate groups (each group under its own HWE) than if they all came from one well-mixed population under HWE. To see whether a given variance or spread is unusually large, we compare it to the values for a large number of randomly drawn HWE populations. The P value is the fraction of random populations which had equal or greater values of the variance or range (respectively). A small P value indicates the observed population has an unusually large variance or range of individual L values. This may indicate the population is not a homogeneous group, but contains individuals from several sources. Unlike the numbers of cross-assignments, this value is computed using only the data from within each population. This may be useful in situations where you suspect some individuals may be coming from a population which you have not sampled directly.

(go to top)

PopGen Files

Data files in PopGen format can be read and turned directly into Doh Genotyped Individuals datasets. There is a button for this on the applet.

Run a Batch of Assignments

You can run the assignment test on multiple PopGen files using the applet available at Boh.html. It allows you to choose assignment test options just as this program does, and further allows you to specify what part(s) of the output from each run are retained.

(go to top)

Using Dominant Markers

You can perform the assignment test on data where each locus represents presence or absence of a dominant marker (e.g. data from RAPD). To do this:

Enter 1 (one) as the Ploidy
Enter the number of marker sites as Num Loci
Enter your data as usual for the assignment test. All individual "genotypes" are now presence/absence data, rather than true alleles. You can use whatever symbol you like to represent presence and absence.
All three zero-frequency adjustment methods are available. It is the frequency of presence/absence of markers which is adjusted, not the frequency of dominant/recessive alleles themselves.

Doh vs. WHICHRUN

Will Eichert (wfeichert@ucdavis.edu) and Michael Banks (mabanks@ucdavis.edu) created the program WHICHRUN, available at http://www-bml.ucdavis.edu/whichrun.htm. John Brzustowski wrote this assignment test calculator (http://www2.biology.ualberta.ca/jbrzusto/Doh.php).

WHICHRUN uses the same calculations as Doh to find the population in which a given genotype is most likely. Here's a brief comparison:

WHICHRUN is faster and easier to use with a better interface, but it requires MS Windows.
Doh does the same as WHICHRUN's jacknifing procedure: if individual X was sampled from population Y, then X's genotype is removed from Y's gene pool before the likelihood of X's genotype in Y is calculated. As of 09/02/2001, scores computed by both programs should agree exactly, provided you use the 1 / C option for zero allele frequency correction in Doh.
until 30/01/2001, Doh's 1 / C method for replacing zero allele frequencies used C = ploidy * popSize. To make the results more easily comparable to WHICHRUN, Doh now uses C = 1 + (ploidy * popSize). This will only rarely make any difference to the assignments.
WHICHRUN can simultaneously use haploid and diploid markers. This behaviour can be imitated in Doh by using missing allele markers ('-') for the second allele copy at all haploid loci.
for WHICHRUN, individuals whose population is unknown are put in a separate file. For Doh, these individuals are placed in the same input DataSet, but their population name is given as '?' (no quotes). When Doh calculates the likelihood of the genotype of an individual whose population is unknown, it uses the full gene pool from each known population. This should give exactly the same results as does WHICHRUN, provided the 1 / C zero allele frequency method is used in Doh.
until 30/01/2001, Doh's handling of PopGen (GenPop?) datasets was very crude. It is now better, although probably still not as complete as WHICHRUN's. You can put the unknown individuals in the same file, provided they are given only an individual identifier. eg.:
```
Title line: this is a bogus dataset
Locus1, Locus2, Locus3
stray1      ,     122124   192185   221220  
stray2      ,     125124   182195   221219  
pop
pop1 i1     ,     121124   192185   221228  
pop1 i2     ,     123123   182185   221227  
pop
pop2 i1     ,     123124   191183   220227  
pop2 i2     ,     122125   189190   219225  
. . .
```
Doh takes the population name as the first word before the comma, and the individual name as the second. If no second word is present, the first word is treated as the individual name and the population is treated as unknown. The popkeyword is ignored. This is still stupid, but I haven't had a chance to do it properly...

My thanks to Vince Buonacorsi, Will Eichert, Peter Wimberger, and Michael Banks for helping to sort this out.

(back to top)

Fixed Bugs and Changes

Here are some problems you might encounter, some fixed, some not. (there are doubtless many others - please let me know!):

Date Reported	Date Fixed	Bug or Change	Details
1-Nov-2002	3-Nov-2002	Subsampling to equal size	A silly mistake caused Doh to hang if you tried to use the "subsample to equal size" randomization on a dataset with individuals from unknown populations.
15-Jan-2001	30-Jan-2001	WHICHRUN issues	Doh now uses 1 / (1 + ploidy * popSize) to correct for zero allele frequencies, improving agreement with WHICHRUN. PopGENE file loading has been improved, and the unknown population marker "?" has been documented. A comparison of WHICHRUN and Doh appears above.
15-Nov-2000	15-Nov-2000	Dominant Marker data	Nothing like screwing up basic arithmetic! The correct way to treat dominant markers is to let "Present" be one allele, and "Absent" the other allele of a monoploid locus. It's not necessary to indicate which is which.
10-Nov-2000	10-Nov-2000	Y2K compliance	Having received no complaints to the contrary in over 10 months of post-millenial operation, I hereby pronounce Doh to be Y2K-compliant. I'm sure you're all relieved.
10-Nov-2000	10-Nov-2000	Dominant Marker data	Dominant Marker presence/absence data can now be used for the assignment test. Allele frequencies are estimated from marker presence/absence assuming HWE.
21-July-99	21-July-99	Read PopGen files; new randomization	PopGen files can be turned directly into datasets. More sensible randomizations have been added: they reshuffle alleles at each locus, either within or across poulations.
6-May-99	9-May-99	Randomization bug; symptom: running histogram disappears; no output generated	For some datasets, randomization would fail without reporting an error, due to a memory allocation bug.
4-Aug-98	6-Aug-98	Prob of Identity with no alleles	When missing alleles are ignored, and a locus has no non-missing alleles, the probability of identity is now defined to be one. This condition used to generate an error value.
	4-Jun-98	Cumulative Heterozygosity and Allele Count	These have been added and are described above.
	13-Feb-98	Load and save data locally	This calculator is now digitally signed, and makes use of Netscape's security manager classes to allow, with your permission, reading and writing of files on your machine (Netscape V4.0 and higher only).
10-Nov-97	10-Nov-97	weird field delimiter from FileMaker Pro	If you export your data from FileMaker Pro, you might have ascii 29 characters separating repeated fields. This program now converts those to spaces when pasting from the virtual clipboard.
(not rep.)	16-Sept-97	wrong probability for locus with missing allele	the probability of a locus where some BUT NOT ALL alleles are missing is wrong; the heterozygosity factor is computed treating 'missing' as just another allele flavour. This can affect assignment for datasets with such individuals.
1-Aug-97	28-Aug-97	probabilities of heterozygotes are wrong: the multinomial coefficient is omitted.	for diploids, heterozygote genotype probabilities are half of what they should be. Fortunately, this doesn't change the population to which an organism is assigned.
1-Aug-97	28-Aug-97	the choice of 0.01 or 1 / ploidy * popSize as the probability for a "vanishing" allele is applied differently to an individual's own population than to others	assignments could be wrong for individuals with rare alleles
June-97		horizontal scrollbars don't appear: some versions of Netscape for the PC don't draw horizontal scrollbars on text windows.	You can move around the window by clicking inside it and using cursor keys.
June-97		scroll bars don't appear on text windows in some versions of Netscape for the Mac	Resizing the browser window, or causing it to be redrawn makes the scrollbars appear
June-97		the virtual clipboard doesn't work on some browsers	a limitation of Microsoft Explorer (no uploading to the clipboard from a file on your machine) and possibly other browsers. Eventually, there will be a version of this program that allows direct access to files on your machine, making the virtual clipboard unnecessary.

(back to top)

Citing Doh:

Doh implements the assignment test first described in

Paetkau, D., W. Calvert, I. Sterling, and C. Strobeck. 1995. Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology 4:347-354

Paetkau and Strobeck developed the test, so you should probably cite this paper as the source of the assignment test.

The Doh program itself can also be cited, but the rules will vary among journals. Here is an example:

Brzustowski, J. "Doh assignment test calculator". Online. Available: <http://www2.biology.ualberta.ca/jbrzusto/Doh.php>. (12 March, 2002)

That date is the last update; some journals will want the date you accessed it instead. I haven't published anything referring to this program, but I would like people who want to use it to be able to find it.

Credits:

Thanks for motivation and feedback to Curtis Strobeck, Greg Wilson, Peter Waser, David Paetkau, Catherine Mossman, Corey Davis, and Linsey Mutch.

The DRAWTREE textual tree format comes from the Phylip package by Joe Felsenstein et al. My thanks to them for making it free.

The HTML and Java source code are free.

This Java and HTML web page is by John Brzustowski. I appreciate any comments or criticisms. Although I try to to correct any known errors in these programs, you use them entirely at your own risk. There is no warranty!

(back to top)