UNSUPPORTED SOFTWARE: USE AT OWN RISK

Assignment Calculator

New: You should now be able to cut/paste large datasets directly into/out of text boxes in this calculator, at least if your web browser is not too old. This means you don't have to use the Virtual Clipboard and should eliminate the need for this calculator to read and write directly to your machine. You may have to use the Paste key, which is often Ctrl-V, to get data into a window.

Note: until I get time to update it, you will need to use a Java 1.1 browser to run this program. Netscape version 4.X should work. Versions 6 and later will probably not work.

New: Finding likely migrants

New: Subsampling to Equalize Population Sizes

Long Overdue: Citing this program

Use the Calculator

About Assignment

Randomization for confidence intervals

About Cumulative Heterozygosity

Using Dominant Marker data

About the Input Format

About the Output

Assigning Individuals from Unkown Populations

How to deal with missing data

Doh vs. WHICHRUN - read if you are new to Doh!

Reading data from files on your machine.

The Virtual Clipboard for large datasets

What are page buttons for?

Running this program without the network

Creating a dataset from a POPGEN file

Running assignment on a batch of files

Bugs, Fixes, and Changes

Credits

About Assignment

This calculator takes genotypes of individuals from several populations and determines from which population each individual is most likely to have come, by using the assignment index, the highest probability of an individual's genotype in any of the populations. Calculations are as described in

Paetkau, D., W. Calvert, I. Sterling, and C. Strobeck. 1995. Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology 4:347-354

and

Paetkau, D., L. P. Waits, P. L. Clarkson, L. Craighead, and C. Strobeck. 1997. An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics 147:1943-1957

About the Input Format

Hit the Create Dataset button to pop up an input datasheet. Don't forget to fill in the number of loci. Duplicate individual names will automatically be suffixed with "-2", "-3" and so on. Names must not include spaces or TAB characters. Make sure the options page corresponds to the format of your data. The Individual Genotypes window should be filled with blank/TAB/newline separated items like this:

GB01  Admiralty 184 198 152 158 105 113 184 184
SIT01 Baranof   184 184 152 154 105 113 178 182

where the first item is an individual name, the second is a population name, and the rest of the items are the number_of_loci * ploidy allele copies for that individual, in order by locus. Alleles can be represented by numbers or words. Space is not significant in the input, as long as at least one space, TAB, or newline separates each item. The Options page of the input datasheet allows you to specify whether individual and/or population names are present, and if so, in what order. If names are not present, individuals and/or populations are numbered sequentially.

Individuals From Unknown Populations

If you have individuals whose source population is unkown and you wish to find the population in which their genotypes are most likely, include them in your dataset with the symbol ? (question mark) as the population name. You'll still need to assign an individual name. Any individuals with unknown population will be listed at the end of the Individual Assignment output after you run the test. The results from this analysis should be the same as you'd get from WhichRun.

Dealing with Zeroes

Double click on the Options for Assignment Test listbox entry to pop up the option datasheet. The main option is what value to use in place of a zero allele frequency. i.e. when the allele has not actually been detected in a population, or when the only individual in which it has been found is removed from that population. You can choose either to always use f = 0.01 or to use f = 1 / C where C = 1 + ploidy * size_of_the_appropriate_population. The latter choice amounts to pretending that the next allele sampled from the population would have been of the given type. In both cases, frequencies of the other alleles for the given locus and population are not changed, and so together, they will sum to slightly more than one.

New: there is now another option for dealing with zero frequencies: you can choose to have all frequencies adjusted in every population using the formula p' = (f + 1/a) / (n + 1), where p' is the adjusted probability estimate for a given allele at a locus in a population, f is the number of allele copies of the given type, a is the number of allele flavours for that locus, and n is the number of gene copies for that locus in the given population, not including missing copies. For comparison, the usual, unadjusted probability estimate is just p = f / n. This zero-avoidance device is described in Titterington et al.(1981) J. R. Statist. Soc. A. 144:145-175. Results from using this method can differ considerably from those using the other two.

Once you have hit OK on both the input and option datasheets, hit the Calculate button to run the assignment test, unless the datasheet complains, in which case there is a problem with the way your data is entered, and if you are lucky, the datasheet complaint will help you figure out what it is.

(back to top)


About the Output

Running the assignment test produces two datasheets:

The Assignment Datasheet contains:

The Assignment Datasheet contains a (lower triangular) matrix of distances between each pair of populations, calculated as:

dx,y = ( Ax,y + Ay,x ) / 2

These are not necessarily distances in the metric sense. (Separate datasheets are used in anticipation of better integration between this program and others in the near future.)

Output looks ugly, but is meant to be pasted into a spreadsheet.

The distance matrix is suitable for use in programs like Phylip Neighbor and a web-based tree-building program . You can copy/paste the distance matrix from here using either the system clipboard, or the virtual clipboard.

(back to top)


Assignment Calculator

This program requires a Java-enabled browser. Perhaps your browser has Java disabled?


(back to top)

Missing Data

Use a '-' (minus sign without the quotes) in the place of each missing allele copy (making sure it has space before and after, just like any other allele label); e.g. a missing locus in a diploid organism is entered as ' - - '. The program simply drops the probabilities for missing allele copies from the calculation of the assignment index. Note: the assignment index for an individual with missing data cannot be directly compared to the index for an individual with all data, or with data missing in different locations, because each allele added to the calculation can only decrease (or at least not increase) the index value. However, it may be legitimate to compare the ratio (probability of an individual in its assigned population / probability in its nominal (input) population) between such individuals.

(back to top)



Virtual Clipboard (for large files)

The Virtual Clipboard lets you paste data into or copy data out of multi-page windows. The ---> VC and <--- VC buttons move data to and from the virtual clipboard. If you see a "Browse" button below, then your browser will also let you copy a file on your machine to the Virtual Clipboard. You can then paste it from the V.C. into the calculator's input windows.

Name of file:

Note: if you get a "Save As..." dialog box when viewing the virtual clipboard, it is because the data there is in a binary format (e.g. a PICT file saved from TreeToy). You should specify an appropriate name (e.g. myplot.pict instead of viewclip.php) before saving it.


The Virtual Clipboard is Brainless

Is your window full of garbage after pasting? Unfortunately, the Virtual Clipboard knows nothing about file formats, so cutting and pasting directly from an Excel file will fill your window with junk (ie. raw Excel file bytes). You can, however, copy from the open Excel worksheet to a text editor, then save as text in a new file which you can then copy to the Virtual Clipboard. You can also just save a copy of your Excel file in tab-delimited text format, and copy that file to the VC. The point of the VC is just to help work around the limitations on window size in Netscape, MS Explorer, and probably other browsers.

(back to top)

(go to Calculator)


Multi-Page TextBoxes (for medium-sized datasets)

Some browsers limit the size of text windows, so this program allows for bigger input and output using multiple pages. Each page is limited to around 10000 characters. To see if you have filled a page, try typing more characters at the end of the window. When you use your system's copy/paste functions to get text in and out of this program, you have to do so one page at a time; the program has no control over this.

(go to top)


Running, Reading, and Writing on Your Machine

You can download this web page and calculator so that you can run it from your own machine, without using the network. The only disadvantage of this is that you won't necessarily be running the latest version. If you are using Netscape Navigator/Communicator V4.0 or higher, you can load data into a text box directly from a file on your machine, without going through the virtual clipboard (and thus avoid sending your data over the network). The Load and Save buttons below a data entry window will pop up a file dialog, allowing you to load (save) the contents of the window from (to) a file. Netscape will ask your permission for these operations. Unfortunately, other browsers do not (yet?) support the same security mechanism. Netscape Communicator is now open source, so this is the only security mechanism I will support for now. To use these feature, make a local copy of Doh:

(go to top)


Cumulative Heterozygosity and Allele Count

Doh can calculate cumulative expected and observed heterozygosity, cumulative allele count, and cumulative probability of identity (P(ID)) as individuals are considered one by one. To use this feature:

Options are: Output for this command is rounded to three digits after the decimal.

Probability of Identity P(ID)

This is the probability that two randomly drawn diploid genotypes will be identical, assuming observed allele frequencies and random assortment. The program uses the unbiased estimator of P(ID) found in Paetkau et. al.(1998) Conservation Biology 12(2):418-29, unless the number of alleles found at a locus is less than 4, in which case the standard biased estimator is used. In the unusual case where no non-missing alleles have been found for a locus, and you have chosen to have Doh ignore missing alleles, P(ID) is set to one.

Output Format for Cumulative Calculations

Output from these commands is a matrix of numbers. Each row represents a locus, except that the last row represents the mean for all loci. The first column has results from considering a single individual, the second column, from two individuals, and so on, up to the number of individuals chosen for analysis.

(go to top)


Randomizing for Confidence Intervals

Doh can repeatedly randomize your data and re-calculate the assignment test. This allows you to test several null hypotheses about the numbers of individuals cross-assigned between populations. The options are:

If randomization (4) suggests rejection of H0, then it's likely that one or both of randomizations (2) and (3) will too. If randomization (2) rejects H0, but randomization (3) doesn't, then the cross assignments you observed are partly due to HW disequilibrium, which can make individuals with multiple rare alleles seem unlikely within their own populations.

While randomization proceeds, which might take some time, a histogram of numbers of population to population cross-assignments for each pair of populations is displayed for your amusement. The pink bar is the number of cross-assignments for the observed data, while the black bars accumulate a histogram of cross-assignments for the randomized populations. Again, for now, this graph is ephemeral.

The Randomization Output

The output consists of three new panes in the assignment test output:

Subsampling to Equalize Population Sizes

As a special case, Doh can run a randomization as follows: This (or at least my implementation of it) is a quick and dirty technique. Martin Carlsson (martin.carlsson@ebc.uu.se) suggested it might be useful for getting around possible problems with highly unequal population sizes.

In the Assignment Output panel, you will only see values for these items:

In particular, no summaries are provided for assignments of individuals, and no A or distance matrix is computed.

(go to top)

Finding Likely Migrants

Here's one approach to looking for individuals who might have come from a population other than the one from which they were sampled:

(go to top)


PopGen Files

Data files in PopGen format can be read and turned directly into Doh Genotyped Individuals datasets. There is a button for this on the applet.

Run a Batch of Assignments

You can run the assignment test on multiple PopGen files using the applet available at Boh.html. It allows you to choose assignment test options just as this program does, and further allows you to specify what part(s) of the output from each run are retained.

(go to top)

Using Dominant Markers

You can perform the assignment test on data where each locus represents presence or absence of a dominant marker (e.g. data from RAPD). To do this:

Doh vs. WHICHRUN

Will Eichert (wfeichert@ucdavis.edu) and Michael Banks (mabanks@ucdavis.edu) created the program WHICHRUN, available at http://www-bml.ucdavis.edu/whichrun.htm. John Brzustowski wrote this assignment test calculator (http://www2.biology.ualberta.ca/jbrzusto/Doh.php).

WHICHRUN uses the same calculations as Doh to find the population in which a given genotype is most likely. Here's a brief comparison:

My thanks to Vince Buonacorsi, Will Eichert, Peter Wimberger, and Michael Banks for helping to sort this out.

(back to top)


Fixed Bugs and Changes

Here are some problems you might encounter, some fixed, some not. (there are doubtless many others - please let me know!):

Date Reported Date Fixed Bug or Change Details
1-Nov-2002 3-Nov-2002 Subsampling to equal size A silly mistake caused Doh to hang if you tried to use the "subsample to equal size" randomization on a dataset with individuals from unknown populations.
15-Jan-2001 30-Jan-2001 WHICHRUN issues Doh now uses 1 / (1 + ploidy * popSize) to correct for zero allele frequencies, improving agreement with WHICHRUN. PopGENE file loading has been improved, and the unknown population marker "?" has been documented. A comparison of WHICHRUN and Doh appears above.
15-Nov-2000 15-Nov-2000 Dominant Marker data Nothing like screwing up basic arithmetic! The correct way to treat dominant markers is to let "Present" be one allele, and "Absent" the other allele of a monoploid locus. It's not necessary to indicate which is which.
10-Nov-2000 10-Nov-2000 Y2K compliance Having received no complaints to the contrary in over 10 months of post-millenial operation, I hereby pronounce Doh to be Y2K-compliant. I'm sure you're all relieved.
10-Nov-2000 10-Nov-2000 Dominant Marker data Dominant Marker presence/absence data can now be used for the assignment test. Allele frequencies are estimated from marker presence/absence assuming HWE.
21-July-99 21-July-99 Read PopGen files; new randomization PopGen files can be turned directly into datasets. More sensible randomizations have been added: they reshuffle alleles at each locus, either within or across poulations.
6-May-99 9-May-99 Randomization bug; symptom: running histogram disappears; no output generated For some datasets, randomization would fail without reporting an error, due to a memory allocation bug.
4-Aug-98 6-Aug-98 Prob of Identity with no alleles When missing alleles are ignored, and a locus has no non-missing alleles, the probability of identity is now defined to be one. This condition used to generate an error value.
4-Jun-98 Cumulative Heterozygosity and Allele Count These have been added and are described above.
13-Feb-98 Load and save data locally This calculator is now digitally signed, and makes use of Netscape's security manager classes to allow, with your permission, reading and writing of files on your machine (Netscape V4.0 and higher only).
10-Nov-97 10-Nov-97 weird field delimiter from FileMaker Pro If you export your data from FileMaker Pro, you might have ascii 29 characters separating repeated fields. This program now converts those to spaces when pasting from the virtual clipboard.
(not rep.) 16-Sept-97 wrong probability for locus with missing allele the probability of a locus where some BUT NOT ALL alleles are missing is wrong; the heterozygosity factor is computed treating 'missing' as just another allele flavour. This can affect assignment for datasets with such individuals.
1-Aug-97 28-Aug-97 probabilities of heterozygotes are wrong: the multinomial coefficient is omitted. for diploids, heterozygote genotype probabilities are half of what they should be. Fortunately, this doesn't change the population to which an organism is assigned.
1-Aug-97 28-Aug-97 the choice of 0.01 or 1 / ploidy * popSize as the probability for a "vanishing" allele is applied differently to an individual's own population than to others assignments could be wrong for individuals with rare alleles
June-97 horizontal scrollbars don't appear: some versions of Netscape for the PC don't draw horizontal scrollbars on text windows. You can move around the window by clicking inside it and using cursor keys.
June-97 scroll bars don't appear on text windows in some versions of Netscape for the Mac Resizing the browser window, or causing it to be redrawn makes the scrollbars appear
June-97 the virtual clipboard doesn't work on some browsers a limitation of Microsoft Explorer (no uploading to the clipboard from a file on your machine) and possibly other browsers. Eventually, there will be a version of this program that allows direct access to files on your machine, making the virtual clipboard unnecessary.

(back to top)


Citing Doh:

Doh implements the assignment test first described in
Paetkau, D., W. Calvert, I. Sterling, and C. Strobeck. 1995. Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology 4:347-354
Paetkau and Strobeck developed the test, so you should probably cite this paper as the source of the assignment test.

The Doh program itself can also be cited, but the rules will vary among journals. Here is an example:

Brzustowski, J. "Doh assignment test calculator". Online. Available: <http://www2.biology.ualberta.ca/jbrzusto/Doh.php>. (12 March, 2002)
That date is the last update; some journals will want the date you accessed it instead. I haven't published anything referring to this program, but I would like people who want to use it to be able to find it.

Credits:

Thanks for motivation and feedback to Curtis Strobeck, Greg Wilson, Peter Waser, David Paetkau, Catherine Mossman, Corey Davis, and Linsey Mutch.

The DRAWTREE textual tree format comes from the Phylip package by Joe Felsenstein et al. My thanks to them for making it free.

The HTML and Java source code are free.

This Java and HTML web page is by John Brzustowski. I appreciate any comments or criticisms. Although I try to to correct any known errors in these programs, you use them entirely at your own risk. There is no warranty!

(back to top)