Assignment Calculator
New: You should now be able to cut/paste large datasets directly into/out of text boxes in this calculator, at least if your web browser is not too old. This means you don't have to use the Virtual Clipboard and should eliminate the need for this calculator to read and write directly to your machine. You may have to use the Paste key, which is often Ctrl-V, to get data into a window.
Note: until I get time to update it, you will need to use a Java 1.1 browser to run this program. Netscape version 4.X should work. Versions 6 and later will probably not work.
New: Subsampling to Equalize Population Sizes
Long Overdue: Citing this program
Randomization for confidence intervals
About Cumulative Heterozygosity
Assigning Individuals from Unkown Populations
Doh vs. WHICHRUN - read if you are new to Doh!
Reading data from files on your machine.
Running this program without the network
Creating a dataset from a POPGEN file
Running assignment on a batch of files
This calculator takes genotypes of individuals from several populations and determines from which population each individual is most likely to have come, by using the assignment index, the highest probability of an individual's genotype in any of the populations. Calculations are as described in
Paetkau, D., W. Calvert, I. Sterling, and C. Strobeck. 1995. Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology 4:347-354
and
Paetkau, D., L. P. Waits, P. L. Clarkson, L. Craighead, and C. Strobeck. 1997. An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics 147:1943-1957
Hit the Create Dataset button to pop up an input datasheet. Don't forget to fill in the number of loci. Duplicate individual names will automatically be suffixed with "-2", "-3" and so on. Names must not include spaces or TAB characters. Make sure the options page corresponds to the format of your data. The Individual Genotypes window should be filled with blank/TAB/newline separated items like this:
GB01 Admiralty 184 198 152 158 105 113 184 184 SIT01 Baranof 184 184 152 154 105 113 178 182
where the first item is an individual name, the second is a population name, and the rest of the items are the number_of_loci * ploidy allele copies for that individual, in order by locus. Alleles can be represented by numbers or words. Space is not significant in the input, as long as at least one space, TAB, or newline separates each item. The Options page of the input datasheet allows you to specify whether individual and/or population names are present, and if so, in what order. If names are not present, individuals and/or populations are numbered sequentially.
Individuals From Unknown Populations
If you have individuals whose source population is unkown and you wish to find the population in which their genotypes are most likely, include them in your dataset with the symbol ? (question mark) as the population name. You'll still need to assign an individual name. Any individuals with unknown population will be listed at the end of the Individual Assignment output after you run the test. The results from this analysis should be the same as you'd get from WhichRun.Double click on the Options for Assignment Test listbox entry to pop up the option datasheet. The main option is what value to use in place of a zero allele frequency. i.e. when the allele has not actually been detected in a population, or when the only individual in which it has been found is removed from that population. You can choose either to always use f = 0.01 or to use f = 1 / C where C = 1 + ploidy * size_of_the_appropriate_population. The latter choice amounts to pretending that the next allele sampled from the population would have been of the given type. In both cases, frequencies of the other alleles for the given locus and population are not changed, and so together, they will sum to slightly more than one.
New: there is now another option for dealing with zero frequencies: you can choose to have all frequencies adjusted in every population using the formula p' = (f + 1/a) / (n + 1), where p' is the adjusted probability estimate for a given allele at a locus in a population, f is the number of allele copies of the given type, a is the number of allele flavours for that locus, and n is the number of gene copies for that locus in the given population, not including missing copies. For comparison, the usual, unadjusted probability estimate is just p = f / n. This zero-avoidance device is described in Titterington et al.(1981) J. R. Statist. Soc. A. 144:145-175. Results from using this method can differ considerably from those using the other two.
Once you have hit OK on both the input and option datasheets, hit the Calculate button to run the assignment test, unless the datasheet complains, in which case there is a problem with the way your data is entered, and if you are lucky, the datasheet complaint will help you figure out what it is.
Running the assignment test produces two datasheets:
The Assignment Datasheet contains:
Ax,y = 1/nx Sum over i in x of [ log10 ( Prx(gi) / Pry(gi) ) ]
where
x, y are populations
nx is the size of population x
gi is the genotype of individual i
Prx is the genotype probability calculated in population x
So Ax,y is a measure of how much more likely genotypes of individuals sampled in population x are in population x than in population y. A is not symmetric.The Assignment Datasheet contains a (lower triangular) matrix of distances between each pair of populations, calculated as:
dx,y = ( Ax,y + Ay,x ) / 2
These are not necessarily distances in the metric sense. (Separate datasheets are used in anticipation of better integration between this program and others in the near future.)Output looks ugly, but is meant to be pasted into a spreadsheet.
The distance matrix is suitable for use in programs like Phylip Neighbor and a web-based tree-building program . You can copy/paste the distance matrix from here using either the system clipboard, or the virtual clipboard.
Use a '-' (minus sign without the quotes) in the place of each missing allele copy (making sure it has space before and after, just like any other allele label); e.g. a missing locus in a diploid organism is entered as ' - - '. The program simply drops the probabilities for missing allele copies from the calculation of the assignment index. Note: the assignment index for an individual with missing data cannot be directly compared to the index for an individual with all data, or with data missing in different locations, because each allele added to the calculation can only decrease (or at least not increase) the index value. However, it may be legitimate to compare the ratio (probability of an individual in its assigned population / probability in its nominal (input) population) between such individuals.
The Virtual Clipboard lets you paste data into or copy data out of multi-page windows. The ---> VC and <--- VC buttons move data to and from the virtual clipboard. If you see a "Browse" button below, then your browser will also let you copy a file on your machine to the Virtual Clipboard. You can then paste it from the V.C. into the calculator's input windows.
Name of file:
The Virtual Clipboard is Brainless
Is your window full of garbage after pasting? Unfortunately, the Virtual Clipboard knows nothing about file formats, so cutting and pasting directly from an Excel file will fill your window with junk (ie. raw Excel file bytes). You can, however, copy from the open Excel worksheet to a text editor, then save as text in a new file which you can then copy to the Virtual Clipboard. You can also just save a copy of your Excel file in tab-delimited text format, and copy that file to the VC. The point of the VC is just to help work around the limitations on window size in Netscape, MS Explorer, and probably other browsers.
Multi-Page TextBoxes (for medium-sized datasets)
Some browsers limit the size of text windows, so this program allows for bigger input and output using multiple pages. Each page is limited to around 10000 characters. To see if you have filled a page, try typing more characters at the end of the window. When you use your system's copy/paste functions to get text in and out of this program, you have to do so one page at a time; the program has no control over this.
Running, Reading, and Writing on Your Machine
You can download this web page and calculator so that you can run it from your own machine, without using the network. The only disadvantage of this is that you won't necessarily be running the latest version. If you are using Netscape Navigator/Communicator V4.0 or higher, you can load data into a text box directly from a file on your machine, without going through the virtual clipboard (and thus avoid sending your data over the network). The Load and Save buttons below a data entry window will pop up a file dialog, allowing you to load (save) the contents of the window from (to) a file. Netscape will ask your permission for these operations. Unfortunately, other browsers do not (yet?) support the same security mechanism. Netscape Communicator is now open source, so this is the only security mechanism I will support for now. To use these feature, make a local copy of Doh:
Cumulative Heterozygosity and Allele Count
Doh can calculate cumulative expected and observed heterozygosity, cumulative allele count, and cumulative probability of identity (P(ID)) as individuals are considered one by one. To use this feature:
Randomizing for Confidence Intervals
Doh can repeatedly randomize your data and re-calculate the assignment test. This allows you to test several null hypotheses about the numbers of individuals cross-assigned between populations. The options are:
While randomization proceeds, which might take some time, a histogram of numbers of population to population cross-assignments for each pair of populations is displayed for your amusement. The pink bar is the number of cross-assignments for the observed data, while the black bars accumulate a histogram of cross-assignments for the randomized populations. Again, for now, this graph is ephemeral.
The Randomization OutputThe output consists of three new panes in the assignment test output:
Subsampling to Equalize Population Sizes
As a special case, Doh can run a randomization as follows:In the Assignment Output panel, you will only see values for these items:
You can perform the assignment test on data where each locus represents presence or absence of a dominant marker (e.g. data from RAPD). To do this:
WHICHRUN uses the same calculations as Doh to find the population in which a given genotype is most likely. Here's a brief comparison:
Title line: this is a bogus dataset Locus1, Locus2, Locus3 stray1 , 122124 192185 221220 stray2 , 125124 182195 221219 pop pop1 i1 , 121124 192185 221228 pop1 i2 , 123123 182185 221227 pop pop2 i1 , 123124 191183 220227 pop2 i2 , 122125 189190 219225 . . .Doh takes the population name as the first word before the comma, and the individual name as the second. If no second word is present, the first word is treated as the individual name and the population is treated as unknown. The popkeyword is ignored. This is still stupid, but I haven't had a chance to do it properly...
Here are some problems you might encounter, some fixed, some not. (there are doubtless many others - please let me know!):
Date Reported | Date Fixed | Bug or Change | Details |
1-Nov-2002 | 3-Nov-2002 | Subsampling to equal size | A silly mistake caused Doh to hang if you tried to use the "subsample to equal size" randomization on a dataset with individuals from unknown populations. |
15-Jan-2001 | 30-Jan-2001 | WHICHRUN issues | Doh now uses 1 / (1 + ploidy * popSize) to correct for zero allele frequencies, improving agreement with WHICHRUN. PopGENE file loading has been improved, and the unknown population marker "?" has been documented. A comparison of WHICHRUN and Doh appears above. |
15-Nov-2000 | 15-Nov-2000 | Dominant Marker data | Nothing like screwing up basic arithmetic! The correct way to treat dominant markers is to let "Present" be one allele, and "Absent" the other allele of a monoploid locus. It's not necessary to indicate which is which. |
10-Nov-2000 | 10-Nov-2000 | Y2K compliance | Having received no complaints to the contrary in over 10 months of post-millenial operation, I hereby pronounce Doh to be Y2K-compliant. I'm sure you're all relieved. |
10-Nov-2000 | 10-Nov-2000 | Dominant Marker data | Dominant Marker presence/absence data can now be used for the assignment test. Allele frequencies are estimated from marker presence/absence assuming HWE. |
21-July-99 | 21-July-99 | Read PopGen files; new randomization | PopGen files can be turned directly into datasets. More sensible randomizations have been added: they reshuffle alleles at each locus, either within or across poulations. |
6-May-99 | 9-May-99 | Randomization bug; symptom: running histogram disappears; no output generated | For some datasets, randomization would fail without reporting an error, due to a memory allocation bug. |
4-Aug-98 | 6-Aug-98 | Prob of Identity with no alleles | When missing alleles are ignored, and a locus has no non-missing alleles, the probability of identity is now defined to be one. This condition used to generate an error value. |
4-Jun-98 | Cumulative Heterozygosity and Allele Count | These have been added and are described above. | |
13-Feb-98 | Load and save data locally | This calculator is now digitally signed, and makes use of Netscape's security manager classes to allow, with your permission, reading and writing of files on your machine (Netscape V4.0 and higher only). | |
10-Nov-97 | 10-Nov-97 | weird field delimiter from FileMaker Pro | If you export your data from FileMaker Pro, you might have ascii 29 characters separating repeated fields. This program now converts those to spaces when pasting from the virtual clipboard. |
(not rep.) | 16-Sept-97 | wrong probability for locus with missing allele | the probability of a locus where some BUT NOT ALL alleles are missing is wrong; the heterozygosity factor is computed treating 'missing' as just another allele flavour. This can affect assignment for datasets with such individuals. |
1-Aug-97 | 28-Aug-97 | probabilities of heterozygotes are wrong: the multinomial coefficient is omitted. | for diploids, heterozygote genotype probabilities are half of what they should be. Fortunately, this doesn't change the population to which an organism is assigned. |
1-Aug-97 | 28-Aug-97 | the choice of 0.01 or 1 / ploidy * popSize as the probability for a "vanishing" allele is applied differently to an individual's own population than to others | assignments could be wrong for individuals with rare alleles |
June-97 | horizontal scrollbars don't appear: some versions of Netscape for the PC don't draw horizontal scrollbars on text windows. | You can move around the window by clicking inside it and using cursor keys. | |
June-97 | scroll bars don't appear on text windows in some versions of Netscape for the Mac | Resizing the browser window, or causing it to be redrawn makes the scrollbars appear | |
June-97 | the virtual clipboard doesn't work on some browsers | a limitation of Microsoft Explorer (no uploading to the clipboard from a file on your machine) and possibly other browsers. Eventually, there will be a version of this program that allows direct access to files on your machine, making the virtual clipboard unnecessary. |
Paetkau, D., W. Calvert, I. Sterling, and C. Strobeck. 1995. Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology 4:347-354Paetkau and Strobeck developed the test, so you should probably cite this paper as the source of the assignment test.
The Doh program itself can also be cited, but the rules will vary among journals. Here is an example:
Brzustowski, J. "Doh assignment test calculator". Online. Available: <http://www2.biology.ualberta.ca/jbrzusto/Doh.php>. (12 March, 2002)That date is the last update; some journals will want the date you accessed it instead. I haven't published anything referring to this program, but I would like people who want to use it to be able to find it.
Thanks for motivation and feedback to Curtis Strobeck, Greg Wilson, Peter Waser, David Paetkau, Catherine Mossman, Corey Davis, and Linsey Mutch.
The DRAWTREE textual tree format comes from the Phylip package by Joe Felsenstein et al. My thanks to them for making it free.
The HTML and Java source code are free.
This Java and HTML web page is by John Brzustowski. I appreciate any comments or criticisms. Although I try to to correct any known errors in these programs, you use them entirely at your own risk. There is no warranty!