UNSUPPORTED SOFTWARE: USE AT OWN RISK

Clustering Calculator


Clustering Calculator

The sample data are counts of insects caught in 4 types of traps from C. Buddle, 1999. Click "Calculate!" to run this example, or "Clear Inputs" to enter your own data.

You need a Java-enabled browser to run this program. Perhaps your browser has Java disabled?

I have copied the Phylip/DRAWTREE output to a clipboard, so take me to TreeToy!

(back to top)


New Features

Bootstrap Consensus Trees

So far, this calculator is not able to print out a Phylip DRAWTREE string with consensus values for randomized trees. However, by selecting these two options from the More Options... menu:

you will get a list of the Phylip tree strings in the output (and the usual text picture of the tree will not appear). You can then run these through Phylip's CONSENSE program. Thanks to Michel Tersac (michel.tersac@ensam.inra.fr) for this idea.

File load/save

If you install a local copy of the applet (see below), then you can load and save the contents of a window from or to a file on your machine, without use of the virtual clipboard. These Load and Save buttons next to a window don't work if you are running cluster directly from the web.

Stability analysis

The applet can assess stability of the tree topology under various resampling scenarios, in order to answer the question If I were to repeat this experiment several times, how often would the observed groupings occur?

There are two factors that determine how this analysis is done:

  1. What other trees might my experiment have come up with?
  2. How do I compare the tree I did see with these others?
There are many approaches possible. Currently, this applet only allows stability analysis when you have a character matrix. Each hypothetical tree is generated by creating a new character matrix from the old one, and then clustering. You can choose one of these methods for generating a new character matrix:

Comparing two trees

This analysis examines only the topology of the two trees, and ignores branch lengths. Consider an interior node (inode) in a binary tree. The samples which are its descendents (its leaf nodes) form two groups: those descending from the left branch, and those descending from the right branch. We treat this inode's partition of its leaf nodes into two sets as the feature of interest (i.e. for a given inode, we examine only how it splits its descendent samples into two groups, and not the structure within those two groups). So every time we generate a new tree, we check it for inodes that divide the same set of samples into the same two subsets as an inode in the original tree does. The output is a number for each inode in the original tree, giving the number of generated trees in which an inode with the same partition was found. Here's an example:

(back to top)


About the Input

The calculator accepts input in several formats:

  • Cornell Condensed
  • Square Distance/Similarity Matrix
  • Upper/Lower Triangular Distance/Similarity Matrix
  • Character Matrix: Samples are Columns
  • Character Matrix: Samples are Rows
  • In all cases, if you select the names in "Names Window" option and that window doesn't contain enough names, or if you select the "No Names" option, sample numbers are used as names for unnamed samples.

    Cornell Condensed Example:

    1     1  1.0 11 4.0 17 7.0 19 4.0 20 2.0
    1     32 4.0
    2     1  3.0 4  2.0 6  3.0 7  4.0 11 4.0
    2     16 5.0 17 5.0 19 4.0 20 7.0 27 5.0
    2     32 3.0
    extra 2  4.0 4  7.0 6  2.0 11 4.0 16 2.0
    extra 32 6.0
        

    Each sample can occupy several consecutive lines. The first number (or name) in a line is the sample number (or name), and the rest of the line is a sequence of (item, abundance) pairs. For example, sample "extra" has an abundance of 4.0 for item 11. Abundances of items not listed for a sample are zero. For example, sample 2 has an abundance of 0.0 for item 5. Distances between samples are calculated using the Euclidean metric. If you select "Data Window" as the location of sample names, then the first "word" in each line is taken as the sample name. The number of spaces or tabs between items is not important, but newline characters are: each line must begin with a sample name or number.

    Square Distance Matrix Example (with names):

    ridge 0.0 0.1 0.3 0.9
    floor 0.1 0.0 0.5 0.6
    banks 0.3 0.5 0.0 0.7
    other 0.9 0.6 0.7 0.0

    The window must contain a number (representing distance or similarity) for each orderered pair of samples. If you select "Data Window" as the location of the sample names, then the name of each sample must appear just before its list of distances to other samples. Here, the distance from sample "banks" to sample "other" is 0.7 If you have similarity data, select "Square Similarity Matrix", and the calculator will obtain distances by subtracting each similari from 1.0 Only the lower triangle of the input is used, and the calculator does not verify that your input matrix is symmetrical and has zeroes (or ones) along the diagonal.

    Names and numbers in the window must be separated by spaces, tabs, and/or newlines, but unlike for the Cornell Condensed format, only the order matters! For example, this input is considered equivalent to the previous example:

    ridge 0.0 0.1
    0.3 0.9
    floor
    0.1 0.0 0.5 0.6
    banks 0.3 0.5 0.0 0.7 other 0.9 0.6 0.7

    Upper Triangular Distance Matrix Example (with names):

    ridge 0.1 0.3 0.9
    floor 0.5 0.6
    banks 0.7
    other

    Lower Triangular Similarity Matrix Example (with names):

    ridge
    floor 0.9
    banks 0.7 0.5
    other 0.1 0.4 0.3

    Lower Triangular Similarity Matrix Example (without names, with diagonal):

    0.0
    0.9 0.0
    0.7 0.5 0.0
    0.1 0.4 0.3 0.0

    The window must contain a number (representing distance or similarity) for each unorderered pair of distinct samples. If you select "Data Window" as the location of the sample names, then the name of each sample must appear just before its list of distances to other samples. Here, the distance from sample "banks" to sample "other" is 0.7 (and their similarity is 0.3) The zeroes along the diagonal of a distance matrix, or the ones along the diagonal of a similarity matrix can be present, but if so, make sure to check the "Diagonal Present" option. If you have similarity data, the calculator will obtain distances by subtracting each similarity from 1.0 Only the lower triangle of a square matrix is used, and the calculator does not verify that such a matrix is symmetrical and has zeroes (or ones) along the diagonal.

    Names and numbers in the window must be separated by spaces, tabs, and/or newlines, but unlike for the Cornell Condensed format, only the order matters! For example, this input is considered equivalent to the previous example:

    0.9 0.7 0.5
    0.1 0.4
    0.3
        

    Character Matrix: Samples as Rows Example (with names):

    7
    A B C D E F G
    Lake 0 4 0 0 5 0 6
    Pond 3 0 0 2 3 3 0
    Lawn 1 0 0 2 1 1 0

    There is a list of characters for which each sample has a value. In this example, there are 7 characters, each representing the abundance of a different species of insect found in sweep samples at different locations. The first item in the window is the number of characters in each sample. The rest of the window must contain a number for each sample-character pair (even if it is zero), and all numbers for a given sample must appear consecutively. To accomodate the natural layout of this kind of data, this applet now assumes row and column names are present. This assumption can be turned off under the More Options... menu.

    If you select "Data Window" as the location of the sample names, then the name of each sample must appear just before its list of character values. The distance between samples is calculated using the distance/similarity measure you choose. Names and numbers in the window must be separated by spaces, tabs, and/or newlines, but unlike for the Cornell Condensed format, only the order matters! For example, this input is considered equivalent to the previous example:

    7 A B C D E F G Lake 0 4
    0 0 5 0 6 Pond 3 0 0 2 3 3 0

    Lawn 1 0
    0 2 1 1

    0


    Character Matrix: Samples As Columns Example (with names):

    7
      Lake Pond Lawn
    A 0 3 1
    B 4 0 0
    C 0 0 0
    D 0 2 2
    E 5 3 1
    F 0 3 1
    G 6 0 0
        

    This is the transpose of the Samples as Rows format: the first item is the number of characters in each sample. This is followed by the sample names, if you selected the "Names in Data Window" option, and then for each character, the values it takes in each sample. In this example there are 7 characters, and three samples, namely "Lake", "Pond", and "Lawn". The value of the fifth character for the Lake sample is 5, and the value of the first character for the Lawn sample is 1. As for the Samples-as-rows format, only the order of the data matters.

    (back to top)


    About the Distance/Similarity Measures

    For the Cornell Condensed, Sample-Character, and Character-Sample input formats, you can choose the method for calculating distances (or similarities) between samples.

    Distance Measures
    The distance between two samples is calculated using their values for each character, in various ways.For the following table, Xik is the value of character k for sample i, and |X| means the absolute value of quantity X.

    Name

    Distance between samples i and j

    Euclidean Square-root of the sum over all characters k of (Xik -Xjk )2
    Mean Censored Euclidean Square-root of (the sum over all characters k of (Xik -Xjk )2/ n ) where n is the number of characters for which at least one of the two samples has a non-zero value
    Bray-Curtis (Sum over all characters k of |Xik -Xjk | ) divided by
    (Sum over all characters k of Xik + Xjk )
    Canberra (Sum over all characters k of |Xik -Xjk | / (Xik + Xjk ) ) divided by
    the number of characters

    (Binary) Similarity Measures
    The similarity of two samples is calculated using only the presence or absence of each character in each sample, where presence means a value greater than zero for the character. For the following table, the variables a, b, c, and d mean the number of characters:

    Name

    Similarity of samples i and j

    Jaccard

    a / (a + b + c)

    Sorensen

    2a / (2a + b + c)

    Simple Matching Coefficient

    (a + d) / (a + b + c + d)

    Baroni-Urbani and Buser (a + ( ad )1/2 ) / (a + b + c + ( ad )1/2 )

    If a similarity measure gives a value of S, then the distance between the two samples is calculated as 1.0 - S
    and this will be the value used in the clustering algorithm.

    (back to top)


    About the Clustering Methods

    The methods used here are all hierarchical and combinatorial:

  • each sample begins in a cluster of its own
  • while there is more than one cluster left:
  • The methods differ only in how they calculate the "distance" from one cluster to another:

    Clustering Method

    Distance Between Clusters A and B

    (where A was formed by combining cluster A1 with cluster A2)

    Single Linkage the smallest distance between a sample in cluster A and a sample in cluster B
    Complete Linkage the largest distance between a sample in cluster A and a sample in cluster B
    Unweighted Arithmetic Average the average distance between a sample in cluster A and a sample in cluster B
    Weighted Arithmetic Average the weighted average distance between a sample in cluster A and a sample in cluster B. The weight assigned to the distance between samples S1 (in A) and S2 (in B) is (1/2)^G where G is the sum of the nesting levels of S1 and S2. (The nesting level of a sample is the number of its enclosing clusters at this stage, in other words, the number of times the sample has participated in a "combine" step.) The effect is to reduce the influence of groups of similar samples, if such are present, on the clustering process.
    Unweighted Centroid Cluster A is treated as the centroid of clusters A1 and A2. The distance from A to B is the average of the distance from A1 to B, and the distance from A2 to B, corrected so that A lies on a line joining A1 and A2, in the abstract "metric" space. This is a recursive definition. The distance between a pair of single-member clusters is just the distance between their member samples.
    Weighted Centroid like the unweighted centroid method, but reduces the influence of large numbers of similar samples
    Ward's Minimum Variance the increase in the mean squared deviation (from a sample to the centroid of its containing cluster) that would occur if clusters A and B were fused

    (back to top)


    About the Output

    Note: Output from this program cannot be printed directly. Use the clipboard (or Virtual Clipboard) to copy and paste the contents of the Output window into your favourite word processor.

    The program always prints out the clustering algorithm chosen, and the number of samples. By checking the appropriate box on the calculator, you can have it output the:

  • Distance vs. Similarity
  • Distance/Similarity Table
  • Distance/Similarity Matrix
  • Clustering Progress
  • Tree as a Text Picture
  • Tree in Phylip/DRAWTREE format
  • Distance vs. Similarity

    You can choose to have the program output either similarities or distances, in both the matrix and table output and in the clustering progress output. This does not affect the computations, the distances output for the tree as a text picture, nor the labels output for the Phylip tree output.

    The Distance/Similarity Table option shows distances/similarities between pairs of samples in a table. This can be used to verify that your input data is being correctly interpreted.

    For example, the sample data produces the following output, if you check the "output similarities instead of distances" option.

    From    To      Similarity
    ========================
    PLI     CH      0.88
    CI      CH      0.99
    CI      PLI     0.88
    NS      CH      0.66
    NS      PLI     0.62
    NS      CI      0.66
    CL      CH      0.77
    CL      PLI     0.7
    CL      CI      0.78
    CL      NS      0.73
    CT      CH      0.75
    CT      PLI     0.71
    CT      CI      0.75
    CT      NS      0.64
    CT      CL      0.76
    SI      CH      0.36
    SI      PLI     0.36
    SI      CI      0.36
    SI      NS      0.28
    SI      CL      0.29
    SI      CT      0.34
    SPI     CH      0.51
    SPI     PLI     0.51
    SPI     CI      0.5
    SPI     NS      0.53
    SPI     CL      0.51
    SPI     CT      0.46
    SPI     SI      0.19
    SGI     CH      0.49
    SGI     PLI     0.49
    SGI     CI      0.48
    SGI     NS      0.5
    SGI     CL      0.49
    SGI     CT      0.45
    SGI     SI      0.2
    SGI     SPI     0.8

    For each sample, its name is followed by the distances between it and the samples preceding it. Here, the distance between sample SPI and sample SI is 0.81, and the distance between CT andCI is 0.256667

    (back to top)


    The Distance/Similarity Matrix option shows the lower triangular matrix of distances or similarities between samples. This can be used to verify that your input data is being correctly interpreted.

    For example, the sample data produces the following output, if the "output similarities instead of distances" option is not checked:

    Distance matrix:
            CH      PLI     CI      NS      CL      CT      SI      SPI  
    PLI     0.12    
    CI      0.01    0.12    
    NS      0.34    0.38    0.353333        
    CL      0.23    0.3     0.25            0.27    
    CT      0.25    0.29    0.256667        0.338   0.24    
    SI      0.64    0.64    0.64            0.72    0.71    0.668333        
    SPI     0.49    0.49    0.493333        0.47    0.49    0.54            0.81    
    SGI     0.51    0.51    0.503333        0.485   0.5     0.506667        0.7025  0.2     
        

    For each sample, its name is followed by the distances between it and the samples preceding it. Here, the distance between sample SPI and sample SI is 0.81, and the distance between CT and CI is 0.256667 The formatting is a bit wonky, but it should allow the text to be cut and pasted directly into TAB-aware programs like spreadsheets.

    (back to top)


    The Clustering Progress option shows the step-by-step progress of the clustering method. The output is a table showing the two clusters joined at each step, and the distance or similarity between them.

    For example, the sample data produces the following output, if the "output similarities instead of distances" option is not checked:

    Clustering procedure:
    Step       A      B    Distance
    ===============================
    1       CI      CH      0.01
    2       1'      PLI     0.12
    3       SGI     SPI     0.2
    4       CT      CL      0.24
    5       4'      2'      0.256667
    6       5'      NS      0.338
    7       3'      6'      0.506667
    8       7'      SI      0.7025
        

    At the first step, the samples CI and CH are joined because the distance separating them, 0.01, is the smallest of any pair of samples. At step 2, the cluster formed in step 1, namely "1'", is joined to sample PLI at a distance of 0.12 The "'" is added to cluster numbers to prevent confusion with sample names which might happen to be numbers.

    (back to top)


    The Tree as Text Picture option draws a tree representing the clustering process. The samples form "leaf" nodes which occur at the right of the picture. Each step in the clustering process is represented by an "interior" node, which has descendents to its right. The calculator also shows a table listing the distances between adjacent nodes.

    For example, the sample data yields the following tree:

    Tree Topology: (branch lengths ignored)
    
              +----SGI
         +----+1
         |    +----SPI
    +----7
    |    |              +----CT
    |    |         +----2
    |    |         |    +----CL
    |    |    +----5
    |    |    |    |         +----CI
    |    |    |    |    +----3
    |    |    |    |    |    +----CH
    |    |    |    +----4
    |    |    |         +----PLI
    |    +----6
    |         +----NS
    8
    +----SI
    
    From    To        Length
    ========================
    1'      SGI     0.1
    1'      SPI     0.1
    7'      1'      0.153333
    2'      CT      0.12
    2'      CL      0.12
    5'      2'      0.00833333
    3'      CI      0.005
    3'      CH      0.005
    4'      3'      0.055
    4'      PLI     0.06
    5'      4'      0.0683333
    6'      5'      0.0406667
    6'      NS      0.169
    7'      6'      0.0843333
    8'      7'      0.0979167
    8'      SI      0.35125
        

    The structure of the tree represents the clustering process: the two descendents of an interior node are the two clusters joined at that point in the process. For example, clusters 2 and 4 were combined to create cluster 5. Although no attempt is made to represent distances in the drawing of the tree, a table is printed listing the actual distances between adjacent nodes. These distances have the property that for any pair of leaf nodes (ie. samples), the distance you get by tracing along the tree from one node to the other, adding appropriate values from the table as you go, is equal to the distance at which the two nodes joined the same cluster. For example, samples NS and CI are connected by the following path:

    From    To      Distance
    =========================
     NS     6'      0.169
     6'     5'      0.0406667
     5'     4'      0.0683333
     4'     3'      0.055
     3'     CI      0.005
    -------------------------
    Total Distance: 0.338
        

    The "tree length" between NS and CI is thus 0.338, and this is the same as the distance at which clusters 5 and NS were joined in step 6, as can be seen in the Clustering Progress table.

    For the Unweighted Arithmetic Mean clustering method used in the example, this fact can be stated in several ways:

    1. the tree length between NS and CI is 0.338
    2. the smallest cluster containing NS and CI was created at a threshold of 0.338
    3. NS and CI are members of maximal disjoint clusters the average distance between whose members is 0.338

    Statements 1 and 2 would be true regardless of which clustering method generated this output. Statement 3 is only true for the Unweighted Arithmetic Mean clustering method.

    This statement is not equivalent and doesn't follow from the previous ones:

  • the distance between samples NS and CI is 0.338
  • For the other clustering methods, similar kinds of statements are true and not true.

    (back to top)


    The Tree in Phylip/DRAWTREE format option outputs a string representing the tree in a way that the Phylip program DRAWTREE can plot the tree in two dimensions. This string can also be used in the TreeToy webpage, as described below. For example, the sample input yields the following Phylip tree:

    Tree in Phylip/DRAWTREE format: 
    (((SGI:0.1,SPI:0.1):0.153333,(((CT:0.12,CL:0.12):0.00833333,
    ((CI:0.005,CH:0.005):0.055,PLI:0.06):0.0683333):0.0406667,NS:0.169):0.0843333):0.0979167,SI:0.35125);


    Visualizing the Tree

    If you want to look at your tree in a two-dimensional graphical form that reflects tree distances, you can use TreeToy as follows:

  • make sure the "Output a Phylip/DRAWTREE..." option is checked on this calculator
  • run the calculator by hitting "Calculate!"
  • from the Output window, select everything after the "Tree in Phylip/DRAWTREE format:" line until the semicolon at the end
  • copy this selected text to your system clipboard
  • return to this spot on the page and follow this link to TreeToy, where you can paste from the clipboard (or Virtual Clipboard) into the Input Window, and hit "Build Tree from Input" to display the tree graphically.
  • Warning: your browser will probably discard the contents of the Clustering output window when you visit TreeToy, so make sure you copy/paste out anything you don't want to lose!

    Phylip is a package of free clustering, phylogeny, and data analysis programs produced by Joel Felsenstein et al, and is available for many platforms. You can visit its homepage. The Phylip DRAWTREE program will take a textual representation of a tree (such as can be produced by this calculator), and render it as a two-dimensional graph in various graphics formats, many of which can be printed. If you want to print your graph, consider downloading a version of Phylip.

    (back to top)



    Virtual Clipboard (for large files)

    The Virtual Clipboard lets you paste data into or copy data out of multi-page windows. The ---> VC and <--- VC buttons move data to and from the virtual clipboard. If you see a "Browse" button below, then your browser will also let you copy a file on your machine to the Virtual Clipboard. You can then paste it from the V.C. into the calculator's input windows.

    Name of file:

    Note: if you get a "Save As..." dialog box when viewing the virtual clipboard, it is because the data there is in a binary format (e.g. a PICT file saved from TreeToy). You should specify an appropriate name (e.g. myplot.pict instead of viewclip.php) before saving it.


    The Virtual Clipboard is Brainless

    Is your window full of garbage after pasting? Unfortunately, the Virtual Clipboard knows nothing about file formats, so cutting and pasting directly from an Excel file will fill your window with junk (ie. raw Excel file bytes). You can, however, copy from the open Excel worksheet to a text editor, then save as text in a new file which you can then copy to the Virtual Clipboard. You can also just save a copy of your Excel file in tab-delimited text format, and copy that file to the VC. The point of the VC is just to help work around the limitations on window size in Netscape, MS Explorer, and probably other browsers.

    (go to Calculator)


    A Note on Pages

    Some browsers limit the size of text windows, so this program allows for bigger input and output using multiple pages. Each page is limited to around 10000 characters. To see if you have filled a page, try typing more characters at the end of the window. When you use your system's copy/paste functions to get text in and out of this program, you have to do so one page at a time; the program has no control over this.

    (back to top)


    Bugs, Bugs, Bugs

    Here are some problems you might encounter (there are doubtless many others - please let me know!):

  • scrollbars don't function: some versions of Netscape for the Mac don't draw scrollbars until you force the browser to refresh the page by, for example, resizing it.
  • horizontal scrollbars don't appear: some versions of Netscape for the PC don't draw horizontal scrollbars on text windows. You can still move around the window by clicking inside it and using cursor keys.
  • the Virtual Clipboard doesn't work: you can't (couldn't used to?) copy a local file to the Virtual Clipboard from Microsoft Internet Explorer. You might not be able to use the VC at all from some browsers. The size of your clipboard might not appear correctly. These are all problems with the current VC, which is only there to get around (sensible!) security limitations imposed by Java.
  • the interface is clunky: I'm working on it.
  • (back to top)


    Credits:

    The clustering functions are slightly-modified Java translations of FORTRAN code by Fionn Murtagh, available from the statlib S-plus archive as multiv. Free use and redistribution are permitted for non-commercial purposes.

    The DRAWTREE textual tree format comes from the Phylip package by Joel Felsenstein et al. My thanks to them for making it free.

    Thanks for motivation and feedback to Jim Hammond, Mark Dale, John Spence, Greg Wilson, Curtis Strobeck, and Charley Krebs.

    This is free software.


    For why software should be free, see http://gnu.linux.ucla.edu/philosophy/why-free.html

    For other free ecology programs, see http://www2.biology.ualberta.ca/jbrzusto

    This Java and HTML/PHP web page is by John Brzustowski (junkjbrzusto@ualberta.cacaca but removing the obvious parts). You can email me comments or criticisms. Although I strive to ensure the correctness of these programs, you use them entirely at your own risk. There is no warranty!

    (back to top)


    Installing CLUSTER on Your Machine

    If your connection to this server is slow, you can get your own copy of the program:

    To run the program, use your browser to open your copy of this HTML file. (For older browsers, you might have to unzip the archive cluster.zip in this directory. Be sure to preserve the directory structure stored in the archive.)

    The virtual clipboard can only be used in conjunction with a web server. See above before deciding to try the following. Your webmaster can set up your server to allow you to use the virtual clipboard as follows:

    (back to top)

    Getting the source code

    The zipped java source for this applet is here.


    John Brzustowski
    Last modified: Wed Jul 3 18:54:17 MST 2002