CL HC - Classification - Controlled

(5/30/09)

PURPOSE

To find clusters of images/elements in factor space (or a selected subspace) by using Diday's method of moving centers, and to apply hierarchical ascendant classification (HAC) using one of several methods to the resulting cluster centers.
Reads input from 'CA S' operation.
Produces a Postscript dendrogram plot and a dendrogram description doc. file. These files can be used to determine the images/elements assigned to each class (cluster).
See: Classification and Clustering Tutorial for further info.   Example.

SEE ALSO

CA S [Correspondence Analysis - Solution]
CL CLA [Classification - Automatic]
CL KM [Classification - K Means ||]
CL HD [Classification - Hierarchical, get number of classes]
CL HE [Classification - Hierarchical, create cluster selection files]

USAGE

.OPERATION: CL HC

.CORAN/PCA FILE (e.g. CORAN_01_IMC) : coran_IMC
[Enter name of the raw image data sequential file (_SEQ), image factor coordinate file (_IMC), or pixel factor coordinate file (_PIX) file containing your data. These files were created by 'CA S. (This operation is usually used with _IMC files).

.FACTOR NUMBERS: 1, 3, 4, 6
[Enter the factors to be included in the hierarchical clustering.]

.FACTOR WEIGHT: 1.5
[Enter a weight for each factor selected. If the answer zero is given at any point, all weights from the current factor onwards are set to one. This question is asked as many times as the number of factors specified, or is terminated by the answer zero.]

.FACTOR WEIGHT: 0
[This question is asked as many times as the number of remaining factors, or is terminated by the answer zero.]

.CLUSTERING CRITERION (1-5): 2
[Enter the number indicating clustering criterion to be used. Possible choices are:
Option 1:   Single linkage
Option 2:   Complete linkage
Option 3:   Average linkage
Option 4:   Centroid method
Option 5:   Ward's method]

.DENDROGRAM POSTSCRIPT OUTPUT FILE: HC_DEND_PLOT.ps
[Enter name of file where the dendrogram will be stored. Enter '*' to skip creating this file and skip the next question.]

If Postscript output was requested SPIDER asks.]

.PLOT CUTOFF (0 ..100): 30
[Enter the scale value value at which the dendrogram will be cut/truncated. Only top portion will be produced.]

.DENDROGRAM DOC. FILE? (Y/N): HC_DEND
[Enter name for document file where the UNTRUNCATED dendrogram information will be stored. This file contains the class numbers and height of all dendrogram branches. Using this information, one can retrieve the images/elements which are present in each of the classes. Enter '*' to skip creating this file.]

NOTES

  1. The RESULTS file contains the following information:

    A) Aggregation history:
    For each of the NP partitions, the NS seeds and sizes of clusters are listed. The non-empty clusters in the crossed partition are given by size and cummulative percentage. A total of NS**NP clusters are possible, but in practice, only 10% of these are non-empty.

    B) Description of hierarchy nodes:
    The nodes are numbered starting from the number of the highest cluster. For each node, the Senior, Junior, size, weight and hierarchy index are given. The hierarchy index is printed out as a histogram.

    C) Description of the classes:
    For each node the constituent classes are listed.

    D) Dendrogram:
    The class relationships are represented in the form of a dendrogram (tree structure). The lengths of the branches (in horizontal print direction) are proportional to the heirarchy indices.

    E) List of class members:
    The members are listed for each of the basic non-empty classes.

    F) List of class center coordinates:
    For each class, the NFAC coordinates of its center is listed. These are contained in the cluster file.

    G) Re-classification lookup table:
    Each cutoff point in the dendrogram, from right to left, defines a new classification scheme with the number of classes increasing by 1 each time. The table gives the new class memberships for any cutoff point selected.

  2. To get the classification on a given "cutting" level use operation: 'CL HD'.

  3. To get the selection doc files corresponding to the given "cutting" use operation: 'CL HE'.

  4. To calculate averages for the classes use operation 'AS DC' and selection doc. files.

  5. Every vertical line at the bottom of the dendrogram (with no cutoff) represents an image that you input into 'CA S'. Each vertical line is an average of the images, or vertical lines, below it.
    The threshold is a scaled value from 0 to 100 that informs 'CL HC' how far "up" the dendrogram you wish it to look. A threshold set at the bottom would result in the number of classes being equal to the number of input images. A median threshold value of 50 results in fewer classes. And a top level threshold gives a single class containing of all the inputs.

  6. With untruncated results and/or very many classes the Postcript plot lines/labels may be overwritten, sorry.

  7. Originally implemented by P. Penczek.

SUBROUTINES

HCLS, HCLP, DIST_P, CHAVA, DENDRO

CALLER

UTIL1