SPIDER: CL CLA (Classification

CL CLA - Classification - Automatic

(4/17/17)

PURPOSE: To find clusters of images/elements in factor space (or a selected subspace) by using Diday's method of moving centers, and to apply hierarchical ascendant classification (HAC) using Ward's method to the resulting cluster centers.
Reads input from 'CA S' operation.
Produces a Postscript dendrogram plot and a dendrogram description doc. file. These files can be used to determine the images/elements assigned to each cluster (class).
See: Classification and Clustering Tutorial for further info. Example.

SEE ALSO

CA S	[Correspondence Analysis - Solution]
CL HC	[Classification - Controlled]
CL KM	[Classification - K Means \|\|]
CL HD	[Classification - Hierarchical, get number of classes]
CL HE	[Classification - Hierarchical, create cluster selection files]

USAGE

.OPERATION: CL CLA

.CORAN/PCA FILE PREFIX: CORAN_01
[Enter prefix of the name of the image factor coordinate file (_IMC) containing the data. This file was created by 'CA S.

.CLUSTER OUTPUT FILE FOR 'CA SM' USE: CLU001
[Enter name for binary cluster file where cluster and classification parameters are placed for later reference. This file must NOT exist currently. The only current use for this file is for labeling images as to their class on a factor map using: 'CA SM'.]

.FACTOR NUMBERS TO BE USED: 1-4
[Enter list of factors to be used in the cluster search. WARNING. INPUT ALTERED 2011. To use factors 1..7 list all factors wanted, i.e. 1-7; Not: 7.]

.NUMBER OF ITERATIONS & CENTERS: 5, 8
[Enter number of iterations (default=5), and number of centers ("seed images") per partition (default=5). Iterations control how many times the center is moved. The centers are randomly chosen images from which the algorithm starts.]

.NUMBER OF PARTITIONS: 4
[Enter number of initial clusters (default=4) that are expected. The number of possible clusters is: "CLUSTERS PER PARTITION" ** "NUMBER OF CLUSTERS" subject to a maximum of 100.]

.OCCUPANCY LEVEL FOR CLASS CUTOFF (0=NO CUTOFF): 2
[Enter the occupancy level at which the dendrogram will be cut. Classes (clusters) having fewer occupants than this level will be removed. This number must be be in occupancy range of the dendrogram tree which was just reported.]

.DENDROGRAM POSTSCRIPT OUTPUT FILE: CLA_DEND_PLOT.ps
[Enter name of file where the dendrogram plot will be stored. Enter '*' to skip creating this file and skip the next question.]

If Postscript output was requested SPIDER asks.]

.PLOT CUTOFF (0 ..100): 30
[Enter the scale value value at which the dendrogram will be cut/truncated. Only top portion will be produced.]

.DENDROGRAM DOC. FILE: CLA_DEND
[Enter the document file name where the UNTRUNCATED dendrogram information will be stored. This file contains the class (cluster) numbers and height of all dendrogram branches. Using this information, one can retrieve the images/elements which are present in each of the classes. Enter '*' to skip creating this file.]

NOTES

The RESULTS file contains the following information:
A) Aggregation history:
For each of the NP partitions, the NS seeds and sizes of clusters are listed. The non-empty clusters in the crossed partition are given by size and cummulative percentage. A total of NS**NP clusters are possible, but in practice, only 10% of these are non-empty.
B) Description of hierarchy nodes:
The nodes are numbered starting from the number of the highest cluster. For each node, the Senior, Junior, size, weight and hierarchy index are given. The hierarchy index is printed out as a histogram.
C) Description of the classes:
For each node the constituent classes are listed.
D) Dendrogram:
The class relationships are represented in the form of a dendrogram (tree structure). The lengths of the branches (in horizontal print direction) are proportional to the heirarchy indices.
E) List of class members:
The members are listed for each of the basic non-empty classes.
F) List of class center coordinates:
For each class, the NFAC coordinates of its center is listed. These are contained in the cluster file.
G) Re-classification lookup table:
Each cutoff point in the dendrogram, from right to left, defines a new classification scheme with the number of classes increasing by 1 each time. The table gives the new class memberships for any cutoff point selected.
The algorithm underlying this set of programs is described by Lebart et al., Multivariate Descriptive Statistical Analysis, J. Wiley and Sons, New York 1984, Chapter V, p.109.
To repeat a run and get the exact same results you must start with the same random number generator seed. Use 'MD' with option 'SET SEED' before invoking 'CA CLA'.
Every vertical line at the bottom of the dendrogram (with no cutoff) represents an image that you input into 'CA S'. Each vertical line is an average of the images, or vertical lines, below it.
The threshold is a scaled value from 0 to 100 that informs 'CL CLA' how far "up" the dendrogram you wish it to look. A threshold set at the bottom would result in the number of classes being equal to the number of input images. A median threshold value of 50 results in fewer classes. And a top level threshold gives a single class containing of all the inputs.
With untruncated results and or very many classes the Postcript plot lines/labels may be overwritten, sorry.

SUBROUTINES: SCLASSI, NOYAU, PARST, RETIR, CLAST, STABK, SHELK, DEUCL, CHAVA, DENDRO, COUPE, RGRI, DENLST, RGRI

CALLER: UTIL1