Classification and Clustering Tutorial - Other Clustering Methods

Automatic Clustering and Hierarchical Classification - Using 'CL CLA'

With the "useful" eigenvectors known, we can more efficiently determine the representative clusters. 'CL CLA' only uses Diday's method for clustering, and Ward's criterion for Hierarchical Ascendant Classification (HAC) so the user does not select the clustering criterion. Also the user can not specify the factor weights.

The procedure: clcla.spi invokes 'CL CLA' which reads the CA/PCA files from 'CA S' and creates a dendrogram document file and a dendrogram plot file.

Hints for using: 'CL CLA'

"CORAN/PCA FILE PREFIX" is the response previously given for the "Output file prefix" question in 'CA S'.
"CLUSTER OUTPUT FILE FOR 'CA SM' USE" is a file to use with 'CA SM' to name cluster points by image. Not otherwise used in Web or SPIDER.
'CL CLA' uses the "moving center" method. Iterations are how many times the center is moved, and the centers per partition is how many points are initially used, per partition.
"NUMBER OF INITIAL CLUSTERS" signifies the number of clusters 'CL CLA' will initally expect.
"OCCUPANCY LEVEL FOR CLASS CUTTOFF" is the level or elements/images at which a class is not included in dendogram.
The "DENDROGRAM POSTSCRIPT OUTPUT FILE" plot is useful to show the relationship between clusters/classes.
The "PLOT CUTTOFF" can be used to truncate the lower edge of the dendrogram to remove excess branches.
A The output "DENDROGRAM DOC FILE" can be used with Web to produce dendrograms with cluster averages. It is also used with the 'CL HD' and 'CL HE' operations to create document files listing the images assigned to each cluster.
The SPIDER results file holds a large amount of useful information after a 'CL CLA' run.

Automatic K-Means Clustering and Classification - Using 'CL KM'

K-Means is a method of classification that divides the data into a user specifed number of clusters. Two random images "seeds" are chosen, and their centers of gravity are computed. A partition is drawn down the middle between the centers, the new centers of gravity are computed, and the process is repeated for a given number of times. The final result is VERY dependent on which image seeds are the first chosen.

Because our faces data set is synthetic, we know the exact number of classes and exactly which images are identical (except for the random noise). The output discussed here was obtained with 8 classes, using factors 1-3, and an uniform factor weight of 1.0 for all three factors. (NOTE: The pictures were from a data set with somewhat different noise from other runs discussed in this tutorial.)

The procedure: clkm.spi invokes 'CL KM' and creates class averages for the specified number of classes.

Hints for using: 'CL KM'

"CORAN/PCA FILE" must be the FULL name of the file you want to use (without the data extension), this is because it can input SEQ, PIX, or IMC files.
"NUMBER OF CLASSES" = # different clusters.
"RANDOM SEED STARTING NUMBER", this should be made note of, so you can re-create runs from the same data using the same random number.
"TEMPLATE FILENAME", for each class/cluster requested, 'CL KM' outputs a file with this prefix.
DOCUMENT FILE, this operation creates a single output file that lists the class assigned for each image/pixel.

The doc file: clkm_453_doc was produced by a run of 'CL KM' with the above input values with a random number seed of 453. The third column describes the image number and the fourth column is the class that 'CL KM' placed the image in. Images 1-10 were all placed in cluster 6, which is what we expect because they are all noisy images of the same protoimage. 'CL KM' kept the images from the same protoimage clustered together somewhat, except for the last ten images. However, it preferred to place images 11-20 and 31-40 in the same cluster, instead of each giving them each their own cluster. The average image for images 11-20 and 31-40 are shown below. They differ by their mouth size.

The doc file: clkm_789_doc was produced by a run of 'CL KM' exactly same as the previous run, except with a random number seed of 789. Once again, most images were placed into the correct protoimage cluster correctly, except for the last few images. But in addition to images 11-20 and 31-40 being clustered together, images 1-10 and 21-30 were placed in the same cluster as well. This clearly demonstrates that K-means is highly dependent on the first image chosen, and should be used with extreme caution. Below are the average images for 1-10 and 21-30.

The doc files: clkm_SEQ_453_doc and clkm_PIX_453_doc are outputs from 'CL KM' being run on the same data as above, but using the SEQ and PIX files, respectively. The clkm_PIX_453_doc.dat file is 95Kb in size. The results for the SEQ run should be the same result as the previous runs, because it is still comparing images. However, the PIX results are expected to be different because it is trying to place the PIXELS in eight different classes.

Source: docs/techs/classif/ca-otherops.html Last update: 06 Feb 2013