Clustering and Classification Tutorial

Source Data Set .
Correspondence or Principal Component Analysis .
Determine what Factors are Attribute or Noise based .
Create Factor Maps .
Controlled Clustering and Hierarchical Classification .
Other Classification Methods .
View and Analyse Dendrograms .
Re-create certain images from eigenvectors .
Create "virtual" images from eigenvectors .
View factor map and create class averages using Web .
References .

Clustering/Classification - Summary of Steps

A note about "cluster" vs "class" terminology. It seems natural to call the group of points seen on a factor map a "cluster". Likewise, it seems natural to call the group of images denoted by those points a "class". However the boundary between "cluster" and "class" is fuzzy so both terms have been, and continue to be, used almost interchangably in the field.

Carry out CORAN or PCA
- Create Orthogonal Vectors
  
  Run: 'CA S'
  
  Reads images.
  
  Carries out either PCA or CORAN analysis.
  
  Creates data files that are input for following steps.

Determine Which Factors Are Important
- Map Factors
  
  Run: 'CA SM'
  
  Uses output from: 'CA S'
  
  Creates factor (vector) map of data: images, pixels, or eigenvectors
- Clustering/Classification
  Uses output from: 'CA S' .
  
  Run: 'CL HC'
  Finds clusters/classifies data using single linkage, complete linkage, average linkage, the centroid method, or Ward's method and HAC.
  
  Creates dendrogram doc file, and dendrogram plot.
  
  Run an alternative classification method: 'CL CLA' or 'CL KM'
  
  Find clusters/classes.
  
  Creates a dendrogram or class membership doc. file.
- View and Analyse Dendrograms.
  Uses dendrogram doc files from 'CL HC' or 'CL CLA' .
  
  Run: 'CL HD'
  
  Counts number of classes and images in each class, for a given threshold.
  
  Run: 'CL HE'
  
  Creates doc files describing which images are in each class for a given threshold.
  
  Use: Web to View dendrogram and average images.
  
  Creates view containing average images for each class at a specifed threshold.
Analyse Eigenimages
Uses output from: 'CA S'
- Run: 'CA SR'
  
  Reconstitute images from eigenvalues & create "virtual" image from arbitrary eigenvector values
- Run: 'CA SD'
  
  Create difference eigenimages.
- Run: 'CA SRI'
  
  Reconstitute importance images.
View Factor maps
- View the factor map and create arbitrary average images

Source Data Set

SPIDER procedures: makefaces.spi (and the face.spi procedure which it calls) were used to create the eight original faces below. These faces differ in three ways: oval vs. round head, left vs. right eyes, and big vs. small mouth. Ten copies of each face with random noise form the sample data set. These procedure files create four SPIDER image stack files.

Name Contents

¤ face_tmpl_stk Face templates
¤ face_nois_stk Noisy faces
¤ face_aver_stk Average of the ten noisy face images for each template
¤ face_varn_stk Variance of the ten noisy face images for each template

	Name	Contents
¤	face_tmpl_stk	Face templates
¤	face_nois_stk	Noisy faces
¤	face_aver_stk	Average of the ten noisy face images for each template
¤	face_varn_stk	Variance of the ten noisy face images for each template

Face Templates

Sample Noisy faces


Face Templates
Sample Noisy faces

Correspondence Analysis (CA) or Principal Component Analysis (PCA)

CA is the preferred method of finding inter-image variations and we will principally be discussing inter-image variance. PCA computes the distance between data vectors with Euclidean distances, while CA uses Chi-squared distance. CA is superior here because it ignores differences in exposure between images, eliminating the need to rescale between images.

The procedure: cas.spi creates a mask and invokes 'CA S' assuming you want to run CA.

Hints for: 'CA S'

The "Number of factors" requested should be larger than the expected number of variations. This is to see the characteristic slope of noise after the "useful" eigenvalues. Typically 10 should be more than enough for our example, but see 'CA S' for more detail.
"Output file prefix" names all output files. A "cas_ca" entry will result in "cas_ca_IMC", "cas_ca_PIX", etc.

Output files from 'CA S' and cas.spi:

Name Contents File type

¤ cas_ca_IMC Coordinates of each IMAGE in the new vector space. Text

¤ cas_ca_PIX Coordinates of each PIXEL in the new vector space. Text

¤ cas_ca_EIG Eigenvector values. Text

¤ cas_ca_SEQ Image pixel values within the mask from all the images.
Only useful internally within other 'CA' & 'CL' operations. Binary

¤ cas_ca_SET Transposed data.
Only useful internally within other 'CA' & 'CL' operations. Binary

In addition to the 'CA S' output files, cas.spi creates:

¤
face_mask.spi
User-defined circular mask. SPIDER
image

¤
cas_ca_cor_doc.dat
List of eigenvectors for each image for factors 1 and 2. SPIDER
document

	Name	Contents	File type
¤	cas_ca_IMC	Coordinates of each IMAGE in the new vector space.	Text
¤	cas_ca_PIX	Coordinates of each PIXEL in the new vector space.	Text
¤	cas_ca_EIG	Eigenvector values.	Text
¤	cas_ca_SEQ	Image pixel values within the mask from all the images. Only useful internally within other 'CA' & 'CL' operations.	Binary
¤	cas_ca_SET	Transposed data. Only useful internally within other 'CA' & 'CL' operations.	Binary
		In addition to the 'CA S' output files, cas.spi creates:
¤	face_mask.spi	User-defined circular mask.	SPIDER image
¤	cas_ca_cor_doc.dat	List of eigenvectors for each image for factors 1 and 2.	SPIDER document

Determining Which Eigenvalues Are Useful, And Which Are From Noise

The best method to determine what eigenvalues are useful, and which are from noise is to view a histogram showing the percentage of eigenvalue variance accounted for by each factor. As the histogram to the right shows, the last five eigenvalues are small compared to the first three. Also their height is level, typical of noise. This is interpreted as the first three eigenvectors control more inter-image variance than the last five.

The procedure: caeigpct.spi invokes 'CA NOISE' to create and plot an eigenvalue histogram. Output files:

¤
cas_ca_eigpct_doc
: A SPIDER document file listing eigenvalues and the percentage of variance accounted for by each factor.

¤
cas_ca_eigpct.gnu
: A Gnuplot file to plot the histogram.

¤
cas_ca_eigpct.ps
: A Postscript file containing the histogram.

The Gnuplot Histogram can be visualized using: gnuplot -persist save/cas_ca_eigpct.gnu

The Postscript Histogram can be visualized using: evince save/cas_ca_eigpct.ps

Creating Factor Maps to View Clusters

Once we know which eigenvectors have some meaning and which are from noise, we can use the output from CA S for clustering. The 'CA SM' operation creates 2D factor maps (graphs) of selected pairs of factors to visualize clustering (if any).

The procedure: casm.spi invokes the SPIDER procedure: factor_casm.spi three times, creating three different factor maps using 'CA SM' for different pairs of factors. 'CA SM' is run under the assumptions that you want image space with "ID's" to create PostScript plots.

Hints for: 'CA SM'

Input for: "IMAGE OR PIXEL COORDINATES" should almost always be "I", unless you are interested in "the images changing around the pixels". This determines if 'CA SM' will read the *_PIX or *_IMC file.
"File prefix" is the name entered for the "Output file prefix" in 'CA S'.
"NUMBER OF HORIZONTAL PATCHES" Usually zero, UNLESS this particular CORAN analysis was done on a set of correlation patch averages.
"(S)YMBOL, (A)SSIGN SYMBOL, (C)LASS, (D)OC, (I)D", "ID" is typically the best choice.
"PREPARE POSTCRIPT FILE FOR MAP" should usually be yes, unless you prefer an ASCII map like ascii12.dat.
"FLIP #1" exchanges the map left-right. "FLIP #2" exchanges the map up-down."
Everything after "Postscript output file" can generally be left as default values.
If you get an: "ERROR: *** MAP ABORTED, MORE THAN 264 POINTS ON FRAME" message, this is due to a large number of points. Overcome this by increasing the scale of the factor map by increasing the value given for ".NUMBER OF SD (OR ≤CR>= 2.3)".
The SPIDER results file holds a large amount of other useful information after a 'CA SM' run.

The three factor maps below were created from image (IMC) factor maps using GNUplot. Postscript files are similar.

Factor 1 vs. Factor 2	Factor 1 vs. Factor 3	Factor 2 vs. Factor 3
Factor 1 vs. Factor 2	Factor 1 vs. Factor 3	Factor 2 vs. Factor 3

'CA SM' also works for pixel factor maps (_PIX)as well. The pixel factor maps below were created from a screen snapshot. The large number of pixels located in the center show that most pixels vary little if at all between eigenvectors.

Factor 1 vs. Factor 2	Factor 1 vs. Factor 3	Factor 2 vs. Factor 3

Controlled Clustering and Hierarchical Classification - Using 'CL HC'

With the "useful" eigenvectors known, we can more efficiently determine the representative clusters. Classification also allows compression of information with minimal loss. The 'CL HC' operation usually gives robust clustering because the user can control the clustering criterion and can alter the "weights" for each factor.

The procedure: clhc.spi invokes 'CL HC' which reads the CA/PCA files from 'CA S' and creates a dendrogram document file and dendrogram plot file.

Hints for: 'CL HC'

"CORAN/PCA FILE" is the name of the file from 'CA S'. The 'CL HC' operation can accept _IMC, _SEQ, or _PIX files. So this entry must include one of these endings to distinguish the kind of file.
"FACTOR WEIGHTS" influence the clustering and the dendrograms. The larger the weight, the more importance a factor has as compared to the other factors. To best represent the "truth" set all factor weights equal to each other. to the other factors. Zero sets all the remaining factors to 1.0.
"CLUSTERING CRITERION". Option 5 is the same method 'CL CLA' uses.
A The output "DENDROGRAM DOC FILE" can be used with Web to produce dendrograms with cluster averages. It is also used with the 'CL HD' and 'CL HE' operations to create document files listing the images assigned to each cluster.

Screen snaphots of the dendrograms from 'CL HC' for our face data set using clustering option 2 (complete linkage) and option 5 (Ward's method) (Click to see full-size):

Complete linkage Ward's method


Complete linkage	Ward's method

Dendrogram doc files affiliated with the above dendrograms:

Complete linkage doc Ward's method doc


Complete linkage doc	Ward's method doc

Other Classification Alternatives - Using 'CL CLA' or 'CL KM'

Alternative automatic classification methods using 'CL CLA', which is less controllable, and 'CL KM' , which may give spurious classes, are discussed in detail elsewere. These operations may be usefull for other classification applications, including classification using the _PIX files.

View and Analyse Dendrograms

In a dendrogram a the junction of a vertical and horizontal line the joining of two classes below it. A representative reconstruction can be formed for each class with the 'CA SR' operation. The larger vertical bars signify a greater difference between classes. The many small differences at the bottom can be eliminated with an increase of the "% cutoff" setting. (The jpg file was obtained using screen snapshot.)

The operation 'CL HD' uses the dendrogram document file from 'CL HC' or 'CL CLA' to create a document file listing how many classes there for a given threshold and how many images there are in each class. This operation is similar to viewing the dendrogram in Web, setting a threshold, and recording the number of images and the number above each image.

The operation 'CL HE' uses the dendrogram doc file from 'CL HC' or 'CL CLA' to create a series of class selection doc. files (e.g. clhc_den_clas_001) , which lists the images which are assigned to each ot the classes.

The procedure: clhe.spi invokes 'CL HD', 'CL HE', and then 'AS R' to threshold the dendrogram produced by 'CL HC' at a theshold of 66%, then average together the 20 images in each of the resulting classes. The class averages are then placed in a SPIDER image montage: clhc_clas_montage.spi

Class Montage (Threshold=60%)

The face images are classified perfectly according to head shape and eye direction.


Class Montage (Threshold=60%)

Web can be used to view a dendrogram from 'CL HC' or 'CL CLA'. Web will display not only a usual dendrogram, but it also can annotate the dendrogram with the average images of all the images below a given threshold, and note the number of images in each average.
The document file: den_doc.dat is an example of a dendrogram document file.

How To Use The Web Dendrogram

Run 'CL HC' or 'CL CLA' and create a dendrogram doc file .
it's easier if your image files and dendrogram doc file have the same extension and are in the same directory.
Start Web with your extension (i.e."Web dat &") and select Command -> image from the menus to navigate to where the images and cluster files are. Press "OK".
Select Commands -> Dendrogram and then select the cluster file.
When the Dendrogram window pops up, to see all of what you selected, choose lowest threshold option. All the variation that was carried over from the 'CL HC' or 'CL CLA' operation is scaled from what the lowest level selected to +1 above.
If the "Show average images" option is selected, Web will show the average of the lowest level shown. This is the reason for having the images in the same folder as the dendrogram doc file.

Screen snapshot of Web display of the complete linkage dendrogram truncated at 20% level. Note that all of the images are properly separated by head shape and eye direction. Mouth size is not so clear because it's eigenvalue is close to the eigenvalue for noise (Click to see full-size):

Dendrogram snapshot


Dendrogram snapshot

Re-Create Images From Eigenvectors - Using 'CA SR'

Re-creation of sample images from eigenvectors can eliminate the noise in the reconstituted image, this also results in large data compression.

The procedure: casr.spi invokes 'CA SR' to re-constitute any image, or images, from the eigenvectors.

Hints for: 'CA SR'

"FILE PREFIX" is the response given for the "output file prefix" question from 'CA S'.
"FILE #'S TO BE USED" = file numbers of images.
"FACTORS TO BE USED" stands for which factors to be utilized in the reconstitution.
"OUTPUT FILE TEMPLATE" needs at least one * at the end or it aborts program.

Below, is an single image from our sample data set and four separate reconstructions using the first four eigenvectors. (The "halo" of noise and dark corners is a result of the masking function in the procedures: face.spi and makefaces.spi) which were used to create the noisy face sample data. ) It is difficult to see what traits the original image has because of noise. With the first three single eigenvalue reconstructions we can see that it has an oval head, with eyes looking left, and a small mouth. The fourth eigenvector does not "add" anything to our knowledge of the image because there are only three attributes that carry information. Therefore the fourth carries only noise.

Below is the sample image again, and a reconstruction from the first three eigenvectors combined. The image is from the 69'th prototype image shown in the source data area.

Below we have the same "protoimage" images used to create the source data, a sample image from each protoimage,and we also have re-created the noisy samples using each eigenvector. The first row is the protoimages used to create the 80 sample images, and the second row consists of a sample image created from the protoimage above it. The third row is each sample image re-created with only the first eigenvector used. The next row includes the second eigenvector, and the last includes all relevant eigenvectors.

Create Arbitrary "Virtual" Images From Eigenvectors

The operation: 'CA SRA' can create images or pixels that were not actually captured, using eigenimages. To do this, the eigenvalues must be given. This can be useful to interpolate images in between classes.

To choose what values to use for eigenvalues, use the factor maps. If you try to use a value outside the range of values for an eigenvalue, the results will be difficult to predict and interpret.

The procedure: casra.spi invokes 'CA SR'.

Hints for: 'CA SRA'

To determine coordinate values, consult a factor map.
Do not assume that a zero will fill the rest of the factor values to zero.
"FILE PREFIX" is the response given for the "output file prefix" question from 'CA S'.
"FILE #'S TO BE USED" = file numbers of images.
"FACTORS TO BE USED" stands for which factors to be utilized in the reconstitution.
"OUTPUT FILE TEMPLATE" needs at least one * at the end or it aborts program.

The images for eigenvector 1 equal to -0.1 and 0.1 are shown in the first two images. They "make sense" in that this vector controls headshape only. These values were chosen because they are the most extreme values shown on the factor maps. But the images for eigenvector 1 equal to -1 and 1,the last two images, do not represent "extreme" roundness like one might think. This is because these values are outside the range that they actually exist, i.e. the factor maps.

The following virtual images were created by changing only one eigenvalue at a time from -0.2 to 0.2 at regular intervals. The top row is eigenvalue one, and the bottom row eigenvalue three. Along each row all other eigenvalues were held to zero. The center column is identical because it is at this point that the varrying eigenvalues are equal to zero. Reconstitution with all eigenvalues set to zero is equivalent to reconstiting the average of the series. If reconstitution of each indiviual eigenvalue had progressed to one, the result would be the first three images directly above. Reconstitution of the three eigenvalues together will result in the fourth image above.

The procedure: casramontage.spi was used to assemble the montage:

Viewing Eigenimages

The operation: 'CA SRE' creates eigenimages from 'CA S' outputs. With eigenimages, the user can easily see what the computer has determined as the factors to classify images by.

The procedure:casre.spi invokes 'CA SRE'. It assumes that if you want more than one factor, they are continuous. Also assumes that you are using CORAN output. Both of these assumptions are in the procedure file, not the operation. Edit the procedure file to change them.

Hints for: 'CA SRE'

None, see the manual page.

Sample results:

Show Differences Between Images (Eigenimages)

Use the 'CA SRD' operation to see the different eigenimages that are used to recreate images.

The procedure: casrd.spi invokes 'CA SRD'. Note that it is similar to casr.spi except for minor changes.

Hints for: 'CA SRD'

"PREFIX" is the base name before _SEQ, _IMC, _PIX, etc.
An entry out of bounds has no useful purpose.
See hints for 'CA SR' above.

The first image below is the average of all 80 noisy sample face images. The next three are the difference image for each useful eigenvector for one image. The dark slivers in the first eigenvector image show that in order to obtain the correct head shape for this image, we must add black to either side of the face. Similarily, for the eyes we would make the left side of each eye socket brighter, and the right side darker. From the fourth image this image has a wide mouth .

Below are the separate difference eigenimages for another image, as well as the combination of all three. This image has the opposite of every trait of the image represented above. We can tell because of white slivers on a dark background, right-hand side of the eyes light, etc. The last image below is the composite of the first three eigenimages.

Below is the average of all the images as well as the composite difference eigenimage from above. When we add (superimpose) these together we get the third image, a re-creation of the original sample image. However, the re-creation has no noise. This is because we only used the first three eigenimages. If we included the other five, then we would have re-created the sample image, with noise.

Subgrouping Images Using 'CA SMI'

The operation 'CA SMI' can be used to separate a series of images into active/inactive clusters. It appears that this can actually perform operations on a series of images using the 'CA S' files from another series. This has not been tested. If the images used in 'CA SMI' will be used with their 'CA S' run, it is a good idea to create a Postscript map before running 'CA SMI' for comparison later.

The procedure: casmi.spi invokes 'CA SMI'.

Hints for: 'CA SMI'

Only supports image input and _IMC output?
If using images with their "native" 'CA S run', be sure to compare the Postcript maps of before and after 'CA SMI'. If they are not very similar, you must re-run 'CA S' with "CN" instead of "C" as the fifth input.
After 'CA SMI', if an odd factor is chosen in 'CA SM', it's axis may be flipped.
If using this to force certain images, from the same series, to standout on a factor map (one class inactive, all others active), use 'CA SM' with (S)ymbol option (it's quicker).

Below are the 'CA SM' maps with no 'CA SMI' input, and below that are 'CA SM' maps with 'CA SMI' input. Note the labeled images are the same in all three, but the axis switch with an odd numbered factor. This axis switch is caused by using non-transposed data set. In order to have 'CA SMI' run on this data, forced 'CA S' to not transpose the data, with the "CN entry". Because of this, 'CA SM' reads in the images differently from the transposed data. Be sure to take note of axes if using 'CA SMI' -> 'CA SM'

Factor 1 vs. Factor 2	Factor 1 vs. Factor 3	Factor 2 vs. Factor 3

'CA SM' maps with no 'CA SMI' input

Factor 1 vs. Factor 2	Factor 1 vs. Factor 3	Factor 2 vs. Factor 3

'CA SM' maps with 'CA SMI' input

View Factor Map In Web and Compute Average Images

Using Web you can view a factor (correspondence) map and select images which are similar using a "lasso" interface. Then the average of that lassoed "class" is computed and can be viewed or stored as a new image. Also, a document file can created listing what class each image was placed in. I believe that only images can be used, not pixels.

How To Use The Web Factor Map

Run the 'SD C' operation in SPIDER.
Start Web with your data file extension (i.e. "Web dat &") and use the command -> image and "accept" to navigate to where the cluster file is.
If you are going to make image averages, the data files and the 'SD C' doc. file should be in the same folder and have the same extension. It is easier.
In Web, Commands -> Corr-map. Explanation of window options.
When you "accept", a Web window will appear, with either numbers or circles representing the images, depending on the "show image numbers" toggle.
"Lasso" a group of images with the left mouse button, and after completely enclosing the intended images, press the center button. A new window will appear.

"Save images in Doc file" will open another window. From this new window you can name the file, and specify which registers carries the X co-ordinate, Y co-ordinate, and the order of that particular group created during the run of Corr-map. The first register always is the image number. Please remember that the X and Y co-ordinates are actually the values of the eigenvectors that was specified during the begining of corr-map.
"Display avg. image" will show an average of all included images on the screen. It will not save that image. The "overlap" toggle cas_cas where the average is displayed.
"Store avg. image" will save the average of the selected images, but it will not be displayed on the screen.
"New masking" will return you to the image factor map.
"Continue masking" will do the same as new masking, except all the selected image markers will be shadowed in green. Useful to see what image markers have already been catagorized.
"stop mapping" ends the corr-map function

Continue selecting image markers with new/continue masking until satisfied. Note that one image can be in many different groups.

The doc. file: sdc_doc.dat is the result of running 'SD C' on a _IMC file. It lists the image number and factor co-ordinates of each image.

The following is a is a screen-shot of Web Corr-map run:

Note the placement of the average images. The upper-right image is overlapped with it's respective mask. We can also see what two traits were being compared in this factor map, head shape and eye direction.

The file: web_docimg001.dat contains the data created with the "Save images in Doc. file" option. It lists the image number, X and Y co-ordinates and the order of class it was formed from. This particular class was the third formed, but the first image document list.

Other stuff

The procedure: webdendro.spi invokes 'CL CLA'' so that a new dendrogram document file is created, with a different lower threshold.

The procedure: rename.spi can be used to change the extensions of a whole series of images. Is not a problem if follow hint number three above.

References

SPIDER image processing for single-particle reconstruction of biological macromolecules from electron micrographs.
Nature Protocols 3, 1941-1974 (2008)

Three-Dimensional Electron Microscopy of Macromolecular Assemblies
Oxford University Press, (February 2006)

Source: docs/techs/classif/tutorial.html Last update: 10 Sep 2013

¤	cas_ca_eigpct_doc :	A SPIDER document file listing eigenvalues and the percentage of variance accounted for by each factor.
¤	cas_ca_eigpct.gnu :	A Gnuplot file to plot the histogram.
¤	cas_ca_eigpct.ps :	A Postscript file containing the histogram.