Multivariate Data Analysis
Previously known as multivariate statistical analysis


There are only three essential steps here, and the second one is even optional:

  1. Alignment in two dimensions
  2. Dimension-reduction -- expression of a mxn image using only a few terms, i.e., eigenvectors
  3. Classification

For the classification below to be sensible, the images will need to have been aligned. The alignment step here is optional if the images have been aligned already.

The dimension-reduction step is even optional, in theory. In principle, one could classify the raw images (which is what SPIDER operation 'AP C' does). As an example here, I'm using correspondence analysis for the dimension-reduction. A similar method is principal-component analysis (PCA); to run PCA, one needs to change an option under SPIDER operation 'CA S' in the procedure file: ca-pca.msa.

For classification, there are three methods illustrated here: Diday's method, Ward's method, and K-means. The individual classification operations are described in more depth in the classification tutorial.


Getting started


Quick version

This quick version is here partly for simplicity, and partly for legacy reasons. The more complete and more up-to-date workflow is presented below.

The overall inputs are simply: If your inputs are not in this form, refer to the full alignment below and then return here.
  1. [Optional] High-pass filtration

    For negatively-stained images, it is often the case that the puddle of stain surrounding the particles dominates the classification (although maybe not the case for deep-stained specimens). In this case, it may be helpful to high-pass filter the images using a filter radius larger than the size of your particle. If you cut off too much of the low-frequency information however, the aligning power of the images will be lost. You may need to tweak the radii.

  2. Reference-free alignment. -- choose one of these two options:
    1. [Recommended] Pairwise alignment
      • PROCEDURE FILE: pairwise.msa
      • INPUT PARAMETER: Object diameter (in pixels, after decimation)
      • INPUTS: Unaligned particles, selection file
      • OUTPUTS: Aligned particles, averages

        Conceptually, this alignment first aligns pairs of images and averages them. Then, it aligns pairs of averages of those pairs and averages them, and so forth. This type of alignment appears to be less random than does 'AP SR', which chooses seed images as alignment references.

        Reference: Marco S, Chagoyen M, de la Fraga LG, Carazo JM, Carrascosa JL (1996) Ultramicroscopy 66: 5-10.

    2. Using 'AP SR'
      • PROCEDURE FILE: apsr4class.msa
      • INPUT PARAMETER: Object diameter (pixels, after decimation)
      • INPUTS: Unaligned particle stack, selection file
      • OUTPUTS: Aligned particles, averages

        There may to be a memory limit in 'AP SR'. If you get a core dump, truncate the selection file and try again.

  3. Low-pass filtration and downsampling

    The low-pass filtration step will make the particles easier to visualize. The downsampling will speed up the computation.

  4. Dimension-reduction
  5. Classification -- choose one of three options:
    1. K-means classification, using 'CL KM'
      The advantage of K-means is that it is simple, requiring primarily the number of classes to divide the particles into. The simplicity is also the disadvantage; there is no relation between the classes generated, in contrast to the hierarchical methods below.

      • PROCEDURE FILE: kmeans.msa
      • INPUT PARAMETERS: Number of factors, number of classes
      • OUTPUT: class averages

        It can be informative to look at the individual particles from a class. You can use WEB/ JWEB, or montagefromdoc.
        Usage:
        montagefromdoc   KMeans/docclass001.dat
        If you have requested too many classes, there will be similar-looking class averages. If you have requested too few, there will be dissimilar particles within a class.

    2. [Recommended] Ward's method, using 'CL HC' -- The advantage is that, unlike Diday's method below, the dendrogram branches to any desired number of classes, down in size to individual particles. The disadvantage is that the dendrogram is unreadable if there are so many branches. You can truncate the dendrogram in WEB/JWEB as described below.
      • PROCEDURE FILE: hierarchical.msa
        • INPUT PARAMETER: number of eigenfactors to use
        • OUTPUT: dendrograms (PostScript and SPIDER formats)

          After running, decide how many classes to use. The PostScript file may be highly branched, and nodes may be unreadable.


          Untruncated dendrogram

          (click to enlarge bottom row)

          The SPIDER-format dendrogram document can be viewed with WEB/JWEB and truncated. In WEB, go to Commands -> Dendrogram (example). In JWEB, go to File -> Open SPIDER Document File.


          Dendrogram in X-Window
          WEB
          (click to enlarge)

      • Visualize binary tree
        It is often not clear where to truncate the dendrogram. In X-Window WEB, one only sees the terminal nodes in the dendrogram averaged. (In JWEB, averaged images in not implemented at the time of this writing, although Bill Rice says that if the prefix is two characters long, it works.)

        • PROCEDURE FILE: binarytree.msa
        • Visualize the output using binarytree.py Syntax:
          binarytree    labeled001.dat    4    2    1024
          where:
          • labeled001.dat Example filename (without a wild card)
          • 4 (optional) Tree depth, i.e., the number of rows (default is 6)
          • 2 (optional) Margin width (default is 2)
          • 1024 (optional)Canvas width
        • If SPIDER's python tool -- binarytree.py is not installed, try spidertree.msa. The output is a SPIDER-format image. However, the file size may be very large.
          • INPUT PARAMETERS: Tree depth (number of averages will be (2**depth - 1))
          • INPUTS: Averages from binarytree.msa
          • OUTPUTS: SPIDER-format tree image


          binarytree.py, depth=4

          (click to enlarge)


          tree.msa, depth=4

          (click to enlarge)

    3. Diday's method, using 'CL CLA' -- I hear that this method works exceedingly well. In practice though, I find that I have limited control over the number of classes, which may or may not be a problem depending on the application. Also, I sometimes get errors with large data sets with this method.
      • PROCEDURE FILE: cluster.msa
        • INPUT PARAMETER: number of eigenfactors to use
        • OUTPUT: dendrograms (PostScript and SPIDER formats)

          After running, decide how many classes to include. using WEB/ JWEB (Commands -> Dendrogram) and clicking on Show averaged images.


        dendrogram
        PostScript format

        (click to enlarge)

      • PROCEDURE FILE: classavg.msa
        • INPUT PARAMETER: Desired number of classes
        • OUTPUT: Class averages


Full version

This full version is the 2D alignment scheme which I use routinely. It is a subset of the steps for alignment and classification for untilted-specimen images for random conical tilt.



Recent modifications:


References

  1. Frank J. (2006) Three-Dimensional Electron Microscopy of Macromolecular Assemblies. (Oxford University Press, New York, NY).
  2. Marco S, Chagoyen M, de la Fraga LG, Carazo JM, Carrascosa JL. (1996) A variant to the random approximation of the reference-free algorithm. Ultramicroscopy 66: 5-10.
  3. Shaikh TR, LeBarron JS, Trujillo R, Baxter WT, and Frank J. (2008a) Particle-verification for single-particle reconstruction using multivariate data analysis and classification. J Struct Biol 161: 41-48.
  4. Shaikh TR, Gao H, Baxter WT, Asturias F, Boisset N, Leith A, and Frank J. (2008b) SPIDER image-processing for single-particle reconstruction of biological macromolecules from electron micrographs.
  5. Nature Protocols 3: 1941-74.
  6. Huang T, Shaikh TR, Gupta K, Contreras-Martin LM, Grassucci RA, Van Duyne GD, Frank J, Belfort M . (2011) The group II intron ribonucleoprotein precursor is a large, loosely packed structure. Nucleic Acids Res. 39: 2845-54.

Source: spider/docs/techs/MSA/index.html     Page updated: 2018/11/15     Tanvir Shaikh