Description of clustering algorithm

For each kind of amino acid nucleotide interaction an attempt was made to split all the superimposed structures into "families" of interactions for easy classification. So far two types of algorithms were used. One is based on maximum distance clustering algorithm, another is a very simplified version of a density function algorithm. We are also planning to write a program implementing K-means algorithm.

In each case clustering was done on representative atoms from amino acids. An atom making hydrogen bond to nucleotide (ND2 in case of asparagine) was chosen to represent the amino acid. Distance between atoms in 3D space was used as a metric for clustering algorithms.

Maximum distance algorithm consists in finding atoms that are cluster centers and then assigning remaining atoms to the closest cluster. Initially two atoms farthest apart are chosen as cluster centers. Then for each remaining atom distances to all cluster centers are calculated and the shortest distance is saved. Out of these distances the longest is selected. If this distance is more than a certain empirical parameter, the atom is desinated as a new cluster center and the procedure is repeated until no more new cluster centers are found.

Simplified density function algorithm assigns an atom to a cluster if the atom is closer to any member of the cluster than a certain empirical parameter.

For both algorithms sample runs were done to determine optimal parameters and the results were compared to visual family assignments based on chemical intuition. Performance of neither algorithm was ideal. They are both capable of finding well-defined sets of interactions, but not of distinguishing finer differences, obvious to chemist's eye. Sometimes separate families were formed from interactions that should probably be kept together.

For the future we plan to use a combination strategy, where a simple algorithm will be used to find well-defined clusters, which will then be examined more carefully, using chemical knowledge, like conformation of amino acid, etc.

A good reference on clustering algorithms is
J. T. Tou, R. C. Gonzales Pattern recognition principles Addison-Wesley, Reading, MA, 2nd Edition, 1977