TreeSOM tools Author/maintainer: Elena.Samsonova@liacs.nl This is freeware under GPL (see gpl.txt). This project is ALIVE. Please send me an e-mail in case something is unclear or if you found a bug. Please supply your input data so that I could trace and fix it. Major changes are listed in the CHANGES section of this Readme. This program is updated regularly, so please check my web page: http://web.inter.nl.net/users/Elena.Samsonova/resources.shtml Self-Organizing Map software package compiles four programs out of the same source code: matrix, som, clustermap and clustertree. - These utilities require each data element to have a unique label. This should be changed in future. - All visualizations use color to show families, if such information is given. Alternate names for data elements may be loaded. - MATRIX calculates a distance matrix from the given data file. - SOM is meant to train a map from scratch or starting from a pre-trained map. Training is controlled by a set of parameters, see som.conf. A trained SOM may then be analyzed for clusters, visualized or saved as a tree file (in Newick format). - CLUSTERMAP is meant for analyzing SOMs that are already trained. It does the same things as SOM except it cannot train the map. - shell script MKGIF uses ImageMagick to make an animated GIF out of the cluster map visualizations to show cluster development. - CLUSTERTREE does tree analysis and visualization. It can load arbitratry trees in Newick format, also with confidence values. Trees may be aggregated and drawn in a variety of ways, including a circular map (nice for very large trees). - shell script MANYSOM allows to train many SOMs with the same data and configuration file. You should fill in -1 for the random seed, of course. This is used for cluster confidence analysis. - consensus tree tools are not included into this package. You can use for example PHYLIP, a package from bioinformatics, containing several consensus tree methods. It also has the description of the Newick tree file format. For a description of the functionality please see the manual and these papers: - E.V. Samsonova, T. Bäck, J.N. Kok, A.P. IJzerman, Reliable Hierarchical Clustering with the Self-Organizing Map, in A. Fazel Famili, Joost N. Kok, José M. Peña, et al. (eds.), Advances in Intelligent Data Analysis VI: 6th International Symposium on Intelligent Data Analysis, Lecture Notes in Computer Science, vol. 3646, pp. 385-396, Springer Verlag, 2005. - E.V. Samsonova, J. Kok, A.P. IJzerman, TreeSOM: Cluster Analysis in the Self-Organizing Map, in Proceedings of the 5th Workshop on Self-Organizing Maps (WSOM), pp. 429-438, 2005. Both articles are available from the same place where you got this source: http://web.inter.nl.net/users/Elena.Samsonova/resources.shtml File formats are as follows: - SOM data standard format: # comments behind a hash sign # first line gives the cardinality of data vectors # example: 3 # each subsequent line represents one data vector # format: x1 x2 x3 ... xn label # the label may not contain spaces # data must be numeric # missing values should be entered as 'x' # continued example: 0.1 2e-03 44 item1 1 2222.4 3 item2 8.3 x x item3 - SOM map standard format (.cod files): # comments behind the hash sign # first line gives the status information: # cardinality, topology (rectangular), size (x, y), # neighborhood function (gaussian) # example: 3 rect 5 4 gaussian # each subsequent line represents one map node vector # format as in the data file, but labels not obligatory - alternative names file: # names may contain spaces; format: # "vector name" "name to use in visualizations" # example: "item1" "teddy bear" "item2" "lorry" - families file: # names may contain spaces; format: # "vector name" "family name" # example: "item1" "plush toys" "item2" "motorized toys" Options for each program are displayed if the program is started without parameters. See some examples in the end of this Readme. With SOM, calculation precision may be specified as float, double or "long double". The higher the precision, the longer it takes to train the SOM. However, if your data contains many small values (e.g. between 0 and 1), then at least double precision is recommended as it results in better SOMs with much fewer iterations. This implementation of the SOM algorithm has been optimized for speed. It uses a two-dimensional array to store map nodes, and a data distance cache for fast cluster analysis. If there is not enough memory for a distance cache, then the system will be slower. SOM training is also sped up by calculating partial distances when searching for the winning node. The smallest distance found so far is remembered, so that the euclidean formula is only calculated up untill it reaches this remembered value. Since it will only grow if we continue calculating, we may just as well stop, since this is not the winning node. This optimization is especially effective for data with a high cardinality. Cluster analysis is a memory-hungry business since it is optimized for speed. It should be made more memory-efficient though. --------------------------------------------------------------------- Changes --------------------------------------------------------------------- Revision 2.1, April 17, 2005. The first released version. Revision 2.2, September 10, 2005. Extra visualizations. - Many new visualizations added. - A memory leak fixed that was a problem when the SOM calculation was embedded into another program. - A funny behaviour of some compilers was detected when sqrt() was presented with a negative value due to a rounding error. Cluster detection was sometimes affected, SOM training too, potentially, but we never saw any difference. This problem was murdered by checking the argument before it is passed to sqrt(). Revision 2.3, September 12, 2005. Debugging version. - Options of the four programs streamlined. - Manual added. - Further testing is still ongoing, a new release is expected soon. Revision 2.4, September 24, 2005. Tree drawing improved. SOM debugged. - Scaling and cropping of circular trees improved. - More flexibility with printing labels added. - Subtree drawing added: zooming into the minimal subtree containing a given data set. Irrelevant intermediate subtrees are drawn as branches so that tree structure is maintained but the tree is smaller. - Bug fixed: SOM parameters are now loaded properly. Revision 2.5, September 28, 2005. Maintenance release. - SOM initialization improved: it is a lot more heuristic now but also a lot closer to the original Finnish SOM. Random SOM is now initialized with values somewhere between the minimal and the maximal value for each dimension. The linear initialization now smears out the average vector over the SOM grid as if it were axis of the first two principal components. Well, sort of. The algorithm is heuristic. It is using Gram-Schmidt matrix transformation which is apparently not very stable, as well as many random numbers. But it does work in most cases. - Observation: C++ has a heavy overhead. The same algorithm done in C is significantly faster. I am considering converting the whole package into C. Revision 2.6, October 7, 2005. Maintenance release. - SOM now draws any kind of tree rather than only curvograms. It also checks for the output file before training (bug fixing). - Color selection improved. - Zooming improved: it is stricter now producing smaller trees. Revision 2.7, December 7, 2005. Maintenance release. - SOM now accepts missing values in the data file. They should be entered as 'x'. Revision 2.8, February 11, 2006. Maintenance release. - Added chi-square distance. - Pictures can now be drawn in grey shades with -G. - Trees can now be aggregated with -A to hide leaves that fall within the same family and make smaller trees.