Cluster Validity Techniques in Medical Imaging
Assessment of Clustering
For most clustering techniques the number of clusters is considered to be known. This assumption holds only in most cases where the data is slowly time-varying or is very small. In all other cases when we deal with an unknown set of data, such as it is the case with medical imaging data, this becomes a very unnatural hypothesis. Thus, it becomes evident that two fundamental questions are representative for any clustering problem: (a) how many clusters are found in the data and (b) how accurate becomes the clustering itself. The goal is to determine measures for quantitative evaluation of the results of a clustering algorithm. This is accomplished by cluster validity. This topic has been a broad research area in pattern recognition [DH,TK98]. A comparison of several validity indices for data being represented by non-overlapping distinct clusters and classified by hierarchical clustering algorithms is given in [MC85]. Determining the optimal number of clusters represents one of the most crucial classification problems. This task is known as cluster validity.
The chosen validity function enables the validation of an accurate structural representation of the partition obtained by a clustering method. While a visual visualization of the validity is relatively simple for two-dimensional data, in case of multidimensional data sets this becomes very tedious. In this sense, the main objective of cluster validity is to determine the optimal number of clusters that provide the best characterization of a given multi-dimensional data set. An incorrect assignment of values to the parameter of a clustering algorithm results in a data partitioning scheme that is not optimal and thus leading to wrong decisions.
Methods in Biomedical Image Analysis
In the last decades, several advanced noninvasive medical imaging techniques such as magnetic resonance imaging (MRI) and functional MRI (fMRI) have been introduced into clinical practice. These new techniques are not only limited to the mere imaging of morphological structures but also emphasize the biological function. As a consequence, the analysis and visualization of medical image time series data poses a new challenge to both basic research and medical application.
Functional and dynamic MRI time series experiments produce a wealth of data and subsequently information packed into two or three spatial dimensions as a function of time. The fMRI and MRI data are both multivariate by nature since each volume contains information about either brain activation or tissue response at thousands of measured voxels. Multivariate techniques have been successfully and frequently applied to fMRI and MRI studies. They are mostly used to detect brain or tissue response to stimuli, to reduce noise artifacts, to capture the variability of the data in a precise way, and to determine a model to be applied on the data. Another important aspect that can be analyzed by multivariate techniques is both the mapping of activities in the brain to cognitive states and also of enhancement patterns to anatomical, physiological and pathological states. These mappings can be treated as a pattern recognition problem. Thus, algorithms have to be designed that learn and later classify or predict spatial patterns representing multivariate fMRI or MRI experimental data. Such experiments are able to localize brain or tissue response to several stimuli types ranging from pharmacological to artificial. These data are abundant in signal information but poorly characterized in terms of signal and noise structure.
Techniques for the analysis of both functional or dynamic MRI time series are either model-free or model-based. The model-based, supervised data analysis methods rely on knowledge of experimental conditions and model assumptions, and combine problem-specific physiological parameters to a mathematical model of the temporal MRI signal. However, frequently the spatiotemporal behavior of the cerebral or tissue response is not known a priori or is variable and poses a challenge to an evaluation based on model-based methods alone. Therefore, model-based analysis methods impose some limitations on data analysis under complicated experimental conditions. Model-free, on the other hand, are not based on explicit signal models nor on the a priori knowledge of the underlying physiological process. Thus, different signal classes have not to be mathematically described and labeled training data are in general acquired during standard evaluation procedures. Therefore, analysis methods that do not rely on any assumed model are considered more powerful and relevant.
On the other hand, time series clustering represents an emerging technique for temporal data mining research by providing important information in several domains. The core problem addressed by this technique is to determine out of a group of unlabeled time series clusters or groups of similar time series based on a clustering algorithm of choice [MSLW07] .
Thus, the choice of the right cluster validity index is of crucial importance in biomedical image processing. The most established techniques are: Kim's index, Calinski Harabasz (CH) index, and the intraclass index. These indices were successfully applied before in biomedical time--series analysis [GTRNH99].
An overview about these methods is given below:
1. Calinski Harabasz index [CH74]: this index is computed for <math>m</math> data points and <math>K</math> clusters as
where <math>\mathbf{B}</math> and <math>\mathbf{W}</math> represent the between and within cluster scatter matrices. The maximum hierarchy level is used to indicate the correct number of partitions in the data.
2. Intraclass index [GTRNH99]: This index is given as
where <math>n_k</math> is the number of points in cluster <math>k</math> and <math>w_k</math> is a prototype associated with the <math>k</math>-th cluster. <math>I_W</math> is computed for different cluster numbers. The maximum value of the second derivative of <math>I_W</math> as a function of cluster number is taken as an estimate for the optimal partition. This index provides a possible way of assessing the quality of a partition of <math>K</math> clusters.
3. Kim's index [KPP01]: This index equals the sum of the over--partition <math>v_o(K,X,W)</math> and under--partition <math>v_u(K,X,W)</math> function measure
where <math>v_u(K)</math> is the under-partitioned average over the cluster number of the mean intra-cluster distance and measures the structural compactness of each class, <math>v_{umin}</math> is its minimum while <math>v_{umax}</math> the maximum value. <math>v_u(K,X,W)</math> is given by the average of the mean intra-cluster distance over the cluster number <math>K</math> and measures the structural compactness of each and every class. <math>v_o(K,X,W)</math> is given by the ratio between the cluster number <math>K</math> and the minimum distance between cluster centers, describing inter-cluster separation.
<math>X</math> is the matrix of the data points and <math>W</math> is the matrix of the prototype vectors. Similarly, <math>v_o(K)</math> is the over-partitioned measure defined as the ratio between the cluster number and the minimum distance between cluster centers measuring the inter-cluster separation. <math>v_{omin}</math> is its minimum while <math>v_{omax}</math> the maximum value. The goal is to find the optimal cluster number with the smallest value of <math>I_{Kim}</math> for a cluster number <math>K=2</math> to <math>Kmax</math>.