Agglomerative Hierarchical Clustering: An Introduction to Essentials. (3) Standardization, Normalization and Dimensionality Reduction of a Data Matrix

Refat Aljumily

Volume 16 Issue 3

Global Journal of Human-Social Science

In a previous tutorial article I looked at a proximity coefficient and, in the light of that proximity created a vectordistance matrix and used it to construct a hierarchical tree using different hierarchical clustering methods which will be the basis for exploratory multivariate analysis. The present article deals with three topics: (i) standardization for variable scales variation, (ii) normalization for sample length variation, and (iii) dimensionality reduction or minimization of data space. These techniques reflect the author’s academic background and particular area of interest and are, by necessity, not a particular purpose and are straightforwardly applicable to other kinds of data, and thus to a wide range of analysis in Linguistics. My treatment of these techniques is, necessarily, introductory and brief. I hope that this article will provide practitioners with an introductory overview of these techniques used for cluster analysis of electronic corpora of linguistic data. The assumption is that the data is in the form of an m x n matrix D in which, may require to transform it in various ways prior to cluster analyzing it. Standardized data matrix enables practitioners to measure the variation between n-variables and to cluster the cases they describe in common scales and values, regardless of their original scales and values. Normalized data matrix enables practitioners to eliminate the effect of variation in length among n-samples and to cluster them as if they were all (about) the same length, regardless of their original length. Dimensionality-reduced space data matrix enables practitioners to select and/or extract n-most interesting variables relevant to the research question and to visualize an existing pattern, regardless of the original space. A worked example is given to illustrate the effect each transformation technique has on a given data matrix. These transformation techniques have their own strengths and weakness but are beyond the scope of