Clustering shapes fits into the general area of Data Mining as applied to non-alpha-numeric data. However, Data Mining englobes a number a disparate subfields and non-alpha-numeric data are obviously extremely diverse in nature. "Data Mining" and "Knowledge Discovery" (cf. http://www.sigkdd.org) are (largely synonomous) generic terms that refer to activities which are intrinsically heterogeneous and (from an historical perspective) developed along independent paths and at different points in time. Clustering first began its existence as a subfield of statistics before Data Mining came to be recognized and covers, even taken alone, a wide variety of approaches to solving the problem that defines it.
The task of clustering SVG shapes involves knowledge in areas with which people competent in the field of vector graphics are not necessarily conversant, such as pattern recognition and machine learning. In-less-than-a-nutshell and not-unduly-formal introductions to relevant topics are thus indispensable, in order to provide a viable minimum of background. The context in which the results reported here are situated can also be brought into sharper focus by briefly reviewing the work most directly related to them.
It is assumed that no introduction to SVG is necessary. Pertinent non-SVG background can be simplified into three major topics: modelling shape, clustering (unsupervised classification), and the discrete Fourier transform. Each is a field in itself. Chapters 4 to 7 of [CosCes00] are a good initial resource for the first topic. [JaiMurFly] is still the standard overview of clustering. Chapter 2.7 of [CosCes00] as well as chapters 5 to 7 of [Bigun] provide compact introductions to pertinent aspects of Fourier analysis.
The first three subsections below introduce each of these background topics, while subsection 2.4 reviews the rare pieces of research that can be seen as our direct forerunners.
Adequate modelling of shape is a central topic in 2D and 3D graphics -- the IEEE International Conference on Shape Modeling and Applications has been taking place since 1997. The problem of how to represent shape forms a leitmotive running through the fields of computer vision and pattern recognition.
More functionally, an operational description is an absolute prerequisite for comparison of any two instances of a shape. Being able to carry out such a comparison and compute a corresponding (dis)similarity value is, in turn, necessary for other operations such as: clustering images, content-based image retrieval, similarity search in image data bases, and identification of objects in a scene. The common, essential desideratum is simple: a good match up between intuitive, perceptual similarity and closeness of the values of the model's parametres, i. e. the features defined and assigned values -- shapes that look (nearly) the same lie at (almost) the same location in the description space, and shapes that are visually very different are far apart. Achieving this goal, however, remains to date an ongoing scientific battle, as is attested to by the variety of approaches that have been and still are being developed.
The entry point into the literature for background information that is the closest to home for SVG-related purposes is probably the book by Da Fontoura Costa and Marcondes Cesar [CosCes00], in particular its two central chapters: "5 Two-Dimensional Shape Representation" and "6 Shape Characterization". The four-level taxonomy of representations embodied in the Figures 5.1 to 5.4 of [CosCes00] provides a schematic overview of the whole matter, beginning with the three-way top-level distinction of contour-based, region-based, and transform-based approaches. The complete tree is much too large to be fully discussed here. The path of greatest interest for understanding the work described in this paper leads from the root "shape representation" down to "contours" to "curve approximation" and finally to "polygonal approximation" (see Figure 5.2). In this method, straight line segments are fitted to the shape outline, giving rise to a sequence of polygonal vertices as shown in Fig. 1 (after Figure 5.22, p. 353 in [CosCes00]).
As to more specifically related work, the contributions most to be mentioned span the last 18 years, ranging from [PJVO] back in 1991, including Zhang's PhD (2002) and (a small) part of MPEG-7 (2004), and continuing up to [CLMMS] in 2008.
[PJVO] is a landmark reference summarising and clarifying information-preserving, parametric contour-based work up to the beginning of the 1990's. It deals not only with this branch of shape representation but also with shape dissimilarity measures defined on this basis. Section 2.1 provides a thorough introduction and gives arguments in favour of the so-called position function z (equation 2.1.4, p. 18) as a shape representation. At the beginning of Section 3.1 van Otterloo notes (p. 87) that "In the literature on shape analysis, it is not the periodic contour representations themselves, but the Fourier coefficients generated by such representations which have been given most attention". This focus of interest has been maintained in later work.
Dengsheng Zhang has worked extensively on the problems of shape modelling, as seen in the context of image retrieval. Chapter 3 of his PhD [DSZ] offers a detailed and more recent (as of 2002) review of shape representation techniques, both contour-based and region-based. Figure 3.1 (p. 32) gives an overview of Zhang's classification, which is organised in a rather different way to that of Figure 5.1 in [CosCes00]. Chapter 4 builds on this basis and reports retrieval results from a comparative study of several contour-based shape representations. [ZhLu01] is one of several publications by Zhang and Lu comparing and evaluating various shape descriptors.
Although the MPEG-7 Standard [MarMPEG7] ignores SVG and vector graphics, it does, of course, include shape descriptors, one of which (section 3.2.4.2) is contour-based and uses the Curvature Scale Space representation, as presented in 7.2.6 of [CosCes00] and studied in 4.3 and 4.4 of [DSZ]. Whether MPEG-7 got it right with this choice was and still is controversial.
Modelling shape continues to be an actively researched topic at present. The most recent relevant major publication we know of, [CLMMS], is only a few months old and examines three modern shape descriptors: Scale-Invariant Feature Transform (SIFT), Maximally Stable External Regions (MSER), and Line Level Descriptor (LLD).
The area of shape similarity measures ( cf. [PJVO] Ch. 4, [DSZ] 2.3.1, [Veltkamp], [VelLat] ) is also obviously relevant here, but even a summary discussion of it would be beyond the scope of this paper.
What does clustering (shapes or other things) consist of ? Clustering, in general, is the process of finding natural groups (alias "clusters") in data. Intuitively, a group is defined by the fact that items that are found to be members of the same group are (very) similar to each other in one or more respects, while items assigned to different groups are as dissimilar to each other as possible. The groups are natural in the sense that the kind of (dis)similarity being evaluated is meaningful for the semantics of the application-domain to which the items belong.
Clustering has now come to be considered -- along with supervised learning, regression, attribute importance, and association rules -- a subfield of Data Mining, the objective of which is, stated intuitively rather than in formal terms, to enable users of data-intensive applications to get more and better information out of their data and to perceive regularities and kinds of coherence that would otherwise go unnoticed. Ultimately, this better understanding of the data overall enables conclusions to be drawn that are impossible to discover from all the data records taken individually.
In the statistics and applied probability literature (as opposed to artificial intelligence and machine learning) clustering is sometimes called "classification". It is easy to understand why: grouping objects that are similar classifies them, in the everyday sense of this term. The English verb "sort" can be used untechnically in the same way: arrange things systematically according to their sort (as a noun, meaning: type, kind, variety). Classifying relevant observations has been and still is a fundamental activity in almost all branches of science. In order to avoid confusion, sometimes the term classification is used only in situations where the number and nature of the groups is well-known in advance, whereas clustering is the preferred term when this information is not available a priori and one relies precisely on the grouping process to discover it.
The human visual perception system is often very good at clustering small 2D plots of simple data. But not always, and in (scientific) practice, data actually occur in a bewildering variety of types, much more complex than the friendly case just mentioned. In order to tackle these kinds of clustering problems, algorithms (implemented as computer programs) that replace -- or imitate -- human perception are required.
Somewhat simplified, the many approaches to clustering fall into two main types: partitional and hierarchical. The first one simply partitions the set of objects into some desired number of clusters, which individually have no internal structuring. This approach can be efficient to implement but is often only useful when the right number of clusters to find is already known in advance (for whatever reasons).
A hierarchical clustering algorithm produces a hierarchy of nested clusters, the number of which ranges from 1 (at the root of the hierarchy) up to the total number of objects being clustered (at the leaves). The level of the hierarchy at which one (i.e. user or algorithm) chooses to "cut" the links between clusters (cf. the horizontal line in Fig. 2 intersecting four parent-cluster / child-cluster links)
A tiny example of hierarchical clustering in action will make this approach clear. The eight items in Fig. 3 are to be processed.
The complete hierarchy can be seen in Fig. 4.
Hierarchical clustering must decide how to calculate a dissimilarity value between two clusters. There are surprisingly many ways to do this, but three may be considered fundamental: single, average, and complete linkage. The single link approach takes the smallest dissimilarity between any one member of the one cluster and any one member of the other as the intercluster dissimilarity. Complete linkage does -- in a loose sense -- the opposite and considers the largest dissimilarity between any one member of the one cluster and any one member of the other. Average linkage is based on the average of the pairwise dissimilarities between all members of the one cluster and all members of the other.
The most well-known and widely consulted compact introduction to clustering is the survey, "Data Clustering: A Review" [JaiMurFly]. [Gordon] also dates from 1999 but provides more in-depth coverage of clustering algorithms, while remaining on an abstract level, without going into implementation details. [KauRou] is not as up-to-date nor as wide-reaching but takes a more practical point of view, closer to actual programming.
Although its beginnings reach back over some decades, clustering is still a very active field, among others with respect to images and other non-alpha-numerical data. Recent work of this kind includes the following contributions. [WaKh] develops a new approach to hierarchical agglomerative clustering as applied to raster images based on their "Dynamic Growing Self-Organizing Tree (DGSOT) algorithm", which continues the branch of clustering using neural nets (cf. section 5.7 in [JaiMurFly]). [LinEtAl] examines the well-known and widely implemented "k-means" clustering algorithm (cf. section 5.2.1 in [JaiMurFly] or section 3.2 in [Gordon]) and proposes ways to alleviate some of the drawbacks inherent in this technique. The work described in [Petru] is based on Kohonen's Self-Organizing Maps (cf. section 6.4.2 in [Gordon]) and performs k-means clustering of map units. [BrechEtAl] builds on two previously developed hierarchical density-based clustering algorithms called DBSCAN and OPTICS and shows how the improvements introduced can be used for similarity search in, for instance, an image database. [FerEtAl] and [EsFaDam] both tackle the problem of conceptually clustering collections of logic formulae, by adapting classical algorithms to this currently still exotic kind of data. Such work testifies to the on-going interest in and never ending relevance of cluster analysis. Chapters 7 and 8 of [CLMMS] extend the work mentioned above (at the end of 2.1) on modelling shapes to clustering them and proposes a new method for solving certain classical clustering problems.
The discrete Fourier transform (DFT) is well-known to be one of the mainstays of digital signal processing. Chapter 11 of [Bracewell99] focuses on it, but the preceding chapters provide the material necessary to put it into its complete context. Put in a very general way, its usefulness is to make a given data set more amenable to some given kind of processing, e. g. filtering as in image processing, or more suitable for certain purposes, e. g. comparison of values from different data items in order to assess their similarity.
The comparisons used during clustering to determine how dissimilar two given shapes are, are usually based not on raw feature values as found in the original shape model, but rather on the results of applying the Fourier Transform to these raw values, thus creating what are called "Fourier coefficients" and, after further normalisation, "Fourier Descriptors" (FDs) (although this terminological distinction is not observed everywhere in the literature). The key insight here is "the idea that look-alike shapes are usually near each other in a space of Fourier Descriptor features endowed with the euclidean metric" [ZahRos]. It is possible, of course, to base similarity comparisons on the raw feature values, but this turns out to be a much less adequate approach. Dimensionality reduction and information distillation are the main advantages of using FDs as a descriptive technique. One can observe that other transforms such as Discrete Cosine and Karhunen-Loeve also exhibit comparable properties.
To get an immediate intuitive glimpse of what FDs are good for, consider the successive red approximations of the letter F in Fig. 6.
The DFT's mathematical origins lie in Fourier's theorem that any periodic function f (x) can be approximated as closely as desired by a sum of cosine and sine functions. These summands form a Fourier series, each member of which is fully determined by its corresponding coefficient. If f (x) is sampled N times over a given interval, then a discrete form of the Fourier transform can be calculated according to the well-known definition given, for instance, on page 260 of [Bracewell99] and page 371 of [Bigun] as well as in the contributions discussed in the following paragraphs. Full details are provided in 4.2.1 below.
The forward transform produces a series of coefficients. If the values of the input function are, for example, coordinates of points on a given contour, then these coefficients characterize all aspects (i. e. location in the coordinate system, size, orientation, etc.) of the original geometrical figure with increasing detail, as more and more coefficients are taken into consideration. If shape -- cf. 3.1 below -- is considered to be a more abstract quality that remains invariant when a figure is translated, scaled, or rotated, then the raw Fourier coefficients can be further normalised to give rise to Fourier Descriptors, that express this more abstract -- and usual -- definition of shape. It is in this sense that all squares, no matter where in the plane, how big, or turned to what angle, are identical in their squareness.
For more background information on the topic of using Fourier analysis for shape description, a good place to start is section 6.5 of [CosCes00]. The first three subsections of Chapter 3 of [PJVO] provide a more detailed introduction.
An overview of more specifically related research begins with [ZahRos] and other pioneering work in the 1970s, e. g. [Granlund] and [PerFu], spans numerous contributions -- notably in the 3-D area -- in the intervening decades and continues up to quite recently, e. g. [BaCiaPa].
[ZahRos] is remarkable in that it describes intuitively and defines formally (in the appendix) the fundamental theme of calculating Fourier Descriptors from some given shape signature (normalized cumulative angular function, in their case) of a polygonal curve, on which so many variations have since been developed by later researchers. Zahn and Roskies also already consider how such descriptive information can be used for clustering purposes (cf. their Figure 6), if only by means of visually inspecting a 2D-plot. [Granlund] was the first of such variations, in the sense that his work uses coordinates of boundary points represented as complex values for the shape signature. The comparative study of various signature options has given rise to a line of reseach in itself (cf. [ZhLu01] and [ZhLu05]).
[RafMen] offers a representative and very readable example of current Fourier-based work. Although emphasis is placed on using FDs in the context of indexing in the database sense, this article includes compact introductions to FD shape representation, comparison, and classification that are generally relevant. The study of the relative usefulness of the information contained in the amplitude and phase parts respectively of FDs continues a major theme of discussion in the FD user community.
[PerFu] and [ArbEtAl] are of particular importance here, in that we use a version of these results adapted to our SVG-related needs (i. e. 2D instead of 3D, only similarity-invariant instead of affine-invariant, and with the algebraic formulae for polygons), as will be explained in detail in 4.2 below. Although [BaCiaPa] does not cite [PerFu] or [ArbEtAl], the work of Bartolini, Ciaccia, and Patella should be understood in relation to these earlier contributions, in that they too demonstrate that simply discarding phase information (for the sake of easy normalisation) significantly reduces the quality of shape comparison results. Their suggestion to use Dynamic Time Warping instead of the usual Euclidean distance to compute the dissimilarity between vectors of FDs is a valuable innovation.
One might well have expected that the Graphics Recognition community (see the series of IAPR workshops, the most recent of which is http://apps.univ-lr.fr/cgi-bin/WebObjects/Colloque.woa/wa/colloque?code=161) would have already tackled some aspects (shape acquisition and description) of the problem addressed here. This has however not been the case to date. SVG as such seems unknown, presumably falling under the heading of "Web graphics", which is known to exist but not actively investigated, as [Tombre] p. 332 observes:
"Surprisingly, whereas there are a lot of high-interest applications in dealing with digital documents (e.g. PDF documents or web graphics, in which to search for and recognize various entities), a workshop like GREC seems not to be deemed the right place to present this, as we have not seen any work in this area."
Vector-graphics images are only beginning to be considered as objects of analysis. To the best of our knowledge, only two published pieces of research have already started to break the ground in this area: [DSDM] initiated content-based retrieval of SVG images in 2004, and [KeyWin99] pioneered grouping shapes, derived from vector-graphics data, by means of visual inspection of plots of descriptive values as far back as 1999. [DSDM] describes the architecture and functionality of an implemented software system, whereas [KeyWin99] rather reports results from experiments carried out with a view to testing the applicability of certain descriptive techniques to certain classes of data.
[DSDM] is similar to the work described in this paper in that it specifically focusses on SVG data (as opposed to other vector-graphics formats or bitmap images). It is however different in several other respects. Its primary functionality is content-based retrieval rather than clustering. It can compute similarity between composite shapes (i. e. groupings of basic shapes) provided that one comparand group subsumes the other, whereas we compare essentially only basic shapes to each other. The subsets of SVG elements that can be processed are also different: [DSDM] deals with shapes represented by the following six SVG elements: circle, rect, ellipse, line, polyline , and polygon, but ignores paths. Our work handles only shapes that are closed contours -- these can be derived in a relatively simple way from all of the above mentioned elements except line and from (many but not all) path elements. Ignoring paths seems to be a major handicap from a practical point of view, in that arbitrary shapes in SVG tend in general to be expressed most frequently with this element.
The group similarities computed in [DSDM] sum weighted contributions from five kinds of primitive similarity: shape, colour, transformation, spatial, and position. Such comparisons are more complex than ours, which are based exclusively on shape information. How [DSDM] actually computes shape similarity is not made clear: it is stated (p. 1042) that "For each shape a set of meaningful points of the shape perimeter is extracted to determine the shape contour (a set of x and y coordinates) and the CTM [Current Transformation Matrix] is applied to the this set returning another set of points". How these coordinate values are obtained from attribute values in SVG elements is not explained. How such sets of points are further pairwise processed to yield a similarity value is also left unstated. This incompleteness of the description of these aspects of [DSDM] makes it impossible to determine more exactly how this work is related to ours.
[KeyWin99], in contrast, answers this last question unequivocally and emphasizes the fact that the Euclidian distance between vectors of Fourier Descriptors computed for closed polygons would form their basis for classifying shapes. To that extent [KeyWin99] ressembles our work, since we too use Fourier Descriptors and clustering is a kind of classification. There are however several differences that are worth pointing out, in order to understand how our contribution extends and improves on the research initiated by [KeyWin99].
Keyes and Winstanley are interested in "cartographic shapes", which are said to be the shapes of objects typically appearing on maps, "for example houses or roads". In fact, their main results involve only two such items: "buildings" (as exemplified in their Figure 3) and "land parcels" (as exemplified in their Figure 4). Their source was data sets representing large-scale maps, "pre-processed to extract closed polygons". Thus [KeyWin99] does not take SVG documents as its starting point and does not need to analyse them from scratch, as we do. The extracted polygons were presumably represented only by their vertices (as is usual), since the authors then apply an interpolation procedure to generate 512 equi-distant points lying on the polygon boundary. It may be questioned whether discarding the already available vertex information in this manner is not in fact contraproductive. The forward Fast Fourier Transform is applied to the 512 points, producing 512 raw coefficients, which are normalised -- in the usual, somewhat unsatisfactory way -- for translation, scaling (in one experiment but not in the other one), and rotation, which leaves 510 Fourier Descriptors. Of these, only 14 are considered necessary for descriptive purposes. And finally the values of only three are actually plotted (see Figures 6 and 7) in order to enable visual inspection as a means of deciding if clusters seem to be present or not. A clustering algorithm is not used. However, the need for a clustering capability is made quite clear.
At the top level of the whole application the overall problem can be broken down into four main subproblems: shape acquisition, description generation, the clustering algorithm, and inspection and manipulation of results by end-users. These subproblems come into play in the order just mentioned, must thus be dealt with sequentially, and naturally give rise to an abstract architecture in four corresponding modules, as shown in Fig. 7.
These problems and modules are handled in their respective subsections, 3.2 to 3.5, below. The very first prerequisite however is to clarify the term "shape". What in general it really means is not at all so obvious and unambiguous as one would think, and what restrictions we place on its usage here have not yet been properly elucidated.
For our purposes here, this problem needs to be broken down into two separate aspects: the general definitional basis in terms of primitives (3.1.1) and the relatively complicated relationship to the contents of SVG documents (3.1.2).
Shape is often taken for granted as an intuitive concept that everyone understands, requiring no explicit definition as such. Indeed, much of the related work mentioned in Section 2 takes this approach. But leaving things purely implicit only promotes misunderstanding, which is what we want to avoid at all costs.
A general definition of the English word "shape" is, of course, provided by the Shorter Oxford English Dictionary:
"External form or contour; that quality of a material object (or geometrical figure) which depends on constant relations of position and proportionate distance among all the points composing its outline or its external surface."
This is, as will be seen, already a useful basis for establishing the restricted sense, in which this term is used later in this paper. It may be noted that this interpretation of what "shape" means corresponds closely to how the term is widely used in the fields of computer vision and raster-graphics CBIR.
If we look to more technical sources, more closely connected with vector graphics, we find that the PostScript Red Book [PSLR], for instance, uses the term but does not define it. Persoon and Fu [PerFu] take the terms "boundary curve" and "shape" to be synonyms. In fact, it seems that the words "shape" and "contour" are often interchangeable -- cf. Longman Dictionary definition of "contour": "the shape of the outer limits of an area". Thus a shape seems to imply the (topological) notion of separating an inside from an outside. The closest the SVG recommendation [SVG11] itself comes to solving our definitional problem is in Section 1.6, where a only a "basic shape" is said to be
"[one of the] Standard shapes which are predefined in SVG as a convenience for common graphical operations. Specifically: 'rect', 'circle', 'ellipse', 'line', 'polyline', 'polygon'.".
It turns out that [SVG11] uses "shape" ambiguously: sometimes in the implicit general geometric sense (as in "Paths represent the outline of a shape . . . ."), sometimes as a technical term referring to one of the six syntactic elements mentioned above.
With this context in mind, we have chosen to follow Chapter 4 of [CosCes00]: a shape as understood here is a "connected set of points" ([CosCes00] p. 267) but further restricted to constitute a closed non-overlapping polygonal path, invariant under translation, rotation, and scaling (but not shearing). Now this characterisation excludes lines and typical polylines from being shapes, which seems to match the common usage investigated above. If, however, this limitation were thought to be too severe, a work-around like the one sometimes used in character recognition could alleviate the problem by additionally (with respect to the source document) tracing a minimal closed contour around the points constituting the line or polyline.
In the case of shapes derived from SVG documents it is indispensable to keep in mind the distinction -- well-known elsewhere (e. g. pattern recognition and computer vision) but not usually applied in a conscious and explicit way to SVG -- of three types of shape from each other: syntactic, perceptual, and semantic.
Syntactic shapes are the basic ones, described in Chapter 9 of [SVG11], plus paths (Chapter 8). They are associated with specific SVG elements, which are trivial to identify in source code. A perceptual shape is a visible, geometric entity -- specifically a polygon as explained in the previous subsection -- that can be derived from many different syntactic shapes. A semantic shape is one that users interpret as having some application-specific meaning attached to it: it is typically a composite of more than one perceptual shape and tends to be considered a representation of some real-world object.
Presenting simple examples is the quickest way to make these distinctions clearer. In the first case (syntactic vs. perceptual), the following five code snippets (among other possible ones) all generate visually identical results -- the same perceptual shape:
<line x1="0" y1="0" x2="150" y2="0"/> <line x1="150" y1="0" x2="150" y2="50"/> <line x1="0" y1="50" x2="150" y2="50"/> <line x1="0" y1="0" x2="0" y2="50"/>
<polyline points="0,0 150,0 150,50 0,50 0,0"/>
<rect x="0" y="0" width="150" height="50"/>
<polygon points="0,0 150,0 150,50 0,50"/>
<path d="M 0,0 L 150,0 L 150,50 L 0,50 Z"/>
End-users of SVG applications are interested in at least perceptual shapes, and one of the problems to solve is recognising that similar perceptual shapes expressed with different syntactic means are indeed similar. As presently implemented, our software would recognize the last four variants correctly but fail to build a shape from the first code snippet. Note that a single syntactic element -- a compound path (i. e. a path with multiple subpaths), in fact -- can often contain more than one perceptual shape. A trivial case in point is the following source code,
<path d="M 209.37500 3.0000000 L 180.62500 9.2500000 L 205.00000 16.125000 L 214.37500 26.750000 L 223.12500 21.750000 L 209.37500 3.0000000 z M 139.37500 19.250000 L 109.37500 34.250000 L 89.375000 50.500000 L 94.375000 60.500000 L 106.87500 53.000000 L 116.25000 56.750000 L 113.12500 68.000000 L 127.50000 73.000000 L 139.37500 78.000000 L 151.87500 64.250000 L 149.37500 59.250000 L 165.62500 44.250000 L 158.12500 29.250000 L 143.75000 36.750000 L 139.37500 31.750000 L 150.00000 20.500000 L 139.37500 19.250000 z M 125.62500 88.000000 L 125.62500 95.500000 L 129.37500 95.500000 L 129.37500 88.000000 L 125.62500 88.000000 z M 25.625000 94.250000 L 10.000000 106.12500 L 6.8750000 125.50000 L 19.375000 126.75000 L 39.375000 101.75000 L 25.625000 94.250000 z M 16.875000 133.00000 L 16.875000 141.75000 L 33.125000 144.25000 L 25.625000 140.50000 L 16.875000 133.00000 z " id="rect1554" style="fill:#0000ff;fill-opacity:1.0000000;fill-rule:evenodd;stroke:#000000;stroke-width:3.1250000; stroke-linecap:round;stroke-linejoin:round;stroke-opacity:1.0000000;stroke-miterlimit:4.0000000; stroke-dasharray:none;"/>
which is perceived to be five separate perceptual shapes, visible in Fig. 8
and recognized successfully by our system.
In the second case (perceptual vs. semantic), the borderline between perceptual shapes and perceived objects tends to become blurred. Consider two illustrations of this phenomenon, the one more symbolic and the other more realistic. The seven perceptual shapes in Fig. 9 were intended to be taken together to represent a lighthouse on a map.
Shape acquisition is the first task at hand. All shapes in question here are perceptual (rather than semantic). A sequence of polygon vertices (rather than equi-distant boundary points) is the target shape representation that must be generated at this stage. Strangely enough, such shapes, that are suitable as input to the next processing step, are not readily apparent and self-evident in SVG documents, as might be naively assumed, given the existence of the six so-called "basic shape" elements of the language (and blithely ignoring paths).
When working with raster-graphics images shape acquisition involves segmenting the raw image to obtain individual regions whose boundaries provide the desired shapes. For simplicity let us assume that a contour (rather than region) -based approach is used. Any survey of these techniques (see [CosCes00] pp. 216-254 for an introduction) is well beyond the scope of this paper. A straightforward method that can be successful in very simple (and untypical) cases is to convert the original colour image to grey scale, threshold it to get a binary image, and then use an 8-connectivity tracing algorithm to produce a contour. The desired shape is thus at first represented by an ordered list of boundary points.
Shape acquisition is obviously entirely different when dealing with SVG. We have seen earlier (in 2.1) that shape descriptions can in general be contour-based or region-based. Of the two, contour-based approaches are the ones that better match vector graphics, in that the boundary contour itself and if-and-how it is explicitly filled are handled as two completely distinct matters.
When taking SVG source data as the starting point one could in principle go one of three ways:
(a) SVG-syntax based -- parse the SVG document (e. g. with SAX or ...) and go on from there: filter out the strictly graphics-related information from the total contents of the document and then "reconstitute" complete actual boundaries from data extracted from various attribute values, some of which may be quite complex in themselves.
(b) rendering-tree based -- use existing functionality (borrowed from ...) to convert raw SVG syntax into some internal representation that exposes in a more convenient form the information needed to acquire shapes.
(c) rasterisation-based -- just convert the SVG code into some raster format and begin from there, doing (some variant of) standard raster-graphics image shape acquisition.
Now it quickly becomes clear that options (a) and (c) are undesirable, if only because of their inefficiency -- doing a lot of work for nothing. With (a) one would have to do oneself a large part of the work that is essentially already carried out by any SVG user agent. Option (c) is even worse, in that it amounts to rasterising a vector representation and then vectorising it back into itself again. This leaves (b) as the only choice. Details of how this approach has been implemented are given in (4.1) below.
A shape acquisition problem of a different nature, but that still needs to be solved, is handling SVG source files that are not strictly standard-conformant but are "close enough" to be successfully rendered, if provision is made for recovering from some types of error.
Sequences of raw vertex coordinates are not however shape descriptions that can be used to compute a pairwise similarity measure between shapes in a way that matches well human intuitive judgements of degree of likeness, since many data sets that look like the same or similar shapes to human eyes would be incorrectly found to be quite different. A second module of the overall architecture is thus required to convert points into something better. Two main questions call for answers here:
(a) what improved description is adequate ?
(b) what is the best -- or at least, a good -- way to generate it ?
We have seen in 2.1 above that a large number of answers have already been given to the first question and that none of them has emerged as preeminent and uncontested. In order to avoid a discussion inappropriate here, we limit the options entertained to Fourier Descriptors (FDs) derived from the coordinates of perimeter points. This is admittedly a conservative choice, but one that is safe and probably the best starting point seen in the perspective of sensible follow-up work. This limitation still leaves open the matter of exactly what properties the FDs should exhibit. In order to capture the intuitive notion of perceptual shape, the desired FDs should be insensitive to variations in position, size, and orientation (i. e. normalized for similarity transformations) as well as to shifts in the starting point on the perimeter. Shearing in either direction applied to a given original shape should still be considered to produce a different shape.
Having answered question (a) makes it possible to deal with (b): how are these FDs to be computed ? We summarize here the standard technique, that is most often applied in the work reported in the literature. It is the one used, for instance, by Keyes and Winstanley (cf. 2.4.2). To a large extent, it adapts the problem (shape description) to the solution (FFT as borrowed from signal processing) instead of doing it the other way round. In 4.2 we describe long known, recently rediscovered but still widely ignored improvements to it.
The common approach can be summarized in the following eight steps:
choose an n power-of-2 number that has to satisfy the contradictory criteria of being not too small and not too large, in order to give a good fit when this number of points are used to describe any shape, be it large and simple, large and complicated, small and simple, small and complicated
resample, i. e. calculate n new perimeter points from raw boundary data, such that these new points: are equi-distant from one another but do not miss out any of the original points with significant curvature
use FFT to compute n Fourier Coefficients from the n new perimeter points
convert these FCs (complex numbers) into FDs by discarding from each one the phase (or argument) and retaining only the amplitude (magnitude), thus producing convenient real values ... and losing information valuable for comparison purposes
discard FD_{0}, in order to normalise for translation
divide all remaining values by FD_{1}, in order to normalise for scaling
discard FD_{1}, which is now equal to 1 in all cases
retain only a "small" number of the next low-order FDs as the final description
The third main module is the clustering software, that uses pairwise similarity measure values (based on e. g. Fourier Descriptors) to do its own job. Here again "clustering" is generic term covering a wide variety of actual algorithms, the most well-known of which were already introduced in 2.2 above.
Three open problems need to be addressed here: the clustering algorithms per se, the two dissimilarity functions, and cluster validity assessment.
All algorithms that require setting of parameters as a precondition to execution are unacceptable here. This rules out the whole family of partitional clustering techniques, since the number of desired clusters must be fixed at the outset. In the application context that we anticipate end-users will have no idea of what this value should be. Some other techniques require setting certain parameter threshold values before any clustering can take place. This is also something our expected end-users would be incapable of doing in any principled way.
The only options that remain available then in practice are hierarchical clustering approaches. Among these, only agglomerative algorithms are computationally efficient. Complexity considerations come into play when attempting to choose among these latter possibilities: single-link is the clear winner in this competition. Intuitive quality considerations are also relevant: average-link seems to be preferred in many but not all cases.
Two kinds of dissimilarity need to be handled: between individual cluster member data items and between complete clusters.
The first kind depends on how the individual data items are represented. In our approach each item is characterised by a vector of FDs. In other words, all the features appearing in the vectors are FDs: they are all continuous, there is no variation of scale among them, and there can be no missing values. This thus simplifies the problem with respect to the general case (cf. [JaiMurFly] section 4 or [Gordon] chapter 2) and makes it at least possible to choose what can well be seen as the standard option: the Euclidian distance, i. e. the square root of the sum of the squares of the individual distances between items at corresponding positions in the vectors. Since our items are in fact FDs, this presupposes that the dissimilarity between two given FDs as such can in turn be computed. As already noted in 2.3 above, [BaCiaPa] argues in favour of replacing the Euclidian distance with the Time Warping Distance.
The problem of dissimilarity between complete clusters, i. e. intercluster dissimilarity, is in our case a matter of the clustering algorithm being applied, as already explained in 2.2 above.
When you run a clustering algorithm on any data set at all, it always produces clusters -- even if a competent human observer, upon inspection of the same data, would judge that no clusters should have been found, i. e. that that data set contains no natural groupings. It can also occur that the clusters created by the algorithm are judged to be wrong or inappropriate. This then poses the question of how to objectively evaluate the output of clustering algorithms. This aspect of clustering as a real-world tool was, for the sake of simplicity, omitted in 2.2 above. Even a short overview of it (cf. [JaDu] chapter 4, [Gordon] 7.1 - 7.2, and [CLMMS] A.1.4) would go well beyond the scope of this paper, but it is too important to be ignored here entirely.
To some extent, the problem of deciding how good or bad results are, is alleviated in our case by the very kind of data -- shapes -- that is being clustered. When working with visual data, one kind of suboptimal or partially erroneous cluster can be detected relatively easily by visual inspection, since the human visual system is comparatively good at detecting dissimilar items within a group. This possibility does not exist when working with traditional strictly numerical values as data to cluster. This kind of direct, local inspection needs to be supported interactively, but how this can be done is rather the topic of the next section 3.5. This approach is useful to detect mistakes that are internal to any given cluster, i. e. members that probably do not properly belong there, but it cannot of course tell you that some correct members are missing there, i. e. should have been included there but actually were assigned to some other cluster(s). Nor does it enable any overall assessment of the results as a whole.
Some assistance in determining how well the complete cluster tree generated does or does not fit the original data was thought to be indispenable for end-users as part of the exploratory process. The most commonly used method to do so is to calculate the so-called cophenetic coefficient, which indicates to some extent how well or how poorly the distances between data items derived from the cluster tree (ultrametric distances) correlate with the original, clustering-independent ones. Section 8.3.4.5 of [CosCes00] gives a minimal, easy-to-follow example. Section 4.3 of [JaDu] discusses this topic in much more depth.
Details of how we calculate cophenetic coefficients are explained in 4.3.3 below.
The fourth main module is the GUI enabling end-users to view clustering results conveniently. Two-dimensional plots (of higher dimensional data), such as those presented in [Gordon] chapter 6 would be inappropriate here. The first problem to solve is displaying all or parts of the cluster tree, keeping in mind the fact that it will generally be quite large for large data sets. The second one is how to best compactly represent a cluster itself in its entirety, i. e. how to "sum up" a cluster. The third one is how to best to show the data item members of a given cluster. The fourth problem is enabling users to instantly recover the context in which individual data items appear, i. e. the complete SVG document from which the item was extracted.
Among other, less widely used options for graphical representations, the standard, classical solution to the first problem is the dendrogram (cf. [Gordon] Fig. 4.2), in the strict sense of the term, i. e. a graphical valued tree, as can be seen in Fig. 2 above. The vertical dimension, the height, as summarily indicated by the scale "h" at the left of the tree itself, is an essential aspect of the representation, in that it accurately and proportionally reflects the relative compactness of the clusters. (In the loose sense, dendrogram is sometimes misused for any kind of graphical cluster tree, including those where compactness information plays no role in determining positions.) Strict dendrograms are frequently used for paedogogical purposes, to illustrate the nature of hierarchical clustering. They work relatively well on (very) small data sets but become impractical to visually inspect on usual computer screens when the number of clustered items is greater than about 100. Since they do not scale up well, they are no option in our context.
Clustering APIs and software systems usually ignore this problem, supposing that it has in fact no general solution: what would be useful and appropriate depends both on the kind of clustering algorithm that produced the results and on the application domain (what sort of "things" have been clustered). With a partitional algorithm, the clusters have no particular relationship to one another, and the only information to be displayed is the membership of any individual cluster. In the case of hierarchical clustering, one would like to see not only the objects in any given cluster but also how clusters themselves relate to each other, i. e. the parent and child relationships. The JDM API (http://jcp.org/en/jsr/detail?id=73), for instance, supports retrieving the Cluster object that is the parent of another one and those Cluster objects that are the children of another one, but neither assistance with displaying such objects to the end-user nor any view of the cluster tree as a whole is provided. The WEKA system (http://www.cs.waikato.ac.nz/~ml/weka) includes a rich set of data mining tools but no support for solving the results display problem. The CLUSS prototype (cf. [BrechEtAl] Fig. 6.12) on the other hand does indeed include functionality that is relevant in this respect but of course not at all adapted to an SVG-based context.
How we tackled these four display and manipulation problems, keeping in particular in mind that SVG documents are both the start and end points in typical usage cycles, is explained briefly in 4.4 and exemplified more fully in Section 5 below.
The four problems and their corresponding modules introduced at the beginning of Section 3 above, depicted together in Fig. 7, and further elaborated on in subsections 3.2 to 3.5, can now be refined into a greater level of detail, describing close-up the solutions actually implemented at present. We shall resort to a combination of tutorial-like walk-through of a small and simple use case and more technical explanations of the algorithms running behind the scenes.
The first thing to do is to select the documents, the shapes of which are to be clustered. In general, this would of course include documents anywhere on the Internet, but the current prototype only supports loading local files. The SVG file loader window is shown in Fig. 11 in its primitive state, when the software has just come up and no files at all have as yet been chosen.
The empty band of space at the bottom of the window could be used to display status messages.
Three buttons and one slider are available for interaction. The slider makes it possible to change the default number (32) of Fourier coefficients that will be computed to create the descriptors for each shape. Non-expert users will not want to change this setting. Of the three buttons, only "add file(s)" should really be enabled at this point, since as long as no files are present, there are no shapes to cluster ("do clustering") and no raw data to examine ("show cci", i. e. open the Clustering Candidate Inspector (see below) on a selected file). The table in the main part of the window is still empty, with only the three column headers being visible.
Our user decides to load the 28 files in the open clip art (NB: version 0.18) "flowchart" directory, using the standard file-chooser. This leads to Fig. 12, where each loaded file occupies a row in the table now.
Upon inspection of the situation for the flowchart files, it can be seen that most of the files contain nothing but shapes (100 %), while others reveal mixed content. In this latter case, users may well be interested in finding out what is going on, i. e. of the complete document contents which items are (clusterable) shapes and which are unclusterable non-shapes. Providing such information is the job of the Clustering Candidate Inspector (CCI), which the user has invoked for "fc09.svg", bringing up the window in Fig. 13.
The CCI offers four views (via the four tabs) of the same document. The view in Fig. 13 includes only (and all) the shapes and shows them as thin black outlines on a white background, regardless of how the original document was styled, positioned as in the original. The "unclusterable non-shapes" tab displays the graphical complement, i. e. all the remaining graphical material. This view of "fc09.svg" is given in Fig. 14.
Now it turns out that four documents (fc01, fc10, fc12, and fc13) contain no shapes at all. So it would have been best, if the software itself had deselected them (check boxes in the leftmost column) from the very beginning. At present, users should do this manually. Note however that shapeless files can remain selected, with no harm being done or resources being wasted later on during the clustering phase.
The user is now ready to do some clustering and clicks on the corresponding button, bringing up a new window, as seen in the centre of Fig. 17.
The process of getting polygon vertices from SVG source syntax is carried out in part by means of the Batik API, in particular the classes in the packages org.apache.batik.bridge and org.apache.batik.gvt. First, the source file is parsed using a SAXSVGDocumentFactory object, giving rise, when all goes well, to an SVGDocument object. The Batik parser insists on strict conformity, while experience shows that many SVG files in circulation in the real world miss the mark to some extent, e. g. one or more required namespaces have not been defined. Some of the exceptions thrown by Batik can be caught by our code and our working copy of the original file slightly modified accordingly to make it acceptable.
Once parsing is completed successfully, the BridgeContext is set to DYNAMIC, ensuring access back to the original DOM elements. The document is rendered off-screen, using an org.apache.batik.gvt.renderer.DynamicRenderer. If the document width or height is not specified in the source file and defaults at first to one pixel, these dimensions are reset to reasonable values, ensuring that the derived graphical objects will have sizes larger than small fractions (which would cause problems later on, when normalising the Fourier coefficients). A GVTBuilder object creates the corresponding GVT tree, which is then walked recursively down from its root, retaining only and all the ShapeNode objects for further processing.
For each ShapeNode thus found its outline java.awt.Shape is recovered, taking into consideration its concatenated global transform as obtained through the GVT tree. The original outline is flattened to provide a list of line segments approximating it. This flattened shape has any duplicate points eliminated and is then checked to see if it is a non-self-intersecting, closed contour. Only shapes that pass these tests are retained, and for each of them a list of java.awt.geom.Point2D objects is constructed. These are the polygon vertices (Fig. 1 above illustrates the general principle involved but shows a much rougher approximation than ours) that constitute the interface between the first and second main modules.
The processing in this module consists of two phases. First, raw Fourier coefficients are computed from the polygon vertices. Then these coefficients are normalised, to finally produce the desired Fourier descriptors. The method described here originated with [PerFu] and follows closely the work of [ArbEtAl]. It is thus rather different to -- and better than, for our purposes -- the more widely used approach already summarised in 3.3 above.
Besides the points themselves the number of coefficients to be generated (cf. Fig. 6 above, where "N" takes on eight progressively larger values, beginning with 2 and ending with 46) is an input parametre here. The value "n" defaults to 32 but can be set with the slider in the File Loader window. It follows the usual Fourier notational convention and thus refers to coefficients C_{1} to C_{n} . The total number will thus be 2 * n + 1, that is to say the coefficients
C_{-n}, C_{-n-1}, . . . , C_{-1}, C_{0}, C_{+1}, C_{+2}, . . . , C_{+n}.
The coefficients themselves will, of course, be complex numbers (org.apache.commons.math.complex.Complex objects).
It is convenient to carry out some preliminary work before directly computing coefficient values. Thus the vertex points (the number of which is called V in the formulae below) are examined to see if they are already in counterclockwise order. If not, the order is reversed to make it counterclockwise. The total arc length of the polygon is ascertained (and called T for use in the formulae below). The cumulative arc lengths between pairs of points are then calculated and retained. The x and y coordinate pairs of the points are now converted to complex numbers. Finally the slopes between two consecutive vertices X, i. e. the normed differences Z, are calculated as
Z_{i} = ( X_{i+1} - X_{i} ) / | X_{i+1} - X_{i} |
and retained.
Computing the coefficients C_{n} can now be done and loops from -n to +n for n as explained above. C_{0} is a special simple case, obtained according to the formula
C_{0} = (T / 2) * Σ _{k=0 .. V-1} ( X_{k} + X_{k+1} ) * | X_{k+1} – X_{k} |
All other coefficients require much more work, as per the following formula
C_{n} = ( T / (2*π*n) ^{2} ) * Σ _{k=0 .. V-1} ( Z_{k-1} + Z_{k} ) * e ^{-i * 2 * π * n * tk / T}
where e is the Euler function and is actually replaced in our Java code by the equivalent
cos (2 * π * n * tk / T) – j * sin (2 * π * n * tk / T)
The raw Fourier coefficients are now available and can be used as the basis for computing the normalised descriptors.
Here again, a little preparatory work is required, namely obtaining the degree of rotation symmetry of the polygon being described. This is the value rs such that an object rotated through an angle of 360° / rs is identical to itself. Squares, for instance, have an rs of 4. Typical irregular figures will have an rs of 1. This value is useful in that only some coefficients C_{n} will have non-zero values, i. e. those C_{n} such that n = 1 + rs * k or n = 1 - rs * k, for all k = 0, 1, 2, ...
The normalisation of the coefficients for translation, scaling, rotation, and start point shift can now be carried out. This presupposes a conversion of the complex values from the Cartesian or rectangular form (real part and imaginary part) to the polar representation (magnitude and phase). To handle translation, it suffices to set the magnitude of C_{0} equal to 0. To handle scaling, the magnitudes of all coefficients are divided by the largest one (which is usually, but not always, C_{1}). Handling of rotation and start point shift modifies the phase of the coefficients and involves the degree of rotation symmetry among other things. Details and formulae would go beyond the scope of this paper.
Each final descriptor value results from multiplying the exponential function of the complex number formed from 0 and the modified phase of the coefficient by the complex number formed from the normalised magnitude of the coefficient and 0.
Each shape is now represented by a vector of Fourier Descriptors. Pairs of such vectors will be compared during clustering by means of a dissimilarity function, thus providing distances between individual shapes, which are the basic input to the clustering algorithms.
There are three mutually complementary components that must be made clear here: the algorithm implementations as such, the dissimilarity functions invoked during clustering, and use of the cophenetic coefficient for cluster validity assessment.
The window entitled "clustering progress dialog", visible in the centre of Fig. 17, offers three capabilities.
With the combo-boxes in the top row the user can select the desired characteristics of the clustering run about to be carried out. These involve three options: the leftmost combo-box determines which algorithm to run (details below in 4.3.1), the middle one decides if a matrix of data item dissimilarities will be constructed or not, and the one at the right sets the dissimilarity function (details below in 4.3.2) to use (although at present there is actually only one option).
The second functionality is provided by the progress bar in the bottom row at the left, which displays in real time how far the clustering run has advanced. This is quite useful for the largest sets of data (about 3000 shapes) that can be handled by the current implementation running with only 1 gigabyte of main memory on a single-core processor, the complete clustering of which can take up to about 10 minutes, in the case of the naive implementations.
The third, obvious use of this window is provided by the two buttons for starting ("run") and aborting ("stop") clustering.
All three hierarchical agglomerative algorithms -- single, complete, and average link -- have been implemented in a naive, straightforward way. Sibson's SLINK algorithm [Sibson] for the single-link variant has also been implemented. An improved version of average linking, following the updating approach explained in Section 4.1 of [KauRou], has begun to be implemented but is not yet complete.
The naive implementations offer two options: (a) first computing a (lower-diagonal) matrix for the pairwise dissimilarities between all the data items to be clustered and simply referring to it later for distance values, and (b) avoiding using up main memory for the matrix and recalculating distance values every time they are needed.
The classes developed here extend the weka.clusterers.Clusterer class and thus could fit into that framework. They all follow the same basic pattern of the following main sequential steps:
create the leaf-level singleton clusters from the individual data items
add all these clusters to the collection of orphan clusters
do any appropriate preprocessing
loop over all orphan clusters as long as there is more than one of them and:
find the pair of orphan clusters that are more similar to each other than any other pair
create a new cluster by merging the two clusters just found (thus building up the tree)
remove these two clusters from the collection of orphans
add the new cluster to the collection of orphans
The differences between the four algorithms concern essentially the steps (c) and (d1). For (c) the three naive approaches do nothing, whereas SLINK builds its so-called "pointer representation", which already does all the real work and can be trivially converted into a dendrogram. Conversely, for (d1) SLINK has already prepared the information required to answer the question, whereas the naive algorithms have to repeatedly rebuild the orphan cluster dissimilarity matrix, applying their respective measures of pairwise cluster dissimilarity in the process.
Running the four algorithms on the same data is usually enlightening. The two versions of the single-link algorithm always produce the same results, as one would expect. Comparing the three kinds of link among themselves is more interesting. Figs. 18, 19, and 20 show the top of the cluster trees that result from analysing the flowchart data, with which we have already been working.
The dissimilarity (i. e. distance) between two individual Fourier Descriptors must be a float or double value rather than a complex number, and it must be greater than or equal to 0 and smaller than or equal to 1. This is currently obtained by taking the difference of the magnitudes of the two FDs (and ignoring their phases).
The dissimilarity (i. e. distance) between the two vectors of FDs describing a given shape is the standard Euclidean distance (cf. 3.4.2).
The value of the cophenetic coefficient gives users some idea of the extent to which the data that they have just clustered actually does consist of natural groups. This value is displayed (with a pointlessly large number of decimal places now) at the top right of the cluster tree window, as already seen in Figs. 18, 19, and 20 which continue the walk-through of our flowchart example.
The cophenetic coefficient is easy to compute in terms of the code required but can be problematic to actually carry out at run time, if a large number of items have been clustered, since it requires a second matrix the same size as the dissimilarity matrix, which means that main memory limitations become twice as severe.
The basic approach is to build the cophenetic matrix and then compute the well-known Pearson Product Moment Correlation Coefficient (PPMCC) between its values and those of the original dissimilarity matrix.
The values in the cophenetic matrix (also referred to in a broader context as the "ultrametric distances" in the literature) are found by retrieving the lowest common ancestor cluster (in the complete cluster tree) of the two singleton clusters containing the two items being compared. The compactness value of this ancestor cluster is then entered into the corresponding cell of the cophenetic matrix
We use the jsc.correlation.LinearCorrelation class of the Java Statistical Classes library (http://www.jsc.nildram.co.uk/index.htm) to produce the final result.
Once clustering has been accomplished, users need to examine the results, in varying degrees of detail. Four main capabilities satisfy this need: top-down browsing of the cluster tree as a whole, cutting the tree at a given threshold (cf. the horizontal line in Fig. 2 intersecting four parent-cluster / child-cluster links) to locate the clusters and (possibly) outliers lying at that level, inspecting individual clusters to find out which data items they contain, and retrieving the SVG document from which a given cluster member item originated.
Fig. 19 showed the cluster tree for our flowchart data in its original, minimal form with only the children of the root already being displayed. The nodes can be interactively expanded or reduced (as is usual with a javax.swing.JTree object) by clicking. Fig. 22 shows the tree in somewhat more detail.
Climbing around in the complete tree is only one of two ways to initially inspect clustering results. The other one is to use the button just mentioned. Doing so opens a pop-up window, enabling users to type in a decimal value between 0 and 1. Values greater than the root cluster compactness (in the current example: 0.24204759401995274) are currently allowed but do not make good sense, in that only the root cluster itself will then be found. Which values are most revealing of useful structure depends on the given data set and must be discovered by browsing on the basis this functionality: the threshold can be repeatedly reset higher or lower, and the new results compared to the previous ones. 0.1 often leads to a result from which a first set of conclusions can be quickly drawn.
Proceeding in this fashion with our running example brings up the windows illustrated in Figs. 23 and 24, with, respectively, the outlier instances -- none in this case -- and the clusters -- five in this case -- that were generated by the requested cut.
The complete absence of outliers often indicates that the cut was not too low.
In both kinds of cluster-level view (i. e. tree and cut results) a cluster is represented by three pieces of information: its medoid member shape, its cardinality, and its compactness (or maximum dissimilarity). The medoid member of a cluster is the one that is nearest its centre and thus representative of it. For relatively homogeneous clusters, the medoid shape gives a good idea of what the cluster actually contains. In the cluster-level views this shape has been scaled up or down, to make it the size of an icon convenient for display. This can turn out to be confusing for novice users, since the same shape as seen in the original SVG document can be (much) larger or smaller.
Once an overview of clustering results has been put to good use, further details will probably be desired, i. e. inspection of individual clusters to find out exactly which data items they contain. This can be done from any three-part summary representation of a cluster, be it in the tree or a window with results of a cut. A click on a cluster summary brings up a window -- cf. Fig. 21 -- with a scrollable, tabular display including all the shapes in that cluster. Each icon is labeled: either with the SVG source code ID of the element that generated it (if it indeed had one) or with a string derived from the kind of element (here always a path) it came from.
In a general, non-SVG-specific clustering context this level of information, i. e. individual cluster membership, is as far down as end-users can dig into results. SVG language elements, however, only ever exist within a given SVG document. In other words, each shape had been extracted from some document and remains closely related to it from the end-user's point of view: once an interesting shape has been located, in order to further exploit it for practical purposes, users need to jump back out of the world of isolated geometrical entities and into the world of SVG and the complete document context. Retrieving the SVG document from which a given cluster member item originated is supported in two ways, as two menu items in the right-button context menu on the shape icons. The first item "name origin URI" pops up a dialogue window simply naming the origin URI (currently a local file), which sometimes is all that users want to know. The second item "open cci" launches the Clustering Candidate Inspector (cf. 4.1 above) on the origin document, making it convenient to exhaustively investigate context details.
Additional examples of all four capabilities in action are given in the use cases described in the following section.
So now that you have built it, does it really work ? An objectively well-founded answer to this question is not currently available and probably never will be, given that there are no generally recognised measures for evaluating the performance of clustering, comparable to recall and precision in the field of Information Retrieval. Nor are there any SVG benchmark data sets, analogous to the MPEG-7 Core Experiment shapes, that could be used as a reference, to at least see what happens on well-known and widely used materials.
The next best thing is no doubt experimenting with data taken from a variety of application areas, that may be thought to be representative in one way or another. Results from such experiments are reported in this section. Clip art (5.1), technical drawings (5.2), and the glyphs of Chinese characters (5.3) are the three areas that were chosen.
Experiments in the clip art area have been generally encouraging, while often pointing up the fact that human users spontaneously convert perceptual shapes into semantic shapes (cf. 3.1.2) -- something which our software cannot do. The trials carried out with technical drawings were neither very extensive nor very successful from a practical point of view, since "line" elements seem to be almost the only syntactic construct used and acquiring a perceptual shape from a combination of syntactic lines is something we cannot do. The tests on glyphs of Chinese characters are still at a preliminary stage but appear to confirm that we have a good basis and shall indeed later be able to build another level of processing on top of it.
Clip art includes a huge variety of contents -- which makes it a very useful testbed. We have selected three cases, representative of three kinds of possible usage: looking for something specific (5.1.1), obtaining an effectively helpful instant overview (5.1.2), and attempting to the same thing on data that in fact do not cluster nicely (5.1.3).
This first example illustrates a use case in which something particular -- a thunderbolt shaped arrow -- is being looked for but cannot be found with a traditional keyword based search, since (as it turns out) neither file names nor included metadata are useful in this case.
You are looking for thunderbolt-style arrows. A keyword-based search has not produced useful results, so you try our software. You go to http://openclipart.org, grab the 105 files in the ".../shapes/arrows" directory, and cluster their contents. Our system finds 236 shape instances in all and immediately shows you the top of the cluster tree, as can be seen in Fig. 25.
Now you have two (non-exclusive) options. (a) You can expand the tree, level by level, and visually inspect the results. In this case, a cluster with four thunderbolts comes into view at the tenth level down (see Fig. 26).
Consider the following scenario. A user needs some jigsaw puzzle pieces for an application she is preparing and knows where to get some (at http://openclipart.org/ in the directory .../shapes/jigsaw), but would prefer not to have to manually open and inspect all 48 similarly named files one by one, in order to find out what kind of pieces are available. Clustering the contents delivers a well organised overview with just a few clicks. Here is how to do it.
Having grabbed the files and run the clusterer, the user notes that the cluster tree (not illustrated here) includes 96 shapes and has a cophenetic coefficient of 0.951, which makes a strong structuring quite likely. Instead of browsing the tree, the user opts for immediately testing a cut at the typical starting value of 0.1. Results: no outliers at all are present, and only six clusters (cf. Fig. 29), all with high relative cardinalities, have been formed.
Selecting any entry in the summary table (Fig. 29) brings up a window containing all the shapes that are members of the corresponding cluster. Fig. 30 shows the 12 items of the type two in and two out opposed.
In this third example we run through a use case, that one suspects -- given the name of the directory and its implication for the kind of files to be found there -- will no doubt not be successful in the sense of producing meaningful clusters. It has been chosen in order to illustrate how such a situation typically unfolds.
Grab an arbitrary small number of files from the directory "unsorted": the first 17 in alphabetical order (going from "aktion.svg" to "break_carlos_katastrofsk_.svg") will do. In Fig. 31 it can be seen that the clusterability of the contents of the respective files varies from 41 up to 100 %.
One can simply push the "do clustering" button, leave the default settings selected, and obtain a first impression of what is going on by looking at the cluster tree window, as shown in Fig. 32.
So the user chooses to see what happens when the tree is cut at 0.1, as a quick first attempt to get an overview of the situation. This opens the two windows that can be seen in the Figures 33 and 34. The first one displays 15 of the over-all 54 outliers at the 0.1 level.
In order to confirm one's negative impression, one can return to the window with the cluster tree and expand it down some number of levels, in order to see if any meaningful structure shows up. Fig. 35 gives a view of the top seven levels of the tree -- no interesting clusters come to light.
The phrase "technical drawings" is a (no doubt suboptimal) generic term, intended to cover application areas that are characterised by the technicity of their content and certainly do not fall under the other two areas investigated here. It encompasses at least CAD and cartography, both of which are highly specialised and in which we ourselves have no personal expertise.
The CAD examples that we have been able to gather bring into focus a possible disadvantage of the definition of the term "shape" as polygon-approximatable, i. e. lines, typical polylines, and open curves are unclusterable non-shapes. This distinction is so fundamental that it is the first item of information that users of our software receive about any file that they are attempting to process: the clusterability factor in % is computed and displayed (cf. Figs. 12 and 31) as soon as a file is loaded and before any actual clustering takes place. The main motivation of the CCI functionality presented in 4.1 above is indeed to enable users to obtain a detailed view of the contents of any SVG document precisely in the light of this distinction. If most of the content is in fact composed of unclusterable non-shapes, the document in question is simply not suitable for processing by our system.
Now the clusterability factor for clip art SVG seldom drops below 50 % and is typically rather in the range of 90 to 100. For CAD data the situation seems to be -- judging from our limited and subjective experience to date -- quite different. Here the clusterability factor averages roughly 5 %, which effectively makes it pointless to attempt clustering such documents. Inspection of source code reveals that SVG "line" elements predominate very strongly, with some polylines and a few paths also being used.
The geometry of items on maps seems to be such that their clusterability factors lie on average midway between the favourable case of clip art and the unfeasability of CAD: some frequent items are mostly clusterable, while others are completely unamenable to clustering. Running through a pertinent use case can illustrate this state of affairs.
The SVG documents used as input here were obtained from http://svg.carto.net and are all maps of Corsica. The unclusterable non-shapes are the rivers, railways, and roads, which are all paths that do not close to form polygons. 348 (clusterable) shapes were found, with a cophenetic coefficient of 0.982, which is of course almost perfect. Cutting at 0.1 produces no outliers and only the seven clusters summarised in Fig. 36, all of which upon individual inspection show themselves to be completely homogeneous (as could be anticipated, in view of a 0.982 value).
While clip art and technical drawings are generally familiar in the SVG world, Chinese characters will certainly come as a surprise, since Unicode and SVG are not particularly germane to each other. The glyphs of modern Chinese characters constitute however a convenient and propitious test domain with several favourable characteristics: a closed world that is still big enough to be challenging, a controlled diversity of shape that makes clustering worthwhile, and a limited set of internal structures that make the problem of dealing with composite shapes relatively tractable and provide a good testbed for attacking it.
Fig. 37 shows a very simple but typical case in point.
Considering only the 200 most frequently used characters is convenient for demonstration purposes. They can be broken down into 959 shapes in all, which, after clustering (cophenetic coefficient 0.851) and a first cut at 0.1, give rise to 187 outliers and 73 clusters, ranging from 2 to 162 in cardinality. Fig. 38 shows information from the summary window, partially scrolled down.
Fig. 39 displays the membership of the cluster at the top of the summary.
For the time being we have only developed a proof-of-concept prototype, but the system demonstrates the feasibility of integrating SVG with Data Mining and Content-Based Image Retrieval, and begins to show what benefits can be gained by doing so. Although it is still early days, it is not too early to step back, put the work done to date into perspective, take stock of progress made, and look towards the future. It is customary to categorise such observations into the three groups: right, wrong, upcoming. This has been done here also. However, in some cases it is hard to tell at present if a given decision was right or wrong. But such choices can be pointed out and tagged as requiring further review before follow-up work is undertaken.
The most important thing that we did right was to do it -- at all. SVG is surely coming of age, but it is also probably confronted with what one might call an identity crisis. As an essentially passive vehicle for 2D graphical data, it is in competition with more application-domain-specific options, for instance in the areas of CAD and GIS. As a means (in conjunction with EcmaScript) of building active software, i. e. Rich Internet Applications (RIAs), it is running up against the likes of Flex, Silverlight (and Moonlight) , and JavaFX. It may thus indeed be a good time to again ask the question, what is SVG good for ? The larger the number and the greater the diversity of valid answers, the rosier the future of SVG.
On a less strategic, more technical level, at least two major choices now still look unequivocally correct: (a) computing Fourier coefficients from polygons according to the approach based on [ArbEtAl] rather than going the more usual FFT-based way, and (b) retaining (as do [ArbEtAl] and [BaCiaPa]) rather than discarding (as per standard practice) phase information, when coefficients are normalised into descriptors.
Two other choices were also probably right but may still have entailed disadvantages. In Section 3.2 we mentioned three possibilities for acquiring shapes from SVG source code, based on: raw SVG syntax, an internal rendering-tree representation (Batik GVT tree), and preliminary rasterisation. The option relying at first exclusively on Batik functionality is still clearly best, but with hindsight it has become clear that it does not facilitate recognising composite shapes. In addition, some analysis of source syntax would probably be helpful in this respect.
The second good but perhaps debatable choice was to avoid reliance on off-the-shelf software components for clustering capabilities. The growing awareness of how useful clustering can be and sinking hardware costs have combined in recent years to make clustering software much less exotic than it once was -- why write it yourself again, when somebody else has already done it ? A very incomplete and heterogeneous list of the resources available would include: the R language (http://www.r-project.org), JMSL Numerical Library (http://www.vni.com), XELOPES library (http://www.prudsys.de), LingPipe library (http://alias-i.com/lingpipe). Investigation of such options reveals that they are either convenient Java but too domain-specific (e. g. genomics) or else general enough but sufficiently hard to interface with that there would be no real net gain.
The final decision to be discussed here -- to define "shape" (cf. 3.1) essentially as a polygonal outline, thus excluding lines and typical polylines, and leaving it at that, i. e. not taking any additional measures to make lines and polylines somehow amenable to processing -- remains hard to evaluate unambiguously. It was no doubt quite reasonable for an initial prototype but may perhaps not be for later phases of development.
Mistakes are less embarrassing when you notice them yourself. Erroneous design is more serious than sloppy implementation. The GUI has, as a whole, remained an unpolished first cut: some of its kinks and wrinkles were already pointed out along the way in Section 4.
It was a mistake to have allocated resources in such a way that the non-naive implementation of the average-link clustering algorithm à la Rousseeuw could not be completed on time.
The dissimilarity functions presented in 4.3.2 are currently quick and dirty. A really good way to compute the distance between two individual Fourier Descriptors cannot be as simple as our implementation.
The cophenetic coefficient of the complete cluster tree and the medoid item of any individual cluster are both currently computed "after the fact", in that these functionalities were added after the basic clustering code had already been written without taking them into consideration. It can be suspected that if they had been thought about from the very beginning, overall efficiency -- both time and space -- could likely be improved.
The CCI functionality (cf. 4.1) works very well in most cases but is implemented in a somewhat ad hoc way. Making it fully general and supporting end-users as conveniently as intended turned out to be a good deal more difficult than expected -- ambitious specifications are relatively easy to write.
Aside, of course, from remedying the shortcomings uncovered in the previous subsection, a lot of other work quickly comes to mind. Three or four perspectives stand out as being the most obvious . . . and hopefully the most promising.
One is "shopping around" in the vast fields of shape modelling and clustering. Our goal at the beginning of the project was to take the best-of-breed results from these areas and apply them to SVG data. As was noted in the Introduction, there is little solid consensus as to what should indeed be considered best. Faced with the necessity of selecting the options to implement, we wanted to reduce risk and thus made the conservative choices reported in this paper. Too conservative ? Arguably not so, in that they provide an indispensable baseline reference, to compare other approaches against. Work in these fields had never paid any attention to SVG, so asking what works best specifically with respect to SVG is still an open question.
The second perspective is moving on from simple to composite shapes. This might be the most valuable practical improvement, in terms of making the software better adapted to end-users' real needs. That would be a big leap. A different angle of attack on the same perceived problem -- too much low-level clutter and not enough high-level organisation -- might be to be able to (automatically) differentiate "content-bearing" shapes from "decorative" shapes. The distinction that we have in mind here seems to be subsumed in the structure vs. presentation opposition, that is one of the basic tenets of XML ideology. An example will help clarify things. One and the same arrow, for instance, often shows up in an SVG document in the form of two shapes: itself, so to speak, and its shadow. Human visual perception differentiates the respective roles of the two shapes and accords more importance to the one than to the other. We have wanted to be able to tell our software, "don't bother with the shadows, just cluster the real stuff", but have no way of doing so.
A third way to follow up on our work would be to lift the restriction that originally excluded colour and texture to the advantage of shape alone, when it comes to selecting the features that are taken into account to determine perceptual similarity. This is already common practice, when working with raster graphics images, and colour (but not texture) in the SVG context was already taken into consideration by [DSDM].
Finally, a rather more strategic perspective should not be forgotten. We have made one small contribution to filling the gap in analytical tools for SVG data. What other progress has already been made in this respect ? What other software exists, that can provide you with valuable and interesting information about SVG documents, that you probably could not obtain otherwise ? Data mining (other than just clustering) SVG contents might tell us worthwhile things about how the language has been used to date and even about the language itself. The whole area of checking (programming language) source code for defects that are not syntax errors but could nonetheless be tidied up or improved on, began long ago with the venerable UNIX processor lint and has grown up into a rich array of invaluable analytical tools, many of which are built into language-specific IDEs. Would SVG benefit from having similar capabilities available for it ?
[ArbEtAl] Arbter, Klaus et al.: "Application of Affine-Invariant Fourier Descriptors to Recognition of 3-D Objects", IEEE Trans. PAMI, Vol. 12, No. 7, July 1990, pp. 640-647
[BaCiaPa] Bartolini, I., Ciaccia, P., and Patella, M.: "WARP: Accurate Retrieval of Shapes Using Phase of Fourier Descriptors and Time Warping Distances", IEEE Trans. Pattern Analysis and Machine Intelligence, 24 (4), pp. 509-522, 2005.
[Bigun] Bigun, Josef: Vision with Direction: A Systematic Introduction to Image Processing and Computer Vision, Springer, 2006, isbn 9783540273226
[Bracewell99] Bracewell, Ronald: The Fourier Transform and Its Applications, McGraw-Hill, 1999, isbn 0073039381
[BrechEtAl] Brecheisen, Stefan et al.: Density-Based Data Analysis and Similarity Search, pp. 94-115 in [PetKhan]
[CLMMS] Cao, Frédéric et al.: A Theory of Shape Identification, Springer, 2008, LNM 1948, isbn 9783540684800
[CosCes00] Costa, Luciano Da Fontoura and Cesar, Roberto Marcondes Jr.: "Shape Analysis and Classification: Theory and Practice", 1st ed., CRC, 2000, isbn 0849334934
[DaLiWa] Datta, Ritendra, Li, Jia, and Wang, James: "Content-Based Image Retrieval -- Approaches and Trends of the New Age", MIR'05, Nov. 11-12, 2005, Singapore, pp. 253-261
[DSDM] Di Sciascio, Eugenio, Donini, Francesco and Mongiello, Marina: "A Knowledge Based System for Content-based Retrieval of Scalable Vector Graphics Documents", Proc. ACM Symposium on Applied Computing, pp. 1040-1043, March 2004, Nicosia, Cyprus
[DSZ] Zhang, Deng Sheng: Image Retrieval Based on Shape, PhD Monash University, March 2002
[EsFaDam] Esposito, Floriana; Fanizzi, Nicola; and d'Amato, Claudia: Conceptual Clustering Applied to Ontologies: A Distance-Based Evolutionary Approach, pp. 42-56 in [RTZ]
[FerEtAl] Ferilli, S. et al.: Generalization-Based Similarity for Conceptual Clustering, pp. 13-26 in [RTZ]
[Gordon] Gordon, A. D.: Classification, 2nd Ed., Chapman and Hall, 1999, isbn 1-58488-013-9
[Granlund] Granlund, Gösta: "Fourier Preprocessing for hand print character recognition", IEEE Trans. Computers, Vol C-21, Febr. 1972, pp. 195-201
[GREC07] Liu, Wenyin, Llados, Josep, Ogier, Jean-Marc (eds.): Graphics Recognition - Recent Advances and New Opportunities, LNCS 5046, Springer-Verlag, 2008.
[JaDu] Jain, Anil and Dubes, Richard: "Algorithms for Clustering Data", Prentice Hall, 1988, isbn 0-13-022278-X
[JaiMurFly] Jain, A. K., Murty, M. N., and Flynn, P. J.: "Data Clustering: A Review", ACM Computing Surveys, Vol. 31, No.3, September 1999, pp. 264-322
[KauRou] Kaufman, Leonard and Rousseeuw, Peter: Finding Groups in Data, John Wiley and Sons, 1990, isbn 0-471-87876-6
[KeyWin99] Keyes, L. and Winstanley, A.C. : Fourier Descriptors as a General Classification Tool for Topographic Shapes, IMVIP '99 Proceedings of the Irish Machine Vision and Image Processing Conference, 193-203, Dublin City University, September 1999.
[LewEtAl] Lew, Michael et al.: "Content-Based Multimedia Information Retrieval: State of the Art and Challenges", ACM Trans. Multimedia Computing, Comm. Appli., Vol. 2, No. 1, February 2006, pp. 1-19
[LinEtAl] Lin, Jessica et al.: Multiresolution Clustering of Time Series and Application to Images, pp. 58-79 in [PetKhan]
[MarMPEG7] Martínez, José (ed.): MPEG-7 Overview (version 10), ISO/IEC JTC1/SC29/WG11N6828, Palma de Mallorca, October 2004
[PerFu] Persoon, Eric and Fu, King-Sun: "Shape Discrimination Using Fourier Descriptors", IEEE Trans. Systems Man and Cybernetics, SMC-7, 1977, pp. 170-179
[PetKhan] Petrushin, Valery and Khan, Latifur (eds): Multimedia Data Mining and Knowledge Discovery, Springer, 2007, isbn 1-84628-436-8
[Petru] Petrushin, Valery: Mining Rare and Frequent Events in Multi-camera Surveillance Video, pp. 80-93 in [PetKhan]
[PJVO] Van Otterloo, Peter: "A Contour-Oriented Approach to Shape Analysis", Prentice Hall, 1991, isbn 0131738402
[PSLR] Adobe Systems Incorporated: "PostScript Language Reference", 3rd ed., Addison-Wesley, 1999, isbn 0-201-37922-8
[RafMen] Rafiei, Davood and Mendelzon, Alberto: "Efficient retrieval of similar shapes", VLDB Journal, 11, pp. 17-27, 2002
[RTZ] Ras, Zbigniew; Tsumoto, Shushaku; Zighed, Djamel (eds): Mining Complex Data, Springer, 2008, isbn 3-540-68415-8
[Sibson] Sibson, R.: "SLINK: an optimally efficient algorithm for the single-link cluster method" in Computer Journal, 16, 30-34, 1973.
[SVG11] W3C REC SVG11 20030114 Scalable Vector Graphics (SVG) 1.1 Specification W3C Recommendation 14 January 2003 http://www.w3.org/TR/2003/REC-SVG11-20030114/
[Tombre] Tombre, Karl: "Is Graphics Recognition an Unidentified Scientific Object ?", pp. 329-334 in [GREC07]
[VelLat] Veltkamp, Remo and Latecki, Longin: Properties and Performance of Shape Similarity Measures, Dagstuhl Seminar Proc. Content-Based Retrieval 2006
[Veltkamp] Veltkamp, Remo: Shape Matching: Similarity Measures and Algorithms, in Proc. SMI 2001, pp. 188-197
[WaKh] Wang, Lei and Khan, Latifur: A New Hierarchical Approach for Image Clustering, pp. 41-57 in [PetKhan]
[ZahRos] Zahn, Charles and Roskies, Ralph: "Fourier descriptors for plane closed curves", IEEE Trans. Computers, Vol C-21, March 1972, pp. 269-281
[ZhLu01] Zhang, Dengsheng and Lu, Guojun: A Comparative Study on Shape Retrieval Using Fourier Descriptors with Different Shape Signatures, In Proc. of International Conference on Intelligent Multimedia and Distance Education (ICIMADE01), pp.1-9, Fargo, ND, USA, June 1-3, 2001
[ZhLu05] Zhang, Dengsheng and Lu, Guojun: "Study and Evaluation of Different Fourier Methods for Image Retrieval", Image and Vision Computing, 23(1):33-49, 2005.