SVG for Historical Documents

Layout Encoding and Transcript Rendering


Table of Contents

Introduction
Style informations extraction and storage
Definitions
Extraction Method
Result Encodage in SVG
Substance Rendering
A simple Javascript Function set
Allow to Create Automatically a set of applications for document image valorization
Conclusion and further works
Bibliography

The number of digitized documents is in constant growth since ten years. Document enrichment and valorization is a key point : documents have to be easily accessible and additional information have to be provided to the reader without any alteration to the original document. Moreover, navigation through and in documents have to be intuitive.

Our study is mainly dedicated to handwritten documents which are hard to render in a way that make the document understandable by any reader, but can be generalized to any type of document.

A textual document is generally composed of two kinds of informations : substance and style. The style is composed of all the layout, style and shape informations. The substance is composed of the text transcription and the additional informations.

SVG can be used to encode the style and also to render the substance in relation with the style. Usage of SVG in combination with TEI (Text Encoding Initiative) for transcription encodage allow to create a real digital document and not a simple facsimile of the original document.

In the first part of our paper we explain how style informations are extracted using image processing methods and stored in SVG for further use in on-line applications. In the second part we explain how we use SVG to render the substance informations and offer a new way of browsing patrimonial documents.

For page structure extraction, we choose a structural approach. After the extraction of connected components we build a neighborhood graph. Text lines are extracted from the neighborhood graph and provide some results and then processed to extract physical structure informations. Logical informations are then extracted from specific configurations of the neighborhood graph.

To perform structure extraction on our document page, we first extract the connected components of our page. A connected component can be a letter, a fraction of word or a complete word. Extraction is performed using edge tracking, as described in [Chang]. Figure 3, “Connected component extraction” show the result of connected component extraction on a sample handwritten pages. Each connected component correspond to a structural entity and is used as a node to build an adjacency graph. Building the graph from connected component information allow to use both structural and sense informations since connected component have more sense in term of document understanding than edges or gradient informations. The major limitation of connected component based methods is component overlapping between lines which can lead to bad extraction of text lines. In our purpose, this limitation is not a key point, since component overlapping generally do not occurs between different paragraphs or between titles and lines.


For each connected component we search for the nearest neighbor in four directions of space : top, down, left and right. Exact orientation ranges are provided by an orientation estimation step using Hough transform : nearest neighbors research is performed around Hough direction, orthogonal Hough direction, inverse Hough direction and inverse orthogonal Hough direction. Once the four neighbors are computed for each connected component we build a weighted directed graph G. G=(V,A), V={v1,v2,...,vn} with vi representing a connected component of our page. Out-degree of G is 4 : each vertex is the tail of up to four arcs e=(vi,vj), representing the link between the connected component and its neighbors. A=(e1,e2,..,en) is the arc set of our graph.

Distance used to build the graph is an edge to edge distance instead of a classical euclidean distance between gravity centers of the connected components (Figure 4, “Distance measure between connected component edges”).


Arc weights are provided by the real distance between connected components. This graph can be reprojected on a manuscript image as shown on Figure 5, “Graph reprojection”. Right arcs (direct Hough direction) are colored in cyan, left arcs (inverse Hough direction) are colored in black, top arcs (orthogonal Hough direction) are colored in red or blue depending of the weight and down arcs (inverse orthogonal Hough direction) are colored in blue. When two arcs are superposed only tops and rights arcs are represented.


Once the graph is extracted, nodes are labeled with 3 local features :

  • Hough orientation

  • Mean distance to linked nodes

  • Mean orientation of links

Those labels, and other information such as in-degree, out-degree of nodes and arc weight will be used for graph division in order to extract text lines and fragment extraction.

A graph based layout representation is an intuitive tool for topology and neighborhood access of any structural components of the document. From connected components to global layout there is a hierarchy of information that can be easily represented by a graph. The concept of multi-resolution that is naturally present in a document page layout can be obviously embedded in a graph. In our study, we actively take advantage of the ability of a geometrical graph to produce easily fusions of nodes. Those nodes represent structural elements of a page at different levels of the decomposition (in a bottom up strategy, it deals with a connected component representation to a fragments based one). The following sections present our contributions at different levels of page layout and show the ability of our unified graph based approach to produce a relevant structural page decomposition.

Text line extraction can be seen as a graph segmentation task. In fact, we have to divide our graph in N sub-graph, where each sub graph is a text-line. To do so we first begin to extract the borders of the page. Border extraction is a graph labeling step : we have to label each vertex of G with its corresponding label in the five classes described in Figure 6, “Text Border Extraction”.


In practice, the graph labeling is based on a simple evaluation of the neighborhood of a node. If the out-degree of the node is equal to 4, the node is labeled as inner-text. Is the out-degree is equal to 0, the node is labeled as isolated. If the out-degree is between 1 and 4, the label is computed given the result of an adjacency function on each node. Adjacency functions returns 1 (Left) if the node as no left neighbor, 2 (Right) if the node as no right neighbor and a left neighbor, 3 (Top) (resp. 5(Down)) if the node a no top (resp. Down) neighbor and an out-degree of 3. We use the following color scale on Figure 6, “Text Border Extraction” to show the results of border extraction : yellow for left components, red for rights, blue for tops, green for downs and black for inner-text ones.

The result of border extraction is used as an initialization for text line extraction. To be consistent with Latin script direction a line starts with a left border component and ends with a right border component. For others scripts the inverse methodology can be applied by starting from a right border component. Line extraction is performed by following the right link for each node, while the node is not a right border. Once we reached a right border, the algorithm store the sub-graph corresponding to the line and step to the next left border node. In order to check the validity of the extracted text line we compare the length of the path between the first and the last node of the sub-graph to the theoretic length of the extracted text line. Is the difference is less than 15%, the text line is validated. Once lines are extracted, paragraphs can be extracted by grouping lines using two criterion : interline space and text orientation variation. Methodology for fragment extraction is described with more details in [09]

In our documents we identify several specific configurations were specific words can be extracted. Those key words are mostly :

Margin Words is the first category of key-words. They generally give a first a priori on the content of the associated text region. Margin of a note page is extracted using vertical profile projection : information density inside the margin is less important than in the rest of the page. Isolated words on the top of the page gives a knowledge on the content of the whole page : this type of keywords are associated with several text-regions. They are extracted using adjacency graph. Word of big size generally represents titles in our documents. They are extracted by computing length to height ratio of connected components. Underlined words are important keywords : they are usually underlined by the writer to attract the sight of the reader and gives a good a priori knowledge on the topic of their container. Extraction is performed by analyzing connected components height to length ratio. Components with a small ratio (less than 0.1). This method is not fully efficient : extraction thresholds are hard to define due to the high variability of size and orientation of underlines. Other approaches are currently under investigation in order to improve underline words extraction. Sample keywords and associated structure are shown on Figure 7, “Keywords extraction result” and Figure 8, “Keywords extraction result 2”



Once extraction is complete, we have to build the SVG file containing the whole set of structural informations. 3 class of objects are defined :

  • Lines

  • Fragments

  • Keywords

Lines and keywords are described using paths to fit as well as possible to the words. Fragments are described with polygons in order to make easier correction by corpus specialists. Each element of the SVG receive a classname corresponding to its nature. The three classnames are "line", "frag" and "keyword". In addition to the class information, each element receive an unique ID, indicating the element position inside the page. For instance, line seven of the document receive the class “line” and the id “filename_line_7”.

The following SVG sample illustrate the principle :


<svg width="1628" height="2426" xmlns="http://www.w3.org/2000/svg"  xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1"> 
  <image xlink:href="frag.JPG" y="0" x="0" id="image2392" height="598" width="1249" /> 
  <polygon points="1228,577 1228,20 20,20 20,577 " id="frag.jpg_frag1" class="frag" transform="matrix(0.9999417,0,0,1.0475621,0.03637842,-13.416029)" /> 

  <path  d="m 73.4375,283.9375 37.5,-14.0625 51.5625,6.25 40.625,-14.0625 145.3125,28.125 139.0625,-14.0625 110.9375,14.0625 207.8125,0 20.3125,-18.75 20.3125,14.0625 367.1875,-1.5625 -25,56.25 L 701.5625,335.5 665.625,366.75 642.1875,348 635.9375,337.0625 246.875,332.375 195.3125,362.0625 190.625,326.125 79.6875,335.5 53.125,333.9375 l 20.3125,-50 z" 
     id="frag.JPG_line1" class="line" /> 
  <path  d="m 89.0625,207.375 -25,40.625 62.5,1.5625 0,15.625 31.25,4.6875 28.125,-15.625 160.9375,7.8125 -9.375,21.875 9.375,4.6875 34.375,-15.625 121.875,-9.375 1.5625,12.5 26.5625,-9.375 164.0625,3.125 L 912.5,273 l 206.25,0 56.25,0 26.5625,-37.5 -46.875,-12.5 -42.1875,1.5625 -48.4375,6.25 -40.625,14.0625 -50,-10.9375 L 740.625,223 662.5,210.5 l -164.0625,6.25 -71.875,18.75 -90.625,-21.875 -37.5,18.75 -59.375,-20.3125 -150,-4.6875 z" 
     id="frag.JPG_line2" class="line"/> 
  <path d="M 96.875,18.3125 64.0625,83.9375 93.75,96.4375 129.98623,94.636942 345.3125,83.9375 385.9375,85.5 396.875,26.125 310.9375,16.75 96.875,18.3125 z" 
     id="frag.JPG_kw1" class="keyword"/> 
</svg>

The resulting svg file, using user-defined css is illustrated on Figure 9, “Sample SVG file”.


In state of the art applications, text zones are generally described by geometric shapes such as rectangle or ellipsis. Those geometric shapes do not allow to describe text zone well in most of the case. If we take the example of a 30 degree skewed line, a rectangle will include lines positioned up and down the line of interested whereas a SVG PATH will allow to follow the edges of the line. Moreover, the vectorial format allow to adapt easily the page structure description to the rendering scale.

Once the SVG file is computed and validated by an human operator, specialist of the corpus we can combine the SVG file, which describe the image structure with the transcript, coded with the TEI norm. Each structural element is designated by an unique ID. Textual elements from the TEI transcription and Structural elements from the SVG are sequentially matched. After that step, each line and each fragment is associated with its corresponding zone on the image. Based on this fact we designed an application to help the reader using SVG and Javascript included inside the SVG. The principle is quite simple : SVG file is made reactive to mouse events using Javascript and when the user place the mouse over a text zone, the corresponding transcription is displayed on a dedicated zone. The dedicated zone can be placed on the pointed line or on a dedicated box. A working demonstration is available the following website : http://www.malleron.info/index.php?page=demos

Using previously described SVG and Javascript functions, we proposed 2 different way to render historical document with their transcription.