Wendell Piez: Towards Hermeneutic Markup

Abstract

Read the abstract as presented in the conference proceedings (use an XML/XSLT-capable browser).

Students of this topic will recognize that I barely skim the surface of the problem here. The overlap problem, although famously difficult and worthy itself of a more extended treatment, must also be considered within the context of broader problems with the currently dominant architecture of document processing, which is designed to support the goals of publishing (especially publishing at scale in multiple formats), not of scholarly interpretation. In brief, what is called for is a data model and architecture supporting the following:

Arbitrary overlap and multiple concurrent hierarchies (these are related, but not the same thing)
Arbitrarily extensible annotations, capable in themselves of being structured and tagged
A flexible development framework supporting iterative and ad-hoc schema and query design

Each of these three points could be elaborated at length. The first two in particular are the subjects of ongoing research. The third one especially is the focus of this presentation. Truly expressive applications of markup to scholarly text processing will be rewarded, it seems to me, by shifting attention from markup as such, to document modeling as a research project in its own right, underlying and enabling markup and applications based on it.

Presentation slides

The presentation slides are in PDF format. This is a very high-level view of the problem, and is not intended to be self-explanatory.

Demonstration

A demonstration shows not hermeneutic markup in anything like its full potential, but only a hint of what will be possible in a markup regimen that does not impose a single unitary hierarchy over a text. The markup here is extremely simple and straightforward, even trivial. The only thing at all remarkable about it (and the fact that this is remarkable is itself somewhat remarkable) is that it identifies phenomena in the texts that overlap, and therefore cannot be directly represented together, at least at the same time, in XML.

The scholarly intent of this demonstration (such as it is) is to depict the way different examples of the sonnet form (mainly in English, but also with German, French and Spanish cases) have different rhythmic profiles in the interplay between their metrical (verse) structures and rhetorical or grammatical (sentence/phrasing) structures. The thesis is that any particular sonnet, and any moment within a sonnet, is more or less quiet or turbulent, turbulence occurring when the speech rhythms proper to the phrasing of the sonnet interfere with the regular flow of the meter. While these differences within and between sonnets are subtly apparent in reading (subject, of course, to the different interpretations provided to them by different enunciations), they can also be more dramatically represented by a graphical rendition in which the correspondence or interference between the two hierarchies is specifically drawn.

This markup, trivial though it is, is manifestly interpretive in at least two respects:

First, some measure of interpretation is required in order to discern and demarcate the structures themselves. In the case of verse structures (quatrains, sestets and so forth) these are for the most part given clearly enough by the conventions of the sonnet form (though with interesting exceptions: see the examples of Tennyson's Now Sleeps the Crimson Petal, Now the White and Meredith's Modern Love XXX). In the case of sentences and phrasing, perhaps a greater measure of more or less deliberate decision-making by the encoder is required to establish proper boundaries (although punctuation is also a serviceable guide, and indeed in the case of most of these examples, a first cut at the markup was facilitated by an automated routine that marked up phrase and sentence boundaries indicated by punctuation).
Secondly, however, these depictions invite the reader to interpret the poems further. What accounts for the differences between sonnets? This consideration brings us both in to each poem (inasmuch as each poem represents a distinct utterance with a particular relation between sound and sense) and out to sonnets in general and to the languages and periods in which they were written. It is perhaps up for debate what can be made, for example, from the regularity of Baudelaire's Correspondences (nineteenth century, French) as compared to Milton's On His Blindess (seventeenth century, English).

In order to create these representations, a library of sonnets is marked up in XML, with the XML tree structure representing the verse form, namely lines within couplets or quatrains. Another hierarchy, indicating the grammar or phrasing of the poem (elements are s for sentence and phr for phrase) are marked up using a milestone convention (the LMNL CLIX notation) in which XML elements, rather than simple start- or end-tags, indicate the beginnings and ends of structures. This enables pipeline processing to create the following alternative formats and renditions:

ECLIX: extended CLIX: the document sources use CLIX elements interspersed into regular XML to indicate the presence of document structures that violate the clean nesting of the XML elements identified in the document.
CLIX: in which the hierarchy is flattened and all LMNL ranges are represented using CLIX notation in a stream. The stylesheet that creates this will work on any XML document and pick up any CLIX notation already in it.
xLMNL is an ad-hoc format for representing LMNL documents in XML. The stylesheet that generates this works on any (flat) CLIX instance. Students of the overlap problem will recognize xLMNL as a kind of standoff representation of overlapping ranges.
From xLMNL it is relatively straightforward to draw graphic representations of overlapping ranges in documents. The first one offered here is an arcs representation.
A map of a sonnet is a somewhat more comprehensive representation (also in SVG) of overlapping ranges in a sonnet. This is a dynamic view, allowing user interaction (via mouseovers and clicks) dramatizing the way verse/line and sentence/phrase structures in the sonnets line up or overlap as the case may be. (SVG's declarative animation features supporting the interactivity are implemented in Javascript by Doug Schepers.)
LMNL sawtooth syntax (an alternative markup syntax for representing this data model) is also easily generated from xLMNL.
It is also relatively straightforward to perform analytical routines over xLMNL, showing which range types in a given instance overlap. In hermeneutic markup, heuristic operations of this kind will be invaluable, helping to inform schema development.
XML induction is the process of deriving XML hierarchies from flat LMNL instances. Here, inductions of two hierarchies are demonstrated: (a) the verse/line hierarchy (which happens to be the XML we started with) and (b) the sentence/phrase hierarchy, with verse/line structures represented using CLIX notation.

Of the stylesheets that perform these conversions, the only ones that are not entirely generic are the two that display the sonnet structures, the arcs and map views. These have been tuned for display of documents with ranges of the types given in the sonnets, namely octave, sestet, quatrain, couplet, line, s, and phr. All other stylesheets will work equally on any documents in which the CLIX notation is used to represent structures overlapping the main hierarchy – a format that is easily generated from many common workarounds used to represent overlap in XML.

More information about LMNL, the Layered Markup and Annotation Language, is available at lmnlmarkup.org.

Readers who wish to see or adapt the XSLT 2.0 code that performs these conversions are invited to contact the author at wapiez (at) wendellpiez.com.

Towards Hermeneutic Markup

An architectural outline

Wendell Piez

Digital Humanities 2010

King's College, London

July 9, 2010

Abstract

Presentation slides

Demonstration