Today's coverage

Conceptual introduction to LMNL
- Background and motivations
- Design principles
Developments and current status
Short demonstration

LMNL via XML (ECLIX and CLIX)

Whence cometh LMNL?

Layered Markup and Annotation Language

Spring 2002: Jeni Tennison faces gnarly XML transformation problem involving pages overlapping paragraphs
Wendell Piez refers her to literature on CONCUR, TexMECS, Core Range Algebra (Gavin Thos. Nicol)
Tennison and Piez cook up LMNL as skunkworks “pure research” project
- Usually pronounced “liminal” (Lat. limen, “threshold” or “doorway”)
First presented at Extreme 2002 (late-breaking)
Current members of the Ad Hoc LMNL Committee: Jeni Tennison, Wendell Piez, John Cowan
Other occasional contributors, lenders and borrowers:

Matt Palmer, Paul Caton, Bert van Elsacker, Alex Czmiel, Gavin Thomas Nicol

(Apologies to anyone inadvertantly left off this list)

Wherefore LMNL?

And why not TexMECS, BUVH/JITTs, or other contemporaneous efforts?

“Um, clarify something for me ... are the requirements for TexMECS the same as or different as those of other efforts to deal with overlap such as LMNL, and if they're different, how are they different, and if they're the same, then why aren't you guys collaborating?”

(Paraphrasing Jonathan Robie to Claus Huitfeldt, Extreme 2006)

Glib answer: but we are collaborating
Better answer:
- This is not like application or standards development, where the boundaries of the problem are clearly understood and the technology is designed to fit
- This is more like pure research: experiments intended to identify possibilities and tradeoffs
- Early on, more numerous independent efforts have advantages compared to fewer coordinated efforts
  - Consider and test more ideas and approaches
  - Help address the design problem(s) bottom-up as well as top-down
  - Friendly competition between unaffiliated researchers is a feature not a bug
    
    (Allows for meetings like this one!)

LMNL as materials research

LMNL as exploration (I)

Of an unknown continent on Planet Markup ...

LMNL as exploration (II)

If this region is interesting, it's reasonable to expect we're not the only ones to visit ...

Design goals of LMNL

Data object model supporting
- Overlapping structures
  
  Including arbitrary overlap (“self-overlap”)
- Structured annotations
  
  (richer than XML attributes)
  - Presenting arbitrary structures (markup) of their own
  - May be ordered wrt one another
- A data object model or API, not an abstract (mathematical) model
  
  Analogous to XML DOM, not DAG
- Consonant and compatible with —
An intelligible markup syntax

(Allowing all the advantages we see in XML's plain-text syntax)
A general solution to the “overlap problem”
- Open-ended with respect to processing requirements
  - Caveat: when XML is suitable, use XML!
- Not just static documents but documents under revision
- Not just documents conforming to predefined models, but documents in the midst of the design and modeling process
- Supporting flexible, iterative document model and application design

Informal design principles

LMNL should be discrete and easily distinguished from what it is not
- Subject to concise, clear, formal definition
- Simple enough to learn quickly and program easily (ha)
- Look and feel different enough to avoid confusion
  
  This applies to both its representations (syntax) and its terminology
- Dependencies on other specifications (e.g. Unicode) should be minimal and explicit
Since LMNL is defined as the model, not the syntax, more than one syntax is possible
- LMNL syntax (“sawtooth notation”) may be well-suited, but is not necessary
- LMNL processing is possible without LMNL syntax
  - Standoff markup
  - XML markup
  - RDBMS
  - Ad-hoc grammars and common mappings
LMNL should have a clean relation to XML
- XML is a legitimate and important target for LMNL transformation
- XML is not a problem or impediment, but a tool

What LMNL looks like: overlap

LMNL syntax (“sawtooth” syntax) is designed to work with LMNL:

[excerpt [source}The Housekeeper{] [author}Robert Frost{]}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]

We use the syntax [r=r1}over[r=r2}lapping{r=r1] ranges{r=r2] to disambiguate between self-overlap and enclosure (much like MECS)
Anonymous ranges (and annotations) are allowed:
```
A range can be [}marked{] without a name
```
And so are empty ranges, which have no width and may therefore “slide” with respect to neighbor ranges

What LMNL looks like: structured annotations

[excerpt}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt
  [source
    [title}The Housekeeper{title]
    [loc}lines 144-146{loc]
    [source
      [title}North of Boston{title]
      [date}1915{date]]]
  [author
    [name}[given}Robert{given] [family}Frost{family]{name]
    [date}1874-1963{date]] ]

Annotations

Can appear on end tags as well as start tags
Have order but not necessarily unique names
May have annotations
May have ranges over their content
... Are isomorphic to documents

Brief overview of LMNL object model

A LMNL document is based on a text layer

(A sequence of zero or more atoms, which generally correspond to Unicode characters, but which can also be represented otherwise in any notation)
A document may contain an arbitrary number of ranges over its content
Each range has
- A name (generic range name or range type name; analogous to XML element name)
- An identifier (may be implicit)
- Its text content (implied by the range's offset and length over its owner text layer)
- One or more annotations
Each annotation has
- A range (its owner)
- A position (in relation to other annotations of its range)
- A name (need not be unique within its range)
- A text layer, which may have ranges of its own
  
  (Annotations are isomorphic to documents)
Incidentally to this design, and especially because they are ordered but need not be named uniquely, it becomes possible to arrange for annotation trees within LMNL (annotations having annotations etc.). Annotations may become an interesting design option for mapping certain kinds of structured data.

Design principle: less is more (especially for now)

LMNL is minimal

Aims to provide the bare minimum necessary for systematic application of semantics to labels over text (“markup”)
We believe that sustainable complexity and variation requires a simple basis
Remove one central assumption of XML
- Validation requires a context-free grammar (ergo, a single hierarchy)
- (BTW this means you may not be able to validate your tagging immediately at parse time)
... and see where it takes us
Optimization is left for later
- Notation (syntax)
- Implementation
- Various desirable and tempting features
  
  (Some, such as respecting tag ordering or virtual elements, could be supported through extensions or at higher levels)

Developments and current status

Since 2002: ongoing intermittent development
Development wiki at http://www.lmnl.org/wiki/index.php/Main_Page
Noteworthy:
- Introduction of atoms
- LMNL syntax parsing
- Codification of pathways to and from XML
  
  With LMNL processing on that basis (XSLT 2.0)
- Refinement of layers as limina
- Validation of markup patterns in LMNL: CREOLE
What we have not done:

Generalized an abstract data model (such as GODDAG)
- We expect this should (will) emerge in development as XML tree (infoset) emerged from XML
- Meanwhile we benefit from the insights of other researchers

LMNL atoms

A problem: LMNL ranges don't have much “thingness”

May have zero width (empty ranges)
Range order is underspecified (their contents are ordered but they are not)
Resulting problems relate to referring to non-character objects such as images

... and marking them up
```
[a [href}mypage.html{]}[img [src}myicon.jpg{]]{a]
```
or
```
[a [href}mypage.html{]][img [src}myicon.jpg{]]
```
... since range starts and ends are indicated only by character offsets, these are the same in the data model
One solution: support tag ordering
- ... But then tags (not just ranges) must be “things”
- This seems a high price to pay
  
  especially since usually tag order actually shouldn't matter in the data model —
- Will we then try to model when tag order matters and when it doesn't?
  
  (Cf. “spurious overlap”, Huitfeldt & Sperberg-McQueen)
Present solution: introduce atoms
- Atoms have width 1: offsets count the atoms
- All (Unicode) characters map to atoms in back
- Atoms can also be arbitrary objects (with names and annotations), represented directly in syntax
```
[a [href}mypage.html{]}{{img [src}myicon.jpg{]}}{]
```

Parsing LMNL syntax

Matt Palmer has been experimenting with parsing LMNL syntax in Python and Java

Specification for LMNL syntax: http://www.lmnl.org/wiki/index.php/Detailed_LMNL_syntax
See http://www.lmnl.org/wiki/index.php/LMNL_Parser_Experiment

Development on this front continues

Using XML syntax instead

Work inspired by Steve DeRose (2004) and Syd Bauman (2005)

CLIX: “Canonical LMNL in XML”

Initially a tagging convention in XML
- Mark XML elements (milestones) as start- and end-range markers
- “Trojan milestones” (suggested at OSIS by Troy Griffiths)
  - Allow any XML element to be repurposed as a milestone marker
  - Putative semantics of the element name would remain
  - Start- and end-points of the range would be indicated by explicit referencing
Refined and dubbed HORSE by Bauman
But adopted (and differently refined) as CLIX by us
- CLIX is LMNL represented in flattened XML with milestones
  
  All text and tagging is directly contained in a CLIX document element
  
  Since XML is otherwise flat, structured annotations can be represented in XML element structures
- ECLIX (extended CLIX) is a convention for allowing fully-structured XML to represent LMNL
  
  Just use LMNL-namespaced attributes to indicate milestone elements
- Given formal specifications, any XML can convert to ECLIX using fairly simple XSLT
  
  ... And an off-the-shelf stylesheet can flatten ECLIX into CLIX
  
  Hence any XML with a consistent convention for representing overlap can map into LMNL
References:
- http://www.lmnl.org/wiki/index.php/CLIX
- http://www.lmnl.org/wiki/index.php/ECLIX

A LMNL processing architecture

Demonstration: LMNL via XML

Sonnet demonstation http://xmlshoestring.com/LMNL/Amsterdam2008/clix-sonnets
Various testing: http://xmlshoestring.com/LMNL/Amsterdam2008/testing
Implemented so far:
- All:
  - ECLIX to CLIX
  - CLIX to xLMNL (LMNL compiled into an XML format)
  - XML induction
    - pick parameter lists ranges to be selected into a containment hierarchy
      
      For example:
      - http://xmlshoestring.com/LMNL/Amsterdam2008/clix-sonnets/XMLinduce/blindness.xml?pick=s%20phr induces sentences and phrases (s and phr)
      - http://xmlshoestring.com/LMNL/Amsterdam2008/clix-sonnets/XMLinduce/blindness.xml?pick=octave%20sestet%20quatrain%20couplet%20line induces verse structure
    - drop parameter lists ranges to be excluded
    - remaining ranges are represented as CLIX
  - Overlap analysis: reports which range types overlap each other
- Sonnets
  - Sonnet XML markup to ECLIX
  - Bars diagram
  - Arcs diagram

Notice: no hierarchy

No parents, children, ancestry, dominance, containment
Relations between ranges are implicit in their ordering and extent, not explicit in the model
We have listed a formal terminology of range relations

http://www.lmnl.org/wiki/index.php/Range_relationships
- encloses
- fits within
- precedes
- follows
- overlaps start
- overlaps end
- etc.
Hierarchies can be inferred from range relations, as in the demo

(Implicit hierarchies can even be validated)

But LMNL as a set of ranges over text (flat LMNL) does not represent it directly as such

(Although annotations also arrange in hierarchies, they do not arrange text in the same layer)

But ... we have talked about “layers”

Because applications and processing languages will need a systematic way of registering higher-level relationships between ranges ...

We stipulate the existence of an object called a limen (pl. limina)

Replaces 2002 idea of “layer”
A document itself has a limen (its text content and ranges)
Also, each annotation has a limen
But limina may be owned not just by the document or an annotation, but by another limen

In this case, its content (ranges and atomic content or text) can be defined by selection of ranges in its owner
Because limina can be derived from limina, LMNL sneaks in dominance relations “through the back door” (subliminally?), and leaves an application to identity hierarchies (whether sacred or profane)
- Limina or related limina in combination can be considered as “views” of the document

Limina are largely untried to date

They have no representation in LMNL syntax (yet)
It may prove useful to declare them externally
- A schema could declare limina on behalf of LMNL instances valid to it
- Or a transformation could declare them ad hoc

An example limen: a document “view”

[excerpt [source}The Housekeeper{] [author}Robert Frost{]}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]

Define a limen whose owner is the document. Select the excerpt and l ranges. This limen maps to a clean hierarchy.

The same can be done with any set of ranges that do not overlap (starts or ends). Enclosure implies dominance in the resulting tree.

An example limen: relating discontinuous ranges

`song` and `stanza` limina

[p}The Hatter shook his head mournfully.
[q [sp}Hatter{]}Not I!{q] he replied. [q [cont}Hatter{]}We quarrelled
last March--just before HE went mad, you know--{q] (pointing with his
tea spoon at the March Hare,) [q [cont}Hatter{]}-- it was at the great
concert given by the Queen of Hearts, and I had to sing{p] [song}
[lg [n}1{]}
  [l}Twinkle, twinkle, little bat!{l]
  [l}How I wonder what you're at!{l]{lg]

[p}You know the song, perhaps?{p]{q]

[p}[q [sp}Alice{]}I've heard something like it,{q] said Alice.{p]

[p}[q [sp}Hatter{]}It goes on, you know,{q] the Hatter continued,
[q [cont}Hatter{]}in this way: --{p]

[lg [n}1{]}
  [l}Up above the world you fly,{l]
  [l}Like a tea-tray in the sky.{l]
  [l}Twinkle, twinkle --{l]{lg]{song]{q]

song limina could select all the song ranges from the document (one song range per limen). stanza limina could select the lg ranges with the same n annotation within the song limina, leaving other ranges (and cosmetic whitespace) behind.

We could then retrieve /%song/%stanza (limina) for stanzas and /%song/%stanza/enclosed::l (ranges) for lines appearing within stanzas.

Another example limen

`quote` limina

[p}The Hatter shook his head mournfully.
[q [sp}Hatter{]}Not I!{q] he replied. [q [cont}Hatter{]}We quarrelled
last March--just before HE went mad, you know--{q] (pointing with his
tea spoon at the March Hare,) [q [cont}Hatter{]}-- it was at the great
concert given by the Queen of Hearts, and I had to sing{p] [song}
[lg [n}1{]}
  [l}Twinkle, twinkle, little bat!{l]
  [l}How I wonder what you're at!{l]{lg]

[p}You know the song, perhaps?{p]{q]

[p}[q [sp}Alice{]}I've heard something like it,{q] said Alice.{p]

[p}[q [sp}Hatter{]}It goes on, you know,{q] the Hatter continued,
[q [cont}Hatter{]}in this way: --{p]

[lg [n}1{]}
  [l}Up above the world you fly,{l]
  [l}Like a tea-tray in the sky.{l]
  [l}Twinkle, twinkle --{l]{lg]{song]{q]

quote limina could select each q with a sp annotation along with any following q with cont annotations equaling the sp on the first, up to the next q with that sp (and ignoring other ranges over the same text).

We could then retrieve /%quote (limina) for quotes and /%quote/enclosed::q/@sp (annotations) for their speakers.

Note: these semantics are only implicit in flat LMNL, and will require some sort of apparatus (syntax or declarations) to express.

Looking for validation

Validation, after syntax and model, is the third “leg” of the tripod (ref. Huitfeldt & Sperberg-McQueen)
It is important enough to get right
- May take time to define and develop
- More than one approach may be worthwhile
- Is related to querying and transformation
  - If we can transform, we can validate
Validation as such is more important than validation at parse time
Validation of a data structure can be distinguished from validation of a markup instance (tagging)
If LMNL is defined as a data model, not a syntax, this means ipso facto, LMNL validation must work on data objects, not any particular serialized markup syntax
Usefulness:
- Classic reasons (QA, application design, process optimization, definition of an interface for data interchange)
- And to determine whether and how a LMNL instance is fit (or can be fitted) to represent or transform as GODDAG, a single XML hierarchy, what have you
- Consonant with “incremental validation” (Arjen De Vries)

Validating range relations: CREOLE

Jeni Tennison is developing CREOLE (“Composable Regular Expressions for Overlapping Languages etc.”)
Builds on RelaxNG, Rabbit-Duck grammars
(Jeni will be presenting on CREOLE separately)

Conclusions and open questions

Some fairly nice things are possible

Possibly surprisingly, even on flat LMNL (no limina, no explicit hierarchy)
Implementation of LMNL concepts is mostly not difficult even in XML
- SVG illustrations are really easy
- XSLT 2.0 grouping methods a “crowbar” for dealing with overlap in XML
  
  (As demonstrated in generalized XML induction code)
- One sticky area is mapping structured annotations (and their annotations) into XML
But: these demonstrations are on toy data sets
- What kinds of tuning and optimization may be helpful or necessary?
- How will the XML/XSLT platform scale?

LMNL in Miniature

An introduction

Wendell Piez

Amsterdam Goddag Workshop, 1-5 December 2008

3 December 2008

Contact: Wendell Piez [wapiez -at- mulberrytech -dot- com]

Today's coverage

Whence cometh LMNL?

Layered Markup and Annotation Language

Wherefore LMNL?

LMNL as materials research

LMNL as exploration (I)

LMNL as exploration (II)

Design goals of LMNL

Informal design principles

What LMNL looks like: overlap

What LMNL looks like: structured annotations

Brief overview of LMNL object model

Design principle: less is more (especially for now)

LMNL is minimal

Developments and current status

LMNL atoms

Parsing LMNL syntax

Using XML syntax instead

A LMNL processing architecture

Demonstration: LMNL via XML

Notice: no hierarchy

But ... we have talked about “layers”

We stipulate the existence of an object called a limen (pl. limina)

Limina are largely untried to date

An example limen: a document “view”

An example limen: relating discontinuous ranges

`song` and `stanza` limina

Another example limen

`quote` limina

Looking for validation

Validating range relations: CREOLE

Conclusions and open questions

LMNL in Miniature

An introduction

Wendell Piez

Amsterdam Goddag Workshop, 1-5 December 2008

3 December 2008

Contact: Wendell Piez [wapiez -at- mulberrytech -dot- com]

Today's coverage

Whence cometh LMNL?

Layered Markup and Annotation Language

Wherefore LMNL?

LMNL as materials research

LMNL as exploration (I)

LMNL as exploration (II)

Design goals of LMNL

Informal design principles

What LMNL looks like: overlap

What LMNL looks like: structured annotations

Brief overview of LMNL object model

Design principle: less is more (especially for now)

LMNL is minimal

Developments and current status

LMNL atoms

Parsing LMNL syntax

Using XML syntax instead

A LMNL processing architecture

Demonstration: LMNL via XML

Notice: no hierarchy

But ... we have talked about “layers”

We stipulate the existence of an object called a limen (pl. limina)

Limina are largely untried to date

An example limen: a document “view”

An example limen: relating discontinuous ranges

song and stanza limina

Another example limen

quote limina

Looking for validation

Validating range relations: CREOLE

Conclusions and open questions

`song` and `stanza` limina

`quote` limina