<LML /> 2.0 - Language Description

Author: Dr. O. Hoffmann (German web-page)

Index

Element and Attribute Index

Introduction

The big open question in (X)HTML was always and still is up to now, how to markup text and literature with meaningful elements. Because currently (X)HTML does not provide specific elements with a semantic meaning for several applications, the XHTML role attribute and a new profile XHTML+RDFa are introduced to provide authors the possibility to specify the semantic meaning of an element with a reference to another language or specification using a CURIE. Additionally the usage for rel and rev can be extended to use CURIEs to specify the relation to or the functionality of a hyperlink target precisely. WAI-ARIA defines some predefined values for role for some applications. SVG tiny 1.2 adopts the role attribute and the RDFa attributes from XHTML.
Unfortunately the XHTML role approach was finally frustrated to just a note in 2010-12-16.
Unfortunately the XHTML CURIE approach was finally frustrated to just a note in 2010-12-16, but is redefined and available in RDFa.

RDFa provides attributes to map RDF syntax to attribute values, which can be used to enhance the semantic meaning of XHTML documents dramatically.

Unfortunately neither (X)HTML nor WAI-ARIA nor RDFa define a complete collection for literature. This is a little bit surprising, because already the name hypertext markup language indicates, that it should be the main domain of (X)HTML to care about this. There are other resources like DAISY, DocBook, DITA, FictionBook, XLDL or TEI with a larger collection of elements with a semantic meaning and there are maybe even more resources for more or less specific applications.

It cannot really be expected, that authors have to search each time in many different resources for a proper element respectively role value, therefore it is collected, supplemented and compared here for anyones convenience and maybe as a bases for a proposal to extend the vocabulary of (X)HTML, WAI-ARIA or RDFa to get a semantically somehow more useful language. Additionally using the similar approach with RDFa and role for SVG ensures semantic meaning for text in SVG too. Because SVG is a graphics format, it cannot be expected, that this is done in a more convenient way than with such a role attribute or with RDFa.
However SVG can be quite important for literature with a strong correlation to the visual presentation of the content. With SVG it becomes possible to markup such 'visual literature' in a meaningful way, not possible with non graphical formats.

LML elements are applicable as CURIEs for the usage as roles and with XHTML+RDFa. Typically the attribute property can be used to define the semantic meaning of an element in more detail, that is not available in (X)HTML itself. This attribute property exists too in SVG tiny 1.2. For SVG 1.1 one has to use the property for example within the namespace of XHTML or LML to ensure a defined meaning.

Alternatively to the role or property approach, because LML is an XML, it can be used directly or can be mixed with other formats to reuse the functionalities of other formats together with the semantic elements of LML.
Therefore mainly the author has the choice, how to use it, to get the best markup one can get for the intended purpose. Unfortunately due to the limited capabilities of typical user-agents in the interpretation of LML or specific XML formats for general use like XLink or XForms still one has to mix what typically works instead of working straight forward with one format.

The main advantage to use LML as its own language is, that the document depends only on one or two formats. If roles from many different formats are used, the risk is increased, that one of the referenced collections gets lost, what is not under the control of the author. And it is much more work to check many dependencies than just one or two. The advantage of a compound of many formats is, that it is typically simple to extend the functionality or semantic meaning of a document with already existing formats. LML tries to combine this, having most literature related things within LML and having additionally those extensions available for other purposes, not directly related to literature, if required.

Several samples are available to show the usage of LML. If no other host language is used or required for other reasons, a default (visual) presentation for LML is provided too, simulated with CSS as far as possible including alternative variants switching on and off the display of meta information, what is restricted somehow due to limitations of CSS. This means, with a proper interactive user interface, it is expected, that the presentation of meta information is much more effective as with this simple CSS approach. In the default style the interface is a small square icon, which is expanded if hovered. It shrinks again, if the meta area is left in the direction of the border.

About Version 2.0

In the years after the publication of LML 1.0 there have come up some new ideas, how to improve or extend LML. Additionally with the advanced working drafts for '(X)HTML 5' it became obvious, that this format fails to solve essential and basic semantical tasks. Additionally the '(X)HTML 5' working group insisted to refuse an option to indicate, that documents follow this recommendation, therefore authors cannot write '(X)HTML 5' documents indicating, that they are '(X)HTML 5' documents, they can only write tag soup documents, that can be tested, whether they are consistent with the '(X)HTML 5' rules or not. But to get a defined meaning of documents, one needs to indicate the version of XHTML, the document uses. Because '(X)HTML 5' is not the only format (version) without a version indication, this results in the need to define a mechanism to indicate the version, that is used to write a document. LML 1.0 only adds a mechanism to add an indication, that LML is used to indicate semantical meanings in another format like (X)HTML.
LML 2.0 adds a meta information structure derived from Dublin Core to indicate the version of format, a document or fragment conforms to to solve this issue.
Obviously, within XHTML without LML one can simply use the related Dublin Core term as meta data, if LML is used anyway to markup semantics of a document or fragment.

The format EPUB for digital books provides a recommendation for archives, more advanced than the manifest approach from (X)HTML 5 or the approach from LML 1.0. However EPUB has both some unfortunate restrictions and complexities. Effectively LML can be used within EPUB to provide roles of structures, not directly. For content documents EPUB 3.0 uses '(X)HTML 5' and SVG for content documents and an own format to describe the archive itself. Concerning the content documents obviously EPUB faces the same problems to markup content in a semantic way. EPUB has an own vocabulary of roles to solve such problems for some structures. But it turns out to be incomplete in several areas, LML covers. LML 2.0 additionally provides an alternative approach to describe archives with a structures navigation document, avoiding the restrictions of EPUB.

Concept

LML is an XML, therefore documents have to be at least well-formed. User-agents may render not well-formed documents up to the point of an error, to help authors to fix those errors, but this is not required. An error message including the line number of the first error should be the minimum information about an erratic document as a help for authors. Typically a meaningful fix of errors depends strongly on the content and is only possible for the author, therefore this has to be left to the author. A meaningful error message will help authors to fix errors before publication.

Because literature has in general not a really precise structure, it would be counterproductive to have a very specific definition of a structure model, therefore this remains intentionally vague to ensure, that this collection is really usable in the real world and not just esoteric. Exceptions are a few structures derived from other formats with precise structure and the archive module, that has a precise structure to simplify the extraction of a navigation for the complete archive. Archives with many documents are more complex than single documents and the archive module does not describe the literature content of a larger work split into several documents, it only provides markup for the structure of such a more complex project.

Another point to have not a precise structure for the literature content is the option to use LML structures to indicate roles with CURIE. Due to this usage in a practical way the values of role and therefore the element names of often used elements are short. A third important point is, that the definitions are human readable and understandable, else there is a low chance, that authors will use such a collection.

Namespace

The namespace related to LML is http://purl.oclc.org/net/hoffmann/lml/ (Note, that this is a PURL; a redirect), the current address may change.
Therefore the namespace declaration may be something like: xmlns:l="http://purl.oclc.org/net/hoffmann/lml/", then for example the role containing the CURIE for the element literature is: role="l:#literature". If used within a CURIE, it is important to note, that this is a PURL for the current version. LML 2.0 has no conflicts with LML 1.0, therefore it uses the same PURL In case other, incompatible versions follow, there will be additionally a PURLs for such versions.

Structure Model and Scheme

Because there is no precise structure model, there is no document type declaration or another scheme. However a scheme may be added, if a scheme language is found (or developed), which is flexible enough to cover the needs of a more complex language. Limitations of current scheme languages can be already seen with popular narrative formats like XHTML and SVG, the used scheme languages can by far not describe all restrictions, variants and possibilities of these languages, therefore already for SVG tiny 1.2 the prose specification is much more relevant and complete than any scheme.

But it might be useful to have some entities defined, for example to reuse content or to simplify the recognition of cryptic glyph numbers. For this a doctype can be provided, containing mainly the defined entities. For example predefined german Umlaute and the ligature 'ß' as in XHTML and some abbreviations look like this:


    
    
    
    
    
    
    Literature Markup LanguageLML">
    eXtensible HyperText Markup LanguageXHTML">
    Scalable Vector GraphicsSVG">
  ]>]]>

But of course, if the encoding is specified correctly within the XML processing instruction, respectively within the header, if send by a server, there is no need to mask these and other glyphs, if available directly on the keyboard. But the abbreviations of course are still helpful, if these constructions appear more often within the document and can be abbreviated with simply &l;, &h; and &s;.

Because currently there is no defined method to specify an attribute or property with a CURIE, only a few specific and a number of common attributes are mentioned here at the moment. However LML attributes have fragment identifiers, which can therefore be used to declare the role or property of an element containing only simple text content. Even if the allowed value of an attribute is one item of a predefined list, typically it has a fragment identifier to be referenced as a role or property of an element, if applicable. The limitation here is obviously, that one has to note such structures as elements in the host language.

Often literature related issues can be solved without attributes anyway. And of course attributes from other namespaces can be simply noted using the correct namespace. LML attributes are defined in the namespace of LML, therefore they can be used in other formats as well, if it is indicated, that they belong to the LML namespace.
If the attribute values are predefined and available with a fragment identifier, it is obviously possible to provide them as CURIE too.

Currently it is assumed for a host language, that at least some common attributes are available, an identifier like xml:id, respectively the id from (X)HTML or SVG, then xml:lang, the xmlns to indicate namespaces and the abbreviations for CURIEs, class (see SVG or (X)HTML), role, property etc.
Often attributes provide meta information about the element or specific functionalities, therefore for several applications there is a related element defined in LML for this purpose to be within the element meta. If the host language provides only these attributes for the intended functionality, then obviously those attributes should be noted too.

Current or old versions of formats often have no role attribute like versions 1.0 and 1.1 from XHTML or SVG or versions 4 or 5 from HTML. In general, because except of 'HTML5' those versions depend on a DTD, the author can define an extension for those formats containing the attribute, however this does not imply, that the attribute has a meaning at all. In the XMLs it is in general possible to use attributes from other namespaces with a related namespace declaration, however, current validators rely typically on DTDs and are not able to check this different dependency. If this is assumed to be important, this approach cannot be used too. LML defines therefore additionally 'non invasive' methods related to the element roles, specified only for this purpose to indicate semantic structures in old (outdated, bad designed) versions of popular formats.

The general concept for the structure of an LML document is derived from (X)HTML, respectively CSS to have content either in block elements or inline elements. Additionally there are metadata elements containing meta information about the content of other elements. Some meta information is intended to be well separated from the other so called flow content, but completely accessible, other meta information is maybe part of the normal flow without a need for a specific separation.

Typically block elements are well separated from each other in presentation. Inline elements indicate phrases within text fragments inside block elements within the same line, typically without a spatial or temporal indication. Meta elements contain (meta) information about text and are presented on another logical level, however for LML this information is considered to be possibly interesting for any audience and should be accessible for anyone, not just for robots or specific helping tools, not available in any user agent.

Styling

Styling is in general not discussed here, this can be applied with an XML stylesheet processing instruction to the complete document or additionally using a specific meta information. Both reference external stylesheet documents.

The model for visual presentation of an LML document is that of a (digital) scroll or spool comparable to the usual visual presentation of XHTML in opposite to that of SVG. This does not apply directly, if the document is printed on paper sheets of course, this requires additional segmentation of the content according to the size of the paper.
Refinements of the basic model are desired. For example following this description.
The reading direction is either horizontal per text line or vertical.
Usually a digital output device like a screen has finite horizontal and size.
The area, available for a program to display an LML document is called the viewport, what is typically only a part of the complete device output area.
A scroll presentation is typically mainly not scrolled in the reading direction, but perpendicular to this. In the reading direction text content is wrapped automatically due to the size of the viewport in reading direction, maybe with some margin. A problem occurs, if there are structures of content, which cannot be wrapped automatically, especially embedded images not intended to be scaled or complex tables. If they are bigger in the reading direction than the viewport, scrolling in the reading direction ensures, that at least all parts of such a structure are available. This does not expand the viewport for other content, to be taken into account for automatic text wrapping. This ensures, that only the problematic fragment of a document requires horizontal scrolling and not other parts as well, just due to the fact, that the document contains one problematic fragment.

Animation Considerations

Animation Considerations are obviously only relevant, if animation is applied for a document, what is not very likely for normal literature. However just in case and for more experimental types of literature, this sections mainly ensures defined behaviour, to be taken into account, if animation is applied.

Concerning animation, declarative animation with SMIL (including timesheets) or SVG is possible. For this it is required to know, which elements, attributes and properties are animatable. In LML everything can be animated except the fragment identifier (xml:id). The target element can be identified with the XLink:href, if it is not the direct parent. Attribute values in LML itself are typically no numeric values, therefore not interpolable and not additive. For them obviously only discrete, no additive and no accumulative animation is available. However for numeric attribute values or for (styling) properties with numeric values all this is possible.
animateMotion and animateTransform and maybe an animation of styling positioning properties may require the definition of an origin. In general the coordinate system of the styling language is used. Without styling (the default) the top left corner is assumed to be the origin. The x-coordinate goes from left to right, the y-coordinate from top to bottom.
For animateMotion and animateTransform the SMIL attribute origin can be used. The value is 'default' as described or the absolute coordinates in parenthesis, for example (10em, 12em) or (-5ex, 12ex) or (100px, 20%) etc. Alternatively relative coordinates can be specified using as a value first the fragment identifier of an element within the document, followed by whitespace followed by coordinates in parenthesis as for absolute coordinates for a possible offset. If the fragment identifier indicates the animation target element itself or the fragment cannot be identified within the document, the provided fragment identifier is ignored. It is possible to skip the additional coordinates in parenthesis and then the whitespace too.
Missing coordinates simply indicate (0,0). For other coordinates there are units required. Coordinates without units or unknown units are ignored. Known units are those from CSS (deviation from current nonsense in CSS: Absolute units like centimeter or millimeter are interpreted due to international standards for such units, not due to the CSS unit obfuscation).
Percentages for absolute coordinates are related to the size of the complete document, those for relative coordinates are related to the size of the element indicated as the origin. 'size' means here the corresponding length in x- respectively y-direction independently. If the size is determined automatically, this is done without the influence of animateMotion and animateTransform.
Authors always have to take care, that the document interpreted without animation contains still the same (text) information as with animation. If the authors intend is, that the document does not contain any (text) information, it might be a good idea to indicate this with some informational text in the element desc. Obviously for documents without any (text) information it is trivial to take care, that the document interpreted without animation contains still the same (text) information as with animation. However the desc should not contain a lie concerning this issue.

Techniques

It is assumed, that there are three quite different techniques to provide text, respectively three types of text. They are prose and poetry and code (for example source code of computer programs or from markup languages).

Historically, before writing was invented, there was mainly the daily conversation without a requirement for conservation and the information intended to be conserved for several generations. Still today a lot of conversation is not intended to be conserved and many people dislike techniques to do this without their knowledge.

For other information it turned out to be essential to be conserved. The archive mechanism for such information was to memorise it somehow. This was simplified by rhythm, rhyme, repetitions, music. This was called lyric or poetry. Later, after writing was invented, this survived and written poetry was structured too, to conserve better the structure of this poetry.

With the written word it was possible too, to conserve other information too and some structure was moved from poetry to this typically more unstructured information, called prose with different functionality and meaning. For any written word the method to structure is developed within the centuries from whitespace separation between words, constructions of sentences and punctuation up to the markup languages of today.

Later machines, computers were invented. It was necessary to provide those machines commands in formalised, machine readable structure or languages. Such texts are again quite different from poetry and have some special requirements. If such texts appear within literature markup language documents, such text is typically not intended to be directly interpreted by the programs, these are often only samples, to show how to structure such texts to the reader for example for educational purposes. Typically such code appears in some prose content around it, explaining the intended purpose or functionality, if the code is interpreted by a related program.

And it happens that there is poetry in a prose environment and vice versa. Because poetry content has some specific requirements for presentation and prose has not, there has to be an indication to meet those requirements. For code the structure of the code needs to be conserved too in the presentation, but for slightly other reasons than for poetry.

Many elements however can be used for both prose and poetry and their behaviour is derived from its environment, typically from the parent element. Sometimes there may be some text for which it is not important, if it is interpreted as prose or as poetry. Due to current practice it is assumed, that without an indication there is not need for a specific poetic presentation, however there is no requirement either to suppress the appearance of accidental rhythm or rhyme in such fragments, which may confuse an audience otherwise for explicitly prose content.

If an element is indicated here in LML explicitly to be poetic or prose or code, this applies of course. If there are two or more roles explicitly given for one real element or the element itself is contrary to the given role, the situation is ambiguous. Authors may avoid this in most cases. If it happens anyway or is intended to be meaningful, the last role in the role list applies. If a parent element is explicitly indicated either as poetic or prose or code and for the element itself this is neither explicitly indicated nor identifiable by the element itself, the behaviour of the parent is used.

The indicated technique is relevant by this approach, not the text content itself or its possible interpretation. It indicates the intention of the author in a conceptual manner, nothing else. Therefore there is no need for discussion, whether the considerations of the author are wrong or right, it is mandatory for the markup - the discussion about the considerations of the author has to be performed with the author directly on another level than the markup. Therefore the document might have some quirky indications but is not wrong just because it is quirky, it is arts.

Many elements can be used for prose, poetry or code content. If they are children of a prose, poetry or code elements, the behaviour is derived from the parent element. If there is no parent element, no specific requirements are assumed.

References

  1. XHTML role attribute
  2. CURIE Syntax
  3. WAI ARIA
  4. XHTML vocabulary
  5. XHTML (1, 2), HTML4
  6. 'HTML 5'
  7. RDF, Semantic Web
  8. RDFa syntax
  9. Dublin Core Metadata Initiative terms
  10. DAISY; DAISY Specifications for the Digital Talking Book; DAISY Structure Guidelines for the Digital Talking Book; DAISY Digital Talking Book element and attribute index
  11. DocBook; DocBook: The Definitive Guide
  12. DITA 1.2
  13. FictionBook
  14. XLDL
  15. TEI
  16. EPUB
  17. Atom
  18. RSS 1.0
  19. 'RSS' 2.0
  20. XForms
  21. XLink
  22. SMIL
  23. timesheets
  24. MARC Code Lists for Relators
  25. Poetry Markup Language: PML (de)
  26. Semantic Markup for Poetry (first ideas an proposal for (X)HTML
  27. Ausgedichtet - Discussion and tutorial about the problem to markup poetry in (X)HTML and SVG (de)
  28. Text in SVG

Glossary

block element
Block elements are well separated from each other in presentation. For aural presentation for example with breaks or annotations. For visual or tactile (Braille) presentation this may happen with margins, paddings, indentations, borders or outlines. Another approach could be marginal notes, glosses. For interactive visual presentation a precise information might be present only on demand with something like a tooltip.
Some block elements can contain only text or inline elements. Other block elements may only contain block elements. Some elements may contain either block or inline elements (and text), but not a mixture. Typically it is not a good approach to have a mixture of block and inline elements in one element. In general block elements do not appear in inline elements. In some cases LML defines, that a block element may switch into an inline element and vice versa depending on the direct parent element.
flow (of presentation/content)
Information in documents is typically expected to be presented in the order, they are in the source code, this is the flow of presentation. Styling like CSS or timesheets may changes this flow. The meta element is considered to be always extracted from this flow and presented somehow separately or presented within the flow only on demand and then with a clear indication.
inline element
Inline elements indicate typically phrases on the same level as text appears within block elements within the same line, typically without a spatial or temporal indication. For visual presentation they are often indicated with a styling or the appearance different from other inline elements or text. For aural presentation some indication might be only available on demand as annotations or as a change in the styling of the presentation. Some interactive indications are possible too. If the indication is done with additional symbols or glosses, they appear inline too.
metadata
Elements intended to provide meta information.
meta element target
The element, the meta information in the meta element is intended for. With the RDF nomenclature the target is the subject of the meta information.
parents, children, siblings, root
A parent element contains children elements between his begin and end tag. The direct parent is that parent element, which contains an element directly without any other parent elements between. If A is the direct parent of B, this means, that there is no other element C, being a children of A and a parent of B. Correspondingly B is a direct children of A, if there is no other element C, being a children of A and a parent of B. Obviously it is possible, that an element has several direct children, but it cannot have more than one direct parent (what is slightly different from sexual biological systems). Parent elements in general can be called ancestors too and children elements descendants. If elements have the same direct parent, they can be called siblings. A root element has no ancestors and no siblings, but typically descendants, because every XML document has exactly one root element, containing all content, only processing instructions appear outside and comments may appear outside too.
pointer
A method to indicate the target of a meta information. In LML the target is either implied as specified or a specific element is used inside the element representing the meta information, the content of the element ll. Another method is to use the attribute about.
styling
Styling or layout determines, how content is presented. This is different from the markup of the content itself, the markup indicates the intended functionality or structural or semantic meaning. Styling can be for example done with stylesheets or timesheets or scripting and can help to indicate the elements and their semantic meaning or can be used to improve the ergonomics of a document. To be usable, LML documents require some default styling either provided by the author or by the user-agent to enable users to identify the semantic meaning of document fragments.