a picture of John Lydgate with the initials of the
How do you create the line comparisons?

HomeAbout the ArchiveAbout John LydgateWorksManuscriptsContactVisualization

Preparing the XQuery

The TEI files for this site use the <sourceDoc> based schema from the Text Encoding Initiative, rather than the more common default <text> based schema. So my code for a single stanza of a poem will look like this:

<surfaceGrp xml:id=“f.75” n=“folio”> 
  <surface n=“recto”> 
      <label>Folio 75 Recto</label> 
      <graphic url=“St_John_56_75r.jpg”/>
      <zone n=“EETS.QD.16”>
          <line n=“l.1”> He myght be called / eleazar the secunde ؛ </line> 
          <line n=“l.2”> The chamnpyoun / moost myghty and notable ؛ </line> 
          <line n=“l.3”> That yaff the olyfaunt / hy laste his laste mortall wounde ؛ </line> 
          <line n=“l.4”> Machabeo<ex>rum</ex> / this story ys no fablee ؛ </line>
          <line n=“l.5”> And hercules / in his conquest stable ؛ </line>
          <line n=“l.6”> Bar vp the heuenys / in his humanytee ؛ </line>
          <line n=“l.7”> Ffor whom ony sorowes / wer maad moost lamentable ؛ </line> 
          <line n=“l.8”> Whan I be hylde hym / þus nayled to atree ؛ </line> 

One of my concerns about markup of text is that it can get so busy describing the text so machines can read it that it ceases to be readable by people. So I have tried to have the minimum amount of markup necessary to fit the rules of TEI while recording the structure and features of the poem as it exists in the witness. That’s what the @xml:id and @n attributes are doing above. They’re letting the machine know that this is folio 75 of the book in question. The <surfaceGrp> is a single page of the book, front and back. The <surface> is the particular side of the page in question—recto or verso. From there, we label the particular surface in a people-friendly way that’ll display on the site and provide the link to the image.

This begs the question of why <zone> and <line> are necessary. <line>, I suspect, is relatively self-evident, and <zone> is the actual stanza in question. Unlike some poems that have extensive critical histories, there’s not a canonical numbering system for many of these poems, so what I’ve done here is provide a two-part system. The first part, the zone, is the actual stanza in question based on what’s considered the critical edition of the work – in this case The Minor Poems of John Lydgate (Early English Text Society early series 107). That gets abbreviated to EETS e.s. 107, and which is where I get the EETS in the @n attribute under <zone>. The “QD” refers to the the initials of the title of the poem as it appears in that book, and the number is the number of the stanza in the poem. So what the @n attribute in <zone> is doing is explaining where the actual stanza is in relation to the poem in the EETS volume. That doesn’t mean it’s necessarily where it is in the actual book the witness is taken from. That’s handled by the folio reference in the <surfaceGrp> element. Also, once I have finalized images for all the texts, I’ll include code to give the dimensions of the zone on the image, so it can be highlighted, but that’s for the future.

<line> functions in much the same way. There are eight lines per zone, just as there are eight lines per stanza in this poem, so the @n there refers to the particular line of the eight – again using the EETS edition as a signpost (it’s “l.x” rather than just “x” because of a limitation of TEI – you can’t have solely numeric @n attributes in the <line> element).

XQuery Part I: the Comparison Feature

The underlying structure mentioned above is what makes the line comparison relatively simple. Because of the @n attributes identifying both the verse and the line in question, a script can be written using XQuery that will easily grab the appropriate analogues whenever they exist:

let $q:=collection(‘file:/users/matt/Documents/tei/Lydgate/Quis_Dabit?select=*.xml’)

  for $y in $q 
  let $s := $y//tei:surface 
  let $t := $y//tei:titleStmt/@xml:id 
  let $m := $y//tei:msDesc/@xml:id 
  let $z := $s/tei:zone[@n=“EETS.QD.4”] 
  let $l := $z/tei:line[@n=“l.1”] 
  let $w := concat($y//tei:msDesc/tei:msIdentifier/tei:settlement/text(),',',$y//tei:msDesc/tei:msIdentifier/tei:institution/text(),' ',$y//tei:msDesc/tei:msIdentifier/tei:idno/text()) 
  let $g := concat($t, “/” ,$m, “/”, substring-before($l/../../tei:graphic/@url,“.”),“.html”) 
  let $o := local:remove-elements($l, $remove-list) 
    where ($z//tei:line/@n = “l.1”) 
          <ref target=“{$g}”>{$o}</ref> 

I realize that to the uninitiated this may appear to be gibberish, but it's actually quite simple:

let $q:=collection('file:/users/matt/Documents/tei/Lydgate/Quis_Dabit?select=*.xml’)

This is a variable that invokes XQuery’s collection function. In this case, it is pointing to a folder on my desktop, but in the live version it points to the folder where the xml for particular texts are located. The *.xml at the end tells it to grab everything with the filename extension “xml” in that folder.

collection() basically puts all the documents together one after the other, so that what was a series of small seperate tree structures now has an overarching root connecting them. I need to be able to walk through that root to grab the individual items.

for $y in $q lets me do that. The code is stating that for each of the items($y) that have been connected in the collection ($q), return some information. That information is identified via the series of $let declarations. These mean exactly what they sound like: let whatever variable ($x$y, etc.) = whatever is after the := symbol. So in this case $s invokes all the surfaces in question, $w grabs the holding institution and shelfmark of the volume by combining a number of elements in the TEI, $g grabs the url information from the graphic and generates a hyperlink so that the result can link back to the original item (this time by combining the element with static text), $z is the zone information with the limitation of a particular stanza, and $l is the particular line. $o is the actual text from the particular witnesses, run through another function that I will explain the reason for shortly.

Once all the information is defined via the let statements, the text needs to be filtered from the entire poem to the single line the viewer wishes to compare. This is handed by the where clause, which is saying that out of all the <line>s in the <zone> (which is already constrained by the @n attribute) we want only the information for line l.1.

XQuery Part II: Functions

Before getting to what is actually returned by this code, a moment to talk about what the local:remove-elements above means. XQuery is what is called a functional programming language. While the link will go into depth on what precisely that means, for practical purposes a function is a piece of code written once, encapsulating multiple lines of code into a single reference that can be called to again and again. This can be useful for a number of reasons that you can find more information about here, but in the case of this project the reason the function was written was primarily for recursion. The full code of the local:remove-elements function is as follows:

declare function local:remove-elements($input as element(), $remove-names as xs:string*) as element() { 
  element {node-name($input) } 
     for $child in $input/node()[not(name(.)=$remove-names)] 
         if ($child instance of element()) 
           then local:remove-elements($child, $remove-names) 
         else $child 

The function is first defined through the declare statement. This lets the engine that's running the xquery know that this is a function rather than a piece of code to be run immediately. the local that prefaces the remove-elements references the local namespace, while remove-elements is the name of the function. You'll note that the function takes two variables (the items in paranthesis): an XML element and a string. The xs prefacing string is another reference to a namespace -- in this case the namespace for the XML data model.

So far, all of this has simply been defining the function. The actual code generated begins on the next line. element, here, is what is referred to as a computed constructor. All this means is that it's creating the element to be returned computationally, rather than through a direct declaration of the element. node-name($input) lets the Saxon engine know that we're only interested in the name of the element at this point, rather than the element and its contents. Now that the new element is declared, the block of code in the curly brackets is executed. $input/@* says to grab everything in the old element, go through each child of that element and return it as a child element except for those items including in the list above (for $child in $input/node()[not(name(.)=$remove-names)]). The comma simply serves as a way to indicate what is the input and what's to be executed.

After it's gone through all the child elements in the orignal inputted element, the return statement indicates that the children are to be added to the newly created element, but there's a condition in place. The if/then statement serves as a check to make sure that there are no children of those child elements. if ($child instance of element()) tells the Saxon engine to check if there are children of this child node. then local:remove-elements($child, $remove-names) tells it to run the function again for that child if there are. This is the primary reason this bit of code needs to be written as a function—so it can recurse through the various children of an element, catch them all, and apply itself to each until it reaches a child element that has no children in turn. Once it's done that, it attaches that child element to the newly-created element and the whole package is returned to us. This is useful, for example, if I have multiple note elements attached to a single line element, as the code will go through and remove each of them in turn.

Getting the results

Much like with the function, the information after return in the larger code sample shows the format of what this code will spit out: a set of tei-formatted lines, each stating the particular book in question’s location and shelfmark, a hyperlink back to the original page the line can be found on, and the actual text in question. Altogether, running it will look like this:

<item> London, British Library Harley 2251: 
  <ref target=“.html”>O alle ye doughtres · of Jerusalem</ref> 
<item> London, British Library Harley 2255: 
  <ref target=“.html”> <hi rend=“blue_pilcrow”>¶</hi>O alle ye douħtren of <hi rend=“underline”>ierusaleem</hi></ref>
<item> Long Melford, Holy Trinity Church Clopton Chantry Chapel: 
  <ref target=“.html”> <hi>O</hi> alle ye <gap quantity=“8” unit=“chars” reason=“illegible”/>s of ierusaleem</ref> 
<item> Cambridge, Jesus College Q.G.8: 
  <ref target=“.html”> <hi>A</hi>ll the <hi rend=“underline”>doughtren </hi>of <hi rend=“underline”> Ierusalem</hi> . </ref> 
<item> Oxford, Bodleian Library Laud 683: 
  <ref target=“.html”>O alle ẏe douhtren of jerusaleem</ref> 
<item> Oxford, St. John’s College 56: 
  <ref target=“.html”>O alle the doughtren / of Jerusalem ؛</ref> 

Which is ready to be styled by XSLT as soon as it’s either embedded in an existing page or has the rest of the TEI wrapper built around it.

Running the XQuery

The actual running of this XQuery has to be done by the Saxon XSLT/XQuery processor. That process needs to be called either on the server or locally on the viewer's machine. Since the viewer may not have the Saxon processor installed, it occurs on the server using a piece of PHP code. Unfortunately, the necessary files to connect that PHP code with the Saxon installation natively are not available due to a corrupt installation file on the Saxon site. A command line version of the program is available, however, and runs with the following command:

java -cp saxon9he.jar net.sf.saxon.Query -t -q:test.xq

This means that a call to the external program from a php page has to be made, requiring this piece of code:

$text = exec (“java -cp saxon9he.jar net.sf.saxon.Query -t -q:test.xq -line=$line zone=$zone collection=file:$collection”);

This works fine, and the results are stored as $text. But they’re still formatted as the xml string shown above, not as html that can be understood by the web without a lot of extra work. What needs to happen to make it easily readable by a web server is that it needs to be styled either with XQuery or with XSLT. Of the two, XSLT makes a whole lot more sense – that’s what it’s designed for, whereas XQuery is really designed as a query langauge to use xml files as a database.

PHP has its own XSLT parser, which can be invoked like this:

$xml = new DOMDocument; 
$xsl = new DOMDocument; 
$xsl->load(‘comparison.xsl’); // Configure the transformer 
$proc = new XSLTProcessor; 
$proc->importStyleSheet($xsl); // attach the xsl rules 
echo $proc->transformToXML($xml); 

What this does is create a new object within php, load it with the xml we just finished creating, load a stylesheet to style said xml, and then attach the stylesheet to the xml. Finally, the result of the transformation is returned to the screen via the echo command. It works really well in most cases. The problem is that my xsl stylesheet has this piece of code in it:

<xsl:variable name=“max” select=“@quantity”/>
<xsl:for-each select=“1 to $max”> 

That code makes sure that any time there’s a gap due to damage, the likely number of characters (taken either from what’s left of the letters or from the critical text, EETS e.s. 107) is reduced to a number of dots, indicating that it’s not just a blank space. The way I do that is through an xml attribute called @quantity and a for-each loop that prints dots until the the system's internal counter matches @quantity. That functionality with a for-each loop is an XSLT 2.0 bit of code, since philosophically XSLT generally eschews such loops in favor of their native <xsl:apply-templates> function. The native php XSLT parser is 1.0. It will not handle this code.


Good old Saxon will handle XSLT 2.0, but we have no php-native XSLT parser for Saxon. So a second external call to the command line is made:

$transform = exec (“java -jar saxon9he.jar -s:$filename -xsl:comparison.xsl -o:$html”);

Notice, though, that that command has a $filename variable. The parser won’t easily just take the string we had before, so now instead of keeping the result in memory it needs to be written to a file, which is then read by the Saxon XSLT 2.0 parser in the command above. Once it does so it transforms the xml into html, which should be able to be displayed via a php echo statement. However, that doesn’t work, so it’s written to another file named by the variable $html.

Now, to actually display the information, we need to go back to the command line and grab that html file:

$test = file_get_contents($html);

and then display the results:

echo $test;

This is what you actually see when you click on the blue dot to the right of the line and the box opens up – the contents of $test. It’s not as simple as just clicking on the dot, though. Clicking on that dot calls some javascript. Javascript has to be used because you’re changing something locally when you click on the dot and javascript is a client-side scripting language.

The javascript passes the line, zone (here represented as id), and collection characteristics to the php code via this function:

function compare_toggle_visibility(id, line, collection) { 
  var e = document.getElementById(id); 
  e.style.display = ((e.style.display!='none’) ? 'none’ : 'block’ ); 
  $(e).html(“Loading Comparison…”);
  $.get(’/XML/XQuery/test_command_line.php’+ ’?collection=’ + collection + ’&zone=’ + id + ’&line=’ + line, 
    function(responseTxt){$(e).html(responseTxt); }); 

What this does is first check to see which element in the html code has the @id attribute. It then checks to see if it has the style attribute ‘display:none’ (indicating it should not be displayed) and switches it to ‘display:block’ if it does. That’s what allows the box to “open up” and become visible. Having done that, it then puts some text into that box so that you know that work is being done, and finally it loads the php page and sends the results of that php page to the box. Clicking on the dot again will close the box back up (and at this point re-runs the code pointlessly – that’s something I need to fix).

On the php side, the three variables are passed to the php through a [$_get] statement (which grabs the appropriate value from the uri passed to php)


and the code is processed as explained above. However, since there’s the possiblity of multiple people accessing the same lines at similar times we can’t have filenames that stay the same – they’d be overwritten. Instead, dynamic filenames need to be created. I do this by generating a random number and attaching it to the machine’s timestamp, then creating two variables based on that number with the extensions .html and .xml.

$unique=microtime(true) . mt_rand(1,5000000000);
$filename=$unique . “.xml”; $html=$unique . “.html”; 

After the code has executed, I then clean up these files so there isn’t a bunch of randomly named files cluttering up my machine:


The unforunate effect of this constant call back and forth to the command line is a lag on the display of the comparison items, but my hope is that the Saxon php installer will be repaired and I can streamline it with the more integrated code.