How do you create the line comparisons?

HomeAbout the ArchiveAbout John LydgateWorksManuscriptsContact

Preparing the XQuery

The TEI files for this site use the <sourceDoc> based schema from the Text Encoding Initiative, rather than the more common default <text> based schema. So my code for a single stanza of a poem will look like this:

 
    <surfaceGrp xml:id=“f.75” n=“folio”> 
        <surface n=“recto”> 
            <label>Folio 75 Recto</label> 
            <graphic url=“St_John_56_75r.jpg”></graphic> 
            <zone n=“EETS.QD.16”>
                <line n=“l.1”> <orig>He myght be called / eleazar the secunde ؛</orig> </line> 
                <line n=“l.2”> <orig>The chamnpyoun / moost myghty and notable ؛</orig> </line>
                <line n=“l.3”> <orig>That yaff the olyfaunt / hy laste his laste mortall wounde ؛</orig> </line> 
                <line n=“l.4”> <orig>Machabeo<ex>rum</ex> / this story ys no fablee ؛</orig> </line> 
                <line n=“l.5”> <orig>And hercules / in his conquest stable ؛</orig> </line> 
                <line n=“l.6”> <orig>Bar vp the heuenys / in his humanytee ؛</orig> </line> 
                <line n=“l.7”> <orig>Ffor whom ony sorowes / wer maad moost lamentable ؛</orig> </line> 
                <line n=“l.8”> <orig>Whan I be hylde hym / þus nayled to atree ؛</orig> </line> 
            </zone> […] 
        </surface>
    </surfaceGrp>
                

One of my concerns about markup of text is that it can get so busy describing the text so machines can read it that it ceases to be readable by people. So I have tried to have the minimum amount of markup necessary to fit the rules of TEI while recording the structure and features of the poem as it exists in the witness. That’s what the @xml:id and @n attributes are doing above. They’re letting the machine know that this is folio 75 of the book in question. The <surfaceGrp> is a single page of the book, front and back. The <surface> is the particular side of the page in question—recto or verso. From there, we label the particular surface in a people-friendly way that’ll display on the site and provide the link to the image.

This begs the question of why <zone>, <line>, and <orig> are necessary. <orig> is an artifact of it being a TEI document. Since the assumption with this particular schema is that you’ll be working from an original manuscript, TEI will actually throw an error at you if you don’t wrap the actual text in an <orig> tag. <line>, I suspect, is relatively self-evident, and <zone> is the actual stanza in question. Unlike some poems that have extensive critical histories, there’s not a canonical numbering system for many of these poems, so what I’ve done here is provide a two-part system. The first part, the zone, is the actual stanza in question based on what’s considered the critical edition of the work – in this case The Minor Poems of John Lydgate (Early English Text Society early series 107). That gets abbreviated to EETS e.s. 107, and which is where I get the EETS in the @n attribute under <zone>. The “QD” refers to the the initials of the title of the poem as it appears in that book, and the number is the number of the stanza in the poem. So what the @n attribute in <zone> is doing is explaining where the actual stanza is in relation to the poem in the EETS volume. That doesn’t mean it’s necessarily where it is in the actual book the witness is taken from. That’s handled by the folio reference in the <surfaceGrp> element. Also, once I have finalized images for all the texts, I’ll include code to give the dimensions of the zone on the image, so it can be highlighted, but that’s for the future.

<line> functions in much the same way. There are eight lines per zone, just as there are eight lines per stanza in this poem, so the @n there refers to the particular line of the eight – again using the EETS edition as a signpost (it’s “l.x” rather than just “x” because of another limitation of TEI – you can’t have solely numeric @n attributes in the <line> element).

That underlying structure is what makes the line comparison relatively simple. Because of the @n attributes identifying both the verse and the line in question, an XQuery script can be written that will easily grab the appropriate analogues whenever they exist:

 
    let $q:=collection(‘file:/users/matt/Documents/tei/Lydgate/Quis_Dabit?select=*.xml’)
                
    for $y in $q 
    let $s := $y//tei:surface 
    let $t := $y//tei:titleStmt/@xml:id 
    let $m := $y//tei:msDesc/@xml:id 
    let $w := concat($s/../../../..//msDesc/msIdentifier/settlement,’,
        ’,$s/../../../..//msDesc/msIdentifier/institution,’ ’,$s/../../../..//msDesc/msIdentifier/idno) 
    let $z := $s/tei:zone[@n=“EETS.QD.4”]
    let $l := $z/tei:line[@n=“l.1”] 
    let $g := concat($t, “/” , $m, “/”, substring-before($l/../../tei:graphic/@url,“.”),“.html”) 
    let $o := $l/tei:orig/node() 
    where ($z//tei:line/@n = “l.1”) 
        return
            <item>
                <orig>{$w}: 
                    <ref target=“{$g}”>{$o}</ref>
                </orig>
            </item> 
            

I realize that to the uninitiated this may appear to be gibberish, but it's actually quite simple:

    let $q:=collection('file:/users/matt/Documents/tei/Lydgate/Quis_Dabit?select=*.xml’)

This is a variable that invokes XQuery’s collection function. In this case, it is pointing to a folder on my desktop, but in the live version it points to the folder where the xml for particular texts are located. The *.xml at the end tells it to grab everything with the filename extension “xml” in that folder.

collection() basically puts all the documents together one after the other, so that what was a series of small seperate tree structures now has an overarching root connecting them. I need to be able to walk through that root to grab the individual items.

for $y in $q lets me do that. The code is stating that for each of the items($y) that have been connected in the collection ($q), return some information. That information is identified via the series of $let declarations. These mean exactly what they sound like: let $x = whatever is after the := symbol. So in this case $s invokes all the surfaces in question, $w grabs the holding institution and shelfmark of the volume, $g grabs the url information from the graphic and generates a hyperlink so that the result can link back to the original item, $z is the zone information with the limitation of a particular stanza, and $l is the particular line. $o is the actual text from the particular witnesses.

Once all the information is defined via the let statements, the text needs to be filtered from the entire poem to the single line the viewer wishes to compare. This is handed by the where clause, which is saying that out of all the <line>s in the <zone> (which is already constrained by the @n attribute) we want only the information for line l.1.

The information after return shows the format of what this code will spit out: a set of tei-formatted lines, each stating the particular book in question’s location and shelfmark, a hyperlink back to the original page the line can be found on, and the actual text in question. Altogether, running it will look like this:

 
    <item> 
        <orig>London, British Library Harley 2251:
            <ref target=“.html”>O alle ye doughtres · of Jerusalem</ref>
        </orig> 
    </item> 
    <item> 
        <orig>London, British Library Harley 2255: 
            <ref target=“.html”> <hi rend=“blue_pilcrow”>¶</hi>O alle ye douħtren of <hi rend=“underline”>ierusaleem</hi>
            </ref> 
        </orig>
    </item> 
    <item> 
        <orig>Long Melford, Holy Trinity Church Clopton Chantry Chapel: 
            <ref target=“.html”> <hi>O</hi> alle ye <gap quantity=“8” unit=“chars” reason=“illegible”/>s of ierusaleem 
            <note place=“bottom” anchored=“true” xml:id=“explanatory”>The “l” in “ierusaleem” in the final word is 
                determined from context.</note></ref> 
        </orig> 
    </item> 
    <item> 
        <orig>Cambridge, Jesus College Q.G.8: 
            <ref target=“.html”> <hi>A</hi>ll the <hi rend=“underline”>doughtren </hi>of <hi rend=“underline”>
            Ierusalem</hi> . </ref>
        </orig>
    </item> 
     <item> 
        <orig>Oxford, Bodleian Library Laud 683:
            <ref target=“.html”>O alle ẏe douhtren of jerusaleem</ref>
        </orig> 
    </item> 
    <item> 
        <orig>Oxford, St. John’s College 56: 
            <ref target=“.html”>O alle the doughtren / of Jerusalem ؛</ref>
        </orig> 
    </item>
            

Which is ready to be styled by XSLT as soon as it’s either embedded in an existing page or has the rest of the TEI wrapper built around it.

Running the XQuery

The actual running of this XQuery has to be done by the Saxon XSLT/XQuery processor. That process needs to be called either on the server or locally on the viewer's machine. Since the viewer may not have the Saxon processor installed, it occurs on the server using a piece of PHP code. Unfortunately, the necessary files to connect that PHP code with the Saxon installation natively are not available due to a corrupt installation file on the Saxon site. A command line version of the program is available, however, and runs with the following command:

    java -cp saxon9he.jar net.sf.saxon.Query -t -q:test.xq

This means that a call to the external program from a php page has to be made, requiring this piece of code:

    $text = exec (“java -cp saxon9he.jar net.sf.saxon.Query -t -q:test.xq -line=$line zone=$zone 
          collection=file:$collection”);

This works fine, and the results are stored as $text. But they’re still formatted as the xml string shown above, not as html that can be understood by the web without a lot of extra work. What needs to happen to make it easily readable by a web server is that it needs to be styled either with XQuery or with XSLT. Of the two, XSLT makes a whole lot more sense – that’s what it’s designed for, whereas XQuery is really designed as a query langauge to use xml files as a database.

PHP has its own XSLT parser, which can be invoked like this:

 
    $xml = new DOMDocument; $xml->loadXML($text); 
    $xsl = new DOMDocument; 
    $xsl->load(‘comparison.xsl’); // Configure the transformer 
    $proc = new XSLTProcessor; 
    $proc->importStyleSheet($xsl); // attach the xsl rules 
    echo $proc->transformToXML($xml);
                

What this does is create a new object within php, load it with the xml we just finished creating, load a stylesheet to style said xml, and then attach the stylesheet to the xml. Finally, the result of the transformation is returned to the screen via the echo command. It works really well in most cases. The problem is that my xsl stylesheet has this piece of code in it:

 
    <xsl:variable name=“max” select=“@quantity”/>
 
    <xsl:for-each select=“1 to $max”> 
        <xsl:text>.</xsl:text>
    </xsl:for-each>
            

That code makes sure that any time there’s a gap due to damage, the likely number of characters (taken either from what’s left of the letters or from the critical text, EETS e.s. 107) is reduced to a number of dots, indicating that it’s not just a blank space. The way I do that is through an xml attribute called @quantity and a for-each loop that prints dots until the the system's internal counter matches @quantity. That functionality with a for-each loop is an XSLT 2.0 bit of code, since philosophically XSLT generally eschews such loops in favor of their native <xsl:apply-templates> function. The native php XSLT parser is 1.0. It will not handle this code.

But!

Good old Saxon will handle XSLT 2.0, but we have no php-native XSLT parser for Saxon. So a second external call to the command line is made:

    $transform = exec (“java -jar saxon9he.jar -s:$filename -xsl:comparison.xsl -o:$html”);

Notice, though, that that command has a $filename variable. The parser won’t easily just take the string we had before, so now instead of keeping the result in memory it needs to be written to a file, which is then read by the Saxon XSLT 2.0 parser in the command above. Once it does so it transforms the xml into html, which should be able to be displayed via a php echo statement. However, that doesn’t work, so it’s written to another file named by the variable $html.

Now, to actually display the information, we need to go back to the command line and grab that html file:

    $test = file_get_contents($html);

and then display the results:

    echo $test;

This is what you actually see when you click on the blue dot to the right of the line and the box opens up – the contents of $test. It’s not as simple as just clicking on the dot, though. Clicking on that dot calls some javascript. Javascript has to be used because you’re changing something locally when you click on the dot and javascript is a client-side scripting language.

The javascript passes the line, zone (here represented as id), and collection characteristics to the php code via this function:

 
    function compare_toggle_visibility(id, line, collection) { 
        var e = document.getElementById(id); 
        e.style.display = ((e.style.display!='none’) ? 'none’ : 'block’); 
        $(e).html(“Loading Comparison…”);
        $.get(’/XML/XQuery/test_command_line.php’+ ’?collection=’ + collection + ’&zone=’ + id + ’&line=’ + line, 
            function(responseTxt){ $(e).html(responseTxt); }); 
        } 
            

What this does is first check to see which element in the html code has the @id attribute. It then checks to see if it has the style attribute ‘display:none’ (indicating it should not be displayed) and switches it to ‘display:block’ if it does. That’s what allows the box to “open up” and become visible. Having done that, it then puts some text into that box so that you know that work is being done, and finally it loads the php page and sends the results of that php page to the box. Clicking on the dot again will close the box back up (and at this point re-runs the code pointlessly – that’s something I need to fix).

On the php side, the three variables are passed to the php through a [$_get] statement (which grabs the appropriate value from the uri passed to php)

 
    $line=htmlspecialchars($_GET[“line”]);
    $zone=htmlspecialchars($_GET[“zone”]);
    $collection=htmlspecialchars($_GET[“collection”]);
            

and the code is processed as explained above. However, since there’s the possiblity of multiple people accessing the same lines at similar times we can’t have filenames that stay the same – they’d be overwritten. Instead, dynamic filenames need to be created. I do this by generating a random number and attaching it to the machine’s timestamp, then creating two variables based on that number with the extensions .html and .xml.

 
    $unique=microtime(true) . mt_rand(1,5000000000);
    $filename=$unique . “.xml”; $html=$unique . “.html”;
            

After the code has executed, I then clean up these files so there isn’t a bunch of randomly named files cluttering up my machine:

 
    unlink($filename); 
    unlink($html);
            

The unforunate effect of this constant call back and forth to the command line is a lag on the display of the comparison items, but my hope is that the Saxon php installer will be repaired and I can streamline it with the more integrated code.