Picture of Gabriel Egan G a b r i e l   E g a n  .  com

"Quick-and-dirty XML applications for textual analysis" by Gabriel Egan

The great bibliographer Fredson Bowers was an incorrigible optimist. In 1966 he had a vision of his field being transformed [SLIDE]:

I have some hopes that electronic computers can be put to work to digest and to analyze much information that at present we do not have. It will be a blessed day in the future when one can press a button and give such a lordly command as 'List for me every time compositor B follows his copy in spelling win as win or winne, every time he changes a copy spelling win to winne, or winne to win, and distinguish in each case what he does in setting prose and setting verse. Then give me all the occurrences of win and winne in texts that he set from manuscript'. (Bowers 1966, 136)

Bowers was referring to the act of typesetting in which a compositor read the work he was supposed to set in type (an existing book or manuscript) and picked individual letters and punctuation from a typecase and placed them together word by word and line by line to make a block of printable type [SLIDE]. Quite often the job of typesetting one book would be shared by several compositors and we can tell where each one started and finished his stint because they varied in their habits of spelling and spacing of type. We don't know their names so they are identified as compositor A, compositor B, and so on.

    Bowers's hopes have not yet been realized. The problem is tougher than he anticipated because where two or more scholars have tried to distinguish the compositorial stints in one book they have come to different conclusions about the numbers of compositors at work and where each one's stints start and finish. If we are to computerize the scholarly knowledge, we will have to record it as a set of hypotheses so that our questions take the form "supposing that Paul Werstine is right about his stints, where does compositor B spell win as winne?" and "now show me the same supposing that Gary Taylor is right". That is, we have to computerize the scholarly differences of opinion.

    The currently most popular way to computerize knowledge about a written text is to take an electronic version of the raw words and punctuation and to surround various parts of it with tags that conform to a standard known as Extensible Markup Language (XML). Just what features one records and what names one gives the tags are up to the user, but a typical model for marking up a play would be [SLIDE]:

<play>
  <act n="1">
    <scene n="1">
      <line n="1">Bar. WHose there?</line>
      <line n="2">Fran. Nay answere me. Stand and vnfolde your selfe.</line>
      <line n="3">Bar. Long liue the King,</line>
      . . .
      <line n="156">Where we shall find him most conuenient. Exeunt</line>
    </scene>
 
</act>
      . . .
  <act n="5">
      . . .
  </act>
</play>

Notice that each component part of the play is marked by a pair of tags [SLIDE], the opening one naming which kind of component it is, such as play, act, scene or line, and the closing one repeating the name but prefixed by a back-slash meaning "end of" play, act, scene or line. I've coloured the tags just to help show the structure. An important point is the Russian-doll (or Chinese-box) principle [SLIDE]: the lines are nested inside scenes, which are nested inside acts, which are nested inside the outermost box called "play". This nesting is demanded in XML--no line may cross a scene boundary, no scene may cross an act boundary--because XML treats every text as what is called an Ordered Hierarchy of Content Objects. This means that in XML all novels have to consist of chapters that consist of paragraphs, and poems have to consist of lines that are made of words. For that reason XML has trouble with the works of writers such as Laurence Stern [SLIDE], who had a marbled endpaper printed in the middle of his novel Tristram Shandy, or e e cummings who frequently broke words across line boundaries, as here at the beginning of his poem "exit a kind of unkindness exit" [SLIDE]:

exit a kind of unkindness exit

little
mr Big
notbusy
Busi
ness notman

(e e cummings "exit a kind of unkindness exit")

You might think such violations of the Russian-doll principle rare outside of literary conceit, but Murray McGillivray recently pointed out a typical example from everyday email [SLIDE]:

Subject: Your parcel
From: Helen Black heblack@ucalgary.ca
Date: Fri. October 2, 2009 3:28 p.m.
To: Murray McGillivray <mmcgilli@ucalgary.ca>
Message: went off in the courier this afternoon, HEB
(McGillivray 2010, 131)

[SLIDE] The sentence "Your parcel went off in the courier this afternoon" is split between two containers: the 'Subject' and the 'Message'. The features of a book that a bibliographer is interested in, such as pages, formes and gatherings, cut across the features usually marked up in XML such acts, scenes, speeches and lines. A speech may easily cross a page boundary and a scene a forme boundary. There are well established means to reconcile incompatible hierarchies of interest within one XML document, but recording scholarly opinions about compositorial stints is a particularly tough case. Suppose that Werstine thinks that compositor A set the first line of the second quarto of Shakespeare's Hamlet (1604-5) and compositor B the rest of the play. We might mark this up thus [SLIDE]:

<werstine-stint comp="A">
  <line n="1">Bar. WHose there?</line>
</werstine-stint>
<werstine-stint comp="B">

  <line n="2">Fran. Nay answere me. Stand and vnfolde your selfe.</line>
  <line n="3">Bar. Long liue the King,</line>
  . . .
</werstine-stint>

This is valid XML [SLIDE]: each line is wholly contained within one of the two stints as determined by Werstine. [SLIDE] Now let's suppose that Taylor thinks that compositor A set the first two lines of Hamlet and compositor B the rest of the play. We would mark this up thus [SLIDE]:

<taylor-stint comp="A">
 
<line n="1">
Bar. WHose there?</line>
  <line n="2">Fran. Nay answere me. Stand and vnfolde your selfe.</line>
</taylor-stint>
<taylor-stint comp="B">

  <line n="3">Bar. Long liue the King,</line>
. . .
</taylor-stint>

This also is valid XML [SLIDE]: each line is wholly contained within one of the two stints as determined by Taylor. Each of these two hierarchies exists perfectly well within its own XML document, but look what happens when we try to make them co-exist in a single document [SLIDE]:

<werstine-stint comp="A">
<taylor-stint comp="A">
 
<line n="1">
Bar. WHose there?</line>
  </werstine-stint>
<werstine-stint comp="B">
  <line n="2">Fran. Nay answere me. Stand and vnfolde your selfe.</line>
</taylor-stint>
<taylor-stint comp="B">

  <line n="3">Bar. Long liue the King,</line>
. . .
</taylor-stint>
</werstine-stint>

If we highlight the hierarchies [SLIDE x 4] we find we have broken the Russian-doll/Chinese-box principle, or as they say in XML we have created overlapping hierarchies. It appears that we cannot make a single representation of Q2 Hamlet containing at once Werstine's and Taylor's views on its typesetting.

    We are forced, then, to have one document for Werstine's view and one for Taylor's. But we don't want two complete copies of the play itself, not least because if we find a transcription error in the electronic text we don't want to have to correct it in two places. We should instead store in one document the base text, with the markup that everyone agrees upon, and keep the scholars' competing views of it somewhere else. This approach is called stand-off markup. Here are snippets from the five documents needed for Q2 Hamlet [SLIDE]:

<line n="TLN-1">Bar. WHose there?</line>
<line n="TLN-2">Fran. Nay answere me. Stand and vnfolde your selfe.</line>
<line n="TLN-3">Bar. Long liue the King,</line>
(basetext.xml)

<xi:include href="basetext.xml" pointer ="TLN-1">
(werstine-on-comp-A.xml)

<xi:include href="basetext.xml" pointer ="TLN-2">
<xi:include href="basetext.xml" pointer ="TLN-3">
(werstine-on-comp-B.xml)

<xi:include href="basetext.xml" pointer ="TLN-1">
<xi:include href="basetext.xml" pointer ="TLN-2">
(taylor-on-comp-A.xml)

<xi:include href="basetext.xml" pointer ="TLN-3">
(taylor-on-comp-B.xml)

[SLIDE] Notice that the base text contains only the uncontroversial line information for the first three lines. [SLIDE] The document giving Werstine's view on compositor A's setting of those three lines simply picks out the first line, and [SLIDE] the document giving his view of compositor B's setting of those three lines picks out the second two. [SLIDE] The document giving Taylor's view of compositor A's setting picks out the first two lines, and [SLIDE] the document giving his view of compositor B's setting picks out just the third one. In this form, the documents holding the scholars' views can be interrogated by off-the-shelf software running the system called XQuery and during processing these XInclude statements are replaced with the content they identify, thus [SLIDE]:

BEFORE PROCESSING
<xi:include href="basetext.xml" pointer ="TLN-2">
<xi:include href="basetext.xml" pointer ="TLN-3">

(werstine-on-comp-B.xml)

[SLIDE] AFTER PROCESSING
<line n="TLN-2">Fran. Nay answere me. Stand and vnfolde your selfe.</line>
<line n="TLN-3">Bar. Long liue the King,</line>
(werstine-on-comp-B.xml)

By running our XQuery question against the document "werstine-on-comp-B.xml" we are running it against just the parts of the play that Werstine thinks compositor B set in type. You might have anticipated that it would be tedious to write an XInclude statement for each line that Werstine thinks compositor B set, but we don't have to specify individual lines: the procedure works just as well for whole pages and even gatherings, so long as we have identified those elements in the base text.

    Let's remind ourselves what Bowers wanted to be able to ask a computer, so we can plan just what we have to mark up in the base text [SLIDE]:

List for me every time compositor B follows his copy in spelling win as win or winne, every time he changes a copy spelling win to winne, or winne to win, and distinguish in each case what he does in setting prose and setting verse. Then give me all the occurrences of win and winne in texts that he set from manuscript.

[SLIDE] This kind of enquiry requires that we record what the compositor was looking at as his copy text [SLIDE] when setting type: not only whether it was an existing printed book or a manuscript, but also exactly what its readings were in the case of every word. This last requirement is a tall order since it means marking up another document, the copy text, and providing a word-level link between every word in the copy text and every word in the book made from it, so that the departures from copy text spelling can be determined. In the case of Shakespeare, Bowers's first lordly command [SLIDE] must refer only to printed copy because there survive no manuscripts used to set the plays and from which we might recover the copy spellings. We can be sure an edition is a reprint of a preceding edition only in cases where the reprinting is so faithful that it repeats the errors in its copy, and luckily in these cases the two editions will be so alike that a computer can identify for us which word in the earlier edition matches which word in the later. However, by definition such a reprint would not be substantive and hence of lesser interest than editions printed directly from manuscripts. [SLIDE] The second of Bowers's lordly commands refers to editions set from manuscripts, but he is careful not to ask for copy spellings since in Shakespeare's case these are unknown: all we know is that the copy was a lost manuscript.

    It may be that Werstine and Taylor have differing opinions about the nature of the printer's copy for a book, or certain pages or just certain lines of it. The place to store that information is not the agreed base text but the document recording a scholarly opinion about it, like this [SLIDE]:

BEFORE PROCESSING
<xi:include href="basetext.xml" pointer ="B1r" copy="ms">
<xi:include href="basetext.xml" pointer ="B1v" copy="print">

(werstine-on-comp-B.xml)

[SLIDE] Unfortunately this doesn't work: the resulting pages do not pick up the copy attribute. We can specify the copy at the beginning of the document "werstine-on-comp-B.xml" so that it covers all of what compositor B is supposed by Werstine to have set, but in fact Werstine might reasonably suppose that compositor B used different kinds of copy in different parts of the book. We can specify the copy within the pages or lines of the base text so that it applies equally to Werstine's and Taylor's analyses, but in fact Werstine and Taylor might reasonably disagree on this point. Chalk up one failure for this method.

    Bowers's reference to different spellings in prose and verse arises from the compositors' art of justification [SLIDE]. When setting prose a compositor would insert additional small spaces between words to push the last word to the end of the line and so produce a smooth right edge to the page, whereas when setting verse he would use regular spaces between words and fill the end of the line with larger ones to give the page a jagged right edge. When the adjustment of the small spaces between words failed to fully justify a line of prose the compositor was free to alter the spellings of the words to get a tight fit, whereas in verse the expanse of space at the end of the line made this exigent unnecessary. Thus in prose--or indeed long verse lines that fill the measure--we cannot assume that the compositor's spelling choices reflect his personal preferences, since he might have resorted to them only to justify the line. In studying compositors' habits, then, it is useful to have a record of whether each line is full. This information is not controversial and we can simply add it as a second attribute of each line in the base text, thus [SLIDE]:

<line n="TLN-1" length="not-full">Bar. WHose there?</line>
<line n="TLN-2" length="full">Fran. Nay answere me. Stand and vnfolde your selfe.</line>
<line n="TLN-3" length="not-full">Bar. Long liue the King,</line>
(basetext.xml)

Writing in 1966, Bowers confined himself to compositors' spelling preferences and did not consider psycho-mechanical habits--such as failure to insert spaces after commas in short lines (where justification cannot be the cause)--that T. H. Howard-Hill, McDonald P. Jackson and Gary Taylor later used to distinguish compositors (Howard-Hill 1973; 1976; Jackson 1975; 1982; 1987; 2001; Taylor 1981). Although it is not demonstrated here, such tests can be incorporated into the present methodology by adding to the base text special characters representing, for example, terminally spaced commas.

    We have now provided enough information to ask the computer pertinent questions. Let me show how all this looks in a real-world example. The earliest surviving (although not necessarily the first) edition of Shakespeare's Love's Labour's Lost is a quarto of 1598. George R. Price (Price 1978, 425) identified three compositors at work in this edition, with the following division of labour by pages set [SLIDE]:

Comp I A2r, A2v, A3r, A4r, A4v, B2r, B3v, Clr, C1v, C2r, C2v, F2r, G4r, G4v, H3r, H3v, H4v

Comp II A3v, B1r, B1v, B2v, B3r, C3r, D1v, D2r, D2v, D3v, D4v, E4r, E4v, F1r, F1v, F2v, G1r, G1v, G2r, G3r, G3v, H1r, H2v, H4r, I1r, I2r, I3r, I3v, I4r, I4v, Klr, K1v

Comp III B4r, B4v, C3v, C4r, C4v, D1r, D3r, D4r, E1r, E1v, E2r, E2v, E3r, E3v, F3r, F3v, F4r, F4v, G2v, H1v, H2r, I1v, I2v, K2r, K2v

Paul Werstine (Werstine 1984, 37-38) also found three compositors at work in this edition, but with a quite different division of labour [SLIDE]:

Comp R B1r

Comp S A2r, A2v, A3r, A3v, F1r, F1v, F2r

Comp T A1r, A1v (blank), A4r, A4v, B1v, B2r, B2v, B3r, B3v, B4r, B4v, C1r, C1v, C2r, C2v, C3r, C3v, C4r, C4v, D1r, D1v, D2r, D2v, D3r, D3v, D4r, D4v, E1r, E1v, E2r, E2v, E3r, E3v, E4r, E4v, F2v, F3r, F3v, F4r, F4v, G1r, G1v, G2r, G2v, G3r, G3v, G4r, G4v, H1r, H1v, H2r, H2v, H3r, H3v, H4r, H4v, I1r, I1v, I2r, I2v, I3r, I3v, I4r, I4v, K1r, K1v, K2r, K2v

The first step is to find an electronic text of this early edition, and luckily [SLIDE] Michael Best's Internet Shakespeare Editions has one. We could get it already marked up with the tags Michael uses, but we want just the raw words. CTRL-a grabs it for us and after trimming a bit of guff off the top and bottom we have a plain text file of lines [SLIDE]. As you can see [SLIDE], there are some unwanted line numbers in it and further down there are some blank lines to delete. A small program (written in the language Perl) [SLIDE] will chop those out and wrap the tags around each line to make it a line element with the attributes linenumber and length, the latter set by default to 'not-full'. [SLIDE] Here's how the text looks after we've done that. All that remains to make this a valid base text is manually to add tags marking the book's sheets and pages and to set the length attribute to 'full' where necessary. [SLIDE] This we can do in an XML editor, here Oxygen, using a facsimile of the edition as a crib. This isn't terribly time consuming: the whole play took me about two hours.

    Oxygen has an XQuery processor built in, so directly inside this editor we can interrogate the documents that represent Price's and Werstine's beliefs about which compositor set which part and thus we can give Bowers's lordly commands. Here is an XQuery asking for the full-length lines Werstine thinks were set by compositor S, together with the result it produces [SLIDE]:

QUERY: doc("werstine's-comp-S.xml")//line[@length="full"]

<?xml version="1.0" encoding="UTF-8"?>
<line xmlns:xi="http://www.w3.org/2001/XInclude" linenumber="_13" length="full">LET Fame, that all hunt after in their lyues, </line>
<line xmlns:xi="http://www.w3.org/2001/XInclude" linenumber="_14" length="full">Liue registred vpon our brazen Tombes, </line>
<line xmlns:xi="http://www.w3.org/2001/XInclude" linenumber="_68" length="full">Ferd. Why that to know which else we should not know. </line>
. . .
<line xmlns:xi="http://www.w3.org/2001/XInclude" linenumber="_1569" length="full">Duma. Darke needes no Candles now, for darke is light. </line>
<line xmlns:xi="http://www.w3.org/2001/XInclude" linenumber="_1584" length="full">King. Then leaue this chat, and good Berowne now proue </line>

Because the line numbers are shown, we can check that the query is doing what we think it is doing. Then, the XQuery can be tweaked to get just the words in the lines without the surrounding XML tags [SLIDE]:

QUERY: doc("werstine's-comp-S.xml")//line[@length="full"]/text()

<?xml version="1.0" encoding="UTF-8"?>LET Fame, that all hunt after in their lyues, Liue registred vpon our brazen Tombes, Ferd. Why that to know which else we should not know. Ber. Things hid &amp; bard (you meane) from cammon sense. Lon. He weedes the corne, &amp; still lets grow the weeding. Ber. The Spring is neare when greene geese are a bree- Bero. Well, say I am, why should proude Sommer boast, Bero. No my good Lord, I haue sworne to stay with you. Fer. How well this yeelding rescewes thee from shame. Ber. Item, That no woman shall come within a myle of Ber. Lets see the penaltie. On payne of loosing her tung. Item, Yf any man be seene to talke with a woman within the tearme of three yeeres, he shall indure such publibue Ferd. What say you Lordes? why, this was quite forgot. In pruning mee when shall you heare that I will prayse a hand, a foote, a face, an eye: a gate, a state, a brow, a brest, Ber. A toy my Leedge, a toy: your grace needs not feare it. Long. It did moue him to passion, &amp; therfore lets heare it. Berow. Ah you whoreson loggerhead, you were borne to Ber. That you three fooles, lackt me foole, to make vp the Bero. True true, we are fower: will these turtles be gon? Clow. Walke aside the true folke, and let the traytors stay. King. What, did these rent lines shew some loue of thine? Ber. Did they quoth you? Who sees the heauenly Rosaline, Duma. Darke needes no Candles now, for darke is light. King. Then leaue this chat, and good Berowne now proue

This block of text is suitable for pasting into a word-frequency counter and thence into, say, a spreadsheet for analysis. Repeating the XQuery for full lines set by compositor T, for example, and counting the resulting words' frequencies enables rapid comparison of the kinds of spelling preferences that bibliographers are interested in [SLIDE]. XQuery has a number of powerful features that I don't have time to show, but it is easy, for example, to have it pull out the lines in the order they were set (by specifying in the query the page order of setting) or just the lines on an inner forme.

    I'll conclude with a survey of the limitations of the above approach [BLANK SLIDE]. The greatest is that one cannot attach fresh attributes (such as statements about the printer's copy) to the individual XInclude lines in the documents representing the bibliographers' opinions: such attributes have to be either global for the compositor stint or encoded (globally or locally) into the base text. My Perl script for removing the base text's line numbers needs improvement: it currently also strips the numbers from the speech prefixes "1 La[dy]" and "2 La[dy]" if they begin a line and it simply misses some of the line numbers. The Internet Shakespeare Editions transcriptions of early editions have certain oddities, such as placing a turned-up line-ending below rather than above the line that it completes. The transcriptions correctly represent the early editions' habit of breaking a word across a line or a even page boundary, such as elegance beginning on E1v and ending on E2r in Q1 Love's Labour's Lost. Even if such a word were begun by one compositor and finished by another (which seems unlikely), for most analyses it would make little difference if we arbitrarily ruled that all words belong with the page and line on which they began. XML documents may not contain lone ampersands as this character is reserved for special purposes; Q1 Love's Labour's Lost has dozens of them (standing for and) and they must be manually altered to XML's code for an ampersand.

    Lastly, an obvious objection is that the present work fails to conform to the guidelines of the Text Encoding Initiative (TEI), which aims to provide an agreed standard for XML markup of literary works. Although TEI is acquiring techniques for representing physical documents--notably in the draft proposals for Genetic Editions--it has long privileged representation of the intellectual content of a work over representation of its material embodiment. The only existing TEI-conformant transcriptions of early editions of Shakespeare are those of the Text Creation Partnership (TCP), and they exhibit this bias against the material form; prose speeches, for example, are encoded as undivided paragraphs within <P> . . . <P> tags. My experiments suggest that one can quickly produce useful results by developing one's own XML conventions and applying them to readily available untagged electronic texts. This gives me hope that alongside the large collaborative projects from which we have all been benefitting--including Internet Shakespeare Editions, the Text Creation Partnership and the Shakespeare Quartos Archive--there remains a place in the digital future for lone scholars 'rolling their own' applications.

Works Cited

Bowers, Fredson. 1966. On Editing Shakespeare. Second edition. Charlottesvile. University of Virginia Press.

Howard-Hill, T. H. 1973. "The Compositors of Shakespeare's Folio Comedies." Studies in Bibliography 26. 61-106.

Howard-Hill, T. H. 1976. Compositors B and E in the Shakespeare First Folio and Some Recent Studies. Columbia SC. Published privated by the author.

Jackson, MacDonald P. 1975. "Punctuation and the Compositors of Shakespeare's Sonnets, 1609." The Library (=Transactions of the Bibliographical Society). 5th series (=3rd of the Transactions of the Bibliographical Society) 30. 1-24.

Jackson, MacDonald P. 1982. "Two Shakespeare Quartos: Richard III (1597) and 1 Henry IV (1598)." Studies in Bibliography 35. 173-90.

Jackson, MacDonald P. 1987. "Compositors' Stints and the Spacing of Punctuation in the First Quarto (1609) of Pericles." Papers of the Bibliographical Society of America 81. 17-23.

Jackson, MacDonald P. 2001. "Finding the Pattern: Peter Short's Shakespeare Quartos Revisited." Bibliographical Society of Australia and New Zealand Bulletin 25. 67-86.

McGillivray, Murray. 2010. "Ian Lancashire's Two Muses: A Belated Reply." Electronic Publishing: Politics and Pragmatics. Edited by Gabriel Egan. New Technologies in Medieval and Renaissance Studies. Toronto. Medieval and Renaissance Texts and Studies (MRTS) and ITER. 121-36.

Price, George R. 1978. "The Printing of Love's Labour's Lost (1598)." Papers of the Bibliographical Society of America 72. 405-34.

Taylor, Gary. 1981. "The Shrinking Compositor A of the Shakespeare First Folio." Studies in Bibliography 34. 96-117.

Werstine, Paul. 1984. "The Editorial Usefulness of Printing House and Compositor Studies: Reprinted from Analytical and Enumerative Bibliography 2 (1978): 153-165 with a New Afterword." Play-texts in Old Spelling: Papers from the Glendon Conference. Edited by G. B. Shand and Raymond C. Shady. New York. AMS. 35-64.