The born-XML Shakespeare edition: The view from the New Oxford Shakespeare

"The born-XML Shakespeare edition: The view from the New Oxford Shakespeare" by Gabriel Egan

The New Oxford Shakespeare, of which I am a General Editor, is a six-volume Complete Works edition of the play and poems of William Shakespeare. Four volumes have appeared so far [SLIDE]:

1) Complete Works: Modern Critical Edition (in modern spelling)

2) Complete Works: Critical Reference Edition Part 1 (in original spelling)

3) Complete Works: Critical Refererence Edition Part 2 (in original spelling)

4) Authorship Companion

The four volumes came out in 2016-17 and the aspect that drew most attention at the time was our claims about the size and shape of the Shakespeare canon in relation to co-authorship. We claim that 16 of Shakespeare's plays contain writing by others authors. It is a relief not to talk about that on this occasion and instead discuss the digital aspects of establishing the text of a new critical edition.

In order to create the edited texts of the works for these volumes, editors were required to submit to the publisher Word documents in which their texts were marked up in something called Jowett Markup Language (JML). Here is a typical example [SLIDE]:

#<S1 Mar.><S2 Marcellus> Tis gone and makes no answer.
<S1 2.><S2 Second Sentinel> How now H{oratio}, you <ES>tremble and looke pale,
Is not this</ES> something more than fantasie?
+What thinke you on't?
=<S1 Hor.><S2 Horatio> <EL>Afore my God, I might not this beleeue,
Without the sensible and true auouch of my owne eyes.</EL>
#<S1 Mar.><S2 Marcellus> Is it not like the King?

If you haven't seen a markup language before, the way to understand this is as lines of Shakespeare [SLIDE], that a human reads, interlarded with code that is only meant to be read by a computer [SLIDE]. The idea is that the computer code controls how the Shakespeare words are presented to the reader in the printed version and any digital version.

The codes written for the computer to read constitute the markup language, and this particular language is named Jowett Markup Language (or JML) after the Shakespeare scholar John Jowett. JML evolved over time and it has aspects of an older language, called SMGL, such as the tags made up of pointy brackets containing [SLIDE] "<EL>" and "</EL>", which tags enclose here a pair of lines that are the lemma for an emendation of the play's lineation. But JML also has elements of an even older convention from the markup language COCOA in which a value that is set by one tag is deemed to remain in force unless altered by a later tag. Thus (although it is not present in this example), an "@" symbol indicates the start of a prose passage and all succeeding text is deemed to be in prose until a "*" character is encountered, at which point all succeeding text is deemed to be in verse.

[SLIDE] In the example shown, unassimilated part-lines in verse are preceded by a hash (or pound) character, the first half of a verse line that is split between two speakers is preceded by a "+" character and the second half is preceded by an "=" character. Aside from the <EL> element that follows the SGML convention, a single pair of pointy brackets is also used to enclose a whole element, so that [SLIDE] "<S1 Mar.>" is not a tag but an entire element, of type S1 (which happens to mean "a speech prefix as it appears in the early printed text"), and with the content "Mar.". [SLIDE] Additionally, curly braces are used to enclose content in italics. It is often complained that markup languages make it hard for humans to read the literary-historical documents that they are used to mark up. I think you'll agree that this particular piece of encoding is especially difficult, not least because the meaning of the symbols is unintuitive.

[SLIDE] For the New Oxford Shakespeare, documents encoded in JML were prepared by editors using Microsoft Word and for a whole play they would at the mimimum consist of a text like the one shown that was supported by another document containing the actual notes for the emendation of lineation indicated by the <EL> element, and another document again for the actual notes for the substantive emendations indicated in the main text using an <ES> element. There was no automated linking between these documents and because they were made in Microsoft Word there were no means for ensuring that editors had in fact conformed to the JML standard or that the markup was internally self-consistent. There was, for example, no mechanism to ensure that an editor did not follow a "switch to prose" tag with another "switch to prose" tag, or designate both parts of a split verse line as the first part.

When I joined the New Oxford Shakespeare project the first four volumes were in an advanced state of preparation, led by experienced general editors who had between them literally decades of experience using Jowett Markup Language and no experience of any other markup language. They were well aware of the problems arising from JML, which may summarized as [SLIDE]:

JML had to be taught to the associate and assistant editors

JML is used only for Oxford University Press projects

JML supports no kinds of error checking for self-consistency or adherence to a set of rules

JML doesn't allow editors to make their own local proofs to overcome the problem that marked-up texts are almost unreadable by humans

Work on the final two volumes of the New Oxford Shakespeare had not begun when I joined [SLIDE]:

5) Complete Alternative Versions: Modern Critical Edition (in modern spelling)

6) Complete Alternative Versions: Critical Reference Edition (in original spelling)

These volumes were going to be made according to the same editorial principles as the Complete Works volumes but from different copy texts, the ones not used as copy texts for the Complete Works. That is, where Shakespeare works come down to us in competing early editions of roughly equal authority--as with the quarto and Folio editions of King Lear and the first quarto, second quarto, and Folio editions of Hamlet--the Complete Works chose just one of those editions as its basis. (It chose the longest one.) The Complete Alternative Versions volumes will contain the other versions not represented in the Complete Works volumes, again presented in original spelling and in modern spelling.

At my exhortation, it was decided to complete the New Oxford Shakespeare using Text Encoding Initiative eXtensible Markup Language (or TEI-XML) instead of JML. TEI-XML is the world standard, approved by the International Standards Organization, for text markup languages, used on many projects. [SLIDE] TEI-XML should address all four of the limitations of JML: there are a lot of resources for learning it and learning it is worth doing because that skill is transferrable to other projects, and it has automated error-checking built in and allows editors to make their own proofs.

But TEI-XML is not a standard hitherto used by Oxford University Press, so the Press has had to implement a new workflow for accepting texts from editors. This was a large financial commitment from the publisher, but since they had been thinking about accepting TEI-XML files from editors of scholarly editions they were able to fund this as a pilot project. Not the least expense was paying for an exceptionally complicated piece of software called an eXtensible Stylesheet Language Transformation (XSLT) that allows the editor working on a play to generate from their draft encoded text a proof of how the play will look when finally printed, complete with line-numbering and properly formatted emendation and glossarial notes.

Using TEI-XML meant editors learning new concepts. Whereas a JML document is essentially a flat structure, nesting is used in XML to represent a tree structure. Thus encoding like this [SLIDE]:

    <div n="1" type="act">
            <div n="1" type="scene">
            [ACT ONE SCENE ONE CONTENT HERE]
            </div>
            <div n="2" type="scene">
            [ACT ONE SCENE TWO CONTENT HERE]
            </div>
    </div>
    <div n="2" type="act">
            <div n="1" type="scene">
            [ACT TWO SCENE ONE CONTENT HERE]
            </div>
            <div n="1" type="scene">
            [ACT TWO SCENE TWO CONTENT HERE]
            </div>
    </div>
    . . .
    <div n="5" type="act">
            <div n="1" type="scene">
            [ACT FIVE TWO SCENE TWO CONTENT HERE]
            </div>
    </div>

in which scene elements appear inside act elements is actually a way of expressing this tree structure [SLIDE of DOM]. In XML, every element that is marked up must be wholly enclosed inside another element, so that lines are wholly inside speeches, speeches are wholly inside scenes, scenes are wholly inside acts. The notion of nestedness is one that editors are often ready to accept in principle but find themselves resisting in practice. To see why, let us go back to that sample of JML [SLIDE]:

#<S1 Mar.><S2 Marcellus> Tis gone and makes no answer.
<S1 2.><S2 Second Sentinel> How now H{oratio}, you <ES>tremble and looke pale,
Is not this</ES> something more than fantasie?
+What thinke you on't?
=<S1 Hor.><S2 Horatio> <EL>Afore my God, I might not this beleeue,
Without the sensible and true auouch of my owne eyes.</EL>
#<S1 Mar.><S2 Marcellus> Is it not like the King?

Here an emendation marked by using the element <ES> spans two verse lines and would not ordinarily be permissible in XML if each verse line was also marked up using its own element, such as <l> for line as is usual in TEI [SLIDE]. Those of you familiar with XML will know this as the familiar problem of overlapping hierarchies [SLIDE]: the <ES> element here cuts across the line element, and that is not allowed. There are various kinds of workaround for problems like this, but these can strike editors as so intellectually dishonest that they find themselves yearning for the freedom of JML, despite all its flaws, because it allows them to avoid this dishonesty.

[SLIDE] Let us a take a real-world example in which an error in one line of a text seems connected to an error in another line, tempting an editor to insist that they must be treated together as a single error affecting both lines, and hence that the lemma for the emendation note should begin part-way into the first line and end part-way into the second. This is a moment from the 1623 Folio edition of Shakespeare's King Lear [SLIDE read it]:

when a wiseman giue thee better counsell giue me mine
againe, I would hause none but knaues follow it
(King Lear 1623 sig. rr1r)

We would expect in the first line to read "when a wiseman giues thee better counsel" and we would expect the second line to read "I would haue none". It is easy to suppose that these errors are connected [SLIDE], as Gary Taylor did when he wrote the textual notes for this moment in the Oxford Complete Works of Shakespeare in 1986-87. [SLIDE] Taylor suggested that perhaps "the intrusive 's' has slipped from the end of F's 'giues' in the line above--which in F's uncorrected state is 'giue'". I think Taylor means that the piece of type for the letter 's' literally slipped down under the force of gravity from the first line to the second.

This is not possible since type is set upside down with each successive line placed on top of the previous one, so this would be a slip upwards against the force of gravity. But that is not the important point here: what matters is that despite his hunch that the errors were connected, Taylor marked these as two distinct errors not one composite error. Taylor rightly reported that two words on different lines are wrong and gave them two independent notes. If he were dealing with this problem in an TEI-XML edition of the play, Taylor would not be tempted to cut across the Orderly Hierarchy that XML imposes, since he already is thinking in terms of independent lines.

What do I conclude from this? It is often objected that the Orderly Hierarchy model of text that XML imposes runs counter to how written language actually works. A common example that is used is the case of an author revising a text and choosing to move a word from one line to another. How, it is asked, can you create markup that describes or discusses such an authorial revision if you are committed to each textual note being contained within its own line and not cutting across a line boundary? The answer I would give is that, at least for handwritten documents, we are wrong to say that "the author moved a word from one line to another".

In a typeset document it is possible for the letters forming a word to move from one line to another, although I don't think that happened in this case in King Lear. But it is literally impossible for an author inscribing ink on paper to move letters and words in this way. What we really mean when we say that an author moved a word is that the author crossed out or erased a word in one line and then wrote another copy of the same word in the second line. And that is two actions, not one. My growing experience with the New Oxford Shakespeare is that being forced to think in terms of lines, being forced to respect the hierarchy of a text, is a useful discipline for the editor.