Untitled Document

"Where do/doe we go/goe from here/heere?: Computational methods in compositorial studies of early printed Shakespeare editions" by Brett D. Hirsch and Gabriel Egan

When creating an electronic text of a early printed book for the purpose of studying it with computers, it is difficult to know just what to record and what to ignore. The dominant encoding standard for XML markup of literary works is the Text Encoding Initiative (TEI) guidelines, of which the current version is P5. These guidelines offer methods for marking up a literary work by its units of artistic section and subsection--in the case of a play that would be acts, scenes, speeches and lines--or of marking the work up by the units of its physical embodiment in a document, say as gatherings, sheets, leaves and pages. It is notoriously difficult to combine both kinds of markup in a electronic text because they represent what are known as competing hierarchies. A play's acts consist of scenes, its scenes consist of speeches, and its speeches consist of lines, so that each subsection is wholly contained with its parent section. Likewise, all printed gatherings consists of sheets and all sheets consist of leaves and all leaves consist of pages, so again the Russian doll or Chinese box principle is maintained. But, crucially, these two hierarchies are at odds: a speech may easily sprawl across two pages, a scene may sprawl across two gatherings, and so on. The TEI guidelines offer no simple way to combine competing hierarchies in one document, because the XML standard on which TEI is based offers no simple solution to this problem.

The problem of competing hierarchies gets even worse when one wishes to record competing scholarly opinions about literary works and their physical embodiment. The opinions we are concerned with in this paper are about the working stints of the men called compositors who set the type of the early editions of Shakespeare and we will take as our example the 1623 First Folio collection. The work of setting this book's 899 pages of type did not fall to one compositor. Thomas Satchell was the first to notice that in Folio's text of Macbeth each of 35 words is spelt one way in the first half of the play (for example, doe and goe) and another in the second half (do and go), and he wondered if this was because each of two compositors imposed his own spelling preferences as he worked (Satchell 1920). Satchell was unable, however, to eliminate the alternative possibility that Folio Macbeth was set from two manuscripts in which these spellings differed. Edwin Eliott Willoughby took the most discriminating five of Satchell's 35 words, added one of his own, and extended the search to Folio plays beyond Macbeth (Willoughby 1932, 54-60) to establish that at least two compositors, A and B, set the Folio, and probably two more as well.

Willoughby's work on the problem was avowedly incomplete, and since then a series of compositorial studies has developed ever more sophisticated tests for distinguishing between compositors not only by their spelling habits, but also by their habits in setting punctuation and the placement of speech prefixes and stage directions. By the latest count there may have been as many as nine compositors working on the Folio (Taylor 1981), although the total is disputed. Subjectivity enters the problem because when one notices a variation in a certain habit in the text--say, a run of several pages where dressed, blessed, and pressed are consistently spelt drest, blest, and prest--one has to decide whether this is evidence of a previously undetected compositor taking over the work or merely the consequence a known compositor exhibiting a new and perhaps temporary preference. Just how variable should we suppose one compositor might be in his spelling habits? Once scholars got past Satchell's neat distinction between doe/goe and do/go the habits became less clear-cut and the number of supposed compositors rose. According the D. F. McKenzie, compositors could be highly inconsistent in their habits and hence existing distinctions between compositors based on assumptions of consistency are probably wrong (McKenzie 1984).

Suspicion about the methods for distinguishing compositors grew when two scholars independently examining a single early edition of Shakespeare came to different conclusions about its typesetting. George R. Price and Paul Werstine independently studied the first surviving edition of Love's Labour's Lost and agreed on three compositors, but disagreed about which parts each worked on (Price 1978; Werstine 1984). To put the whole topic on a firmer footing, the present authors propose over the next few years to replicate, in chronological order, each of the previous studies of Shakespearian compositorial stints to see if the claimed patterns of spelling, punctuation, and spacing do indeed form distinct clusters that we may confidently attribute to compositors applying their own habits to the texts as they worked. The first step is to represent in a computer not only the early editions and their physical structure, but also the competing scholarly hypotheses about their creation. This paper is about that first step.

Trying to represent in one document the competing scholarly views about compositorial stints inevitably leads to the clash of overlapping hierarchies. We can illustrate why using the first few lines of the first play in the Shakespeare Folio, The Tempest. Let us suppose that Paul Werstine thinks compositor A set the first two lines of The Tempest and compositor B the rest of the scene. We might encode thus:

<werstine-stint comp="A">
  <line n="TLN-1">A temptestuous noise of Thunder and Lightning heard: En-</line>
<line n="TLN-2">ter a Ship-master, and a Boteswaine.</line>
</werstine-stint>
<werstine-stint comp="B">
  <line n="TLN-3">Master</line>
  <line n="TLN-4">BOte-swaine</line>
  <line n="TLN-5">Botes. Heere Master: What cheere?</line>
      . . .
  <line n="TLN-78">faine dye a dry death. Exit.</line>
</werstine-stint>

Notice that we are recording in the <line> element the bibliographical line of type, not the artistic line. Thus the first line ends "En-" even though this is merely the beginning of a word "Enter" that is split across two physical lines of type. Of the two hierarchies--the linguistic and the physical--we here privilege the physical. The result is acceptable (in the jargon, well-formed) XML: each line is wholly contained within one of the two stints as determined by Werstine. Now suppose that Gary Taylor disagrees with Werstine, thinking that compositor A set the first three (rather than two) lines of the scene and compositor B the rest. We might encode thus:

<taylor-stint comp="A">
  <line n="TLN-1">A temptestuous noise of Thunder and Lightning heard: En-</line>
<line n="TLN-2">ter a Ship-master, and a Boteswaine.</line>
  <line n="TLN-3">Master</line>
</taylor-stint>
<taylor-stint comp="B">
  <line n="TLN-4">BOte-swaine</line>
  <line n="TLN-5">Botes. Heere Master: What cheere?</line>
      . . .
  <line n="TLN-78">faine dye a dry death. Exit.</line>
</taylor-stint>

This too is well-formed XML: each line is wholly contained within one of the two stints as determined by Taylor. Each of these two hierarchies may exist perfectly well within its own XML document, but if we try to make them co-exist in a single document a problem emerges:

<werstine-stint comp="A">
  <taylor-stint comp="A">
    <line n="TLN-1">A temptestuous noise of Thunder and Lightning heard: En-</line>
  <line n="TLN-2">ter a Ship-master, and a Boteswaine.</line>
</wertine-stint>
<wertine-stint comp="B">
    <line n="TLN-3">Master</line>
  </taylor-stint>
  <taylor-stint comp="B">
    <line n="TLN-4">BOte-swaine</line>
    <line n="TLN-5">Botes. Heere Master: What cheere?</line>
      . . .
    <line n="TLN-78">faine dye a dry death. Exit.</line>
  </taylor-stint>
</werstine-stint>

We have broken the Russian-doll/Chinese-box principle, or as they say in XML we have created overlapping hierarchies: instead of being wholly inside Werstine's compositor A stint, Taylor's compositor A stint is not 'closed' before Werstine's compositor A stint is closed and Werstine's compositor B stint is opened. It appears that we cannot make a single representation of the opening of the Folio that contains at once Werstine's and Taylor's hypotheses about its typesetting.

The solution is to have separate documents representing Werstine's and Taylor's views, but we do not want to create multiple transcriptions of the Folio since any improvements or corrections to it would have to be made multiple times. We should instead store in one document the base text, with the markup that everyone agrees upon, and keep the scholars' competing views of it elsewhere. This approach is called stand-off markup. Here are snippets from the five documents needed for the beginning of The Tempest:

<line n="TLN-1">A temptestuous noise of Thunder and Lightning heard: En-</line>
<line n="TLN-2">ter a Ship-master, and a Boteswaine.</line>
<line n="TLN-3">Master</line>
<line n="TLN-4">BOte-swaine</line>
<line n="TLN-5">Botes. Heere Master: What cheere?</line>
(basetext.xml)

<xi:include href="basetext.xml" pointer ="TLN-1">
<xi:include href="basetext.xml" pointer ="TLN-2">
(werstine-comp-A.xml)

<xi:include href="basetext.xml" pointer ="TLN-3">
<xi:include href="basetext.xml" pointer ="TLN-4">
<xi:include href="basetext.xml" pointer ="TLN-5">
(werstine-comp-B.xml)

<xi:include href="basetext.xml" pointer ="TLN-1">
<xi:include href="basetext.xml" pointer ="TLN-2">
<xi:include href="basetext.xml" pointer ="TLN-3">
(taylor-comp-A.xml)

<xi:include href="basetext.xml" pointer ="TLN-4">
<xi:include href="basetext.xml" pointer ="TLN-5">
(taylor-comp-B.xml)

Notice that the base text contains only the uncontroversial line information for the first five lines of the play. The document giving Werstine's view on compositor A's setting of those lines identifies the first two lines of the base text, and the document giving his view of compositor B's setting identifies the third, fourth and fifth lines. The document giving Taylor's view of compositor A's setting picks out the first three lines of the base text, and the document giving his view of compositor B's setting picks out the fourth and fifth.

The point of encoding the scholars' views in this form is that although there exist few tools for working with standoff markup--and the investigators are not sufficiently skilled to build new tools--the above encoding using what are called XInclude statements will work with the software built into one of the cheap off-the-shelf XML editors, SyncRO Soft Limited's Oxygen package, which incorporates the Saxon XQuery processor. This XQuery processor allows us to ask pertinent questions of the encoded XML text based on the elements and attributes within that text, and crucially it also allows us to ask those questions of the standoff markup files so that are questions are confined to just the lines that the standoff markup points to. That is, although the statement of Werstine's view of compositor B's work contains only these pointers

<xi:include href="basetext.xml" pointer ="TLN-3">
<xi:include href="basetext.xml" pointer ="TLN-4">
<xi:include href="basetext.xml" pointer ="TLN-4">

when this document is queried by XQuery the pointers are replaced by the elements they point to in the base text and thus the query is actually asked of this snippet

<line n="TLN-3">Master</line>
<line n="TLN-4">BOte-swaine</line>
<line n="TLN-5">Botes. Heere Master: What cheere?</line>

By running our XQuery against the document "werstine-comp-B.xml" we are running it against just the parts of the play that Werstine thinks compositor B set. If we wanted to test yet another scholar's views of the setting of these lines, we would leave the base text untouched and instead create new standoff markup files showing which lines this scholar thought each of her designated compositors set.

The key advantage of this method--the reason we are showing it to you--is that once the base text has been established, each competing scholarly opinion about compositing can be rapidly encoded in standoff markup that points to certain parts of this base text and tests run on the standoff markup are run against only those parts. If, as not infrequently happens, a scholar changes her mind about which parts of the book each compositor set, a small change can be made to the standoff markup reflecting this reattribution of lines. In the example given we have identified parts of the base text by their line numbers, and you may be thinking that it would be tedious to write an XInclude statement for each line that each scholar thinks each compositor set. It would, but we do not have to specify stints in terms of individual lines: the procedure works just as well for columns, pages, formes or sheets: so long as we have identified these elements in the base text we may refer to them in the document that defines a scholar's view of a stint.

What you have just heard reflects how far one of the two authors of this paper, Egan, got in a proof-of-concept project. To do some useful work with the technique required a base text marked up with rather more bibliographical features than those shown in the example. You may have noticed that in the example speech-prefixes, stage-directions, and changes of type from roman to italic and back again were not encoded, and these are just the kinds of things a physical analysis may wish to pick out. To get from proof-of-concept to something useful, the co-author Hirsch first encoded the whole of the Folio text of Macbeth using established principles. A base text may be marked up with any bibliographical features that are agreed upon by all investigators. Most usefully one would record whether a line of type is full, meaning that the last letter or punctuation mark is hard against the right edge of the block of type (as is typical in setting prose), or short in the sense that the end of the line is filled with spaces (as it typical in setting verse). In short lines a compositor was free to follow his own preferences in the spellings of words, but in full lines he might be tempted or even forced to alter spellings to make the type exactly fit the width he was setting to and for this reason full lines are usually excluded from spelling tests used to identify compositors. The uncontested information about line length may be encoded as an additional attribute of each line in the base text:

<line n="TLN-1" full="yes">A temptestuous noise of Thunder and Lightning heard: En-</line>
<line n="TLN-2" full="no">ter a Ship-master, and a Boteswaine.</line>
<line n="TLN-3" full="no">Master</line>
<line n="TLN-4" full="no">BOte-swaine</line>
<line n="TLN-5" full="no">Botes. Heere Master: What cheere?</line>

This information can be used to exclude from the XQuery results the words in any line identified in the base text as full.

Similarly, the use of spaces around commas, the indentation or turn-over/turn-under of overflowing lines, and the indentation or centering of stage directions, and other features counted by investigators, may be encoded in the base text. HERE BRETT YOU COULD TALK ABOUT WHAT YOU DECIDED TO ENCODE, PERHAPS MENTIONING THAT YOU STARTED WITH THE PRACTICES OF ISE AND DRE. YOU MIGHT TALK ABOUT THE DIFFERENCE BETWEEN ENCODING PARTICULAR FEATURES AS ELEMENTS (AS WITH YOUR <C> ELEMENT) VERSUS ENCODING THEM AS ATTRIBUTES (AS WITH MY 'POSITION="CENTRED"' PREFERENCE); I OF COURSE WAS THINKING ABOUT ENCODING THAT WOULD MAKE IT EASY FOR ME TO PULL OUT CERTAIN LINES USING XQUERY. YOU MIGHT ALSO TALK ABOUT THE JUDGEMENTS YOU HAD TO MAKE ON HARD BIBLIOGRAPHICAL CASES, INCLUDING JUST WHAT IS THE ONTOLOGICAL STATUS OF, SAY, A LETTER 'u' THAN IS DELIBERATELY INVERTED TO BE USED AS AN 'n' BECAUSE THE COMP'S 'n' SORT-BOX IS DEPLETED. IF SOMEONE ARGUES THAT IT STAYS BEING AN 'u' THEN WOULD THEY SAY THAT ALL INVERTED COMMAS (USED AS QUOTATION MARKS OR HIGHLIGHT MARKS) ARE REALLY COMMAS AND SHOULD BE ENCODED AS SUCH? ALSO, RECORDING LIGATURES--JUST HOW DO WE SPOT THEM?--IS AN INTERESTING CASE. PERHAPS SHOW THE AUDIENCE VARIOUS BITS OF YOUR ENCODING?

Satchell decided that in Folio Macbeth Compositor A set from the first line to the end of scene 3.3 with the exception of scene 1.7 and Compositor B did the rest, including scene 1.7. In bibliographical terms we may for brevity express these stints using a variety of units including whole leaves, pages, columns, and individual lines. For Compositor A's stint that would be all of leaves ll6 and mm1, the lines on page mm2r that are not scene 1.7, page mm2v, all of leaves mm3 and mm4 and the first column of page mm5r. The standoff markup needed to represent this is:

<xi:include href="F.xml" xpointer="ll6"/>
<xi:include href="F.xml" xpointer="mm1"/>
<xi:include href="F.xml" xpointer="line-468"/>    
. . .                                                                     
<xi:include href="F.xml" xpointer="line-481"/>    
<xi:include href="F.xml" xpointer="line-579"/>    
. . .                                                                     
<xi:include href="F.xml" xpointer="line-589"/>    
<xi:include href="F.xml" xpointer="mm2v"/>
<xi:include href="F.xml" xpointer="mm3"/>
<xi:include href="F.xml" xpointer="mm4"/>
<xi:include href="F.xml" xpointer="mm5ra"/>
("satchell-comp-a.xml")

The XQuery command that draws on Satchell's statement of his view of compositor A's stint is simply:

doc("satchell-comp-a.xml")//line[@full="no"]/text()

This tells the XQuery processor to open the file "satchell-comp-a.xml", to expand its XInclude statements by drawing the necessary elements (leaves, pages, columns, lines) from the file they point to ("F.xml"), to throw away all lines but those for which the 'full' attribute is set to 'no', and to return just the raw words within what remains.¹

Once we have isolated just which words in Folio Macbeth Satchell thinks were set by Compositor A we can subject them to the tests by which he came to decide on the extents of this stint. We can, for example, put the resulting pool of words through word-frequency counters to separate out the various spellings or make customized software scripts that look for pecularities of spacing and punctuation. Those tests are the topic for another paper. The key points here are that we can quickly check that Satchell's claimed count of various features in the stint are correct, and that we can quickly make adjustments to the supposed stint to see what differences they make. For example, it would be only a few moments work editing the standoff markup to reassign scene 1.7 from Compositor B to Compositor A and once this is done XQuery will produce a subtly different pool of words whose features may be tested and compared to those of the first pool. These kinds of investigation are the next step.

Notes

¹If the suffix "/text()" is omitted from the end of this command, the processor returns not the words in the lines but the lines themselves complete with their surrounding XML tags showing such things as the line numbers. As a useful check on the methodology this more complete output was first produced and compared with a facsimile of the Folio to make sure that the system was indeed isolating just those parts of the play that Satchell thought were set by each compositor.

Works Cited

McKenzie, D. F. 1984. "Stretching a Point: Or, the Case of the Spaced-out Comps." Studies in Bibliography 37. 106-21.

Price, George R. 1978. "The Printing of Love's Labour's Lost (1598)." Papers of the Bibliographical Society of America 72. 405-34.

Satchell, Thomas. 1920. "'The Spelling of the First Folio': A Letter to the Editor." Times Literary Supplement Number 959 (3 June). 352.

Taylor, Gary. 1981. "The Shrinking Compositor A of the Shakespeare First Folio." Studies in Bibliography 34. 96-117.

Werstine, Paul. 1984. "The Editorial Usefulness of Printing House and Compositor Studies: Reprinted from Analytical and Enumerative Bibliography 2 (1978): 153-165 with a New Afterword." Play-texts in Old Spelling: Papers from the Glendon Conference. Edited by G. B. Shand and Raymond C. Shady. New York. AMS. 35-64.

Willoughby, Edwin Eliott. 1932. The Printing of the First Folio of Shakespeare. Supplements to the Bibliographical Society's Transactions. 8. Oxford. Oxford University Press for the Bibliographical Society.