Picture of Gabriel Egan G a b r i e l   E g a n  .  com

"What can computers tell us about Shakespeare's early editions and authorship?" by Gabriel Egan

The simplest kind of evidence for the authorship of a book is the presence of its author's name on the title page. The earliest editions of Shakespeare's plays in the 1590s did not routinely print his name on the title page, but this was true of English printed drama generally so it has no special significance. By the end of Shakespeare's career, editions of plays did routinely print the dramatist's name on the title-page, but of course this is only evidence, not proof, of authorship. Shakespeare's name appeared on the title-pages of the plays The London Prodigal in 1605, A Yorkshire Tragedy in 1608, Sir John Oldcastle in 1619, and 1, 2 The Troublesome Reign in 1622, but no-one today takes these attributions seriously.

    [SLIDE] In 1623 the First Folio edition gave Shakespeare sole credit for 36 plays that have since then formed the core of his accepted canon [SLIDE]. Only one play that was already in print but left out of the First Folio has been universally accepted as part of the Shakespeare canon since the late eighteenth century [SLIDE]: Pericles, which was published in 1609 with his name on the title-page. It is now widely agreed that Shakespeare co-wrote Pericles with George Wilkins. [SLIDE] In 1634 an edition of the play The Two Noble Kinsmen appeared with the names of Shakespeare and John Fletcher on its title-page, by the late twentieth century this had become widely accepted as an accurate attribution. [SLIDE] One seemingly conservative way to define Shakespeare's dramatic canon, then, is to include the 36 First Folio plays plus Pericles and The Two Noble Kinsmen. [SLIDE] These 38 plays are the ones offered in the Royal Shakespeare Company edition Complete Works (Shakespeare 2007). [SLIDE] Unfortunately this conservative definition is certainly wrong: there are undoubtedly more plays to which Shakespeare contributed parts, and substantial parts of plays in the 1623 First Folio are not his [BLANK SLIDE].

    Henry 8 is presented in First Folio as entirely by Shakespeare, but in the mid-nineteenth century Samuel Hickson and James Spedding independently considered it in the light of the known Shakespeare and Fletcher collaboration in The Two Noble Kinsmen and both agreed that Henry 8 has the same distinctive mix of features. The scholars said that they could just hear the difference in styles, but those of us "less quick in perceiving the finer rhythmical effects" might be more readily convinced, wrote Spedding, by some counts of "lines with a redundant syllable at the end" (Spedding 1850, 121), meaning feminine endings to regular verse lines of iambit pentameter, as in "to BE or NOT to BE that IS the QUEST-yun". This feminine-ending tests remains a valuable tool for authorship attribution, as writers are fairly consistent in their rates of feminine ending use so long as we count across a substantial amount of writing. The rate in any one scene of a play might fluctuate according to the dramatic needs, but for whole acts and especially whole plays the rate of feminine endings to verse lines is stable for one author writing at one particular time.

    Scholars in the nineteenth and twentieth century lacked tables of frequency rates for various verse features used by all the major dramatists. Philip Timberlake's PhD thesis accepted by Princeton University in 1926 went some way towards remedying this deficiency, and despite covering only the drama up to 1595 it remains the most complete tabulation of the frequencies of feminine endings in existence (Timberlake 1931). Timberlake addressed head-on the problem that investigators often disagree on just what is a feminine ending, since words like heaven, even, hour, bower, flower, tower, power, and friar may be pronounced monosyllabically to give a masculine ending or disyllabically to give a feminine one. Towards the end of his study, Timberlake applied his findings to various problems of Shakespearian authorship. In the anonymous play Edward 3 (first published in 1596) the Countess of Salisbury's scenes show a sharp rise in the rate of feminine endings from well below Shakespeare's norm at 2.1% for the rest to the play to well within his norm of 4-16% for these scenes; Timberlake concluded that it is distinctly possible that Shakespeare wrote this part of that play  (Timberlake 1931, 78-80, 124).

    The problem of differentiating Shakespeare's writing from that of his co-author Fletcher was revisited in the 1950s and 1960s by Cyrus Hoy as part of a series of seven articles on the collected plays edition called Fifty Comedies and Tragedies by Beaumont and Fletcher published in 1679 (1679). Hoy suspected that many not all 50 of them were by Beaumont and Fletcher, and in his first article Hoy laid out his chief means for detecting Fletcher's writing: "use of such a pronominal form as ye for you, of third person singular verb forms in -th (such as the auxiliaries hath and doth), of contractions like 'em for them, i'th' for in the, o'th' for on/of the, h'as for he has, and 's for his (as in in's, on's, and the like)" (Hoy 1956, 130-31). Hoy acknowledged that such tests had been used before--most innovatively by W. E. Farnham and A. C. Partridge--and claimed only that his was the first study to apply them all systematically to the whole of a substantial body of writing. In large part Hoy's method confirmed earlier divisions of Henry 8 and The Two Noble Kinsmen between Shakespeare and Fletcher. The kinds of tests employed by Hoy have widely varying success with different authors, being particularly effective for distinguishing Massinger from Fletcher but less good for others.

    Hoy's success in establishing a series of linguistic-preference tests and applying them to a substantial body of drama was inspirational to others in the field. Essentially the same kind of analysis--counting preferences for different ways of saying the same thing--was applied in the 1970s by David J. Lake and MacDonald P. Jackson to the problem of identifying Thomas Middleton's work, in the course of which the case for Middleton's hand in Shakespeare's Timon of Athens--another of the 36 plays in his Shakespeare's Folio--emerged most clearly (Lake 1975; Jackson 1979). Lake made no claims for innovation in the kinds of internal evidence he collected, indeed quite the opposite: "the general methods or particular tests I employ", he wrote, have "all been used over the past fifty years in authorship investigations" (Lake 1975, 10).

    But one of Jackson's methods had not previously been applied to Shakespearian authorship attribution: the counting of the frequency of occurrence of so-called function words that express grammatical relationships between other words while carrying little or none of their own lexical value. Their role is to bring together the nouns, verbs, and adjectives in order to give a sentence its foundational structure. Typical function words in the English language are prepositions, conjunctions, articles, particles, auxiliary verbs, and pronouns, although linguists differ on just which words have so little lexical value that they properly belong in this category.

    In his 1982 PhD thesis on distinguishing Middleton and Shakespeare's writing, and especially apportioning their shares of Timon of Athens, R. V. Holdsworth put himself squarely in the tradition of Hoy, Lake and Jackson (Holdsworth 1982; Holdsworth 2012). Like them, he counted various linguistic features such as contractions and the preference for modern (and urban) you over archaic (and rural) thou, but Holdsworth also introduced the innovation of counting the various formulaic phrasings used in stage directions to find author-specific idiosyncracies (Holdsworth 1982, 181-235). His comprehensive study of the form "Enter A and B, meeting", in which the syntax makes clear that neither character is already on stage, was the first systematic proof that a recurrent form of stage direction could usefully distinguish authorship.

    Without computer automation, the counting of linguistic features was always likely to be incomplete and error prone. The Textual Companion to the Oxford Complete Works was published in 1987 when such manual methods had taken the subject about as far as it could go, and its survey of the Canon and Chronology of Shakespeare's writing was a synthesis of the scholarship up to that point (Taylor 1987). So what has the New Oxford Shakespeare got to show for the 30 years of research on co-authorship since the last Oxford Shakepeare? Let us look first at what was claimed by the 1986-87 Oxford Complete Works [SLIDE]:

Henry 8 A Shakespeare and John Fletcher collaboration

Timon of Athens A Shakespeare and Thomas Middleton collaboration

Titus Andronicus A Shakespeare and George Peele collaboration

Pericles A Shakespeare and George Wilkins collaboration

Macbeth Thomas Middleton's adaptation of Shakespeare's lost original

Measure for Measure Middleton's adaptation of Shakespeare's lost original

1 Henry 6 A collaboration by Shakespeare and Thomas Nashe and others

Sir Thomas More A collaboration between Anthony Munday, Henry Chettle, Thomas Dekker, Thomas Heywood, Shakespeare and others

All these attributions are accepted in the New Oxford Shakespeare that was published in 2016-17 and several of them are strengthened with new evidence, in particular the Middleton adaptations of Macbeth and Measure for Measure. But we also have some new claims of collaboration that substantially change the shape of the canon as the New Oxford Shakespeare will present it [SLIDE]:

Edward 3 This is a collaboration between Shakespeare and others

2, 3 Henry 6 These are all collaborations of Shakespeare with Christopher Marlowe and others

The Spanish Tragedy Additions 2 through 5 (but not 1) to the play (originally written by Thomas Kyd) that first appeared in the 1602 quarto (the fourth edition) are by Shakespeare

Arden of Faversham Act 3 (= Scenes 4 through 9) is by Shakespeare

Cardenio Lewis Theobald's play Double Falsehood is an eighteenth-century adaptation of this lost collaborative play by Shakespeare and John Fletcher

Titus Andronicus The 'Fly Scene' (present only in F) is by Thomas Middleton

All's Well that Ends Well The virginity dialogue in 1.1, the King's speech on class in 2.3, and the gulling of Paroles are by Thomas Middleton

The scholarship that has convinced the New Oxford Shakespeare general editors to change the Shakespeare canon in this was not only done by the general editors themselves. In particular, recent publications by MacDonald P. Jackson, Hugh Craig, John Burrows, R. V. Holdsworth, Marina Tarlinskaya, Guiliano Pascucci, Brett Greatley-Hirsch, Jack Eliott, and Farah Karim-Cooper have shaped our  view. I do not have time to go into all that scholarship, but I can briefly sketch the approaches of MacDonald P. Jackson and Hugh Craig, who have contributed most to our view.[SLIDE]

    Jackson's attribution method, now widely known, admired, and emulated, uses the database called Literature Online (LION) that is available to universities by subscription from the ProQuest Corporation and that offers typed-up, searchable texts of the vast majority of all English Literature--novels, poems, plays--published before the twentieth century. Jackson goes searching in LION for phrases and collocations found in the text he is trying to attribute, looking for those that are comparatively rare. When I say "phrases and collocations" I mean that he manually extracts from the passage he is trying to attribute every two-word and three-word string of words (that is, n-grams) and searches for them within a constrained time-period within LION (say, works written between 1590 and 1610) to see which are relatively rare, occurring five or fewer times. As well as strict strings of words, Jackson also looks for the same words occurring near to one another but not necessarily in the same order, in other words collocations.

    Jackson tabulates which authors' canons contain occurrences of these phrases and collocations from the text to be attributed, and counts how many times each author's canon contains such a hit: the one with the most hits is declared to be the author. There are many refinements to Jackson's method that I do not have time to go into, for example the way that he weights the hits according to the size of the canons they occur in. Shakespeare has by some considerable margin the largest dramatic canon in this period, so all else being equal he has, as it were, greater 'opportunity' to produce matches for the phrases and collocations in the work to be attributed simply because he wrote more than anybody else. There is another investigator working with a similar method to Jackson, called Brian Vickers, but I omit him from this survey because there are fatal flaws in the method and the tools that he uses that I would be happy to talk about the Q&A session and that make his conclusions unreliable.

    The methods used by Hugh Craig are in large part adaptations of methods developed by his sometime co-investigator John Burrows, called the Zeta and the Delta tests (Burrows 2002; Burrows 2003; Burrows 2007). Instead of counting rare phrases and Jackson does, the Delta test counts the frequencies of the very commonest words, the function words like the, a, on, in, and so on. The rates of usage of these words are demonstrably specific to specific authors--we each have our own unconscious preferences about how often we use each one--and with electronic texts of all our materials we can have computers do the counting. The Delta method's key innovation is that it discounts the importance of words for which a set of authors is demonstrably variable in their rates of usage and it weighs more heavily the evidence from words that the authors use at consistent rates. Moreover, Delta puts on an even footing words that are used at different rates of frequency, as it measures variations in rates of usage, not the absolute numbers of occurrences. When comparing the text to be attributed to the texts in the comparison set, Delta finds where the unknown text uses certain words more often and other words less often than the average for the comparison set and it finds where a particular author's contributions to the comparison set also show the same pattern of favouring the same words and disfavouring the same other words.

    This principle of identifying on a case-by-case basis the words that are most discriminating between various authors, rather than relying on pre-determined lists of words, also underlies Burrows's second innovation, the Zeta test. As a first step, the investigator establishes two sets of texts, each being the securely attributed works of a single candidate author or a group of authors. Zeta finds for itself the words that most distinguish these two sets, the ones that are especially common in the first set and especially uncommon in the second, and vice versa. The vice versa step means that the investigator has two lists of words, both of which are good discriminators between the two sets of texts. Rather than treat the whole play as one block of writing, this method divides the play into segments of equal size, typically 2,000 words.

    When the numbers of occurrences of the discriminating words in each of the segments in the two text sets are plotted on an x/y graph--x for counts of words favoured by the first set and disfavoured by the second, and y for counts of words disfavoured by the first set and favoured by the second--the segments' scores fall into two distinct clusters: high-x/low-y for texts in the first set and and high-y/low-x for texts in the second set. This is just as we would expect since Zeta was made to find the words that would produce this outcome. Then the investigator has Zeta count the occurrences of the discriminating words in the text to be attributed and plot this on the same x/y graph [SLIDE]. If the text to be attributed shares the word-preferences of one of the two text sets, its x and y values will place it near or within that set's cluster on the graph. Here is a typical Craig scatter-plot showing that the play Coriolanus contains a lot of the words that Shakespeare favours that other writers tend to avoid, and has few of the words that other writers favour and that Shakespeare tends to avoid. This is just what we would expect, since Shakespeare wrote Coriolanus.

    If the sets are chosen to be, say, Shakespeare plays on the one hand and Marlowe plays on the other, the Zeta method becomes for that application a good discriminator of these two writer's styles. One of the sets may be a multi-writer collective, so that the test may be, say, Shakespeare versus Marlowe+Greene+Peele+Nashe. As Burrows showed, and Craig confirmed with a great many validation runs for this technique (Craig & Kinney 2009), when the investigator takes a text of known authorship out of one of the sets and reruns the experiment as if this text were of unknown authorship--without letting this text help choose the discriminating word lists--the correct author is identified with reliability that typically (depending on who is being tested) exceeds 95% accuracy. Zeta is by some way the most powerful general-purpose authorship tool currently available.

Word Adjacency Networks

    [SLIDE] It is possible to imagine a new method in computational stylistics that would be both like the approach of MacDonald P. Jackson in attending to the proximities of particular words to one another, yet without excluding all but the rare collocations, and at the same time like the Craig-Burrows approach in counting every occurrence of even the most frequently occurring words. What if we could take a text and count the proximity of every word to every other word, so that we capture the phenomenon of word-clustering at all levels where it occurs? It so happens that generating such data is technically trivial--the algorithm is not difficult to code--and the trouble arises rather in capturing this vast dataset in a form that enables meaningful comparisons to be made between texts. The technique I will end with is an application to Shakespearian authorship attribution of a mathematical approach to this problem that has been developed in other fields for other purposes, using what are called Markov chains to represent Word Adjacency Networks (WANs).

    To explain the Word Adjacency Networks method one needs to understand a notion called Shannon entropy, which in the limited time available I can best do with a practical exercise. [Do the guess-the-word game with me writing on the board; use a book from their own collection.]

    The first to do the experiment we have just performed was the founder of the science of Information Theory, Claude Shannon. His equations, and in particular the celebrated equation for Shannon Entropy [SLIDE], are now routinely used in scientific work on authorship attribution that appear in journals that are little read by literary scholars. According to Claude Shannon, the father of Information Theory, the amount of information in any piece of writing or other code that bears meaning can be precisely quantified as its unpredictability. After performing exactly the experiment we just did, Shannon calculated that overall English prose is about 75% redundant: three times out of four the next letter is guessable. This is the reason that today's SMS text-speak and various kinds of shorthand work [SLIDE]. In this context, redundancy means predictability: after the letter t the letter h is much more likely to follow than x is, and directly after q the appearance of u is almost a certainty. Shannon gave us the mathematics with which to quantify these patterns of predictability or unpredictability, borrowing the physicists term of entropy for it, and enabling us to compare the entropy of one text to that of another.

    The 26 letters of the alphabet do not occur with equal frequency in normal English prose or verse, whether modern or early modern. About one in every eight letters of written English is the letter e, but only one-in-50 of them is the letter p, one-in-100 is the letter v, and one-in-1000 is the letter q. Moreover, certain letters are more likely to follow certain other letters. Following t with h makes the commonest letter bigram in English, comprising about one-in-30 of all letter bigrams, while following u with r makes a bigram that occurs only once in every 200 letter bigrams. Without consciously noticing it, all competent users of the English language have internalized these preferences, as demonstrated in our experiment. 

    The preferences for one letter following another that we see in this experiment are built into the English language rather than belonging to any one writer. They express the fact that letter combinations are not random distributed but statistically ordered. If letter combinations were not ordered like this, any combination would be equally likely and crossword puzzles would be impossible. When we shift our attention from combinations of letters to combinations of words, the same statistical principles apply and the same Shannon equation tells us how far from purely random the sequences are. The freedom of the writer is much greater at the level of word combination: the structure of English is far less constraining when choosing words to make a sentence than when choosing letters to make a word. Moreover, it is empirically provable that the preferences for following one word with another are personal preferences of particular authors.

    Instead of individual letters, we can apply this principle to how often certain words follow other words, either following them directly or falling nearby or far away. Let me show how the method works using this extract from Shakespeare's Hamlet [SLIDE]

With one auspicious and one dropping eye,
With mirth in funeral and with dirge in marriage,
In equal scale weighing delight and dole,
(Shakespeare Hamlet 1.2.11-13)

[SLIDE] Let us confine our attention to the proximities, one from another, of the four function words with, and, one, and in. [SLIDE] Starting with with and looking forward five words we find an occurrence of the word one, an occurrence of the word and, and another occurrence of the word one. [SLIDE] We record that in our Markov chain by a line from with to and with a value of 1 and a line from with to one with a value of 2. [SLIDE] We are done with the first word in the extract, With, and we [SLIDE] move to the next occurrence of one of our function words, which is the second word in the extract, one. Again looking forward five words we spot an occurrence of and and an occurrence of one, [SLIDE] so we draw a line from one to and, weighted 1, and a line from one to itself, weighted 1. [SLIDE] Then we move to the next occurrence of one of our function words, and it is and in the middle of the first line. Looking forward five words, we find an occurrence of one and and occurrence of with, so we add these to our Markov chain as two weighted lines emerging from the node for and. We proceed through the extract in the same way, adding fresh weighted lines (called edges) between nodes to indicate how often each word appears within five words of the others [SLIDE x 16]. This is our completed Markov chain.

We then do the same for the same four functions words' appearance in another passage, [SLIDE] this time from Thomas Dekker's Satiromastix; here's the completed Markov chain. [SLIDE] We end up with two Markov chains, each showing the Word Adjacency Network for the four words with, and, one, and in, in each extract. These two chains contain the information about the word proximities in the two extracts, and using Shannon's mathematics for entropy we can compare them. You will see that there are fewer lines in the Satiromastix network, but the absolute number of lines is not the most important point. The key question is, "when this author chooses to follow one of these words with another of these words, which is she most likely to choose?" These network embody the author's preferences that answer this question. You can see that in the Dekker extract, the word in is never followed (within five words) by the word with: there is no line running from in to with. Dekker instead chooses to follow in by and (one time) and by one (two times). [BLANK SLIDE]

This is only an illustration of the idea and for authorship attribution we use many more than four function words; 100 would be typical, but the resulting pictures are too complex to show you. And of course rather than short extracts from plays we use whole authorial canons as our samples. And instead of just recording the raw numbers of edges from node to node there are some weightings of edges and nodes to be applied using Shannon's mathematics for entropy and what is called limit probability. The edge-weightings reflect the fact that we consider words appearing close to one another to be more significant than words that are far apart, so instead of scoring "1" for a word appearing anywhere within our 5-word window, we give a greater score to words appearning near the beginning of the window. The limit probability weighting of nodes reflects the fact that we attach greater significance to words that are used often in the text being tested than to words that are used infrequently.

Specifically, how do we compare two WANs? The technical answer is that for each edge common to the two WANs we subtract from the natural logarithm of the weight in the first WAN the natural logarithm of the weight in the second WAN and then multiply this difference by the weight of the edge in the first WAN and then by the limit probability of the node from which this edge originates. This calculation is made for each edge and the values summed to express the total difference. In this method, it matters which WAN we designate as 'first' and which 'second', so the procedure is performed twice, switching the designation the second time. (I give you this technical account because the last time I spoke about this an audience member complained that I had concealed the most interesting part of the story; chacun a son gout.)

In this method, authorship attribution is done by calculating the overall difference (measured in Shannon entropy units called centinats) between a Markov chain representing the word-adjacency preferences for an entire canon by one author and the Markov chain representing the word-adjacency preferences of the play we wish to attribute. The author whose canon produces the Markov chain least different from the Markov chain for the play to be attributed is the likeliest author amongst the candidates tested. This method is entirely unlike other recently applied methods of authorship attribution, but it confirms their results: Christopher Marlowe did contribute to parts of all three of Shakespeare's Henry 6 plays. My team's detailed study of this was published in Shakespeare Quarterly in 2016 and it used 100 function words to create profiles for Shakespeare, John Fletcher, Ben Jonson, Christopher Marlowe, Thomas Middleton, George Chapman, George Peele, and Robert Greene. Thomas Kyd is not one of the authors we profile because his dramatic canon is too small even if we accept the play Soliman and Perseda as his.

The relative entropy between different authors' profiles is a measure of how easy they are to tell apart by this method [SLIDE]: the larger the relative entropy, the most distinctive are their habits in clustering function words. Because in these calculations it matters which author we treat as 'first' and which as 'second', each pairing appears here twice and it is the average of the two figures that we are concerned with. [SLIDE] The pairing with the lowest relative entropy is Shakespeare and Ben Jonson at an average 4.4 centinats: they are the hardest to tell apart by this method. [SLIDE] The highest pairing is Marlowe and John Fletcher over 16 centinats: they are easiest to tell part. [SLIDE] Looking at whose profile is most disimilar from Shakespeare, it is Marlowe's at an average 9.5 centinats distance; this dispels any possibility that Marlowe actually wrote all the plays we confidently attribute Shakespeare.

To get a sense of how well this method words, we can test how similar to each author's profile is each of the Shakespeare plays [SLIDE]. Here are the results, and there are three things to observe. First, for almost every play the red circles indicating Shakespeare appear nearest the bottom of the chart, indicating that for the play in that column the Shakespeare profile is the nearest profile to it. Secondly, the highest of the red circles--indicating greatest distances from the average Shakespeare style--are for (reading from right to left) The Two Noble Kinsmen, Titus Andronicus, Timon of Athens, The Taming of the Shrew, Pericles, Measure for Measure, Macbeth, Henry VIII, and the three Henry VI plays. That is, in every case the plays that our method shows to be most unlike Shakespeare's style in function-word clustering are those plays that entirely different methods have shown to be co-authored. Thirdly, notice that for all authors the coloured markers form roughly horizontal coloured bands at different heights: this could not happen unless the method were indeed capturing something distinct about authorial style.

New Work on King Lear

Most recently I have been working on applying some of these methods to questions other than those of authorship attribution. I have been trying to see if these methods throw any light on the differences between the early editions of Shakespeare for those 18 plays for which we have multiple early editions and do not know why they differ and hence we do not know which editions best reflect what Shakespeare actually wrote. I'll confine my comments mainly to the case of King Lear, which as you probably know comes down to us via two early editions: the 1608 quarto edition and the 1623 Folio edition. These two editions of King Lear are substantially different in that hundreds of lines present in one edition are absent in the other (that works both ways), and these variations cluster so that entire speeches are present or absent in one or other edition and a whole scene in the quarto, Scene 17, has no equivalent scene in the Folio. It is often stated that there are a certain number of lines present in the quarto (typically around 300) that are absent in the Folio and a certain number of other lines that are absent in the Folio (typically around 100) that are present in the Folio. That is, each has some material that the other lacks. This claim is not as simple or straightforward as it seems, because on account of spelling and punctuation variation, there are literally thousands of small differences between the quarto and the Folio, and it is not at all clear how many of these should make us say that a line in the quarto is essentially different from a line in the Folio.

To illustrate this problem, we need some kind of tool that can compare texts and highlight their differences. Happily, we all already have such a tool: Microsoft Word is excellent at producing such an analysis using its Compare Documents feature. The Internet Shakespeare Editions website has high-quality transcriptions of the quarto and Folio texts, so let us see what differences Word finds between them if we designate the quarto as the original document and the Folio as a revision of it. [SWITCH TO LIVE WORD DOC "Q1-F-original-spelling-comparison"] In Word's markup, unchanged text is left in black type and changes are in red type, with deletions indicated by the original being struck through and insertions indicated by underlining. So, in this opening line the spelling change Gloster > Gloucester and character renaming of Bastard > Edmond are shown in each case by the first word being struck out and the second word underlined as an insertion. In the seventh line down, we see allwaies struck out and alwayes inserted, which likewise is not what most people would call a substantive change. We have to scroll down to the second page to find what most people consider a real substantive difference: the presence only in the Folio of the lines "strengths, while we . . . preuented now". The problem with this analysis is that the substantive differences are hard to see because they are swamped by the small differences in spelling and punctuation, which Word dutifully marks up as revisions.

To get a better sense of the substantive differences, we can take modernized editions of the quarto and Folio so that merely incidental differences disappear, and for that the 1986-87 Oxford Complete Words versions are useful because they do not conflate the texts as almost all previous editions did. Comparing those [SWITCH TO LIVE WORD DOC "Q1-F-modern-spelling-comparison"],  we see more clearly the substantive differences, and just how much the two editions have in common once we regularize the spelling. But look at the difficulty we now have in the second speech: in the quarto, Gloucester refers to "the division of the kingdoms" (plural) but in F he says "the division of the kingdom" (singular). Is that a substantive difference? In the quarto he goes on to say that "equalities" are weighed but in F it is "qualities". These last two words have distinct meanings and etymologies, but they differ only in one letter: does that count? What about cases where the words are the same, but the word order (the syntax) differs, as with "I have, sir, a son by order of law" in the quarto and "I have a son, sir, by order of law" in the Folio? It is easy for the human mind to treat those as semantically equivalent but it is hard to teach a computer to treat them as the same since in terms of runs of individual characters they are quite distinct. To take a notorious example, Regan says [SLIDE] "Sir, I am made | Of the self-same mettle that my sister is" in the quarto but "I am made of that self mettle as my sister" in the Folio. How much difference is that? Editing the play for the Arden3 Shakespeare series [SLIDE], R. A. Foakes took the view that these are essentially the same line except that the quarto begins it with "Sir" and the Folio does not, which he marked up like this. Another reader might say that the other differences between Q and F are significant too, and that is a subjective judgement.

Is there not some objective measure we might apply to the problem, some way of quantifying the difference between these lines? In Information Theory there are a number of ways we might count these differences, and the simplest to understand is what is called Edit Distance. This is measure of how may keystrokes it would take to turn one reading into the other. One can imagine turning the quarto reading into the Folio reading by [SLIDE] deleting the "Sir," (four keystrokes), changing "the" into "that" (three keystrokes: a deletion and two insertions), deleting "-same" (five keystrokes), changing "that" to "as" (four keystrokes: three deletions and one insertion), and deleting "is" (two keystrokes). That comes to 18 keystrokes, the Edit Distance between these lines. Notice that we did not count the keystrokes needed to move our insertion/deletion point along the line: we assumed that we could insert and delete anywhere for no cost. There are other ways to calculate the detail of Edit Distance but this is the essential concept.

But Edit Distance does not always match well our subjective judgements of textual difference, as can be seen from our earlier example of [SLIDE] "But I have, sir, a son by order of law" (Q) versus "But I have a son, sir, by order of law" (F). To the human eye all that has happened is that [SLIDE] ", Sir," has been shifted two words to the right, but how much editing is that? But by a process of deletion and insertion--with no editing manoeuvre called transposition--this is five deletions and five insertions, making a total of 10 keystrokes. And indeed we might not even agree that moving the five characters of ", Sir," is what needs to be done. To my surprise when making these slides, [SLIDE] Microsoft Word elected instead to call this a leftwards move of the phrase "a son", which I suppose has the same 'cost' it too since it also consists of five characters and costs 10 keystrokes. But either route, giving the difference between these two lines a Edit Distance of 10 seems excessive: it is more than half the Edit Distance of the previous example, which I think you'll agree was much more complicated a difference.

There is an existing branch of science that is concerned with this notion of Edit Distance because it has a practical application in medicine. In bioinformatics, scientists are concerned with sequences of genes or proteins that are encoded as letters of the alphabet [SLIDE], and as with the two texts of King Lear the important question is often "how can we quantify the differences between these sequences of letters?" In my team we have started using methods from bioinformatics to try to quantify the differences between the quarto and Folio texts of King Lear and indeed the other plays for which interestingly different early editions exist, including Hamlet, The Merry Wives of Windsor, 2 & 3 Henry 6, and Henry 5. I don't have much in the way of results yet that I can reveal to you, but I can show this visualization from a technique we are trying out called Dynamic Time Warping applied to the three early editions of Hamlet [SLIDE].

Dynamic Time Warping is a technique used in gene analysis, and elsewhere, that attempts to align two runs of letters using the minimum number of 'jumps' in the sense of lateral movements pulling one of the texts left or right to begin realigning it with the other. As you may know, Q2 and Folio Hamlet are highly similar and Q1 Hamlet is quite unlike either of them. Looking at the chart on the right comparing Q2 and Folio Hamlet, we begin at the origin, which represents the beginning of the play. The numbers along the x and y axes count the words from the beginning of the play, and the white line shows which word in Q2 lines up with which word in F. The line rises steadily at 45 degrees, the 200th word of Q2 aligns with the 200th word of F, the 400th word of Q2 aligns with the 400th word of F, and so on. The only wrinkle was around the 100th word where some words in Q2 had to be passed over to get the texts to line up again. Now look at the picture on the left: lots of jiggling back and forth of one text against the other is required to get Q1 Hamlet lined up with Folio Hamlet. This a fairly good visualization of the problem of text alignment.

We are just starting to get a handle on the problems of comparing the early editions of Shakespeare using computational means. As I say, I'm sorry that I have no results to show you yet for this part of my work. But I hope I have at least given you some sense of the problems we are trying to tackle and some of the ideas we are using to try to solve them. Our aim is to be able to describe in some quasi-objective ways just which early editions of Shakespeare are most like which other early editions to see if they fall into various categories, for example the category of 'bad quartos' that was so much talked about 30 years ago in Shakespearian textual studies and which is now given little credence. Our goal is to find out how we can best explain the quarto/Folio differences in  Shakespeare plays and if we can to determine just where they are caused by textual corruption and where by artistic revision of the plays, either by Shakespeare or else by one of the many people we now know were his co-authors. 

Works Cited

Beaumont, Francis and John Fletcher. 1679. Fifty Comedies and Tragedies. Wing B1582. London. J. Macock [and H. Hills] for John Martyn, Henry Herringman, Richard Marriot.

Burrows, John. 2002. "'Delta': A Measure of Stylistic Difference and a Guide to Likely Authorship." Literary and Linguistic Computing 17. 267-87.

Burrows, John. 2003. "Questions of Authorship: Attribution and Beyond." Computers and the Humanities 37. 5-32.

Burrows, John. 2007. "All the Way Through: Testing for Authorship in Different Frequency Strata." Literary and Linguistic Computing 22. 27-47.

Craig, Hugh and Arthur F. Kinney. 2009. Shakespeare, Computers, and the Mystery of Authorship. Cambridge. Cambridge University Press.

Holdsworth, R. V. 1982. 'Middleton and Shakespeare: The Case of Middleton's Hand in Timon of Athens': An Unpublished PhD Thesis Submitted to the University of Manchester.

Holdsworth, Roger. 2012. "Stage Directions and Authorship: Shakespeare, Middleton, Heywood." On Authorship. Edited by Rosy Colombo and Daniela Guardamanga. Memoria di Shakespeare. 8. Rome. Bulzoni Editore. 185-200.

Hoy, Cyrus. 1956. "The Shares of Fletcher and His Collaborators in the Beaumont and Fletcher Canon ([Part] I [of VII])." Studies in Bibliography 8. 129-46.

Jackson, Macdonald P. 1979. Studies in Attribution: Middleton and Shakespeare. Jacobean Drama Studies. 79. Salzburg. Institut fur Anglistik und Amerikanistik, Universitat Salzburg.

Lake, David J. 1975. The Canon of Thomas Middleton's Plays: Internal Evidence for the Major Problems of Authorship. Cambridge. Cambridge University Press.

Shakespeare, William. 2007. The Complete Works (=The Royal Shakespeare Company Complete Works. Ed. Jonathan Bate and Eric Rasmussen. Basingstoke. Macmillan.

Spedding, James. 1850. "Who Wrote Shakespere's Henry VIII?" Gentleman's Magazine. ns 34. 115-123, 381-382.

Taylor, Gary. 1987. "The Canon and Chronology of Shakespeare's Plays." William Shakespeare: A Textual Companion. Edited by Stanley Wells, Gary Taylor, John Jowett and William Montgomery. Oxford. Clarendon Press. 69-144.

Timberlake, Philip. 1931. The Feminine Ending in English Blank Verse: A Study of its Use By Early Writers in the Measure and its Development in the Drama Up to the Year 1595. Menasha WI. Banta.