"New light on the King Lear quarto/Folio differences: Computational approaches to problems in textual studies" by Gabriel Egan and Hugh Craig

Licence My co-author on this talk, Hugh Craig, is Director of the Centre for Literary and Linguistic Computing at the University of Newcastle in Australia. His most important works are the book that he co-wrote with Arthur F. Kinney called Shakespeare, Computers, and the Mystery of Authorship (published by Cambridge University Press in 2009) and the book that he co-wrote with Brett Greatley-Hirsch called Style, Computers, and Early Modern Drama: Beyond Authorship (published by Cambridge University Press in 2017). Hugh Craig and I make this talk's script and slides and any recording of them available under a Creative Commons Attribution Share-Alike (CC-BY-SA) licence. This means that anybody can share copies of them so long as they credit Hugh Craig and Gabriel Egan as the authors, but they may not put upon their copies of this work a more restrictive licence forbidding further copying and sharing. What I will be showing today is a work-in-progress and is rough around the edges. You can help improve it by your comments and questions after the talk.

Preamble When I arrived as an MA student here at the Institute 25 years ago, I was not much interested in the topic of how the early editions of Shakespeare, the quartos and Folio, differ from one another. I did not think it important. I learned from John Jowett's bibliography course that it does matter because the modern editions we read are based on one or other of the early editions, and which one is chosen affects whether Gertrude in the play Hamlet knows or does not know what her son thinks her husband did. It affects whether Polonius says "Neither a borrower nor a lender, boy" (Shakespeare 1604-5, C4v) or "Neither a borrower, nor a lender be" (Shakespeare 1623, nn6v). It affects whether in Romeo and Juliet the heroine says "a Rose, | By any other name would smell as sweet" (Shakespeare 1597, D1v) or "a rose, | By any other word would smell as sweete" (Shakespeare 1599, D2v). And it affects whether in King Lear there is or is not a Mock Trial episode (Shakespeare 1608, G3v-G4v; Shakespeare 1623, rr4r). If we care about the words in Shakespeare, we care about the differences between the early editions.

    Shakespeare's King Lear was published in a quarto of 1608, which we will call Q1, that was reprinted in 1619 to make what we will call Q2, and published again in the 1623 collected plays edition known as the First Folio. As you probably know, the Q1 and the Folio versions of King Lear differ substantially. [SLIDE] The following sentence attempts to state their differences in the simplest possible terms, but there is something wrong with it:

The two texts differ in their makeup: the Quarto lacks 102 lines (also many smaller phrases and single words) not found in the Folio, whereas the Folio lacks 285 lines (and some phrases and words) not found in the Quarto. (Vickers 2016, ix).

Reviewers of this book were quick to leap upon this sentence in which Brian Vickers became entangled in his own double negatives. [SLIDE] What he obviously meant to write was that the quarto lacks 102 lines found in the Folio and the Folio lacks 285 lines (different lines, obviously) found in the quarto.

    But that is not what I find wrong with this sentence, since it is easy to conjecturally emend it by just deleting two spurious 'nots' that spoil the sense. [SLIDE] What I find troubling about this sentence is the quantifying of lines, phrases, and words that differ between the quarto and the Folio. If you read different accounts of the quarto/Folio variants in King Lear these numbers differ because no one can agree on what objectively counts as textual difference. Consider these two speeches from the two early editions of King Lear [SLIDE]:

Goe to, goe to, mend your speech a little,
Least it may marre your fortunes.
(quarto sig. B2r)

How, how Cordelia? Mend your speech a little,
Least you may marre your Fortunes.
(Folio sig. qq2r)

At the level of the word, the quarto has [SLIDE] "goe to" (twice) and [SLIDE] "it" that are not in the Folio, and the Folio has "how" (twice) and "Cordelia" and "you" that are not in the quarto. If we attend to smaller details, there are variations in punctuation--the speech is one sentence in the quarto, two in the Folio--and of capitalization. Looked at crudely, the above speeches do not contribute to the counts of lines present in one edition and absent in the other, but perhaps they should. How can we decide?

    [SLIDE] If we pay close attention to the verbal and literal details of quarto/Folio King Lear, the differences start to run into the thousands. Can we even find them all? By manual comparison it is practically impossible, but we can instruct a computer to compare two digital texts and count the differences in various ways. One of the simplest to understand is called Edit Distance, in which we imagine trying to turn the quarto text into the Folio text using individual acts of deleting and inserting single characters, counting how many keystrokes you have to make to get from one text to the other [SLIDE animation]. So, using just the actions of 'delete' and 'insert' the Edit Distance from quarto to Folio King Lear for these two lines is 46. We can quantify textual difference. I say we have to quantify textual difference to have meaningful discussions about it.

    There are other ways to quantify textual difference, as we shall see. We must first decide what counts as no difference at all. [SLIDE] For many purposes, we do not count the difference between the long-s and the modern short-s as a difference, nor do we count fancy typefaces such as swash letters. [SLIDE] For many purposes we do not care about spelling differences caused by early modern writers' and printers' freedom to spell one word in multiple ways, [SLIDE] nor differences in capitalization. We might even want to generalize further and not count the differences between the various forms that a verb might take, so we might decide to treat identically, and count as one occurrence of the verb 'to be' all its possible forms as 'am', 'are', 'is', 'was', 'were', 'been', 'being', 'wast' and 'wert'. There are technical means for doing this that I can going into in the Q&A if people are interested. Suffice to say these things can be done.

* * *

    [BLANK SLIDE] Across most of the twentieth century is was assumed by most scholars that Shakespeare only ever wrote a single version of each play and that any differences between the early editions arose mainly or entirely because of variations in how that single version changed before and while being printed. The processes of manuscript copying and printing were held to be the causes of the differences we find in the early editions.

    Then, in 1976, Michael Warren argued that there never was "one primal lost text, an 'ideal King Lear' that Shakespeare wrote" (Warren 1978, 96). Instead the differences between the quarto and Folio regarding the characters of Edgar and Albany point to authorial revision. These artistic changes, according to Warren, make Albany "a weaker character, avoiding responsibility" and turn Edgar from "a young man overwhelmed by his experience" to one who "has learned a great deal, and who is emerging as the new leader of the ravaged society" (Warren 1978, 99). The idea that for King Lear authorial revision is better than textual corruption at explaining the early editions' differences occurred almost simultaneously to two other scholars, Steven Urkowitz and Peter Blayney (Urkowitz 1980; Blayney 1982), and to another, P. W. K. Stone, who thought the revision non-authorial (Stone 1980).

    Like Warren, Steven Urkowitz saw special significance in the way Albany differs in the quarto and Folio (Urkowitz 1980, 80-128). Differences clustering in one or two characters are hard to explain by textual corruption in general since without a special cause general corruption ought to fall more or less evenly across the text. In Urkowitz's analysis, Albany becomes a markedly less sympathetic character in the Folio when compared to the quarto: he is more obviously overwhelmed by the tension between his situation as the ruler of an invaded land and his sympathy towards its former monarch. In the final scene, according to Urkowitz, Albany's zeal to bring on the trial of Edmund for his own reasons fatally distracts him from the imminent danger to Lear and Cordelia.

    Character analysis was not the only evidence brought to bear. A landmark contribution to the debate was Gary Taylor's claim that the treatment of the war in the play, especially the anti-climactic battle in Scene 23/5.2, is so different in the two editions that only authorial revision can reasonably explain it (Taylor 1980). There followed a collection of essays, The Division of the Kingdoms, edited by Taylor and Warren, that remains the definitive survey of the evidence for the authorial-revision hypothesis (Taylor & Warren 1983). The authorial-revision hypothesis for King Lear was met with a series of sceptical responses, which for the most part objected that if Shakespeare had wanted to revise the play to change the characters of Edgar and Albany and the depiction of the war he would have done so more obviously, making greater differences than those we see between the quarto and the Folio (Thomas 1984; Carroll 1988; Foakes 1993; Meyer 1994; Clare 1995; Knowles 1995; Thomas 1995; Shakespeare 1997; Knowles 1999; Knowles 2008; Vickers 2016).

* * *

    Can computational approaches throw any light on the question of whether quarto and Folio King Lear are separated by authorial revision? Can it tell us whether the relationship between the quarto and the Folio text for this play is unlike the quarto/Folio relationship for other plays? At the height of the New Bibliography in the twentieth century, some early editions of Shakespeare were classified as bad quartos because they seemed especially affected by textual corruption and omission when compared to their good-quarto or Folio counterparts. Prime bad-quarto candidates were the 1597 Q1 edition of Romeo and Juliet, the 1602 Q1 edition of The Merry Wives of Windsor, and the 1603 Q1 edition of Hamlet. One theory that arose to account for these early editions' badness was that their underlying manuscripts were created by minor actors in the play attempting to recall their entire script from memory, putting together an essentially pirated text and selling it to an unscrupulous publisher. Another explanation offered for corruption in an early quarto was that an audience member made a surreptitious shorthand record of the script as it was being performed. The most longstanding theory, first suggested by eighteenth-century editors, was that the early short quartos were simply early versions of the plays (containing the work of Shakespeare and others), that Shakespeare expanded upon to make the good quarto and Folio versions.

    In the case of King Lear, it was always generally accepted that 1608 quarto was too good a text to have been made by its actors' memorial reconstruction of the script. The view of E. K. Chambers (Chambers 1930, 465-66) and W. W. Greg (Greg 1933, 252-57) was that it was made by a shorthand report, but that idea was largely dispelled by George Ian Duthie's study of the shorthand systems available at the time (Duthie 1949). Duthie found that the differences between quarto and Folio King Lear could not result from any conceivable misreading or faulty expansion of symbols that the three available systems of shorthand--Timothy Bright's Charactery (Bright 1588), Peter Bales's Brachygraphy (Bales 1590), and John Willis's Stenography (Willis 1602)--might produce, and moreover the misreadings and faulty expansions to which the systems are prone are generally absent.

    [SLIDE] In the following list are eight plays for which we have quarto and Folio versions to compare, together with a crude summary of the ways these editions have been described, especially in the 1986-87 Oxford Complete Works (Wells et al. 1987):

The Merry Wives of Windsor Q1: bad quarto, memorial reconstruction. F: independent theatrical manuscript

Hamlet Q1: bad quarto, memorial reconstruction (or early version?). Q2: authorial papers. F: independent theatrical manuscript (reflecting authorial revision)

Ben Jonson Every Man in his Humour Q1: authorial papers. F: Q1 marked up to reflect extensive authorial revision

Romeo and Juliet Q1: bad quarto, memorial reconstruction. Q2: authorial papers. F: Q3 (a reprint of Q2) lightly annotated from a theatrical manuscript affecting SDD and SPP but not dialogue

King Lear Q1: authorial papers. F: Q2 (a reprint of Q1) heavily annotated from a theatrical manuscript (reflecting authorial revision).

1 Henry IV Q1 fragment: authorial papers. Q2: Q1. F: Q6 (descended from Q2) annotated from a theatrical manuscript affecting SDD and SPP and religious swearing in dialogue

Love's Labour's Lost Q1: reprint of a lost edition set from authorial papers. F: Q1 lightly annotated by reference to a theatrical manuscript

The Merchant of Venice Q1: authorial papers. F: Q1 sporadically annotated by reference to a theatrical manuscript

For some plays the preceding quarto is highly similar to the Folio because the Folio was in fact typeset from the quarto. For others, most especially, King Lear Q1 and Hamlet Q2, the quarto seems to show us the play as Shakespeare first wrote it and the Folio a version after he had revised it. And for others again, including Hamlet Q1 and Romeo and Juliet Q1, the first edition has in the past been classified as a bad quarto because it seems like a garbling of the later, better version. Let us see if computational comparisons tell us anything that might support or undermine these classifications.

    [SLIDE] For our first computational test we take the 100 most common words in English, which includes 'the' and 'and', and the other so-called function words. These words are so common that they comprise half of everything we say or write. For each word we count its frequency in each act of our eight plays (as they are divided in modern editions) in the early quarto and in the Folio. For 1 Henry IV we use Q2 because Q1 is only a fragment, and for Hamlet and Romeo and Juliet we repeat the process separately for the Q1 and Q2 editions, giving us results for ten quartos in all.

    [SLIDE] Here are the counts for just the word 'the' for Q2 and Folio Hamlet. As you can see, they are pretty similar for the two editions. For some quarto/Folio pairs they are more extreme than this, with one edition have considerably more or fewer occurrences of 'the'. Using a statistical method called a t-test we can ask "how often we would get results as extreme as the ones we got"--that is, with results so different for the quarto and the Folio--"when random variation alone is operating?" The t-test gives us a probability, called a p-value, which tells us how often the results we got would come up anyway by chance alone. [SLIDE] For this particular table, the p-value from Welch's t-test is about 0.975, meaning that about 98 times out of 100 we would get numbers as varied as this when only random variation is the cause. That means there is no significant difference between Q2 and F on this measure.

    If the count for Q2 were very different from the count for F, we would get a low p-value. Where it is say, 0.01, that means that we would get results as extreme as this just one time in a hundred. We did the counting for all 10 editions for all 100 common words and calculated the resulting p-values. For each play we sorted these p-values, and hence the words from which they are derived, into a list from lowest probability to highest and place this list along the x-axis of a graph. For each word in the list of 100 words, we can plot on the y-axis its corresponding p-value. This is the result [SLIDE].

    The right-hand side of this picture shows that for every play the words that give us the least surprising difference between the quarto and the Folio, the words for which the p-values are the 97th, 98th, 99th, and 100th in a list sorted from smallest to largest, have p-values close to 1. For these words, the plays are virtually all alike. But for words at the opposite end of the list, the words with low p-values for each play (on the left side of this picture), the words that are 1st, 2nd, 3rd, and 4th (and so on) in our sorted list for each play, the plays are unalike. The word with the lowest p-value in The Merchant of Venice Q1/F pair has a p-value of around 0.6, so the difference between that word's frequency of occurrence in Q1 and its frequency of occurrence in F is the sort of result we would expect random variation to produce quite often. But the word with the lowest p-value in The Merry Wives of Windsor Q1/F pair has a p-value close to zero, meaning that the difference between that word's frequency of occurrence in Q1 and its frequency of occurrence in F is not the sort of result we would expect random variation to produce very often.

    The most interesting feature of this picture is how we get from the lowest p-values (close to zero) on the left to the highest p-values (close to 1) on the right, in the case of each play. For the texts forming the lowest tracks on this graph, there is a steady rise in the p-value so that the track is close to a straight line running south-west to north-east. But the tracks higher up have a distinctive curve, starting off on the left side almost vertical and then--if one imagines driving along the track--taking a gentle right turn to become almost horizontal. What does that tell us about the numbers that the tracks represent? It means that whereas the lower tracks have a steady rate of change, and hence are roughly linear, these upper tracks have a rate of change (from one word in the list to the next) that starts out high (hence their verticality), so that the p-value for Word #2 is considerably higher than the p-value for Word #1, and the p-value for Word #3 is considerably higher than that for Word #2. But for the upper, curved tracks this rate of change slows and ends up low so that the tracks flatten out to become almost horizontal, with their p-values scarcely changing at all once they get close to 1.

    In relation to the words themselves, this means that for a text with one of the upper tracks we find a few words at the beginning of the list showing highly significant differences (represented by low p-values) between their frequencies in the early quarto and the Folio but after those few words the remainder of the list is made of many words that show little significant difference between their frequencies in the early quarto and the Folio. For a text with one of the lower tracks, the differences in frequency between the quarto and the Folio start out highly significant (low p-values) and stay low for many more words, so that overall for such a text many more words have highly significantly different frequencies in the quarto and the Folio.

    What does this all tell us? The startling feature of this slide is that it visualizes the distinction that textual scholars had already made by subjective study of the plays. The editions with low tracks in this picture are Hamlet Q1, The Merry Wives of Windsor Q1, Every Man in his Humour Q1, and Romeo and Juliet Q1. These are plays that were already thought to be highly different in their early quarto and Folio versions, being three Shakespeare editions that we formerly called bad quartos and Jonson's play that we know he substantially revised. Above these four low tracks is a clear area above which sit six plays with the characteristic right-turn shape. The lowest of these is King Lear Q1, which was always too good to be a bad quarto and which now is widely thought to differ from its Folio counterpart because of authorial revision. Above that comes Hamlet Q2, always deemed a good text but suspected of differing somewhat from the Folio version because of minor authorial revision. The remaining four texts, with the highest tracks, are Romeo and Juliet Q2, 1 Henry IV Q2, Love's Labour's Lost Q1, and The Merchant of Venice Q1. In each case, these are the plays for which we already thought that the Folio was printed from an exemplar of the quarto (or a direct reprint of the quarto) that was first lightly annotated by reference to a theatrical document that for the most part affected its stage directions and speech prefixes but not its dialogue.

    This result suggests that we can computationally separate the early quartos into the existing categories, from (at the bottom) the ones we used to call bad quartos that differ from their Folio counterparts due to corruption, through (around the middle) those it has been claimed differ from their Folio counterparts due to revision, to (at the top) the ones for which the Folio is for the most part a reprint of the early quarto. In this spectrum of quarto/Folio difference, quarto King Lear is distinct from the corrupt texts formerly called bad quartos in it is more like its Folio counterpart than they are theirs and is also distinct from the plays that were little changed when they were reprinted in the Folio. Perhaps most significantly of all, this experiment suggests that the category of bad quartos was not simply an invention of the New Bibliography: there is something in common that makes Hamlet Q1, The Merry Wives of Windsor Q1, and Romeo and Juliet Q1 seem alike.

* * *

    There is an entirely different kind of measure we can make to approach the same question of how alike or unalike this set of quartos is. That measure is called Shannon entropy, but unfortunately in this talk I do not have time to go through the full explanation of what Shannon entropy measures and how it is calculated. The short and not very satisfying description is that entropy measures how wide is the variety of word choices in a text. [SLIDE] I have time enough only to show the Shannon's celebrated equation for entropy--which I consider to be a profound statement about the nature of language--and the results for the ten quarto editions we have been considering. In this picture, the entropy for each quarto text is deducted from the entropy for the corresponding Folio text, for the rising blue bars on the right the Folio version has a greater internal variety of language than the quarto.

    The first thing to notice is that the three editions long considered to be bad quartos--The Merry Wives of Windsor Q1, Hamlet Q1, and Romeo and Juliet Q1--are markedly less varied in their language than their Folio counterparts, hence their tall blue bars. Also marked, to a lesser extent, is the fact that Every Man in his Humour Q1 is less varied in its language than the Folio version. For the other editions, the difference is less marked and indeed for two of them the difference is significantly negative, meaning that the quarto is more varied in its language than the Folio. King Lear Q1 and Hamlet Q2 are both markedly more varied than their Folio counterparts, and strictly speaking the same is true of 1 Henry IV Q2 although the effect is small.

    In the case of 1 Henry IV Q2, the standard view is that the Folio reprints a descendant of the quarto that was first somewhat expurgated by the removal or softening of its oaths. For three three editions in the middle of this picture--The Merchant of Venice Q1, Romeo and Juliet Q2, and Love's Labour's Lost Q1 the quarto and the Folio are about equally varied in their language. These are the three plays for which we already thought that the Folio is essentially a reprint of the quarto with only minor changes that hardly effect the dialogue.

    King Lear Q1 and Hamlet Q2 are editions that have long been thought to have derived from authorial papers (representing the play as Shakespeare first wrote it down) while their Folio counterparts have been thought to reflect the plays after they had not only been through the process of readying for the stage but also been altered by authorial revision, which made most difference in King Lear. Again, as with the previous measure of the frequencies of occurrence of 100 function words, we seem here to have a computational measure than makes broadly the same classification that traditional textual scholarship has made. Or, in other words, we can with these quantify distinctions that were formerly made only qualitatively.

* * *

    When Michael Warren first proposed that Folio King Lear represents Shakespeare's revision of the play as represented in the 1608 quarto, his argument was held to be especially persuasive because it focussed on characters. In particular, Warren was able to show that the character of Albany seems to have been artistically rethought between the quarto and Folio editions, and Urkowitz independently reached the same conclusion (Warren 1978; Urkowitz 1980, 80-128). Let us see what we can do computationally working with individual characters' speeches.

    Before I show you the results for this experiment, I need to explain what is called Principal Component Analysis, since without that it is impossible to make sense of what we did. [SLIDE] Suppose that we recorded on an x/y scatterplot the heights and weights of the people in this room, starting with me [SLIDE]. This dot represents me and its position on the picture represents my readings for two variables: my height on the x-axis and my weight on the y-axis. [SLIDE] If we start measuring more people's height and weight, we get more dots--one per person--for each of which the dot's place represents both their height and weight. Without measuring him, I am confident that John Jowett is taller than me and lighter than me, so we place him here. We can carry on for all the people in the room [SLIDE x 5] and we end up with a cigar-shaped cloud of dots lying on a southwest-to-north-east axis.

    In general, as height goes up, weight goes up, so we say that these variables are positively correlated. The dots do not form a straight line because some people are heavy for their height and others are light. [SLIDE] We may draw a line through this cigar-shaped cloud and label it 'size'. Each person's 'size' would be found by drawing a line from their personal dot on the scatterplot to this new 'size' axis in order to strike the axis at a right angle (a 'perpendicular'). Where each dot strikes the 'size' axis is that person's size and is our First Principal Component (PC1).

    This 'size' factor is not a perfect representation of most people's body shape. [SLIDE] It is perfect for those two people whose dots fall exactly on the line, but it is [SLIDE] not perfect for these two people (remember who they were?), one of whom is heavy for his height and one of who is light for his height. [SLIDE] The collective failure of PC1, the 'size' factor, to describe every person's body shape is given by the sum of these distances to the 'size' line, and we can construct a second axis [SLIDE], perpendicular to the 'size' axis. This new axis is the Second Principal Component (PC2) and each dot will have a reading on this new axis: zero for dots that fall exactly on the 'size' line, positive for those people whose dots fall above the 'size' line (who are heavy for their height) and negative for those people whose dots fall below the 'size' line (who are light for their height).

    Because in this illustration we have only two pieces of data (called 'dimensions') for each person, height and weight, we cannot derive more than two Principal Components. The first Principal Component reduces the two data points for each person down to a single value, 'size', and the second Principal Component tells us how much or how little of the variance was captured by the first. Where we have more than two data points we can derive more Principal Components, each new one recording the variance not captured by the preceding ones. The important point here is that we reduced two pieces of data, height and weight, down to one piece, called size. In doing so, we lost some information but we retained the most salient feature of the data, which is that across a group of people the height and weight go up somewhat in step. [BLANK SLIDE]

    To explore whether the quarto and Folio editions of King Lear differ significantly regarding particular characters, we can repeat our first analysis regarding the frequencies of 100 of the most common words but this time looking only at particular characters' speeches in King Lear. It would be cumbersome and unnecessary to do this for all of the dozens of roles in the play, so we set the cutoff at roles of at least 800 words. For all the roles larger than 800 words we create 800-word segments, adding any words left over from this segmentation process to the last segment.

    [SLIDE] In this picture we see the differences between the King Lear Q1 (orange dots) and Folio (blue triangles) regarding the frequencies of 100 common words in each of the 800-word segments for each of the characters who speaks at least 800 words. Where a role produces multiple 800-word segments, these are numbered sequentially as in "Lear (1)", "Lear (2)", and so on, with the last segment also containing the remaining words that do not add up to a whole 800-word segment. The counting of 100 words' frequencies produced 100 data points for each segment, and using Principal Component Analysis we reduced these 100 data points to just two, the First and Second Principal Compoents, which represent the most salient features of that set of data, just as 'size' represents the most salient feature about the bodies shapes of a group of people. For each segment in this picture, its position on the x-axis is given by the First Principal Component of the 100 frequency counts for that segment and its position on the y-axis is given by the Second Principal Component of the 100 frequency counts for that segment.

    For the most part, each Folio segment is close to its corresponding quarto segment, but three quarto/Folio differences stand out. The quarto/Folio distances are large for the third segments of Edgar's part (marked with a single asterisk), the single segments for Albany's part (marked with a pair of asterisks), and the second segments of Edgar's part (marked with three asterisks). This appears to suggest what literary analysis has claimed, that Albany's and Edgar's are the parts most affected by the differences between the quarto and the Folio versions. Recall that we are here not simply measuring the amount of rewording between the quarto and Folio versions of a character's speeches, but those speeches' use of 100 common words that in general are not much affected by rewording because they are words such as 'the' and 'and' and so on that writers seem not to consciously choose and whose rates are usually static across an author's writing.

    Perhaps the most noticeable aspect of this picture is that the markers for the Fool's speeches fall a great distance from the tightly clustered mass of markers for the other characters. This measure makes objective what has long been subjectively appreciated: the Fool really does speak like no one else. Although overall the Folio is shorter than the quarto, the Folio Fool's part is larger, most noticeably because of the Folio-only Merlin Prophecy speech at the end of Scene 9/3.2, and the difference is enough to generate two 800-word segments from the Folio to only one from the quarto. This reveals something else new, which is that the Fool not only speaks like no one else, but in the latter half of his Folio part the Fool does not speak like himself in the former half. Two segment's of the Fool's part in the Folio, shown by the two blue triangles in the top-left corner, are more unalike on this measure than any quarto/Folio segments are.


    We have found that a computational approach is able to make quantitative distinctions between the editions that mainstream New Bibliography classified as bad quartos and other quartos by systematic comparison with their Folio counterparts. To that extent, the category of bad quartos is objectively real. [SLIDE] This result was achieve by merely counting in these texts the occurrences of the 100 most-common words in English and distinguishing (using a t-test) the more-significant from the less-significant quarto/Folio differences on these counts. In this experiment, the plays for which the 1623 Shakespeare Folio was based on a preceding quarto that was only lightly annotated before being used as the Folio's copy are clearly distinguished from the bad quartos, and for the two plays most strongly suspected of authorial revision between the quarto and the Folio, Hamlet and King Lear, the good early quartos fall squarely between the bad quartos below them and the only-lightly-annotated quartos above them. An anomalous result here is that Q1 of Jonson's Every Man in his Humour tests more like a Shakespearian bad quarto than a Shakespeare play authorially revised between its quarto and Folio editions.

    Taking an entirely unrelated measure from Information Theory, called Shannon entropy, the same broad categorization is reproduced. [SLIDE] The bad quartos of The Merry Wives of Windsor Q1, Hamlet Q1, and Romeo and Juliet Q1 have, of the quartos tested, the largest differences from their Folio counterparts. That is, these bad quartos are less varied in their language use than their Folio counterparts. In contrast, for the plays thought to have been revised by Shakespeare, Hamlet and King Lear, the good quartos are more varied in their language use than their Folio counterparts. The plays for which we already thought that the Folio was simply printed from an existing good quarto with only light annotation fall between these extremes.

    And as we just saw, using our counting of the 100 most-common words in English, the characters for which the quarto and Folio counts are most different are Edgar and Albany, which is what subjective literary criticism has long maintained. There is much more to be done. These results are very recent and my co-author Hugh Craig and I have not yet spent enough time in self-criticism trying to convince ourselves that we have made some mistakes and the results do not tell us what they seem to tell us. That is where you can help, by critiquing what I have just said. The critique of others is always more effective than self-critique, or as a wise man once said: "The first principle is that you must not fool yourself and you are the easiest person to fool" [SLIDE].

Works Cited

