"What we Will Never Know About Authorship: Limits to the Art of Attribution" by Gabriel Egan
According to the New Oxford Shakespeare complete-works edition, there are 43 plays that are wholly or partly by Shakespeare. These are the 36 plays in the 1623 Folio plus Pericles published in 1609 and The Two Noble Kinsmen published in 1634 -- two additions that few people will find controversial -- plus five more that remain disputed. [SLIDE] Those five are Arden of Faversham, The Spanish Tragedy, Edward the Third, Sir Thomas More, Cardenio. The New Oxford Shakespeare also attributed to other writers parts of plays that we have long thought were solely by Shakespeare, including subtantial sections of the three Henry VI plays.
These attributions have been controversial and my topic today is some technical aspects of how we attribute authorship in uncertain cases based on the writings themselves. I want to show why some techniques for authorship attribution are reliable and others not. Most commonly, we attribute text to an author by noticing that certain rare words or phrases from the text appears frequently (or at least once) in the canon of a candidate writer but not in the canons of other writers.
When doing such a comparison, we have to bear in mind that different writers' canons are different sizes. The 8 dramatists from Shakespeare's time who have left us the most plays have these canons of sole-authored works [SLIDE]. Shakespeare has largest canon: 27 out of the total here of 101 plays. If we imagine this set of 101 plays as a 'target' surface in which we might find any particular phrase we are looking for [SLIDE], it is clear that Shakespeare presents a larger surface area in which to find a match. [SLIDE] If our searches are like darts randomly thrown at this target, then all else being equal our darts will land in the 'Shakespeare' sector more often than in any other, merely by virtue of its being the biggest sector. I am speaking here of the significance we attach to finding or not finding rare words of phrases in authors' canons. As we shall we shall see, the problem of differing canon sizes scarcely affects studies that look for common words and phrases.
To correct for differing canon sizes, we could say that a 'hit' for Greene is weighted as 7 times more significant than a hit for Shakespeare because the Greene part of the target is only one-seventh the size. This is the procedure undertaken in the tables of Pervez Rizvi, whose online dataset 'Collocations and N-Grams' is the primary attribution tool used in the current project to edit the complete works of Thomas Kyd being led by Brian Vickers (Rizvi 2018; Vickers 2019). By this method, the project has expanded the Kyd canon -- which until a few years ago was widely agreed to have just one play in it, The Spanish Tragedy -- so that it now includes Soliman and Perseda, and Cornelia (which for other reasons most people already accepted as his) and also Arden of Faversham, King Leir, Fair Em, 1 Henry 6, and Edward 3.
I want to show that the adjustment made for differing authorial canon sizes that is applied by Rizvi, on which these attributions depend, is unnecessary if one is counting common words and phrases and invalid if one is counting rare words and phrases, as the Kyd project's investigators do.1
Take the uncommon word 'water(s)', which is the 636th most frequently used word in all of Shakespeare. [SLIDE] This is how it is unevenly distributed across the plays, listed in alphabetical order from left to right. [SLIDE] The play with the most occurrences, The Tempest, has 14 times as many occurrences as the play with the least occurrences, Much Ado About Nothing. [SLIDE] For contrast, let us see how the word 'in', the 10th most common word in Shakespeare, is evenly distributed across the plays [SLIDE]. Here the spread is far less: Henry 5's 156 per 10,000 words is not even double The Winter's Tale's 87.
What happens, then, if we extrapolate from a small canon to a large one, if we weight our hits so as to scale up the smaller canon to the larger? If we had only 4 Shakespeare plays (as we have only 4 Thomas Greene plays), would they give us an accurate sense of how many occurrences for 'in' and 'water' to expect in a larger canon? We can test this directly in the case of Shakespeare, because we do have the larger canon. That is, we can construct a small Shakespeare canon from just one or two plays and then keep adding one play to this canon to see how that affects the frequencies of the words we are interested in. Obviously, a larger canon will have more words of all kinds, but we are interested in frequencies so we are looking at how often the words of interest occur in a 10,000 sample.
[SLIDE] Here we show along the x axis an increasing canon size, taking the plays in alphabetical order. For both words, 'in' on the left and 'water' on the right, we start with just one scene of 1 Henry 4, then the whole act of 1 Henry 4 containing that scene, then 1 Henry 4 as one play, then 1 Henry 4 plus 2 Henry 4 as a two-play canon, then those two plays plus Much Ado About Nothing as a 3-play canon, and so on adding one new play to the canon each time until we have put all 27 Shakespeare plays together. For each constructed canon we calculate the rate of 'in' and 'water' per 10,000 tokens.
[SLIDE] We can see on the left that once we get to 4 plays, the rate of 'in' remains almost constant: it does not matter how many new plays we add to the canon. We need only a 4-play canon to get a good sense of rate of usage of 'in' in any larger canon. [SLIDE] But for 'water' the pattern is quite different. We see that the effect of adding the third play Much Ado About Nothing (which is exceptionally low in its use of 'water') is to drag the rate for the 3-play canon down markedly, and then adding the fourth play Antony and Cleopatra (which has an unusually large number of occurrences of 'water') takes the rate for the 4-play canon right up again. Then our 5-play canon is even worse than our 4-play canon for predicting the rate of usage of 'water(s)', since its rate is lower, and our 6-play canon is worse still, being lower still. Not until we have 14 plays in our canon is this collection showing a rate of usage of 'water(s)' that is even as much as three-quarters of the final rate for the full 27 plays.
In growing our Shakespeare canon from one play to 27 plays we here took the plays in alphabetical order, which is effectively random order. In the worst-case scenario -- by which I mean if our 4-play canon happens to be our candidate author's 4 plays that use the word in question the least -- then the problem is far worse for the rare word 'water' but not for the common word 'in'. I have not the time to show this, nor to show that the problem is not confined to the uncommon word 'water': it is a general problem with uncommon words because they are unevenly distributed across author's canons.
* * *
There is much more to be said about the methods for distinguishing authorial styles so that we can detect co-authorship. But as a general principle it is safest to look at rates of common words rather than rare words, because of this problem of uneven spread. Of course, one would not base an authorship attribution on the rate of usage of one word. But even counting just two common words gives useful data about authorship. [SLIDE] Here is what we get if we plot on the x axis how often each of our 8 dramatists uses the word 'the' and on the y axis how often he uses 'and'. Each dot represents two counts, one for 'the' and one for 'and', for each of our dramatists' canons.
For Shakespeare we have removed As You Like It from the set of 27 plays and plotted the usage of 'the' and 'and' for just the remaining Shakespeare 26 plays. [SLIDE] Then we count these words in As You Like It and see where it falls on the plot. As you can see, the dramatist whose rates of using 'the' and 'and' are closest to the rates found in As You Like It is Shakespeare: his dot is the nearest to the As You Like It dot. If As You Like It were a play of unknown authorship, this plot would tell us that, regarding these two words at least, Shakespeare's habits are -- amongst those of the 8 dramatist we are considering -- the habits most similar to the habits found in the play.
We can count the rates of usage of more than two words, and typical experiments count the 50 or 100 most frequent words. Of course, we cannot display 100 counts on a two-dimensional plot like this because we would need 100 dimensions. But the mathematical formulas that tell us which dot is nearest to which other dot work exactly the same in 100 dimensions as they do in two, and finding the nearest dot is trivial.
* * *
Even if you have a reliable method for determining authorship, it is not obvious how you find where one author takes over from another in a co-authored text. It may be that you have reason to suppose that the authorial stints were organized by an artistic unit such the scene or the act in plays or the chapter in novels. Evidence from papers in the archive of the theatre impresario Philip Henslowe gives reason to suppose that the act was sometimes the unit of authorial composition in Shakespeare's time, so that different writers would be given the task of writing different acts of a play. But we cannot assume that this always happened.
One successful approach for detecting authorial stints is called 'rolling windows'. [SLIDE] Suppose we want to be able to detect the presence of a single run of lines, say 1500 words, by a second author (here coloured red) within a play that is otherwise all written by a main author (here coloured blue). Suppose we know that the minimum block of text that we can reliably detect the authorship of is 2000 words, and that our test will always point to the author who wrote the majority of the words in that 2000-word block. We could test each successive 2000-word block in our text. But there is a good chance that, as here, the run of 1500 words by our second author that we are trying to detect will not fall wholly within any one of our 2000-word blocks and hence the verdict given for each block with be that the main author, in blue, wrote it, because the second author's writing never happens to predominate in any one block.
[SLIDE] If we instead make our 2000-word window roll across the text, creeping forwards say 500 words at a time, then as it rolls over the block of 1500 words by our second author there will be several successive windows in which the second author's words predominate and the verdict will change to show our second author. [SLIDE x 34) We will not be able to say precisely where this second author's stint begins and ends, but we will know the approximate location. For this method to succeed, the window size needs to be just a little larger than the smallest block of secondary writing we wish to detect and also it needs to be large enough to bring in enough writing for reliable determination of authorship [BLANK SLIDE].
* * *
In this brief outline I have sketched what I think are two indisputably correct principles to be followed in good authorship attribution investigations. The first is to count the use of frequently occurring words rather than rare words. When you do not know where in a text a second hand starts and ends, do not assume anything about how dramatists divided up the labour of collaboration or adaptation, but instead use the Rolling Windows approach to let the text tell you. A final principle, that I have not the time to cover, would be to guard against confirmation bias. That is, to be wary when you find results that you wanted to find. We know that training people in the avoidance or minimization of Unconscious Bias is pointless: if you were able to have any affect on your Unconscious Biases they would not be Unconscious Biases. But you can at least guard against the biases you know you have, for instance preferring to find that Shakespeare wrote something rather than finding that someone else wrote it. [SLIDE] So I suppose the best I can do is echo what is supposed to have been inscribed upon the Temple of Apollo at Delphi: know thyself.
Works Cited
Rizvi, Pervez. 2018. 'Which N-grams Are the Best [for Authorship Attribution]':? An Essay Self-published on Pervez Rizvi's Website ''Collocations and N-grams (CAN')
Vickers, Brian. 2019. "Is EEBO-TCP / LION Suitable for Attribution Studies?" Early Modern Literary Studies. vol. 21.1
Notes
1Rizvi's website <https://www.shakespearestext.com/can/> is organized as a series of HTML pages that link to large ZIP files that contain files in the proprietary formats Word and Excel from the Microsoft Office software suite. The document containing Rizvi's weighting formula is called "Which-N-grams-are-the-Best.docx" and at the time of writing, 26 May 2024, the way to find it is:
1) Start at the landing page for https://shakespearestext.com/can.
2) Follow the hyperlink attached to the word "experiments" (8 lines from the bottom of the page).
3) Follow the hyperlink attached to the phrase "Browse Results (approx. 173 Mb)" to download the file "results.zip".
4) Expand "results.zip" to create the five folders including one named "1-Which-N-Grams-are-the-Best". In this folder you will "Which-N-grams-are-the-Best.docx".