"Instructive failures in authorship attribution by shared phrases in large textual corpora" by Gabriel Egan
On 7 May 1996, Dorothy Woods, a retired health worker, was found dead in her home in Huddersfield in the north of England. She had been smothered by a pillow and signs of a break-in made local police pursue the theory of a burglary gone wrong. A window at the point of entry was found to hold the oily impression of a human ear pressed against it. Unfortunately for local burglar Mark Dallagher, Huddersfield police consulted a Dutch police officer, Cornelis van der Lugt, who although he had no forensics training had become convinced that ear-prints are as incriminating as fingerprints. Comparison of Dallagher's ear with the print left at the crime scene led to his conviction for murder, followed six years later by his retrial and exoneration. The Court of Appeal found that the first trial judge misdirected the jury regarding the value of expert testimony and failed to identify fallacious reasoning about statistical probability. The Dallagher case contains several lessons for the study of authorship attribution.
As with finger-print and DNA evidence, the strongest kinds of argument in such cases are those used to exclude suspects rather than include them. If we find a partial human genome or finger-print at a crime scene, we might with certainty declare that it matches no part of the DNA or the fingers of a given suspect. The suspect cannot have left this evidence. But finding that the fragment matches part of a suspect's DNA or finger is not itself proof of guilt since, being only a fragment, it might also match others' DNA or fingers. When evaluating so-called partial matches, we are forced to make statistical speculations about the likelihood that a fragment of a given size might match more than one person. Human beings, including experts, have not always made the correct judgements about such likelihoods. In this brief talk I want to look at recent failures in this area to see what we can learn from them. The key lesson is that finding that the suspect's ear matches the ear-print doesn't mean that he did it.
* * *
Brian Vickers has published a series of articles in the past 10 years that make claims based on automated searching of a collection of electronic texts put together by him and his collaborator Marcus Dahl. Most importantly, the claims rely as much on what is not found in searching this database as on what is found, and this raises an obvious doubt in the mind of the reader. Since no one but Vickers has access to their database, or even knows which works are in it, we cannot tell if his failure to find something--typically the allegedly distinctive phrases used by the dramatist whose canon he was investigating--is due to that something genuinely being absent from various writings of the period, or simply because his database in incomplete and/or he is searching it ineffectively.
In an article in the Times Literary Supplement in 2010, Vickers argued that--contrary to recent claims--Shakespeare's contemporary Thomas Middleton did not adapt a lost version of Macbeth to produce the play we know all know (and attribute to Shakespeare) because it appeared in the 1623 First Folio of Shakespeare (Vickers 2010). His method was to find in the 151 lines of Macbeth that have been attributed to Middleton the occurrences of "three consecutive words ('trigrams', as they are known in corpus linguistics)" that appear in Shakespeare's work and not in Middleton's. To check this, I simply went looking for these trigrams in the Oxford Collected Middleton of 2007 (Middleton 2007; Taylor & Lavagnino 2007), which I happen to have in electronic form.
Each item in the following list consists of a phrase, followed by Vickers's assertion about that phrase's absence from, or rareness, in the Middleton canon, and then this author's contrary assertion about its occurrence in the Middleton canon:
Phrase: "Showed like a"
Vickers says "Not Mid[dleton]"
Egan says it is in: A Yorkshire Tragedy 4.71; and Honourable Entertainments 2.8Phrase: "him till he"
Vickers says "1 Middleton" occurrence
Egan says it is in: Wit at Several Weapons 4.1.23; and Any Thing for a Quiet Life 5.2.192Phrase: "him from the"
Vickers says "1 Middleton" occurrence
Egan says it is in: The Puritan Widow 1.3.29; Your Five Gallants 2.4.224; The Lady's Tragedy 1.2.86; and The Owl's Almanac 2358-9Phrase: "and fixed"
Vickers says "Not in Middleton"
Egan says it is in: The Triumphs of Truth 420; The Old Law 2.2.152; The Nice Valour 3.3.43; and The Triumphs of Honour and Virtue 93Phrase "the other and"
Vickers says "Not in Middleton"
Egan says it is in: Plato's Cap 264Phrase: "here's another"
Vickers says "Not in Middleton"
Egan says it is in: The Phoenix 12.128; The Patient Man and the Honest Whore 5.39 and 6.81; The Yorkshire Tragedy 5.58; and Your Five Gallants 1.1.27-28Phrase: "a crew of"
Vickers says "Not in Middleton"
Egan says it is in: The Owl's Almanac 1804; The World Tossed at Tennis 683; The Spanish Gipsy 3.2.54; and Microcynicon 4.46Phrase: "I have seen him"
Vickers says "Not Mid[dleton]"
Egan says it is in: The Puritan Widow 1.3.75Phrase: "to the eye"
Vickers says "Not in Middleton"
Egan says it is in: Sir Robert Sherley 257; and The Sun in Aries 276
I have left out, of course, phrases that appear in the parts of Middleton's collaborative works that were written by his collaborators, as determined in the latest scholarship (Taylor & Lavagnino 2007, 335-443). If we allow for mildly variant forms, there are yet more occurrences in Middleton's works of words and phrases that Vickers asserts are not there, too many to show you now. It is clear, then, that either the collection of electronic texts used by Vickers to try to discredit the claim that Middleton wrote part of Macbeth was missing a great many Middleton works, or else Vickers's means of searching the collection was ineffective and he missed a great deal of the relevant evidence, or both.
My second example concerns Vickers's claim that the Additions to Kyd's The Spanish Tragedy are by Shakespeare. (These Additions were not in the play as first published in 1592 but appeared for the first time in the fourth edition of 1602.) Vickers used his home-made database of "400 plays and masques dating from the 1580s to the 1640s, and including the complete canons of Marlowe, Lyly, Peele, Kyd, Shakespeare, Dekker, Jonson, Chapman, Middleton, Beaumont, Fletcher, Massinger and Shirley, together with all the anonymously published plays" (Vickers 2012, 29). From searching this collection Vickers claimed that "the uniquely Shakespearian matches amount to 116 in the 320 lines of the Additions, a rate of one every 2.5 lines" (Vickers 2012, 29). Vickers listed these 116 phrases supposedly common to the Additions to The Spanish Tragedy and the Shakespeare canon and nothing else. These we can test by searching in Literature Online (LION) and Early English Books Online Text Creation Partnership (EEBO-TCP) to see i) if they in fact appear in other dramatic canons, and ii) if any of the phrases were simply common in the period and hence are not decisive in ascribing authorship.
Here are the first twelve phrases from Vickers's list of those 116 allegedly unique to Shakespeare's works and the Additions to The Spanish Tragedy, with examples missed by Vickers from the collections of EEBO-TCP and LION (searched March 2015) cited by Short Title Catalogue (STC) number and signature for relatively obscure works (easily found in EEBO) and by title for canonical texts of English Literature (easily found in LION):
1) Phrase "[take] note of it". The square bracket is necessary because in some of Vickers's matches all four words are present, and in others only the last three. The phrase is common in published writing: STC 6553 (published 1606) has "taking note of it" (sig. A4v); Nashe's Have With You to Saffron Walden STC 18369 (published 1596) has "take note of it" (sig. L4v); STC 18639 (published 1607) has "take note of it" (sig. K7r); and STC 18800 (published 1618) has "take note of it" (sig. E1v). Naturally, the three-word string "note of it" is even more common since it includes all these and many more.
2) Phrase "of it [ ] besides". The phrase is common with 432 occurrences in 385 EEBO-TCP books. Confining ourselves to just the period up to the year 1600 we find that: STC 3071 (published 1585) has "of it besides" (sig. Aaa1r); STC 3734 (published 1587) has "of it, besides" (sig. Xxx8r); STC 3802 (published 1580) has "of it? Besides" (sig. Yy4v); STC 4442 (published 1583) has "of it? Besides" (sig. Ccc6v); STC 4470 (published 1562) has "of it, beside" (sig. ***1v); STC 5008 (published 1563) has "of it. Besides" (sig. Q4v); and STC 14842 (published 1535) has "of it. ¶Besyde" (sig. B2r). There are 35 more occurrences in 33 books published before 1600.
3) Phrase "short lived". EEBO-TCP shows 26 occurrences in 14 books before 1600. LION shows occurrences in Robert Burton's Anatomy of Melancholy and in Middleton's The Revenger's Tragedy (first performed 1605-6), the latter of which should be in Vickers's database.
4) Phrase "run to the". LION has dramatic occurrences before 1600: the anonymous Look About You (first performed 1597-99); Thomas Ingelend's The Disobedient Child (first performed 1559-70); Marlowe's The Jew of Malta (first performed 1589-90); and John Marston's Antonio and Mellida (first performed 1599-1600). There are dozens more if one expands one's purview to the period generally. These additional examples include Shakespeare's Julius Caesar that should be in Vickers's database. There are dozens more if one looks beyond just drama to poetry and literary prose. EEBO-TCP has 2,269 occurrences from 1,586 books published before 1700.
5) Phrase "presently [ ] and bid". EEBO-TCP shows that Aston Cokayne's play Trappolin Supposed a Prince, first performed in 1633 and printed as Wing C4894, has "presently, and bid" (sig. Gg6r).
6) Phrase "strange dream[s]". Even if we confine ourselves only to drama there are plenty of occurrences in LION including: the anonymous Birth of Hercules (first performed 1597-1610); Marston's Antonio's Revenge (first performed 1600-01); twice in Marston's The Malcontent (first performed 1602-04); Webster's The Duchess of Malfi (first performed 1612-14); and several less well-known plays. EEBO-TCP finds 80 occurrences in 67 books before 1700.
7) Phrase "do . . . hear me sir" where one word fills the gap. LION finds: "do you hear me, sir" in Middleton's The Puritan Widow (first performed 1606); "doe but heare me sir" in Middleton's Michaelmas Term; "do you heare mee, Sir?" in William Percy's The Cuck-Queanes and Cuckolds Errant (no later than 1604 since it is a source for Marston's The Dutch Courtesan); "do you heare me sir?" in Henry Porter's The Two Angry Women of Abingdon (first performed 1598-9); and "Do'y heare me sir" in Edward Sharpham's Cupid's Whirligig (first performed 1607). All of these plays ought to be in Vickers's database.
8) Phrase "Nay blush not". LION finds this phrase in: Fletcher's Love's Pilgrimage (first performed 1616), The Little French Lawyer (first performed 1619-23) and The Island Princess (first performed 1621); William Haughton's Englishmen for My Money (first performed 1598); Thomas Heywood's The Four Prentices of London (first performed 1594); Jonson's The Devil is an Ass (first performed 1616); three times in Massinger's The Great Duke of Florence (first performed 1627); and Francis Quarles's The Virgin Widow (first performed 1641). All of these plays ought to be in Vickers's database.
9) Phrase "Saint James [or Jamy]". LION finds this phrase in: the anonymous King Darius (first performed 1565); the anonymous The Pedlar's Prophecy (first performed 1561-63); the anonymous Free Will (first performed 1565-72); John Heywood's The Four Ps (first performed 1520-22), Jonson's The Gypsies Metamorphosed (first performed 1621); and George Ruggle's Ignoramus (first performed 1615).
10) Phrase "within this hour [ ] that". LION finds: "Within this hour, things that" in Fletcher's Monsieur Thomas (first performed 1610-16). EEBO-TCP finds that STC 22719 (published 1593) has "within this houre, that" (sig. Vv2r).
11) Phrase "hanged up" when said of persons. LION finds this in: the anonymous play Nice Wanton (first performed 1547-53), Lording Barry's Ram-Alley (first performed 1608-10); Dekker and Middleton's 1 Honest Whore (first performed 1604); Fletcher's The Spanish Curate (first performed 1622); Massinger's Believe As You List (first performed 1631); William Stevenson's Gammer Gurton's Needle (first performed 1552-63); and Lewis Wager's Life and Repentance of Mary Magdelene (first performed 1550-66).
12) Phrase "me a taper" used in the imperative, such as "give me a taper" or "lend me a taper". Vickers finds this only in the Additions to The Spanish Tragedy and Othello, but LION finds that Antony and Cleopatra has "Get me a taper".
Thus we have to test the first 10% (12 out of 116 phrases) of Vickers's list of claimed "unique" parallels, meaning trigrams found in the Additions to The Spanish Tragedy and in Shakespeare's works but nowhere else, before we hit one for which this claim is actually true, the imperative "me a taper". And even in the final example, faith in the effectiveness of Vickers's searching is undermined by the fact that he missed a Shakespearian example, "Get me a taper" from Antony and Cleopatra. Clearly, the large number of dramatic occurrences of these phrases listed above and missed by Vickers confirms again that either his database is largely incomplete or his searching of it is ineffective, or both.
Of course, by searching EEBO-TCP too we have included non-dramatic works that Vickers did not claim to check. But Vickers should have looked beyond the drama since if a phrase is simply the common currency in the period then finding it in the Additions to The Spanish Tragedy and Shakespeare's works has no significance for authorship attribution. This is what Muriel St Clare Byrne meant by the necessity to perform a "negative check" to see who else is using a phrase, which advice Vickers has approvingly cited but does not follow (Vickers 2012, 18).
It so happens that Shakespeare probably did write at least some of the passages added in the 1602 edition of Kyd's The Spanish Tragedy, but no-one should believe that on the basis of Vickers's work. Other scholars such as Hugh Craig, John Burrows, Arthur F. Kinney, and MacDonald P. Jackson--of whom is Vickers is scathingly critical (Vickers 2011)--have demonstrated this by muliple independent studies using different means. The underlying flaw in Vickers's approach, in the cases examined here, and others, is that he bases his claims on a private database of electronic texts to which no one else has access. There are no good reasons to do this since LION contains virtually all literary texts of the period and EEBO-TCP virtually all published texts of the period. Scholars should not take seriously any claims made by such methods, or by Vickers in particular on the basis of such techniques. Scholarly publishers and journals should in future not publish claims based on methods, like this, which have been repeatedly shown to be unreliable.
Works Cited
Middleton, Thomas. 2007. The Collected Works. Ed. Gary Taylor and John Lavagnino. Oxford. Clarendon Press.
Taylor, Gary and John Lavagnino, eds. 2007. Thomas Middleton and Early Modern Textual Culture: A Companion to the Collected Works. Oxford. Clarendon Press.
Vickers, Brian. 2010. "Disintegrated: Did Thomas Middleton Really Adapt Macbeth?" Times Literary Supplement Number 5591 (28 May). 14-15.
Vickers, Brian. 2011. "Shakespeare and Authorship Studies in the Twenty-first Century." Shakespeare Quarterly 62. 106-42.
Vickers, Brian. 2012. "Identifying Shakespeare's Additions to The Spanish Tragedy (1602): A Newer)( Approach." Shakespeare 8. 13-43.