Picture of Gabriel Egan G a b r i e l   E g a n  .  com

"N-gram matching as a test for authorship, genre, and date-of-composition of London's early modern plays" by Paul Brown and Gabriel Egan

For over a century, scholars attempting to discover just who composed certain writings that have come down to us without a named author have used as evidence the appearance and reappearance of unusual phrases. If a certain phrase--that is, a sequence of words, an n-gram--is known to be peculiar to just one writer and also appears in the writing under investigation, this is evidence that the writer who favours it composed the writing under investigation. This principle of authorship attribution applies equally in cases where we know that a piece of writing was co-written by two authors and want to discover just which of them wrote which part. An important aspect of such "parallel hunting" is ensuring that the n-grams used really are peculiar to one writer and are not just commonplaces of the period, which in 1932 Muriel St Clare Byrne called performing the "negative check" (Byrne 1932-33, 24).

In 1932 it was impossible to reliably perform the negative check for works from Shakespeare's time, since printed concordances for most writers did not exist. Now, with virtually all writings from this period in digital form, we can perform the checks to ensure that the n-grams we use truly are peculiar to one writer. But because we can amass large bodies evidence, we can also somewhat relax this requirement and consider not just the phrases that are unique to one writer but also those for which different writers merely have different preferences. With enough examples of such preferences, the differences in different writers' habits start to become statistically significant. Thus MacDonald P. Jackson looks for differential preferences in choices of n-grams using the Literature Online and Early English Books Online databases (Jackson 2003; Jackson 2014). And if are we unconcerned with eliminating the commonplace phrases of the period and are interested only in the different habits among a closed set of writers, we can search within smaller datasets, as Hugh Craig does in searching within a hand-curated set of several hundred of the plays from Shakespeare's time (Craig & Kinney 2009; Craig & Greatley-Hirsch 2017). This is also the approach taken by Pervez Rizvi, who has published online his collection of texts and analyses called "Collocations and N-grams", the subject of this paper.

Rizvi's dataset comprises 527 English plays from the period 1552-1657, gathered from multiple sources including the Early Print project and the Folger Digital Texts project. The texts used are regularized and lemmatized, so that words that have variant spellings in the originals are found under one modern spelling in the dataset, and all the conjugated and inflected forms of each word are found under a single modern lemma. The precise details of how this was done are not provided by Rizvi, and since the Early Print project uses machine modernization and the Folger Digital Texts project uses hand modernization by humans we should assume a certain inconsistency across his dataset. However, since regularization and lemmatization greatly reduce inconsistency overall, it is likely that the inconsistencies arising from these differing provenances are insignificant.

For this talk, we will ignore Rizvi's work on collocations and look only at n-grams. After making his counts, Rizvi offers a Ranking Formula that treats the sharing of phrases as more significant when the phrases are long than when they are short, and when the words in the phrases are relatively rare in his dataset. The first part of this assumption is questionable, and important independent studies by David L. Hoover and Craig's team have suggested that shared long phrases are in fact less significant than shared short phrases (Hoover 2012; Craig, Antonia & Elliott 2014). For this reason, we ignore Rizvi's Ranking Formula and rely solely on his raw data.

This raw data of n-grams shared between plays is still not quite as raw as we would like. Because comparing every n-gram with every other n-gram in 527 plays (typically 20,000-30,000 words each) would produce too many matches, Rizvi ignores certain cases involving these 154 common words [SLIDE]:

 'tis, a, about, after, against, all, am, an, an, an, an, and, another, any, are, as, at, away, bar, be, because, before, both, but, by, can, close, come, could, dare, did, do, down, enough, enter, every, for, from, given, go, good, had, hath, have, he, hence, her, here, him, his, how, i, i'll, if, in, into, is, it, know, let, like, little, lord, love, make, man, many, may, me, might, more, most, much, must, my, need, neither, never, next, no, none, nor, not, nothing, now, o, of, off, on, once, one, or, other, our, out, over, part, past, see, shall, she, should, since, sir, so, some, such, take, than, that, the, thee, their, them, then, there, therefore, these, they, this, those, thou, though, through, thy, till, to, too, until, unto, up, upon, us, was, we, well, were, what, when, where, which, while, who, whom, whose, why, will, with, within, without, would, yet, you, your

Rizvi remarks that in his methods "Bigrams and trigrams are reported only if they contain at least two words which are not among the 154 most common words in all the plays, to avoid thousands of matches for phrases like and the. Tetragrams and above are always reported". (In fact, Rizvi just last week updated his site to include n-grams and collocations for all words, but we have not had time to respond to this new development and are reporting on the data he uploaded for the original announcement of his dataset on 8 October 2017.)

To help investigator's make sense of the vast amount of data he provides, Rizvi offers what he calls his Attribution Tester, in the form of a Microsoft Excel spreadsheet with live formulas drawing from multiple worksheets. This tool takes as its input a table, provided by Rizvi for each play in the dataset, of counts of the n-grams in common to one play (the subject of the table) and each of the other plays in the dataset. These n-grams-in-common are counted by play and categorized into n-grams that appear in the play under scrutiny and one other play, which Rizvi calls "unique matches", and n-grams that appear in the play under scrutiny and two or more of the other plays in the dataset, which Rizvi calls "all matches". Having made the counts, one can sort the table to put at the top (in rank order) the authors whose canons provide the most matches to a play under scrutiny. To compensate for the fact that the different authors in the dataset have vastly different sizes of canon--Shakespeare's being the largest with several dozen plays--Rizvi applies the following weighting formula to his n-gram matches [SLIDE]:

Weighted number of matches between Play X and canon of Author Y =

                Number of n-grams common to Play X and plays by Author Y                  
Number of words in Play X + Number of words in the matching plays by Author Y

This weighting formula is applied for the "unique matches" and "all matches" cases and its purpose is to lower the significance of matches found in canons that dominate the dataset. We would argue, however, that rather than using in the divisor the "Number of words in the matching plays by Author Y" one should use the number of words in the entire canon of Author Y, since that represents the 'target' in which the matches were found and it is this target's dominance of the dataset that we are trying to compensate for. (Just before boarding the aeroplane to attend this conference, we heard from Rizvi that he agrees with this refinement and has applied it to the latest interation of his analysis, although only for n-grams not for collocations.)

To see whether Rizvi's Attribution Tester produces plausible results we tried it on 24 sole-authored, well-attributed Shakespeare plays. Using the "unique matches" and the "all matches" counts, here for each play we list in numerical order the top three authors whose canons provide the greatest number of n-gram matches to that play. Notice that Shakespeare is not always the top author in these lists, and sometimes isn't even in the top three:

1 Henry IV UNIQUE MATCHES
Shakespeare, William
Cumber, John (?)
Porter, Henry

1 Henry IV ALL MATCHES
Shakespeare, William
Peele, George
Rowley, Samuel

2 Henry IV UNIQUE MATCHES
Shakespeare, William
Dekker, Thomas; Webster, John
Beaumont, Francis; Fletcher, John

2 Henry IV ALL MATCHES
Shakespeare, William
Beaumont, Francis; Fletcher, John
Anon.

Much Ado About Nothing UNIQUE MATCHES
Shakespeare, William
Dekker, Thomas; Webster, John
Cumber, John (?)

Much Ado About Nothing ALL MATCHES
Beaumont, Francis; Fletcher, John
Shakespeare, William
Fletcher, John; Beaumont, Francis (?)

Antony and Cleopatra UNIQUE MATCHES
May, Thomas
Shakespeare, William
Daniel, Samuel

Antony and Cleopatra ALL MATCHES
Shakespeare, William
Beaumont, Francis; Fletcher, John
May, Thomas

As You Like It UNIQUE MATCHES
Fletcher, John; Beaumont, Francis (?)
Day, John
Ariosto, Ludovico

As You Like It ALL MATCHES
Beaumont, Francis; Fletcher, John
Shakespeare, William
Fletcher, John; Beaumont, Francis (?)

Coriolanus UNIQUE MATCHES
Shakespeare, William
Rojas, Fernando de
Fletcher, John

Coriolanus  ALL MATCHES
Shakespeare, William
Beaumont, Francis; Fletcher, John
Fletcher, John

Cymbeline UNIQUE MATCHES
Shakespeare, William
Beaumont, Francis; Fletcher, John; Massinger, Philip
Anon.

Cymbeline ALL MATCHES
Shakespeare, William
Shirley, James
Beaumont, Francis; Fletcher, John

The Comedy of Errors UNIQUE MATCHES
Munday, Anthony; Chettle, Henry
Munday, Anthony; Drayton, Michael; Wilson, Robert; Hathaway, Richard
Shakespeare, William

The Comedy of Errors ALL MATCHES
Fletcher, John; Beaumont, Francis (?)
Beaumont, Francis; Fletcher, John
Shakespeare, William

Henry V UNIQUE MATCHES
Munday, Anthony; Drayton, Michael; Wilson, Robert; Hathaway, Richard
Shakespeare, William
Anon.

Henry V ALL MATCHES
Shakespeare, William
Anon.
Rowley, Samuel

Hamlet UNIQUE MATCHES
Shakespeare, William
Chapman, George
Webster, John

Hamlet ALL MATCHES
Shakespeare, William
Beaumont, Francis; Fletcher, John
Brome, Richard

Julius Caesar UNIQUE MATCHES
May, Thomas
Shakespeare, William
Alexander, William

Julius Caesar ALL MATCHES
Shakespeare, William
Beaumont, Francis; Fletcher, John
May, Thomas

Love's Labours Lost UNIQUE MATCHES
Holiday, Barten
Wilson, Robert
Marston, John

Love's Labours Lost ALL MATCHES
Shakespeare, William
Dekker, Thomas
Beaumont, Francis; Fletcher, John

A Midsummer Night's Dream UNIQUE MATCHES
Shakespeare, William
Rojas, Fernando de
May, Thomas

A Midsummer Night's Dream ALL MATCHES
Shakespeare, William
Anon.
Sharpham, Edward

The Merchant of Venice UNIQUE MATCHES
Marlowe, Christopher
Shakespeare, William
Terence

The Merchant of Venice ALL MATCHES
Tomkis, Thomas
Shakespeare, William
Brome, Richard

Othello UNIQUE MATCHES
Shakespeare, William
Sampson, William
Carlell, Lodowick

Othello ALL MATCHES
Hausted, Peter
Beaumont, Francis; Fletcher, John
Brome, Richard

Richard II UNIQUE MATCHES
Shakespeare, William
Munday, Anthony
Marlowe, Christopher

Richard II ALL MATCHES
Anon.
Shakespeare, William
Heywood, Thomas

Richard III UNIQUE MATCHES
Anon.
Shakespeare, William
S., W.

Richard III ALL MATCHES
Anon.
Shakespeare, William
Heywood, Thomas

Romeo and Juliet UNIQUE MATCHES
Davenport, Robert
Shakespeare, William
Haughton, William

Romeo and Juliet ALL MATCHES
Shakespeare, William
Dekker, Thomas
Anon.

The Two Gentleman of Verona UNIQUE MATCHES
Yarington, Robert
Rojas, Fernando de
Shakespeare, William

The Two Gentleman of Verona ALL MATCHES
Rojas, Fernando de
Shakespeare, William
Fletcher, John

The Tempest UNIQUE MATCHES
Lower, Sir William
Harding, Samuel
Mayne, Jasper

The Tempest ALL MATCHES
Hausted, Peter
Tomkis, Thomas
Dekker, Thomas

Twelfth Night UNIQUE MATCHES
Shakespeare, William
Beaumont, Francis; Fletcher, John
Middleton, Thomas; Rowley, William

Twelfth Night ALL MATCHES
Beaumont, Francis; Fletcher, John
Fletcher, John; Beaumont, Francis (?)
Hausted, Peter

Troilus and Cressida UNIQUE MATCHES
Heywood, Thomas
Shakespeare, William
Ariosto, Ludovico

Troilus and Cressida ALL MATCHES
Tomkis, Thomas
Hausted, Peter
Dekker, Thomas

The Merry Wives of Windsor UNIQUE MATCHES
Sharpham, Edward
Cooke, Joshua
Shakespeare, William

The Merry Wives of Windsor ALL MATCHES
Ariosto, Ludovico
Middleton, Thomas
Rojas, Fernando de

The Winter's Tale UNIQUE MATCHES
Shakespeare, William
Marmion, Shackerley
Ford, John

The Winter's Tale ALL MATCHES
Brome, Richard
Shakespeare, William
Beaumont, Francis; Fletcher, John

Using the "unique matches" criterion, Shakespeare's canon had the most matches for just 11 of the 24 plays tested, and was not even in the top three authors for three plays (As You Like It, Love's Labour's Lost, and The Tempest). Using the "all matches" criterion, Shakespeare's canon again had the most matches for just 11 of the 24 plays (a different set of 11), and was not even in the top three authors for five of the plays (Othello, The Tempest, Twelfth Night, Troilus and Cressida, and The Merry Wives of Windsor). We are interested in why this method does not confirm what we already know. Is there something wrong with our general assumptions about shared n-grams being a marker of authorship? Perhaps Rizvi's exclusion of some n-grams involving the 154 most common words in English weakens the discriminating power of his method; we should be able to test that using his new data that doesn't exclude those 154 words.

It is sometimes suggested, especially by those who are sceptical of authorship attribution by computational methods, that in fact authorship is not the strongest determinant of what gets written in early modern plays; Craig conveniently summarizes the objections to authorship attribution that deploy this argument (Craig 2009-10). Craig himself has investigated this idea and found that across all early modern plays and a range of methods for measuring their styles, it is generally untrue: authorship really is a best predictor of how alike plays will appear, exceeding the clustering by genre or chronology or theatre company (Craig & Greatley-Hirsch 2017). But Craig also showed that the genre of history plays comes closest to overcoming this tendency of plays to cluster by author: using some measures, the history plays by different authors look more like one another than they look like their respective authors' other plays. We can test whether this phenomenon is detectable in Pervez Rizvi's dataset . . .

The short answer to this is 'no'. The slightly longer version is 'not with the questions we asked it'. What Rivzi's dataset can do though, it turns out, is find a play's genre more readily than a play's authorship. But just how we test that is a complex question. We have already seen the 'attribution tester' in action--using weighted number of unique n-gram matches as a way to rank candidate authors--and we return to that metric here. The 'attribution tester' is no use though, since it's concerned solely with authorship. What we wanted is to be able to compare authorship information with genre information. Rivzi provided 'summary' files of his n-gram counts and matches for each play, ranked by 'weighted number of unique matches'. The summary files look like this [SLIDE], with the highest matching play--based on weighted unique ngrams--appearing at the top, row 2 here, the second highest matching play appearing in row 3 and so on. We undertook to test these matching plays for authorship and genre against the authorship and genre of the play with which they matched. That is--for the example on screen--we said: Your Five Gallants is a comedy by Thomas Middleton, so do comedies dominate the top of the list of plays that share the most n-grams with Your Five Gallants. Likewise, with authorship: do plays by Thomas Middleton, the author of Your Five Gallants, dominate the list of plays that share the most n-grams with Your Five Gallants?

We did just this across the whole dataset. Rather than manually matching plays with metadata and counting ourselves, we wrote a program to do it for us and return the results. For the technically interested, we wrote the script in Python using its pandas data science library to work with Rivzi's .csv files. I'll say no more about the technical side here and do please feel free to ask should it be of interest. (Though suffice it to say that Google won't be hiring me as a software engineer any time soon).

Rather than examine every matching play, we arbitrarily choose to look at the top 10 for each play in the dataset. [SLIDE] Of those top 10 matching plays, we counted their authors and their genres, and tallied the scores, as shown here. To test how good a metric the weighted number of unique matches is at predicting authorship and genre, we took the highest-scoring genre and highest-scoring author and compared them against the author and genre for the play being examined. Here are the results [SLIDE]. 63% of the time such a method got a play's genre correct, and 47% of the time it got a play's author correct. This 47% is about the same as the 'attribution tester's' failure rate of 11 right out of 24 (46%) in the Shakespeare example. These scores are based on a dataset containing 648 plays or parts of plays--Rivzi breaks some plays suspected of co-authorship into various chunks, alongside a whole play file.

By testing different numbers of matching plays, only taking the top three most-matching plays, say, or taking the top 20 most-matching plays, we found the scores vary little. A percentage point here or there. 7 matching plays seems to be the magic number in this test pushing the genre score up to 64% and the author score up to 49%.

Isolating a genre means a different pattern emerges. If the play being looked at is a history, then a prediction of its genre based on which genre's plays dominate the list of plays with most matches gets it right about 66% of the time. If it's a tragedy, then 58% of the time it gets it right. Comedies, however, are correctly attributed 87% of the time. Partly, this is down to having more chances to match: there are more comedies in the dataset than tragedies and histories. But these results for comedies are striking nevertheless. When examining plays by genres in this way, one genre at a time, we can also ask whether the plays dominating the top of the list of plays with the most matches to the play under consideration are by the author of the play under consideration. When we isolate genres in this way, the n-gram matching methond correctly predicts authorship between 45% and 55% of the time.

With data in this format, we can perform yet another test to assess whether genre or authorship clusters at the top of these lists. That is, when looking at, say, a Ben Jonson comedy, do we see more comedy plays at the top of the list or more Jonson plays? This is a subtly different question from those already addressed. So far we have been looking at authorship and genre in matching plays. This new inquiry pitches genre against authorship to see which is the better predictor of the plays that will dominat the top of the list.

71% of the time the genres of the top 10 plays match more frequently than the authors. When looking just at histories or just at tragedies, we get more or less the same scores: around the 70% we got without isolating genre. But for comedies, genre is the better predictor than authorship 82% of the time. There are, we should note, considerably fewer genres than authors in the dataset, so correct-genre matches are likely to be more frequent than correct-author matches for that reason alone. Nevertheless, we have already seen that the tool does not reliably detect authorship, so a test where genre trumps authorship is plausible.

Does any of this matter? Well, it teaches investagtors to be wary of these measures as markers of authorship in Rivzi's dataset. This is an important warning since the easiest tool with which to engage is Rivzi's 'attribution tester'. It also suggests that n-gram research can be fruitful in detecting genre (and Nigel Fabb mentioned yesterday Jonathan Hope and Michael Witmore's study using other linguistic features to detect genre so we know it can be done (Hope and Witmore 2010)). Weighted unique n-grams alone--at least those that do not account for strings containing the commonest words--are not enough. The beauty of approaching problems with digital datasets and computer programs is that with a little recalibration we can ask a different set of questions in the hope of isolating a more effective approach. What is certain is that Rivzi's data are a boon for investigators interested in attribution of early modern plays, even if they only teach us to be sceptical of present assumptions about shared n-grams being good evidence of shared authorship.

Works Cited

Byrne, Muriel St Clare. 1932-33. "Bibliographical Clues in Collaborate Plays." The Library (=Transactions of the Bibliographical Society). 4th series (2nd of the Transactions of the Bibliographical Society) 13. 21-48.

Craig, Hugh and Arthur F. Kinney. 2009. Shakespeare, Computers, and the Mystery of Authorship. Cambridge. Cambridge University Press.

Craig, Hugh and Brett Greatley-Hirsch. 2017. Style, Computers, and Early Modern Drama: Beyond Authorship. Cambridge. Cambridge University Press.

Craig, Hugh, Alexis Antonia and Jack Elliott. 2014. "Language Chunking, Data Sparseness, and the Value of a Long Marker List: Explorations with Word N-grams and Authorial Attribution." Literary and Linguistic Computing 29. 147-63.

Craig, Hugh. 2009-10. "Style, Statistics, and New Models of Authorship." Early Modern Literary Studies 15.1. 41 paras.

Hope, Jonathan, and Michael Witmore. 2010. "The Hundredth Psalm to the Tune of 'Green Sleeves': Digital Approaches to Shakespeare’s Language of Genre." Shakespeare Quarterly 61.3. 357–90.

Hoover, David. 2012. 'The Rarer They Are, the More There Are, the Less They Matter': Online Abstract for a Paper Delivered on 19 July at the Conference 'Digital Humanities' Held at the University of Hamburg on 16-20 July 2012. Internet Http//www.dh2012.uni-hamburg.de/conference/programme/abstracts/the-rarer-they-are-the-more-there-are-the-less-they-matter:.

Jackson, MacDonald P. 2003. Defining Shakespeare: Pericles as Test Case. Oxford. Oxford University Press.

Jackson, MacDonald P. 2014. Determining the Shakespeare Canon: Arden of Faversham and A Lover's Complaint. Oxford. Oxford University Press.