"Telling Shakespeare and Marlowe apart by function-word clustering" by Gabriel Egan
The 26 letters of the alphabet do not occur with equal frequency in normal English prose or verse, whether modern or early modern. About one in every eight letters of written English is the letter 'e', but only one-in-50 of them is the letter 'p', one-in-100 is the letter 'v', and one-in-1000 is the letter 'q'. Moreover, certain letters are more likely to follow certain other letters. Following 't' with 'h' makes the commonest letter bigram in English, comprising about one-in-30 of all letter bigrams, while following 'u' with 'r' makes a bigram that occurs only once in every 200 letter bigrams. Without consciously noticing it, all competent users of the English language have internalized these preferences, as can be demonstrated with an experiment. I have here an extract from a 20th-century novel and I would like members of the audience to call their guesses for the letter or piece of punctuation or a space that comes next. I'll give you the first letter, 'T':
The clock struck half past two. In the little office at the back of Mr McKechnie’s bookshop, Gordon--Gordon Comstock, last member of the Comstock family, aged twenty-nine and rather moth-eaten already--lounged across the table, pushing a four-penny packet of Player's Weights open and shut with his thumb.
[If they guess right, tell them and write the letter with an underscore beneath it. If they guess wrong, tell them and write the letter without an underscore. The underscores show how guessable the language is. After experiment is finished, SLIDE to show above quotation.] The first to do the experiment we have just performed was the founder of the science of Information Theory, Claude Shannon. His equations, and in particular the celebrated equation for Shannon Entropy [SLIDE], are now routinely used in scientific work on authorship attribution that appear in journals that are little read by Shakespearians and Marlovians.
The preferences for one letter following another that we see in this experiment are built into the English language rather than belonging to any one writer. They express the fact that letter combinations are not random distributed but statistically ordered. If letter combinations were not ordered like this, any combination would be equally likely and crossword puzzles would be impossible [SLIDE]. When we shift our attention from combinations of letters to combinations of words, the same statistical principles apply and the same Shannon equation tells us how far from purely random the sequences are. The freedom of the writer is much greater at the level of word combination: the structure of English is far less constraining when choosing words to make a sentence than when choosing letters to make a word. Moreover, it is empirical provable that the preferences for following one word with another are personal preferences of particular authors.
In the Humanities, investigators of this topic often use word-combination preferences governing the relatively infrequent lexical words, [SLIDE] so they ignore the short, so-called function words--the, and, of, and so forth--that make up much of what we say and write. However, there is good empirical evidence that we should attend to those short function words that we hardly notice but which are demonstrably distinctive of an author's style. [SLIDE] When we count how often these function words are used, the frequencies turn out to be different for different writers. What is more, a new technique has been developed by a team to which I belong that counts not the frequencies of these function words, but the preferences for following one with another shortly afterwards. That is, how they cluster. The Word Adjacency Network method captures the distance, measured in words, between each occurrence of a function word and each other occurrence of every other function word.
Let me show how the method works using this extract from Shakespeare's Hamlet [SLIDE]
With one auspicious and one dropping eye,
With mirth in funeral and with dirge in marriage,
In equal scale weighing delight and dole,
(Shakespeare Hamlet 1.2.11-13)
[SLIDE] Let us confine our attention to the proximities, one from another, of the four function words with, and, one, and in. [SLIDE] Starting with with and looking forward five words we find an occurrence of the word one, an occurrence of the word and, and another occurrence of the word one. [SLIDE] We record that in our Markov chain by a line from with to and with a value of 1 and a line from with to one with a value of 2. [SLIDE] We are done with the first word in the extract, With, and we [SLIDE] move to the next occurrence of one of our function words, which is the second word in the extract, one. Again looking forward five words we spot an occurrence of and and an occurrence of one, [SLIDE] so we draw a line from one to and, weighted 1, and a line from one to itself, weighted 1. [SLIDE] Then we move to the next occurrence of one of our function words, and it is and in the middle of the first line. Looking forward five words, we find an occurrence of one and and occurrence of with, so we add these to our Markov chain as two weighted lines emerging from the node for and. We proceed through the extract in the same way, adding fresh weighted lines (called edges) between nodes to indicate how often each word appears within five words of the others [SLIDE x 16]. This is our completed Markov chain.
We then do the same for the same four functions words' appearance in another passage, [SLIDE] this time from Thomas Dekker's Satiromastix; here's the completed Markov chain. [SLIDE] We end up with two Markov chains, each showing the Word Adjacency Network for the four words with, and, one, and in, in each extract. These two chains contain the information about the word proximities in the two extracts, and using Shannon's mathematics for entropy we can compare them. You will see that there are fewer lines in the Satiromastix network, but the absolute number of lines is not the most important point. The key question is, "when this author chooses to follow one of these words with another of these words, which is she most likely to choose?" These network embody the author's preferences that answer this question. You can see that in the Dekker extract, the word in is never followed (within five words) by the word with: there is no line running from in to with. Dekker instead chooses to follow in by and (one time) and by one (two times). [BLANK SLIDE]
This is only an illustration of the idea and for authorship attribution we use many more than four function words; 100 would be typical, but the resulting pictures are too complex to show you. And of course rather than short extracts from plays we use whole authorial canons as our samples. And instead of just recording the raw numbers of edges from node to node there are some weightings of edges and nodes to be applied using Shannon's mathematics for entropy and what is called limit probability. The edge-weightings reflect the fact that we consider words appearing close to one another to be more significant than words that are far apart, so instead of scoring "1" for a word appearing anywhere within our 5-word window, we give a greater score to words appearning near the beginning of the window. The limit probability weighting of nodes reflects the fact that we attach greater significance to words that are used often in the text being tested that to words that are used infrequently.
Specifically, how do we compare two WANs? The technical answer is that for each edge common to the two WANs we subtract from the natural logarithm of the weight in the first WAN the natural logarithm of the weight in the second WAN and then multiply this difference by the weight of the edge in the first WAN and then by the limit probability of the node from which this edge originates. This calculation is made for each edge and the values summed to express the total difference. In this method, it matters which WAN we designate as 'first' and which 'second', so the procedure is performed twice, switching the designation the second time. (I give you this technical account because the last time I spoke about this an audience complained that I had concealed the most interesting part of the story; chacun a son gout.)
In this method, authorship attribution is done by calculating the overall difference (measured in Shannon entropy units called centinats) between a Markov chain representing the word-adjacency preferences for an entire canon by one author and the Markov chain representing the word-adjacency preferences of the play we wish to attribute. The author whose canon produces the Markov chain least different from the Markov chain for the play to be attributed is the likeliest author amongst the candidates tested. This method is entirely unlike other recently applied methods of authorship attribution, but it confirms their results: Christopher Marlowe did contribute to parts of all three of Shakespeare's Henry VI plays. My team's detailed study of this was published last year in Shakespeare Quarterly and it used 100 function words to create profiles for Shakespeare, John Fletcher, Ben Jonson, Christopher Marlowe, Thomas Middleton, George Chapman, George Peele, and Robert Greene. Thomas Kyd is not one of the authors we profile because his canon is too small even if we accept Soliman and Perseda as his.
The relative entropy between different authors' profiles is a measure of how easy they are to tell apart by this method [SLIDE]: the larger the relative entropy, the most distinctive are their habits in clustering function words. Because in these calculations it matters which author we treat as 'first' and which as 'second', each pairing appears here twice and it is the average of the two figures that we are concerned with. [SLIDE] The pairing with the lowest relative entropy is Shakespeare and Ben Jonson at an average 4.4 centinats: they are the hardest to tell apart by this method. [SLIDE] The highest pairing is Marlowe and John Fletcher over 16 centinats: they are easiest to tell part. [SLIDE] Looking at whose profile is most disimilar from Shakespeare, it is Marlowe's at an average 9.5 centinats distance; this dispels any possibility that Marlowe actually wrote all the plays we confidently attribute Shakespeare.
To get a sense of how well this method words, we can test how similar to each author's profile is each of the Shakespeare plays [SLIDE]. Here are the results, and there are four things to observe. First, for almost every play the red circles indicating Shakespeare appear nearest the bottom of the chart, indicating that for the play in that column the Shakespeare profile is the nearest profile to it. Secondly, the highest of the red circles--indicating greatest distances from the average Shakespeare style--are for (reading from right to left) The Two Noble Kinsmen, Titus Andronicus, Timon of Athens, The Taming of the Shrew, Pericles, Measure for Measure, Macbeth, Henry VIII, and the three Henry VI plays. That is, in every case the plays that our method shows to be most unlike Shakespeare's style in function-word clustering are those plays that entirely different methods have shown to be co-authored. Thirdly, notice that for all authors the coloured markers form roughly horizontal coloured bands at different heights: this could not happen unless the method were indeed capturing something distinct about authorial style. Fourthly, notice that Marlowe's profile is frequently the most distant from the Shakespeare play in question (his purple triangles are mostly at the top of the picture) except for the Henry VI plays recently attributed to him by entirely independent tests, where they fall to the bottom.
What if we test Marlowe's plays this way? [SLIDE] Here are seven plays believed to have been written by Marlowe, where Dido, Queen of Carthage is the only collaborative work, with Thomas Nashe its known co-author. Our method cannot form a reliable profile for Nashe because his canon of sole-authored dramatic works comprises just one play, Summer's Last Will and Testament, and that is too little to go on. Notice that the markers nearest the bottom of the picture are the purple triangles for Marlowe: his profile is closest to each play for each of the six known Marlowe plays tested. On this evidence, Dido Queen of Carthage looks closest to the style of Shakespeare with Marlowe a close second. With the sole-authored Marlowe plays, each is attributed to Marlowe by a substantial margin and with relative entropies between -6 and -13 centinats. These large values suggest that the plays are much more similar to Marlowe's profile than they are to the profile of an average playwright. This difference may be a result of the fact that Marlowe's plays were written at least a decade before most of the other authors considered, thus indicating a shift in writing style during the one or two decades that separate Marlowe from the rest. In the longer version of this paper we discuss the whole question of whether chronology and or genre are confounding variables in this method. That is, we investigate whether what we are are tracking is not authorship but date of composition or dramatic genre. The short answer that summarizes a lot of careful validation is 'no', date and genre are not confounding variables here: authorship really is the strongest correlation detected by this method. We really can tell Shakespeare and Marlowe apart by function-word clustering.