What is Entropy and How Does it Help us Attribute Authorship?

"What is Entropy and How Does it Help us Attribute Authorship?" by Gabriel Egan

I start from the assumption that most people in this room do not know what entropy is, let alone what it might have to do with Shakespeare Studies, so I'll begin with a practical exercise to help illustrate it. First please turn to the neighbour on your right or left, smile, and form a two-person partnership. In other words, get into pairs. One of you should hold the printed sheet printed provided and a pen and the other should hold a blank piece of paper and a pen. The person with the blank sheet is the guesser trying to guess the letters, words and sentences that the other person, the guessee, is holding. The guesser will take a guess what the first letter is and say it, and then is told by the guessee whether they are correct or, if they are incorrect, they are told what the correct letter was. I will repeat that: the guesser will take a guess what the first letter is and say it, and then is told by the other person either "Yes, that it correct" or "No, the correct letter is ...". The guesser should write down just each correct letter as it is discovered, either because they guessed it or because they were told it. The guessee who has full text should, for each guess, above each letter in their printed sheet either a dash for a letter that is correctly guessed or, if the letter is wrongly guessed, they should write that letter above the one in their printed text. [SLIDE] As a cheat-sheet, here is a typical exchange.

Allow about 3 minutes for this exercise.

The guessee should now have, in those dashes, a record of how many letters were correctly predicted by the guesser, and it should be more than half if the guesser is any good. We just recreated the experiment by which Claude Shannon, the father of information theory (and hence the computer age), calculated that overall English prose is about 75% redundant: three times out of four the next letter is guessable. This is the reason that today's SMS text-speak and various kinds of shorthand work [SLIDE]. In this context, redundancy means predictability: after the letter t the letter h is much more likely to follow than x is, and directly after q the appearance of u is almost a certainty. Shannon gave us the mathematics with which to quantify these patterns of predictability, and borrowed from physics the term of entropy for it. With Shannon's mathematics of information we can capture, quantify, and study the patterns of repetition in language that make for its predictability and we can use the data to compare texts.

So much for the predictability of individual letters within words, which and relies on your knowledge of common patterns of letters found in English words. If we had more time to complete more of the text, the guesser might also have been able to draw upon her knowledge of the habits of this particular text's author if she recognized who it was. Anybody know? Knowing that it's Raymond Chandler could help you predict that a word begin with g-u- is likely to be gun or gum rather than guildhall or gulag, and knowing Chandler's preferred phrasings could help you guess the entire word that is coming next.

These wider word-order choices can be quantified with same techniques, and the same precision, as the letter-order choices. Thus we can capture authors' individual phrasing preferences. The most useful words to capture for this kind of analysis are not the relatively rare and distinctive lexical words but the common and meaningless function words. We need, though, a systematic way to record how often certain words appear close to other words to form certain phrases. y or far away. The solution is to use what mathematicians call a Markov chain. Take this extract from Shakespeare's Hamlet [SLIDE]

With one auspicious and one dropping eye,
With mirth in funeral and with dirge in marriage,
In equal scale weighing delight and dole,
(Shakespeare Hamlet 1.2.11-13)

[SLIDE] Let us just confine our attention to the proximities, one from another, of the four function words with, and, one, and in. [SLIDE] Starting with with and looking forward five words we find an occurrence of the word one, an occurrence of the word and, and another occurrence of the word one. [SLIDE] We record that in our Markov chain by a line from with to and with a value of 1 and a line from with to one with a value of 2. [SLIDE] We are done with the first word in the extract, With, and we [SLIDE] move to the next occurrence of one of our function words, which is the second word in the extract, one. Again looking forward five words we spot an occurrence of and and an occurrence of one, [SLIDE] so we draw a line from one to and, weighted 1, and a line from one to itself, weighted 1. [SLIDE] Then we move to the next occurrence of one of our function words, and it is and in the middle of the first line. Looking forward five words, we find an occurrence of one and and occurrence of with, so we add these to our Markov chain as two weighted lines emerging from the node for and. We proceed through the extract in the same way, adding fresh weighted lines (called edges) between nodes to indicate how often each word appears within five words of the others [SLIDE x 16]. This is our completed Markov chain.

We then do the same for the same four functions words' appearance in another passage, [SLIDE] this time from Thomas Dekker's Satiromastix; here's the completed Markov chain. [SLIDE] We end up with two Markov chains, each showing the Word Adjacency Network for the four words with, and, one, and in, in each extract. These two chains contain the information about the word proximities in the two extracts, and using Shannon's mathematics for entropy we can compare them. You will see that there are fewer lines in the Satiromastix network, but the absolute number of lines is not the most important point. The key question is, "when this author chooses to follow one of these words with another of these words, which is she most likely to choose?" These network embody the author's preferences that answer this question. You can see that in the Dekker extract, the word in is never followed (within five words) by the word with: there is no line running from in to with. Dekker instead chooses to follow in by and (one time) and by one (two times). [BLANK SLIDE]

This is only an illustration of the idea and for authorship attribution we use many more than four function words; 100 would be typical, but the resulting pictures are too complex to show you. And of course rather than short extracts from plays we use whole authorial canons as our samples. And instead of just recording the raw numbers of edges from node to node there are some weightings of edges and nodes to be applied using Shannon's mathematics for entropy and what is called limit probability. The edge-weightings reflect the fact that we consider words appearing close to one another to be more significant than words that are far apart, so instead of scoring "1" for a word appearing anywhere within our 5-word window, we give a greater score to words appearning near the beginning of the window. The limit probability weighting of nodes reflects the fact that we attach greater significance to words that are used often in the text being tested that to words that are used infrequently.

Specifically, how do we compare two WANs? The technical answer is that for each edge common to the two WANs we subtract from the natural logarithm of the weight in the first WAN the natural logarithm of the weight in the second WAN and then multiply this difference by the weight of the edge in the first WAN and then by the limit probability of the node from which this edge originates. This calculation is made for each edge and the values summed to express the total difference. In this method, it matters which WAN we designate as 'first' and which 'second', so the procedure is performed twice, switching the designation the second time. (I give you this technical account because the last time I spoke about this an audience complained that I had concealed the most interesting part of the story; chacun a son gout.)

In this method, authorship attribution is done by calculating the overall difference (measured in Shannon entropy units called centinats) between a Markov chain representing the word-adjacency preferences for an entire canon by one author and the Markov chain representing the word-adjacency preferences of the play we wish to attribute. The author whose canon produces the Markov chain least different from the Markov chain for the play to be attributed is the likeliest author. The first, quite limited, application of this method will be published in the next issue of Shakespeare Quarterly, in which we--that is, me and my collaborators in the Electrical Engineering Department at the University of Pennsylvania--agree with what Hugh Craig and his collaborators have found with their entirely different methods. That is, we agree by this method that Christopher Marlowe had a substantial hand in all three parts of Shakespeare's Henry 6.

* * *

But my starting question was about the bad quartos, which quite a lot of Shakespearians now think simply don't exist as a category of texts. The New Bibliographers, of course, thought that the editions of Romeo and Juliet (1597), Henry 5 (1600), The Merry Wives of Windsor (1602), Hamlet (1603) and Pericles (1609) are corrupt far beyond the normal vicissitudes of early modern printing and corrupt in particular, distinctive ways. Of course, for Shakespeare editions we lack anything like an uncorrupted original (say, an authorial manuscript) from which to measure the usual level of textual corruption in the printing house, although we can get some sense of how much corruption occurs when one edition is printed directly from another.

We can, though, measure the relative entropy between texts, quantifying how far a quarto text differs from its corresponding Folio counterpart in its pattern of repetitions. In her book Reforming the 'Bad' Quartos in 1994 Kathleen O. Irace attempted to quantify these differences, but her method was largely subjective even though she processed her results computationally. Another recent endeavour in this field was also, in my view, a complete failure: Lene B. Petersen's book Shakespeare's Errant Texts of 2010. Indeed, the failure of Petersen's work is the stimulus to my own because she asked exactly the right questions and pointed the way forward.

To illustrate this, here is a pattern of repetition identified by Petersen [SLIDE read the bold]. Petersen's wider purpose was to compare Q1 Hamlet to texts in the oral tradition--ballads--in order to argue that in repeated oral performance the play script itself gets modified to produce these such local patterns of repetition as part of a general process of patterning and the creation of formulaic phrasings. This is precisely the sort of thing that we ought to be able to measure with information theory. But there are repetitions and repetitions, and they are not all alike. I want to end with a complex kind of repetion that I hope Shakespearians will consider typically Shakespearian. [SLIDE] In As You Like It, Corin asks Touchstone how he likes the shepherd's life, and gets the famous reply [SLIDE]:

TOUCHSTONE Truly, shepherd, in respect of itself, it is a good life; but in respect that it is a shepherd's life, it is naught. In respect that it is solitary, I like it very well; but in respect that it is private, it is a very vile life. Now in respect it is in the fields, it pleaseth me well; but in respect it is not in the court, it is tedious. (As You Like It 3.2.13-18)

This is full of repetitions, but it is not immediate clear how many there are nor how to count and record them. [SLIDE] Do we care most about long phrases being precisely repeated, like the three occurrences "in respect that it is", [SLIDE] or can we admit two more that are nearly the same in reading "in respect it is"? [SLIDE] Or should we care more about short phrases, so that the six occurrences of "in respect" [SLIDE] and the nine occurrences of "it is" count more heavily. There is no right or wrong way to do this, no one method for weighting the evidence from strings of different lengths. (In a related context, Brian Vickers argues that long strings of words are alwasy inherently better evidence for how the mind works--and for shared authorship--than short strings are, but I believe that Hugh Craig has proven this view to be mistaken). The important point with which I will close is that for the first time we are becoming able to approach in an empirical, quantitative, and replicable way the patterns in the language of Shakespeare and his contemporaries. The results are already changing the boundaries of the dramatic canons, and will shortly, I predict, change our notions of authorship altogether.