Egan2023a

"How far is Shakespeare's language from that of his contemporaries?" by Gabriel Egan

I would like first to record the assistance of my PhD student Nathan Dooner in helping me to understand some of the concepts and problems introduced in this talk. Any errors about them are of course mine not his.

In one sense the question in my title -- "How far is Shakespeare's language from that of his contemporaries?" -- is unanswerable. The notion of distance has no direct application to language. The two domains of language and space are incommensurable. Yet the metaphor of distance is one we easily fall into when discussing texts. And it ceases to be a metaphor once we start to count things in texts. We are familiar with the notion of distance between numbers. [SLIDE] How far it is from 4 to 7? Three. By 'distance' we here mean the absolute magnitude that follows from a subtraction, which is absolute in the sense that we discard the sign of the answer. It matters not which of the two terms we put first. In mathematics, vertical bar before and after a term means that we take its absolute value.

One of the simplest things to count in language is the number of words. [SLIDE] We can ask "What is the distance between the number of words in Shakespeare's dramatic canon and the number of words in Christopher Marlowe's dramatic canon? First we must agree that here we mean by "words" the tokens, so that "never, never, never" counts as three word tokens not one word type. Next we must agree on exactly which plays Shakespeare and Marlowe wrote. To assist in the work of the New Oxford Shakespeare editors in 2011, Hugh Craig kindly made this calculation based on an agreed set of attributions to Shakespeare -- leaving out the disputed Arden of Faversham, Double Falsehood, and the Additions to The Spanish Tragedy -- and came up with the number 740,209 (Taylor 2017, 247). Assisting the same project, Paul Brown calculated word counts of plays by other dramatists of Shakespeare's time and if we agree that the Marlowe canon is Doctor Faustus, Edward II, The Jew of Malta, The Massacre at Paris, and Parts One and Two of Tamburlaine then the total from Brown's counts is 101,146 (Brown 2018).

We might well be tempted to visualize this difference with the kind of picture shown here called a histogram. How much bigger than the Marlowe canon is Shakespeare's canon? It is 639,063 words bigger. We can do the same calculation [SLIDE] for the complete works of the modern writers Stephen King (just over eight millions words) and Isaac Asimov (around seven-and-a-half million words). The distance between these numbers is 642,526, almost exactly the same as the distance between the size of the Shakespeare canon and that of the Marlowe canon.

[SLIDE] Here we put the two histograms on the same scale. Can we say that, regarding canon size, Shakespeare is to Marlowe as King is to Asimov, because the distance between the sizes of the two canons is, in each pairing, about the same? No, because the King and Asimov canons are so much bigger than the Shakespeare and Marlowe canons. For every 10 words that Marlowe left us, Shakepeare left us about 70, whereas for every 10 words that Asimov left us, King has given us about 11. Sometimes the correct measure of difference is not distance but proportion, and the correct operation is division not subtraction [SLIDE]. If instead of subtracting one number from another we divide one number by another we get their relative proportions.

[SLIDE] Proportions are meaningful when the numbers arise from counting actual objects in the world. But when our numbers arise from an arbitrary scale that humans have invented, such as temperature, the reverse is true. There is no meaningful way to divide one temperature by another in because in any temperature scale we simply invent the magnitude of the unit and set the zero point arbitrarily. So that is Common Methodological Error #1. A surprising number of studies that quantify aspects of language get this wrong and use a measure of distance where proportion is the right measure or use proportion where distance is the right measure.

* * *

The 740 thousand words in the Shakespeare canon are not 740 different words, of course, since many of these are repetitions of the same word. Shakespeare used 740 thousand word tokens, but many fewer distinct word types. To figure out how many words Shakespeare knew -- his vocabulary -- we can begin by considering how many different word types he used in his plays. To count the types in Shakespeare, I will use the corpuses of sole-authored-well-attributed plays by Shakespeare and seven of his fellow dramatists for whom more than a few plays survive [SLIDE] For each of these corpuses I used the transcriptions of the plays from the ProQuest One Literature database (formerly known as Literature Online). After each name I have here recorded how many plays are in that dramatist's corpus.

A complicating factor is that in a small sample of writing we will, simply because it is small, find fewer word types than in a longer piece of writing by the same author. In my talk today I will use 3600 word tokens but only 956 word types, which is far fewer types than I know how to use. Word types that I rarely use simply do not get the'opportunity', as it were, to appear in a sample of my language as short as this talk. To adjust for the different sizes of the corpuses we can divide the number of word types in a writer's corpus (the count of how many different words) by the number of word tokens in their corpus (the count of how large the corpus is) [SLIDE]. In this division, a small corpus that uses many different words will get a result, a quotient, larger than a big corpus that uses few different words. This types-to-tokens ratio is a measure of the richness of variety in a writer's language.

Here is table of the types-to-token quotients for our eight early modern dramatists:

	types	tokens	types/tokens	tokens/types
George Chapman	17722	237604	0.075	13.41
John Fletcher	16700	339744	0.049	20.34
Robert Greene	8187	66967	0.122	8.18
Ben Jonson	26680	441301	0.06	16.54
Christopher Marlowe	10663	101506	0.105	9.52
Thomas Middleton	22025	332972	0.066	15.12
George Peele	9262	70662	0.131	7.63
Shakespeare	30216	638302	0.047	21.12

[SLIDE] The smaller the types/tokens value, the less the variety in the writing. [SLIDE] In the last column I have flipped these values to give the reciprocal, the ratio of tokens to types, and on this measure the higher the value the less the variety in the writing. [SLIDE] Notice that the highest three values in this column, the dramatists with the least varied writing, are the Fletcher, Jonson, and Shakespeare: the dramatists for whom we have the most surviving plays. [SLIDE] And notice that the lowest three values in this column, the dramatists with the most varied writing, are Greene, Marlowe, and Peele: the dramatists for whom we have the fewest surviving plays.

In fact, this calculation of language variety or richness is quite wrong. [SLIDE]. Dividing the number of different words types by the size of the canon measured in tokens overcompensates for the effect of some writers having large canon sizes, making their language seem less varied that that of writers with small canons. Simply scaling one's counts by the size of a dramatist's canon would be effective if the relationships between two values -- number of types and number of tokens -- were linear. But it is not. This is Common Methodological Error #2: assuming that a relationship is linear when it is not.

Rather than a straight line, the type/token relationship is a characteristic curve. [SLIDE] To illustrate it, Gilbert Youmans (Youmans 1990, 588) noted how many different types had been encountered (and recorded on the y axis) as he read through, from first word to last, the 5000 tokens of a particular text (recorded on the x axis). He chose as his text the simplified story of Shakespeare and Middleton's Macbeth as told in Charles Lamb's Tales from Shakespeare and rendered into the Basic English system invented by Charles Kay Ogden, in which only 850 different word types are allowed. With so few types at the writer's disposal it soon became necessary, after writing just a few sentences, to heavily reuse types that had already been used. Thus each new sentence is increasingly made up of repetitions of previously used word types and the curve soon starts to plateau [SLIDE x 9]. That is, as the token count rises steadily, the type count -- which is increased only by the use of new words not previously seen in the text -- goes up by ever smaller amounts.

The same principle of plateauing applies in real-world language that is not artificially constrained to using as Ogden's Basic English. The limit in the real world is not Ogden's 850 words but the complete set of words known by the writer, her vocabulary. This plateauing effect is the reason that large canons such as Shakespeare's, Jonson's, and Fletcher tend to produce overall a lower type/token ratio than small canons. In the large canons, the writers have more fully exhausted their entire vocabulary and are forced to repeat themselves. These large canons have more types than the smaller ones, but not more proportionally more. This tapering off of new type deployment allows us to estimate the size of a writer's vocabulary from the shape of this type/token curve. [SLIDE] The trick is to extrapolate the curve until the it becomes perfectly flat and then read off from the y axis the number of types in the vocabulary.

The mathematical calculations for doing this are complex but the principle is straightforward. [SLIDE]. Here I have rescaled the axes in the tens of thousands here because with a real author's canon and vocabulary the token and type counts are much larger that in Youman's illustative example using Lamb's-Macbeth-in-Basic-English. In real-world examples, the curve simple stops long before it plateaus, because the writer's surviving canon (which constrains the x axes) is nowhere near large enough to start exhausting her vocabulary. [SLIDE] Thus we must delete the part of the curve where the plateauing occurs. How is it possible to extrapolate from this beginning part of the curve? We do this by observing the rate of change of the slope of the curve as we move along the x axis, from steep at the beginning and less steep as we read more of the canon [SLIDE x 9]. Each tangent to the curve shows the slope of the curve at a particular point in the x axis. The rate at which these tangents slow down their clockwise rotation is constant, so we can predict future tangents at higher x values by applying decreasing clockwise rotation, and plot the y values that each new projected tangent gives us [SLIDE x 10].

In a landmark study of 2011, Hugh Craig produced this kind of curve for William Shakespeare and for 12 of his contemporary playwrights (Craig 2011). What we have depicted as moving our attention along the x axis, taking in more and more writing by the author, was in Craig's study implemented as considering what is added to the type count by each successive new Shakespeare play as he added it to the experiment. When comparing what each new Shakespeare play added to the Shakespeare type-count with what each new play by one of the other dramatists added to that dramatist's type-count, Craig found that Shakespeare was in the middle of the pack. Entirely average. If we want to know what makes Shakespeare's writing extraordinary, we must stop looking in the area of vocabulary richness because in that he is not unusual. Shakespeare seems to use a greater variety of words than his rival dramatists, but that is an illusion caused by his leaving us more writing than they did.

Craig pursued his analysis to consider how often in standard-size chunks of his writing Shakespeare used commonplace words versus rare words. Again Shakespeare was absolutely like his peers in this regard, not exceptional. Craig measured how often Shakespeare used the 100 most common words, compared to his rival dramatists. Again Shakespeare came out as utterly ordinary. Indeed, Craig concluded "If anything his linguistic profile is exceptional in being unusually close to the norm of his time" (Craig 2011, 68).

* * *

[SLIDE] To produce these figures I showed earlier for various dramatist's types/tokens ratios, I used a simple computer program in the language called Python [SLIDE]. Here is the entire program. In the progamming classes I have given, it takes a typical Humanities scholar about half a day of training to get good enough to write a program like this one. This program not only produces the types and tokens counts, but also prints a frequency table showing for any text how often each of the types it contains appears in that text. [SLIDE] Here is the beginning of that table for George Chapman's plays. We normally expect the word the to be the most frequent in any large body of writing, but in Chapman's plays the word and is more common.

The notion of a 'distance' between one writer's language and another's gets particularly complicated when we work in two dimensions instead of one, meaning that we are simultaneously counting two features of their writing instead of one. Here we take the counts for the and and for all eight of our playwrights and plot them on an x/y scatterplot [SLIDE]. In this picture, each dot represents a corpus of plays by a different dramatist and each dot has a position in the picture that represents two numbers. [SLIDE] How far the dot is along the x axis shows what proportion of that dramatist's tokens are the word the. [SLIDE] How far the dot is along the y axis shows what proportion of the dramatist's tokens are the word and. [SLIDE] So, we can read off from Fletcher's dot that 0.022 (that is, 2.2%) of his tokens are the word the and 0.033 (that is, 3.3%) of his tokens are the word and. Any dot in the top left corner of the scatterplot has many more ands than thes and any dot in the bottom right corner has more thes than ands. The dots for Peele and Marlowe in the top-right corner show that these two dramatist use the and and more than the other dramatists do.

We can see that Peele and Marlowe in the top right corner are far from the other dramatists. But how far are they from, say, Jonson? [SLIDE] The answer might seem simple: we draw a line directly from the Peele dot to the Jonson dot and measure its length, and then do the same for the distance from the Marlowe dot to the Jonson dot. This as-the-crow-flies measurement is called the Euclidean distance. But another way to measure the same thing is to imagine how a taxi driver might make the journey from Jonson to Marlowe or Peele if she had to drive along roads laid out in a grid of city blocks [SLIDE]. So long as the driver does not overshoot the destination [SLIDE x 4] either in the north-south or east-west destination, all the different routes have the same total length: [SLIDE] 17 city blocks for Marlowe and 22 city blocks for Peele. Named after a city famed for its grid layout, this measure is known as Manhattan distance.

The Euclidean and Manhattan measurements give different distances for the same journeys. In this example, they at least agree that the Marlowe data point is nearer to the Jonson data point than the Peele data point is. But it is possible for the Euclidean and Manhattan measurements to give different answers about which of two points is nearer a third point. [SLIDE] Consider these three points A, B, and C. Which of B and C is nearer to A? [SLIDE] By Manhattan Distance, the drive from A to B is 10 city blocks south followed by 28 city blocks west for a total distance of 38 blocks, while the drive from A to C is 20 blocks east and 20 blocks north for a total of 40 blocks, [SLIDE] so B is nearer to A than C is. [SLIDE] We calculate the Euclidean Distance as the crow flies using Pythagoras's theorem for right-angled triangles and find that C is the nearer to A than B is.

[SLIDE] Yet a third way to measure the distance between these authors' habits regarding the use of the and and is to say that we do not care about how often the words are used overall but rather the relative preferences for one of these words over the other. [SLIDE] Fletcher clearly prefers and over the, using more ands [SLIDE]. Middleton clearly prefers the over and, using more thes. [SLIDE] And Shakespeare falls somewhere between Fletcher and Middleton, but nearer to Middleton. To represent these preferences, we can use these angles in a measure called Cosine Distance [SLIDE x 3]. Cosine Distance can easily disagree with Euclidean and Manhattan Distance about how different two writers' styles are. [SLIDE] Consider the Cosine Distance from Greene to Middleton, which is smaller [SLIDE] than the Cosine Distance from Shakespeare to Middleton, although by Euclidean and Manhattan Distance the Shakespeare data point is closer to the Middleton data point than the Greene data point is. What this tells us is that although Greene uses many more thes and ands than Middleton does, using them about as liberally as Shakespeare does, Greene's strong preference for the over and is much like Middleton's strong preference for the over and and is different from Shakespeare's habit which only slightly prefers the over and. This is Common Methodological Error #3: assuming that all distance measures will agree on how far apart are the data points that we derive by counting features of various writers' writings.

The importance of our choice of distance measure increases as we use more dimensions. [SLIDE] If we count not only the occurrences of the and and but also the occurrences of the verb to be in segments of plays by Jonson, Marlowe, and Peele, we end up with three data values for each play. We can visualize in a three-dimensional scatterplot the results of such counting, but when I show you such a three-dimensional plot like this as a two-dimensional picture it becomes impossible to read off the values. [SLIDE] Consider this data point. [SLIDE] Perhaps it is floating in space and shows a value of 73 for the play segment's frequency of the, [SLIDE] 40 for its frequency of to be, and zero for its frequency of and. [SLIDE] Or maybe this data point is resting on the bottom plane and shows a value of 41 for the frequency of the, 400 for the frequency of and, and zero for the frequency of to be. The only way to see the true distances between the points is to rotate the scatterplot [SLIDE].

We need not stop at counting three features of a text and treating our counts as the coordinates of points in three-dimensional space. We might count occurrences of the 100 most common words and treat the resulting numbers as coordinates in 100-dimensional space. Obviously I cannot show you a visualization of 100-dimensional space, but the calculations we use for measuring distance also work for space in any number of dimensions. [SLIDE] In experiments with multi-dimensional data we often want to ask just which of several clouds of data points, each derived from one author's works, is the cloud from which a new data point is least distant. The nearest cloud will be for writing that is closest in style to that of the writing that generated the new data point. Unfortunately, as we increase the number of dimensions, something odd happens to the data points: they spread out so that the kind of clustering by author that we see here ceases to appear.

To understand why this happens, let us return to the simple number line we began with [SLIDE]. If our one-dimensional universe has one 10 possible places that a data point can fall, then we need only 10 data points to fill that universe [SLIDE]. But if we add one more dimension to make a two-dimensional universe [SLIDE] then we need 100 data points to fill the 100 places in that universe [SLIDE]. And if we have only 10 data points from our experiments -- our counting of features in texts -- then those data points will more widely spaced out [SLIDE]. If we add a third dimension [SLIDE] then we need 1000 data points to fill that space [SLIDE]. And if we have only 10 data points from our experiment, then they will be widely distant from one another. Every new dimension multiples by 10 the number of data points we need to fill the space and if we have only a few data points they become ever more widely separated. [SLIDE] At 100 dimensions, which is the space we are using if we count the 100 most commonly used words, as we often do in authorship studies, we need [SLIDE] 10 to the power of 100 data points to fill the space. For a sense of perspective, this greater than the number of atoms in the universe. Even if we our experiments give us tens of billions of data points -- which rarely happens -- these data point will be hugely distant from one another in 100-dimensional space. The notion of nearness -- the notion of how far Shakespeare's writing is from that of his contemporaries that we stared with -- stops making sense when are data points are so far apart.

This problem is known as the Curse of Dimensionality and it is a significant obstacle across data science. Just measuring more features in language does not necessarily give us more knowledge, because our distance measures are less discriminating as we move into the higher dimensions. This is Common Methodological Error #4: thinking that the more features of writers' writings that we count, the more data we derive, the more knowledge we have about that writing. In important ways, counting more gives us less knowledge. The problem of measuring distance does not affect all distance measures equally. Euclidean, Manhattan, and Cosine Distance, and various more esoteric measures, are more or less discrimating of authorship depending on precisely what we are measuring and how many dimensions our data have. Unfortunately, most published papers on computational analysis of writing style pay little or no attention to the choice of distance measure. Investigators typically accept the default distance measures provided in such software packages as the popular R Stylo and neglect to consider how the choice of distance measure affects their results.

To conclude, then. The metapor of distance has no direct application to the comparison of texts. Once we start counting features of texts, the notion of distance has some validity, but it is not simply a matter of subtracting one number from another. Sometimes division rather than subtraction gives the more meaningful sense of 'distance'. When we create sets of numbers arising from our counting of features of writing, we can measure the 'distance' between the sets of numbers is more than one way and none is obviously the correct way. Investigators must show in each investigation why a particular distance measure is the one that yields the most useful distinctions between texts. Finally, our ability to generate more and more numbers from texts should not lead us to conclude that we are gaining more and more information about them, on account of the Curse of Dimensionality. Having captured data along multiple dimensions we can perform dimension-reduction processes such as Principal Components Analysis. In a longer talk I could discuss how these methods again make it easy to deceive oneself about the meaning of one's results. In the field of computational stylistics, self-deception is the easiest trap to fall into [SLIDE of Feynman].

Works Cited

Brown, Paul. 2018. Play Word Counts. A contribution to the website 'Shakespeare's Early Editions: Computational Methods for Textual Studies' hosted by De Montfort University and funded by the Arts and Humanities Research Council from grant AH/N007654/1.

Craig, Hugh. 2011. "Shakespeare's Vocabulary: Myth and Reality." Shakespeare Quarterly 62. 53-74.

Taylor, Gary. 2017. "Did Shakespeare Write The Spanish Tragedy Additions?" The New Oxford Shakespeare Authorship Companion. Edited by Gary Taylor and Gabriel Egan. Oxford. Oxford University Press. 246-60.

Youmans, Gilbert. 1990. "Measuring Lexical Style and Competence: The Type-token Vocabulary Curve." Style 24. 584-99.