"How far is Shakespeare's language from that of his contemporaries?" by Gabriel Egan

I would like first to record the assistance of my PhD student Nathan Dooner in helping me to understand some of the concepts and problems introduced in this talk. Any errors about them are of course mine not his.

In one sense the question in my title -- "How far is Shakespeare's language from that of his contemporaries?" -- is unanswerable. The notion of distance has no direct application to language. The two domains of language and space are incommensurable. Yet the metaphor of distance is one we easily fall into when discussing texts. And it ceases to be a metaphor once we start to count things in texts. We are familiar with the notion of distance between numbers. [SLIDE] How far it is from 4 to 7? Three. By 'distance' we here mean the absolute magnitude that follows from a subtraction, which is absolute in the sense that we discard the sign of the answer. It matters not which of the two terms we put first. In mathematics, vertical bar before and after a term means that we take its absolute value.

One of the simplest things to count in language is the
number of words.
[SLIDE] We can ask "What is the distance between the number of words in Shakespeare's
dramatic canon and the number of words in Christopher Marlowe's dramatic
canon? First we must agree that here we mean by "words" the tokens, so that "never, never, never" counts
as three word tokens not one word type. Next we must agree on exactly which
plays Shakespeare and Marlowe wrote. To
assist in the work of the New Oxford Shakespeare editors in 2011, Hugh Craig
kindly made this calculation based on an agreed set of attributions to
Shakespeare -- leaving out the disputed *Arden of Faversham*, *Double Falsehood*, and
the Additions to *The Spanish Tragedy* -- and came up with the
number 740,209 (Taylor 2017, 247). Assisting the same project, Paul Brown calculated
word counts of plays by other dramatists of Shakespeare's time and if we agree that
the Marlowe canon is *Doctor Faustus*, *Edward II*, *The Jew of
Malta*, *The Massacre at Paris*, and Parts One and Two of *Tamburlaine*
then the total from Brown's counts is 101,146 (Brown 2018).

We might well be tempted to visualize this difference with the kind of picture shown here called a histogram. How much bigger than the Marlowe canon is Shakespeare's canon? It is 639,063 words bigger. We can do the same calculation [SLIDE] for the complete works of the modern writers Stephen King (just over eight millions words) and Isaac Asimov (around seven-and-a-half million words). The distance between these numbers is 642,526, almost exactly the same as the distance between the size of the Shakespeare canon and that of the Marlowe canon.

[SLIDE] Here we put the two histograms on the same scale. Can we say that, regarding canon size, Shakespeare is to Marlowe as King is to Asimov, because the distance between the sizes of the two canons is, in each pairing, about the same? No, because the King and Asimov canons are so much bigger than the Shakespeare and Marlowe canons. For every 10 words that Marlowe left us, Shakepeare left us about 70, whereas for every 10 words that Asimov left us, King has given us about 11. Sometimes the correct measure of difference is not distance but proportion, and the correct operation is division not subtraction [SLIDE]. If instead of subtracting one number from another we divide one number by another we get their relative proportions.

[SLIDE] Proportions are meaningful when the numbers arise from counting actual objects in the world. But when our numbers arise from an arbitrary scale that humans have invented, such as temperature, the reverse is true. There is no meaningful way to divide one temperature by another in because in any temperature scale we simply invent the magnitude of the unit and set the zero point arbitrarily. So that is Common Methodological Error #1. A surprising number of studies that quantify aspects of language get this wrong and use a measure of distance where proportion is the right measure or use proportion where distance is the right measure.

* * *

The 740 thousand words in the Shakespeare canon are not 740 different words, of course, since many of these are repetitions of the same word. Shakespeare used 740 thousand word tokens, but many fewer distinct word types. To figure out how many words Shakespeare knew -- his vocabulary -- we can begin by considering how many different word types he used in his plays. To count the types in Shakespeare, I will use the corpuses of sole-authored-well-attributed plays by Shakespeare and seven of his fellow dramatists for whom more than a few plays survive [SLIDE] For each of these corpuses I used the transcriptions of the plays from the ProQuest One Literature database (formerly known as Literature Online). After each name I have here recorded how many plays are in that dramatist's corpus.

A complicating factor is that in a small sample of writing we will, simply because it is small, find fewer word types than in a longer piece of writing by the same author. In my talk today I will use 3600 word tokens but only 956 word types, which is far fewer types than I know how to use. Word types that I rarely use simply do not get the'opportunity', as it were, to appear in a sample of my language as short as this talk. To adjust for the different sizes of the corpuses we can divide the number of word types in a writer's corpus (the count of how many different words) by the number of word tokens in their corpus (the count of how large the corpus is) [SLIDE]. In this division, a small corpus that uses many different words will get a result, a quotient, larger than a big corpus that uses few different words. This types-to-tokens ratio is a measure of the richness of variety in a writer's language.

Here is table of the types-to-token quotients for our eight early modern dramatists:

types | tokens | types/tokens | tokens/types | |

George Chapman | 17722 | 237604 | 0.075 | 13.41 |

John Fletcher | 16700 | 339744 | 0.049 | 20.34 |

Robert Greene | 8187 | 66967 | 0.122 | 8.18 |

Ben Jonson | 26680 | 441301 | 0.06 | 16.54 |

Christopher Marlowe | 10663 | 101506 | 0.105 | 9.52 |

Thomas Middleton | 22025 | 332972 | 0.066 | 15.12 |

George Peele | 9262 | 70662 | 0.131 | 7.63 |

Shakespeare | 30216 | 638302 | 0.047 | 21.12 |

[SLIDE] The smaller the types/tokens value, the less the variety in the writing. [SLIDE] In the last column I have flipped these values to give the reciprocal, the ratio of tokens to types, and on this measure the higher the value the less the variety in the writing. [SLIDE] Notice that the highest three values in this column, the dramatists with the least varied writing, are the Fletcher, Jonson, and Shakespeare: the dramatists for whom we have the most surviving plays. [SLIDE] And notice that the lowest three values in this column, the dramatists with the most varied writing, are Greene, Marlowe, and Peele: the dramatists for whom we have the fewest surviving plays.

In fact, this calculation of language variety or richness is quite wrong. [SLIDE]. Dividing the number of different words types by the size of the canon measured in tokens overcompensates for the effect of some writers having large canon sizes, making their language seem less varied that that of writers with small canons. Simply scaling one's counts by the size of a dramatist's canon would be effective if the relationships between two values -- number of types and number of tokens -- were linear. But it is not. This is Common Methodological Error #2: assuming that a relationship is linear when it is not.

Rather than a straight line, the type/token relationship
is a characteristic curve. [SLIDE] To illustrate it, Gilbert Youmans
(Youmans 1990, 588) noted how many different types had been encountered (and recorded
on the *y* axis) as he read through, from first word to last, the 5000
tokens of a particular text (recorded on the *x* axis). He chose as his text
the simplified story of Shakespeare and Middleton's *Macbeth*
as told in Charles Lamb's *Tales from Shakespeare* and rendered into the
Basic English system invented by Charles Kay Ogden, in which only 850 different
word types are allowed. With so few
types at the writer's disposal it soon became necessary, after writing just a few
sentences, to heavily reuse types that had already been used. Thus each new
sentence is increasingly made up of repetitions of previously used word types and the curve
soon starts to plateau [SLIDE x 9]. That is, as the token count rises steadily, the type count --
which
is increased only by the use of new words not previously seen in the text -- goes
up by ever smaller amounts.

The same principle of plateauing applies in real-world
language that is not artificially constrained to using as Ogden's Basic English. The
limit in the real world is not Ogden's 850 words but the complete set of words
known by the writer, her vocabulary. This plateauing effect is the reason that
large canons such as Shakespeare's, Jonson's, and Fletcher tend to produce
overall a lower type/token ratio than small canons. In the large canons, the
writers have more fully exhausted their entire vocabulary and are forced to
repeat themselves. These large canons have more types than the smaller ones, but
not more proportionally more. This tapering off of new type deployment allows us to estimate the size of a writer's vocabulary from the shape of this
type/token curve. [SLIDE] The trick is to extrapolate the curve until
the it becomes perfectly flat and then read off from the *y* axis the
number of types in the vocabulary.

The mathematical calculations for doing this
are complex but the principle is straightforward. [SLIDE]. Here I have rescaled the axes in the
tens of thousands here
because with a real author's canon and vocabulary the token and type counts are
much larger that in Youman's illustative example using Lamb's-*Macbeth*-in-Basic-English.
In real-world examples, the curve simple stops long before it plateaus, because the
writer's surviving canon (which constrains the *x* axes) is nowhere near
large enough to start exhausting her vocabulary. [SLIDE] Thus we must delete the part of the curve where the
plateauing occurs. How is it possible to extrapolate from this beginning
part of the curve? We do this by observing the rate of change of the slope of
the curve as we move along the *x* axis, from steep at the beginning and
less steep as we read more of the canon [SLIDE x 9]. Each tangent to the
curve shows the slope of the curve at a particular point in the *x* axis.
The rate at which these tangents slow down their clockwise rotation is constant,
so we can predict future tangents at higher *x* values by applying decreasing clockwise rotation, and plot the *y* values that each new
projected tangent gives us [SLIDE x 10].

In a landmark study of 2011, Hugh Craig produced this kind
of curve for William Shakespeare and for 12 of his contemporary playwrights
(Craig 2011). What we have depicted as moving our attention along the *x*
axis, taking in more and more writing by the author, was in Craig's study
implemented as considering what is added to the type count by each successive
new Shakespeare play as he added it to the experiment. When comparing what each new
Shakespeare play added to the Shakespeare type-count with what each new play by
one of the other dramatists added to that dramatist's type-count, Craig found
that Shakespeare was in the middle of the pack. Entirely average. If we want to
know what makes Shakespeare's writing extraordinary, we must stop looking in the
area of vocabulary richness because in that he is not unusual. Shakespeare * seems*
to use a greater variety of words than his rival dramatists, but that is an
illusion caused by his leaving us more writing than they did.

Craig pursued his analysis to consider how often in standard-size chunks of his writing Shakespeare used commonplace words versus rare words. Again Shakespeare was absolutely like his peers in this regard, not exceptional. Craig measured how often Shakespeare used the 100 most common words, compared to his rival dramatists. Again Shakespeare came out as utterly ordinary. Indeed, Craig concluded "If anything his linguistic profile is exceptional in being unusually close to the norm of his time" (Craig 2011, 68).

* * *

[SLIDE] To
produce these figures I showed earlier for various dramatist's types/tokens
ratios, I used a simple computer program in the language called
Python [SLIDE]. Here is the entire program. In the progamming classes I have given, it
takes a typical Humanities scholar about half a day of training to get good
enough to write a program like this one. This program not only
produces the types and tokens counts, but also prints a frequency table showing
for any text how often each of the types it contains appears in that text.
[SLIDE] Here is the beginning of that table for George Chapman's plays. We
normally expect the word *the* to be the most frequent in any large body of
writing, but in Chapman's plays the word *and* is more common.

The notion of a 'distance' between one writer's language
and another's gets particularly complicated when we work in two dimensions instead of one, meaning that we are
simultaneously counting two features of their writing instead of one. Here we take the counts for *the* and *and* for all
eight of our
playwrights and plot them on an *x*/*y* scatterplot [SLIDE]. In this
picture, each dot represents a corpus of plays by a different dramatist and each
dot has a position in the picture that represents two numbers. [SLIDE] How far
the dot is along the *x* axis shows what proportion of that dramatist's tokens
are the word *the*. [SLIDE] How far the dot is along the *y* axis shows what
proportion of the dramatist's tokens are the word *and*. [SLIDE] So, we can
read off from Fletcher's dot that 0.022 (that is, 2.2%) of his tokens are the word *the*
and 0.033 (that is, 3.3%) of his tokens are the word *and*. Any dot in the top
left corner of the scatterplot has many more *and*s than *the*s and
any dot in the bottom right corner has more *the*s than *and*s. The
dots for Peele and Marlowe in the top-right corner show that these two dramatist
use *the*
and *and* more than the other dramatists do.

We can see that Peele and Marlowe in the top right corner are far from the other dramatists. But how far are they from, say, Jonson? [SLIDE] The answer might seem simple: we draw a line directly from the Peele dot to the Jonson dot and measure its length, and then do the same for the distance from the Marlowe dot to the Jonson dot. This as-the-crow-flies measurement is called the Euclidean distance. But another way to measure the same thing is to imagine how a taxi driver might make the journey from Jonson to Marlowe or Peele if she had to drive along roads laid out in a grid of city blocks [SLIDE]. So long as the driver does not overshoot the destination [SLIDE x 4] either in the north-south or east-west destination, all the different routes have the same total length: [SLIDE] 17 city blocks for Marlowe and 22 city blocks for Peele. Named after a city famed for its grid layout, this measure is known as Manhattan distance.

The Euclidean and Manhattan measurements give different distances for the same journeys. In this example, they at least agree that the Marlowe data point is nearer to the Jonson data point than the Peele data point is. But it is possible for the Euclidean and Manhattan measurements to give different answers about which of two points is nearer a third point. [SLIDE] Consider these three points A, B, and C. Which of B and C is nearer to A? [SLIDE] By Manhattan Distance, the drive from A to B is 10 city blocks south followed by 28 city blocks west for a total distance of 38 blocks, while the drive from A to C is 20 blocks east and 20 blocks north for a total of 40 blocks, [SLIDE] so B is nearer to A than C is. [SLIDE] We calculate the Euclidean Distance as the crow flies using Pythagoras's theorem for right-angled triangles and find that C is the nearer to A than B is.

[SLIDE] Yet a third way to measure the distance between these
authors' habits regarding the use of *the* and *and* is to say that we
do not care about how often the words are used overall but rather the relative
preferences for one of these words over the other. [SLIDE] Fletcher clearly prefers *and*
over *the*, using more *and*s [SLIDE]. Middleton clearly prefers *the*
over *and*, using more *the*s. [SLIDE] And Shakespeare falls somewhere
between Fletcher and Middleton, but nearer to Middleton. To represent
these preferences, we can use these angles in a measure called Cosine Distance
[SLIDE x 3]. Cosine Distance can easily disagree with Euclidean and Manhattan
Distance about how different two writers' styles are. [SLIDE] Consider the
Cosine Distance from Greene to Middleton, which is smaller [SLIDE] than the
Cosine Distance from Shakespeare to Middleton, although by Euclidean and
Manhattan Distance the Shakespeare data point is closer to the Middleton data
point than the Greene data point is. What this tells us is that although Greene
uses many more *the*s and *and*s than Middleton does, using them about
as liberally as Shakespeare does, Greene's strong preference for *the *over
*and *is much like Middleton's strong preference for *the* over *and*
and is different from Shakespeare's habit which only slightly prefers *the*
over *and*. This is Common Methodological Error #3: assuming that all
distance measures will agree on how far apart are the data points that we derive
by counting features of various writers' writings.

The importance of our choice of distance measure increases
as we use more dimensions. [SLIDE] If we count not only the occurrences of *the*
and *and* but also the occurrences of the verb *to be* in segments of
plays by Jonson, Marlowe, and Peele, we end up with three data values for each
play. We can visualize in a three-dimensional scatterplot the results of such
counting, but when I show you such a three-dimensional plot like this as a two-dimensional picture
it becomes impossible to read off the values.
[SLIDE] Consider this data point. [SLIDE] Perhaps it is floating in space and
shows a value of 73 for the play segment's frequency of *the*, [SLIDE] 40 for its
frequency of *to be*, and zero for its frequency of *and*. [SLIDE] Or
maybe this data point is resting on the bottom plane and shows a value of 41 for
the frequency of *the*, 400 for the frequency of *and*, and zero for
the frequency of *to be*. The only way to see the true distances between
the points is to rotate the scatterplot [SLIDE].

We need not stop at counting three features of a text and treating our counts as the coordinates of points in three-dimensional space. We might count occurrences of the 100 most common words and treat the resulting numbers as coordinates in 100-dimensional space. Obviously I cannot show you a visualization of 100-dimensional space, but the calculations we use for measuring distance also work for space in any number of dimensions. [SLIDE] In experiments with multi-dimensional data we often want to ask just which of several clouds of data points, each derived from one author's works, is the cloud from which a new data point is least distant. The nearest cloud will be for writing that is closest in style to that of the writing that generated the new data point. Unfortunately, as we increase the number of dimensions, something odd happens to the data points: they spread out so that the kind of clustering by author that we see here ceases to appear.

To understand why this happens, let us return to the simple number line we began with [SLIDE]. If our one-dimensional universe has one 10 possible places that a data point can fall, then we need only 10 data points to fill that universe [SLIDE]. But if we add one more dimension to make a two-dimensional universe [SLIDE] then we need 100 data points to fill the 100 places in that universe [SLIDE]. And if we have only 10 data points from our experiments -- our counting of features in texts -- then those data points will more widely spaced out [SLIDE]. If we add a third dimension [SLIDE] then we need 1000 data points to fill that space [SLIDE]. And if we have only 10 data points from our experiment, then they will be widely distant from one another. Every new dimension multiples by 10 the number of data points we need to fill the space and if we have only a few data points they become ever more widely separated. [SLIDE] At 100 dimensions, which is the space we are using if we count the 100 most commonly used words, as we often do in authorship studies, we need [SLIDE] 10 to the power of 100 data points to fill the space. For a sense of perspective, this greater than the number of atoms in the universe. Even if we our experiments give us tens of billions of data points -- which rarely happens -- these data point will be hugely distant from one another in 100-dimensional space. The notion of nearness -- the notion of how far Shakespeare's writing is from that of his contemporaries that we stared with -- stops making sense when are data points are so far apart.

This problem is known as the Curse of Dimensionality and
it is a significant obstacle across data science. Just measuring more features
in language does not necessarily give us more knowledge, because our distance
measures are less discriminating as we move into the higher dimensions. This is
Common Methodological Error #4: thinking that the more features of writers'
writings that we count, the more data we derive, the more knowledge we have
about that writing. In important ways, counting more gives us less knowledge.
The problem of measuring distance does not affect all distance measures equally. Euclidean, Manhattan, and Cosine
Distance, and various more esoteric measures, are more or less discrimating of
authorship depending on precisely what we are measuring and how many dimensions
our data have. Unfortunately, most published papers on computational analysis of
writing style pay little or no attention to the choice of distance measure.
Investigators typically accept the default distance measures provided in such
software packages as the popular *R Stylo* and neglect to consider how the
choice of distance measure affects their results.

To conclude, then. The metapor of distance has no *direct*
application to the comparison of texts. Once we start counting features of
texts, the notion of distance has some validity, but it is not simply a matter
of subtracting one number from another. Sometimes division rather than
subtraction gives the more meaningful sense of 'distance'. When we create sets
of numbers arising from our counting of features of writing, we can measure the
'distance' between the sets of numbers is more than one way and none is
obviously the correct way. Investigators must show in each investigation why a
particular distance measure is the one that yields the most useful distinctions
between texts. Finally, our ability to generate more and more numbers from texts
should not lead us to conclude that we are gaining more and more information
about them, on account of the Curse of Dimensionality. Having captured data
along multiple dimensions we can perform dimension-reduction processes such as
Principal Components Analysis. In a longer talk I could discuss how these
methods again make it easy to deceive oneself about the meaning of one's
results. In the field of computational stylistics, self-deception is the easiest
trap to fall into [SLIDE of Feynman].

**Works Cited**

Brown, Paul. 2018. *Play Word Counts*. A contribution to the website 'Shakespeare's Early Editions: Computational Methods for Textual Studies' hosted by De Montfort University and funded by the Arts and Humanities Research Council from grant AH/N007654/1.

Craig, Hugh. 2011. "Shakespeare's Vocabulary: Myth and Reality." *Shakespeare Quarterly* 62. 53-74.

Taylor, Gary. 2017. "Did Shakespeare Write *The Spanish Tragedy* Additions?" *The New Oxford Shakespeare Authorship Companion*. Edited by Gary Taylor and Gabriel Egan. Oxford. Oxford University Press. 246-60.

Youmans, Gilbert. 1990. "Measuring Lexical Style and Competence: The Type-token Vocabulary Curve." *Style* 24. 584-99.