I am one of the General Editors of the New Oxford Shakespeare  Complete Works. The editor's first job is to, as we say, "establish" the text: to figure out what the author wrote. In the case of Shakespeare, this is made complicated by the fact that it was perfectly normal for plays to be published without the author's name on the title page. But in 1623, seven years after Shakespeare died in 1616, a collection of "Master William Shakespeare's Comedies, Histories, and Tragedies" was published in London. This is the table of contents for that book. [SLIDE] This collection of 36 plays is now known as the First Folio of Shakespeare ("folio" refers to the book's large format) and for centuries it has been the foundation of his canon of works. [SLIDE] It has long been known that this First Folio omits at least two plays that it should have included. Everyone agrees that the play Pericles published in 1609 with a titlepage saying that it is "By William Shakespeare" is indeed by him and by the playwright George Wilkins. We do not know why it was omitted from the Folio and its co-authorship might have something to do with it. [SLIDE] The play The Two Noble Kinsmen was published in 1634--18 years after Shakespeare's death--and it says on its title-page that it was written by William Shakespeare and John Fletcher, and that seems to be true. This was the first time a play titlepage told its readers that Shakespeare co-authored a play with another man, which is something that we now believe Shakespeare did throughout his career. [SLIDE] So far, then, the plays missing from the Folio are ones printed elsewhere with Shakespeare's name on them.

You might think, then, that a Complete Works of Shakespeare should contain the 36 plays published in the 1623 First Folio plus the two collaborative plays that were left out of the First Folio and published separately: Pericles published in 1609 and The Two Noble Kinsmen published in 1634, making 38 plays in all. But in fact there are 5 further plays that are included in the New Oxford Shakespeare because we believe they were written in part by Shakespeare although they were not published with his name on the titlepage. This is 'found' Shakespeare in the sense that until we discovered that that they are in part by Shakespeare they were 'lost' works of his. Perhaps more disturbingly, however, there are newly 'lost' Shakespeare plays in the sense that we now believe that 9 of the plays in the 1623 First Folio were not written wholly by Shakespeare. That is, in each case, we have for many years been treating as Shakespeare's writing parts of these plays that were written by someone else. And this is not one or two isolated cases: the latest scholarship shows that more than a third of the entire Shakespeare canon is plays he co-wrote with one or more dramatists. In this talk I want to tell the stories of how the lost writings by Shakespeare have been found and how writings that we have long thought were Shakespeare's are now 'lost' in the sense of being attributed to someone else.

* * *

Our stories begin in the middle of the nineteenth century. In 1847 Samuel Hickson reviewed three books about The Two Noble Kinsmen, the only Shakespeare play whose first edition proclaimed it to be co-authored, and explored the problems of attributing particular parts to Shakespeare and Fletcher. After some loosely phrased comparisons of character development, "sentiment" and "boldness of metaphor", Hickson turned to "redundant syllables", meaning feminine endings of verse lines, and noted that Shakespeare wrote them far less often than Fletcher (Hickson 1847, 63, 65, 66). [SLIDE] Hickson assigned a speech to Fletcher on the grounds that it does something Fletcher favours, and Shakespeare does not: using "in the plural certain nouns of quality or circumstance commonly used in the singular", such as honours and banishments (Hickson 1847, 68). Another long scene Hickson gave to Shakespeare because it is in prose and he thought Fletcher virtually incapable of long prose scenes (Hickson 1847, 69). We see in these comments the first signs of a rigorous comparison of writing styles based on objective and countable features, but without the extensive listing of evidence that is needed to prove stylistic difference.

    The first time that a play had been thought to be solely by Shakespeare was seriously investigated for possible co-authorship was when Hickson and James Spedding independently considered the First Folio text of Henry VIII   (Hickson 1850; Spedding 1850).

    Across Shakespeare's late plays Spedding found that about 28-38% of lines have feminine endings; tabulating the figure for each scene of Henry VIII  he found a marked difference between the scenes his practised ear already told him were Shakespeare's, in which the feminine endings ranged from 28% to 40%, and those he already thought were Fletcher's, in which the rate was 50% to 77%. There is no overlap in this stylistic feature: Shakespeare's maximum is 10% lower than Fletcher's minimum. Moreover, the range for the scenes in Henry VIII  already subjectively attributed to Shakespeare matched the range in his other late plays The Winter's Tale and Cymbeline. Spedding had found a strong marker of the difference between the two men's styles. If it was true that Spedding first made his division on subjective grounds then the agreement of the numbers with his impression is all the more convincing. And since Hickson independently arrived at the same conclusion about the same play, the claim is stronger still. The professional application of objective measures of style to the problem of Shakespearian authorship attribution had begun.

    The first editors of Shakespeare to be university-employed professionals were W. G. Clark and W. Aldis Wright, whose Cambridge-Macmillan Complete Works of 1863-66 was by far the most scrupulous investigation of the texts to date. Clark and Wright's edition of Macbeth for Oxford University's Clarendon Press in 1869 revisited the previously observed likenesses between parts of this play and parts of Thomas Middleton's play The Witch. Their conclusion about Macbeth was that ". . . the play was interpolated after Shakespeare's death . . . . The interpolator was, not improbably, Thomas Middleton; who . . . expanded the parts originally assigned by Shakespeare to the weird sisters, and also introduced a new character, Hecate" (Clark and Wright 1869, xii). Not until the late twentieth century would the full implications of this insight be widely accepted in Shakespeare scholarship.

    The hero of the Victorian break-through on authorship problems is F. G. Fleay. In 1873 F. J. Furnivall, who had already founded the Early English Texts Society in 1864 and the Chaucer Society in 1868, founded the New Shakspere Society, whose purpose was to study "the metrical and phraseological peculiarities of Shakspere" (Furnivall 1874b, vi). The point was to ascertain the order in which the plays were written and so track the progress of Shakespeare's mind across his career, but looking closely at Shakespeare's versification and phrasing meant counting certain features. Comparisons with other writers' counts were inevitable. The New Shakspere Society did not set out to alter the attribution of plays amongst Shakespeare and his contemporaries, but its philologically influenced focus on countable features--of which its member Fleay was the leading exponent--necessarily led that way.

    1874 was the annus mirabilis for authorship attribution by analysis of internal evidence. In his first paper addressed to the New Shakspere Society, Fleay acknowledged Furnivall's point that metrical tests can help determine the order in which Shakespeare's plays were written, but he saw a "far more important end" in determining the genuineness of the plays traditionally assigned to Shakespeare (Fleay 1874a, 6). It was the act of making his counts that first led Fleay to suspect that The Taming of the Shrew and parts of Timon of Athens, Pericles, Henry VIII / All is True and the Henry VI plays are not by Shakespeare, and as he observed this was largely a new development in the field. Fleay's tests mentioned in this first paper were the rates of rhyming, "double endings" (that is, feminine endings), "incomplete lines" (presumably those with fewer than ten syllables), and "Alexandrines" (that is, lines of iambic hexameters) (Fleay 1874a, 7).

    From these rates, Fleay found reason to suspect the above plays and also--because their rates of these metrical phenomena put them at odds with the chronological order established by other means--he found reason to suppose that Troilus and Cressida and All's Well that Ends Well are Shakespeare's revisions of his earlier works. Fleay acknowledged that subjectivity entered the problem because the "laws of metre" are not "definitely laid down" (Fleay 1874a, 15). That is, there remains room for experts to disagree about how close in sound two words must be to count as a rhyme, about the permissible relineation of verse to regularize meter, and about how tightly to define a term such as Alexandrine (does the caesura have to appear after the third iamb?) The lack of shared definitions of metrical features was to prove an obstacle to the corroboration of findings based on counting them, and in the discussion of Fleay's paper reported in the Society's Transactions the problem was extensively debated.

    Fleay's next paper for the new Society divided Timon of Athens between Shakespeare and an unknown author using the same metrical tests as his first paper (Fleay 1874c). His division is strikingly similar to the modern generally accepted division, in particular in giving scenes 1.2 and 3.1 to 3.5 to the other writer (Jowett 2004a, 202). At the fourth meeting of the Society, Fleay presented his evidence confirming the already widespread suspicion that Acts 1 and 2 of Pericles are by someone other than Shakespeare (Fleay 1874d). The starkest difference is in the number of rhyming lines: Acts 1 and 2 come to about the same length as Acts 3, 4, and 5, but have 195 rhymes to the latter's 14.

    In the discussions of these early papers on authorship attribution, only one new test was added to those devised by Fleay. Spedding proposed what he called the Pause Test, building on what others had called the phenomenon of the stopped line (Spedding 1874). This measures what is now usually called enjambment, which is where the grammatical clauses of the verse run across multiple lines rather than ending at the ends of lines. As Spedding remarked, in early Shakespeare the ends of lines tended also to be the ends of grammatical clauses, while in late Shakespeare--and he rightly identified Cymbeline as an extreme case--enjambment predominates so that clauses run over the ends of lines and in spoken delivery an actor pausing at the line ending would disrupt the sense.

    For his last paper delivered in the first year of the meetings of the New Shakspere Society, Fleay picked up the suggestion by Clark and Wright that Macbeth contains material added to the play by Middleton after Shakespeare's death (Fleay 1875). Unfortunately, he also saw such adaptation at work in Julius Caesar, which opinion found no followers. After a lengthy tour of what he considered the parts of Macbeth too poorly written to be Shakespeare's--a recurrent attitude in early authorship studies--Fleay provided the stylistic evidence for his division of the play.

    The first piece of evidence was that Macbeth is abnormally short, the only comparable plays being The Comedy of Errors, The Two Gentlemen of Verona, and A Midsummer Night's Dream--all of which might be short simply because they are early interlude-style comedies not mature tragedies--and Julius Caesar, Pericles, and Timon of Athens that Fleay had already bracketed off as "finished or altered by some other poet" (Fleay 1875, 355). Much more persuasive was Fleay's second piece of evidence: more scenes end with rhyming couplets in Macbeth than in any other Shakespeare play, and there are many more such couplets overall, and yet by the middle of the first decade of the sixteenth century (around the time Macbeth was written) Shakespeare had largely given up using rhyme.

    The last paper of interest to us that was read in the New Shakspere Society's first year made a subtle distinction between different kinds of line-ending (Ingram 1875). Ordinarily the tenth syllable of a regular iambic pentameter line is stressed, and John K. Ingram was concerned to distinguish two kinds of deviation from this norm by use of a weak monosyllable in this position. In the first kind, which Ingram called a light ending, "the voice can to a certain small extent dwell" at that point, while the other, which he thought properly deserved the name of a weak ending, is "so essentially proclitic" that "we are forced to run" it "into the closest connection with the opening words of the succeeding line" (Ingram 1875, 447).

    Most usefully, Ingram listed the particular monosyllabic words that in his scheme usually fall into each category, and detailed the circumstances--such as emphatic use or being followed by a parenthetical clause--that might on occasion put it in a different category. This was an important development in the formulating of precise rules for metrical analysis since even scholars who disagreed about the validity of the categories could nonetheless check that certain counts were being made according to the stated rules. Indeed, so long as the rules were being followed rigorously the validity of the categories need not be agreed upon if an investigator's purpose were merely to find verifiable discriminators of one writing style from another.

    Ingram was alert to the problem that as more people started counting verse features the freedom to interpret certain rules and to understand certain phonetic features in different ways might result in scholars' raw counts failing to agree. To forestall this he had an idea: "I would strongly advise the appointment by the New Shakspere Society of a 'Counting Committee', to fix beyond doubt the numbers of lines of different sorts in the several plays, and to verify all the figures brought out by the application of the different verse-tests" (Ingram 1875, 449n2). We in the early twenty-first century are no nearer this ideal situation than Ingram was.

    The methods for authorship attribution by the analysis of internal features of the plays remained essentially unchanged for the next 100 years. There were just two methods: counting the frequencies of certain verse features--new studies introduced new countable features--and finding parallel passages showing that a work of known authorship contains the same words and/or phrases and/or sequences of ideas as the work for which the authorship is sought. Of all the things one might count in literary writing, habits of versification had the attraction that they could be counted fairly quickly and recorded quite easily--the key metric was generally expressed as the average number or lines per occurrence (or its inverse, occurrences per line)--and they demonstrably distinguished different writers. A complication was that writers might drift in their habits over time, so that on many tests the loose versification of late Shakespeare scores significantly differently from the metrically more-regular writing of his early career. For Shakespeare we can, to some extent at least, factor this into the calculations since the chronological order of his works is in large part well agreed upon.

    Other than habits of verse, the obvious features of writing that may in principle be counted are the choices of words and the various frequencies of their occurrence. Until the 1960s this was virtually impossible on any substantial scale because without machine-readable texts the counting had to be done by hand and it is extraordinarily laborious. The existence of printed concordances to Shakespeare made it possible to locate his use of interesting lexical words but concordances typically omit the high-frequency function words: the articles, prepositions, and others that serve primarily grammatical rather than lexical purposes. Because they occur at high frequencies that are demonstrably distinctive of authorship, function words are of special interest to attribution investigators. However, without concordances to all the other dramatists of Shakespeare's time, the comparison of Shakespeare's use of language with that of other writers could not be systematic and where it was attempted it relied on scholars' happening upon or recalling parallel passages.

    Just as the lack of concordances to the works of all the other dramatists of Shakespeare's time hindered the task of distinguishing truly significant parallel passages from the merely commonplace, so in versification the lack of tables of frequency rates for all the dramatists hindered the extensive comparisons that would make studies exhaustive. Philip Timberlake's PhD thesis accepted by Princeton University in 1926 went some way towards remedying this deficiency, and despite covering only the drama up to 1595 it remains the most complete tabulation of the frequencies of feminine endings in existence (Timberlake 1931). Timberlake addressed head-on the problem alluded to by Ingram in his suggestion that a committee might standardize the counting of verse features: "there has been no general agreement as to what constitutes a feminine ending" (Timberlake 1931, 1–2) . Without standardized definitions, comparisons can be made only within individual studies--in which the investigator was, we hope, at least self-consistent--and not between studies.

    A frequent point of disagreement between investigators was how to count lines ending in the words heaven, even, hour, bower, flower, tower, power, and friar, all of which may be pronounced monosyllabically to give a masculine ending or disyllabically to give a feminine one. Timberlake's solution was to count both ways, keeping separate tallies based on the assumption that they are all monosyllabic to give his "strict" count and on the alternative assumption that they are all disyllabic to give his "loose" count (Timberlake 1931, 5). Likewise, Timberlake separated out--and excluded from his "strict" count--all feminine endings caused by proper nouns appearing at the ends of lines, which he thought might compel a poet to use feminine endings more often than he was otherwise wont; the "loose" count included them.

    Towards the end of his study, Timberlake applied his findings to various problems of Shakespearian authorship. In the anonymous play Edward III the Countess of Salisbury's scenes show a sharp rise in the rate of feminine endings from well below Shakespeare's norm at 2.1% for the rest to the play to well within his norm of 4-16% for these scenes; Timberlake concluded that it is distinctly possible that Shakespeare contributed them (Timberlake 1931, 78–80, 124) . Regarding Sir Thomas More Timberlake could find no clear evidence since it consistently uses feminine endings in more than 18% of its lines, and on that basis alone was probably written after 1596 when all writers began to use this feature more frequently (Timberlake 1931, 80). Dividing by scenes, Timberlake found significant variations in the rates of feminine endings in Titus Andronicus, with rates of 1.9% in 1.1, 2.4% in 2.1 and 1.5% in 4.1. No other scenes fell below 4.1% and most tested significantly higher still, leading Timberlake to suspect that Shakespeare's co-author was George Peele or Robert Greene (Timberlake 1931, 114–18) .

    The problem of differentiating Shakespeare's writing from that of his co-author Fletcher was revisited by Cyrus Hoy as part of a series of seven articles on the purported Fifty Comedies and Tragedies by Beaumont and Fletcher, as their second folio of 1679 styles itself. In the first of these articles Hoy laid out his chief means for detecting Fletcher's writing: "use of such a pronominal form as ye for you, of third person singular verb forms in ‑th (such as the auxiliaries hath and doth), of contractions like 'em for them, i'th' for in the, o'th' for on/of the, h'as for he has, and 's for his (as in in's, on's, and the like)" (Hoy 1962, 130–31). Hoy acknowledged that such tests had been used before--most innovatively by W. E. Farnham and A. C. Partridge--and claimed only that his was the first study to apply them all systematically to the whole of a substantial body of writing. In large part Hoy's method confirmed earlier divisions of Henry VIII and The Two Noble Kinsmen between Shakespeare and Fletcher. The kinds of tests employed by Hoy have widely varying success with different authors, being particularly effective for distinguishing Massinger from Fletcher but less good for others.

    Hoy's success in establishing a series of linguistic-preference tests and applying them to a substantial body of drama was inspirational to others in the field. Essentially the same kind of analysis--counting preferences for different ways of saying the same thing--was applied in the 1970s by David J. Lake and MacDonald P. Jackson to the problem of identifying Middleton's work, in the course of which the case for his hand in Shakespeare's Timon of Athens emerged most clearly (Lake 1975; Jackson 1979). Lake made no claims for innovation in the kinds of internal evidence he collected, indeed quite the opposite: "the general methods or particular tests I employ", he wrote, have "all been used over the past fifty years in authorship investigations" (Lake 1975, 10).

    One of Jackson's methods had not previously been applied to Shakespearian authorship attribution: the counting of the frequency of occurrence of so-called function words that express grammatical relationships between other words while carrying little or none of their own lexical value. Their role is to bring together the nouns, verbs, and adjectives in order to give a sentence its foundational structure. Typical function words in the English language are prepositions, conjunctions, articles, particles, auxiliary verbs, and pronouns, although linguists differ on just which words have so little lexical value that they properly belong in this category. Jackson counted the frequency of occurrence of each of 13 function words--a/an, and, but, by, for, from, in, it, of, that, the, to, and with--in sample writings by the 20 most prolific dramatists of Shakespeare's time.

    In his PhD thesis on distinguishing Middleton and Shakespeare's writing, and especially apportioning their shares of Timon of Athens, Roger Holdsworth put himself squarely in the tradition of Hoy, Lake and Jackson (Holdsworth 1982; Holdsworth 2012). Like them, he counted various linguistic features such as contractions and the preference for modern (and urban) you over archaic (and rural) thou, but Holdsworth also introduced the innovation of counting the various formulaic phrasings used in stage directions to find author-specific idiosyncracies (Holdsworth 1982, 181–235) . His comprehensive study of the form "Enter A and B, meeting", in which the placing of meeting makes clear that neither character is already on stage, was the first systematic proof that a recurrent form of stage direction could usefully distinguish authorship.

    Without computer automation, the counting of linguistic features was always likely to be incomplete and error prone. The first systematic and extensive application of computer counting methods to the authorship problems in Shakespeare was undertaken by Ward E. Y. Elliott and Robert J. Valenza in response, initially, to the unscholarly question of whether William Shakespeare of Stratford-upon-Avon was an author at all. Elliott and Valenza addressed themselves to the problem of just how far and in what ways an author might reasonably be expected in a particular work to deviate from his norms on the various features counted by the methods described above. Their approach, applied first to Shakespeare's poetry, was notable for its comparatively high-level mathematical analysis of the numbers thrown up by their counting, including the calculation of such recondite phenomena as co-variance and eigenvalues.

    In the 1980s and 1990s the Chadwyck-Healey company began to pay for the keyboarding of large quantities of out-of-copyright English literary texts in order to sell them as searchable electronic collections on CD-ROM--under the titles English Poetry, Early English Prose Fiction, English Verse Drama, and English Prose Drama--that were later combined to form a unified web-hosted database called Literature Online (LION). Having effectively all of English literature in one searchable database transformed the field of Shakespearian authorship attribution because it was at last possible to perform rapidly and more or less definitively what we call the "negative check". An investigator could now assert with some confidence just whose writing did and did not contain a particular feature by which she was attributing authorship. The first to put this potential into practice was Jackson in a series of articles (Jackson 1998; Jackson 1999b; Jackson 2001a; Jackson 2001c; Jackson 2001d) and then a ground-breaking book that established the co-authorship of Pericles beyond any reasonable doubt (Jackson 2003a).

   The key feature that characterizes Jackson's approach is that words from the text to be attributed are searched for in the LION database, either as complete phrases (say, "purple mantle torn") or as collocations ("purple NEAR mantle NEAR torn"). So long as LION contains works by all the possible candidates for authorship, every author has, as it were, a chance of using the same phrase or something like it, and the foundational assumption is that the true author of the text to be ascribed is likely to do this more often than other authors because, consciously or unconsciously, he favours that phrase.

    Around the turn of the millennium, the Jane Austen scholar John Burrows announced a new way of processing the rates of frequently occurring features such as function words, called Delta (Burrows 2002a; Burrows 2003), and, even more importantly, he went on to develop a new way of selecting just which words to count, called Zeta (Burrows 2007). The Delta method's key innovation is that it discounts the importance of words for which a set of authors is demonstrably variable in their rates of usage and weighs more heavily the evidence from words that they use at consistent rates. 

    This principle of identifying on a case-by-case basis the words that are most discriminating between various authors, rather than relying on pre-determined lists of words, also underlies Burrows's second innovation, the Zeta test. As a first step, the investigator establishes two sets of texts, each being the securely attributed works of a single candidate author or a group of authors. The software of Zeta finds for itself the words that most distinguish these two sets, being especially common in the first set and especially uncommon in the second, and vice versa. The vice versa step means that the investigator has two lists of words, both of which are good discriminators between the two sets of texts.

    If the sets are chosen to be, say, Shakespeare plays on the one hand and Marlowe plays on the other, the Zeta method becomes for that application a good discriminator of these two writer's styles. One of the sets may be a multi-writer collective, so that the test may be, say, Shakespeare versus Marlowe+Greene+Peele+Nashe. As Burrows showed, and Craig confirmed with a great many validation runs for this technique (Craig and Kinney 2009d), when the investigator takes a text of known authorship out of one of the sets and reruns the experiment as if this text were of unknown authorship--without letting this text help choose the discriminating word lists--the correct author is identified with reliability that typically (depending on who is being tested) exceeds 95% accuracy. Zeta is by some way the most powerful general-purpose authorship tool currently available.

            There is just one more authorship attribution test I want to show you, and it is one that I am personally involved in. It is called Word Adjacency Networks and it measures how closely together a writer puts each of the most commonly used words in the language. To explain the method, I'll start with the fact that individual letters tend to appear next to one another rather than being randomly distributed. The 26 letters of the alphabet do not occur with equal frequency in normal English prose or verse, whether modern or early modern. About one in every eight letters of written English is the letter 'e', but only one-in-50 of them is the letter 'p', one-in-100 is the letter 'v', and one-in-1000 is the letter 'q'. Moreover, certain letters are more likely to follow certain other letters. Following 't' with 'h' makes the commonest letter bigram in English, comprising about one-in-30 of all letter bigrams, while following 'u' with 'r' makes a bigram that occurs only once in every 200 letter bigrams. Without consciously noticing it, all competent users of the English language have internalized these preferences, as can be demonstrated with an experiment. I have here an extract from a 20th-century novel and I would like members of the audience to call their guesses for the letter or piece of punctuation or a space that comes next. I'll give you the first letter, 'T':

The clock struck half past two. In the little office at the back of Mr McKechnie's bookshop, Gordon--Gordon Comstock, last member of the Comstock family, aged twenty-nine and rather moth-eaten already--lounged across the table, pushing a four-penny packet of Player's Weights open and shut with his thumb.

[If they guess right, tell them and write the letter with an underscore beneath it. If they guess wrong, tell them and write the letter without an underscore. The underscores show how guessable the language is. After experiment is finished, SLIDE to show above quotation.] The first to do the experiment we have just performed was the founder of the science of Information Theory, Claude Shannon. His equations, and in particular the celebrated equation for Shannon Entropy [SLIDE], are now routinely used in scientific work on authorship attribution that appear in journals that are little read by Shakespearians and Marlovians.

The preferences for one letter following another that we see in this experiment are built into the English language rather than belonging to any one writer. They express the fact that letter combinations are not random distributed but statistically ordered. If letter combinations were not ordered like this, any combination would be equally likely and crossword puzzles would be impossible [SLIDE]. When we shift our attention from combinations of letters to combinations of words, the same statistical principles apply and the same Shannon equation tells us how far from purely random the sequences are. The freedom of the writer is much greater at the level of word combination: the structure of English is far less constraining when choosing words to make a sentence than when choosing letters to make a word. Moreover, it is empirical provable that the preferences for following one word with another are personal preferences of particular authors.

In the Humanities, investigators of this topic often use word-combination preferences governing the relatively infrequent lexical words, [SLIDE] so they ignore the short, so-called function words--the, and, of, and so forth--that make up much of what we say and write. However, there is good empirical evidence that we should attend to those short function words that we hardly notice but which are demonstrably distinctive of an author's style. [SLIDE] When we count how often these function words are used, the frequencies turn out to be different for different writers. The Word Adjacency Network method captures the distance, measured in words, between each occurrence of a function word and each other occurrence of every other function word. [SLIDE] Here is example of the Word Adjacency Networks for two short extracts from two plays: one speech each from Shaksepeare's Hamlet and Thomas Dekker's Satiromastix.

This is only an illustration of the idea and for authorship attribution we use many more than four function words; 100 would be typical, but the resulting pictures are too complex to show you. And of course rather than short extracts from plays we use whole authorial canons as our samples. In this method, authorship attribution is done by calculating the overall difference (measured in Shannon entropy units called centinats) between a Markov chain representing the word-adjacency preferences for an entire canon by one author and the Markov chain representing the word-adjacency preferences of the play we wish to attribute. The author whose canon produces the Markov chain least different from the Markov chain for the play to be attributed is the likeliest author amongst the candidates tested. [SLIDE] This method is entirely unlike other recently applied methods of authorship attribution, but it confirms their results: Christopher Marlowe did contribute to parts of all three of Shakespeare's Henry VI plays. My team's detailed study of this was published last year in Shakespeare Quarterly and it used 100 function words to create profiles for Shakespeare, John Fletcher, Ben Jonson, Christopher Marlowe, Thomas Middleton, George Chapman, George Peele, and Robert Greene. Thomas Kyd is not one of the authors we profile because his canon is too small even if we accept Soliman and Perseda as his.

To get a sense of how well this method words, we can test how similar to each author's profile is each of the Shakespeare plays [SLIDE]. Here are the results, and there are four things to observe. First, for almost every play the red circles indicating Shakespeare appear nearest the bottom of the chart, indicating that for the play in that column the Shakespeare profile is the nearest profile to it. Secondly, the highest of the red circles--indicating greatest distances from the average Shakespeare style--are for (reading from right to left) The Two Noble Kinsmen, Titus Andronicus, Timon of Athens, The Taming of the Shrew, Pericles, Measure for Measure, Macbeth, Henry VIII, and the three Henry VI plays. That is, in every case the plays that our method shows to be most unlike Shakespeare's style in function-word clustering are those plays that entirely different methods have shown to be co-authored. Thirdly, notice that for all authors the coloured markers form roughly horizontal coloured bands at different heights: this could not happen unless the method were indeed capturing something distinct about authorial style. Fourthly, notice that Marlowe's profile is frequently the most distant from the Shakespeare play in question (his purple triangles are mostly at the top of the picture) except for the Henry VI plays recently attributed to him by entirely independent tests, where they fall to the bottom.

