Picture of Gabriel Egan G a b r i e l   E g a n  .  com

SHAXICAN Perl scripts working on Shakespeare's language

The SHAXICAN project takes its name from the SHAXICON database developed by Donald Foster and used by him to attribute the poem 'A Funeral Elegy for William Peter' to William Shakespeare and to suggest which roles Shakespeare might have played, among other things.

SHAXICON ('the idea') was begun as an experiment in programming by Gabriel Egan and published on his website.

The idea and the scripts were taken up and extensively refined by Steve Roth ('Roths' refinements' below), to whom is owed most of the credit for the work here. Most recently still, Matt Steggle made the excellent suggestion ('Can we tell . . . ?, below) that one of Foster's uses for SHAXICON (deducing which roles Shakespeare played based on the assumption that an actor-dramatist's writing would be influenced by the rare words in the part he'd most recently acted) could be tested for the work of actor-dramatists other than Shakespeare. Matt and Steve's collaborated to test the principle for the work of Colley Cibber.

The idea

This page is intended to offer, free to all, Perl scripts1 that do the sorts of things Donald Foster's SHAXICON database is designed to do. In 1995 SHAXICON was used to support Foster's attribution of the 'Funeral Elegy' poem to Shakespeare, the academic community being asked at the time to wait a short while until the database itself was ready for publication. In the 5 years since these claims were made, SHAXICON's non-appearance has prevented substantiation of the claims made for it, and the resources provided here are intended to assist those interested in the development of public domain tools which do the same analyses. This is work-in-progress and others interested in this area are invited to copy and improve on materials provided here and to share their results. I'll post here anything which moves the academic community towards a set of public-domain tools for this kind of work.

The area of exploration is Shakespeare's 'rare' words: those he used say 12 times or fewer in his entire extant output. To start with one needs an etext of the complete works. In the collection of files offered at the bottom of this page is one ('all.txt') based on the Oxford Shakespeare etext (prepared by Lou Burnard, 1989) to which I hope I've added enough value--by retagging and play-title/line-numbering every line--that Oxford University Press don't claim copyright infringement. If they do, I will of course comply with a 'cease and desist' notice and substitute an inferior text.

Before getting started with the following explanations, you might want to see the practical work ('Steve Roth's refinements' on the left) that has done by building the limited foundations sketched here.

Part One: Building the raw files

The first thing one needs is a list of all the characters in the Shakespeare canon, with unique names for the many Antonios and Claudios. One solution is to prepend (in the sense of 'add before') the name of the play before the name of the character, so ADO-ANTONIO is distinct from MV-ANTONIO, and TGV-ANTONIO. The following script will do that:

while (<>) {
   ($PlayName, $RestOfLine) = split(" ", $_, 2);
    if ($RestOfLine =~ m/\W([A-Z|-|\[|\]\^|\@]{2,})(( [A-Z|-|\[|\]|\^|\@]{2,})*)\W/) {
      $PlayAndName = $PlayName."-".$1.$2;
foreach $CharacterName (sort keys(%CharacterNames)) {
   print $CharacterName, "\n";

This script I call 'findchars.pl' and when the above complete work etext is run through it, the resulting file of character names ('charnames.txt') is produced. The next step step is to produce a list of 'rare' words: those Shakespeare used infrequently. The following script lists those words Shakespeare used no more than 12 times in his entire extant dramatic output:

while (<>) {
   @words = split(/\W*\s+\W*/, $_);
   foreach $word (@words) {
foreach $word (sort keys(%wordcount)) {
   if ($wordcount{$word} < 13) {
      printf "%20s %d\n", $word, $wordcount{$word};

This script I call 'rare.pl' and when the above complete works etext is run thought it, a file of rare words ('rarewords.txt') is produced. The next step will be to work out which character spoke which rare words, and then we can see if an already-existing character's vocabulary "floods into" Shakespeare's writing with a given new play2. This is the claim of Foster's SHAXICON: learning a part for performance brought that part's rare words to Shakespeare's mind and these words are over-represented in the next play he wrote. I've picked '12 or fewer' occurrences as the cut-off for distinguishing what is a rare word because this was Foster's cut-off, but comments are invited on whether '6 or fewer' might be a better borderline.

The observant will have noticed problems in the above. The 'findchars.pl' script produces a few 'ghosts' such as "ABC" in TGV. This is because the underlying etext renders character names using all uppercase letters, which is the distinguishing feature the script looks for. Thus in TGV when Speed likens Valentine to "a schoolboy that had lost his ABC" (2.1.21) a spurious hit occurs because ABC looks like a character's name. Also, the script 'rare.pl' doesn't discard numbers so there's much numerical guff at the top of the rare words file before one gets to the alphabetical section, and the script doesn't deal properly with possessive apostrophes and hyphens.

Page 2: Making a 'part' for each character

The 6 steps to reproducing what SHAXICON does, as I understand it, are these:

1) You make a list of all the characters in all the plays (using unique identifiers for the multiple Claudios, Antonios, etc).

2) You make a list of all the rare words in all the plays (rare in the sense that they are words Shakespeare rarely uses).

3) You count how many times each of the characters in (1) uses each of the rare words in (2).

4) You take a sample text (one of the plays) and make a list of all its words and how frequently they appear.

5) You check the list in (3) with each of the lists in (2) to see if there's a character whose rare words turn up much more often in the sample play than they do in the Shakespeare canon as a whole.

6) If (4) yields a good match, that character was played by Shakespeare shortly before he wrote that play. (Hence those rare words were over-represented in the sample play: they were in Shakespeare's head from his having recently memorized them for his part.)

We have achieved, albeit in a rough-and-ready form, steps (1) and (2). Step (3) is difficult, and as an intermediate step we could do with an actor's "part" for each character in the canon. That is to say, we need a separate document containing just the words for a given character in a given play. Once we have ALL the words a character says, we can throw away the ones which are not rare (by using the list from step (2)) and will be left with just the rare words spoken by a particular character.

The original etext we started with, 'all.txt', has a feature which makes it not immediately suitable for generating "parts": some lines are shared between by two or more characters. Take this example:

KING JOHN Death. HUBERT My lord. KING JOHN A grave. HUBERT He shall not live. KING JOHN Enough.
(King John 3.3.66)

For our purposes, it would be better if each speech began on a new line. The following script 'breaker.pl' does this, and also wraps a tag (<speaker>...</speaker>) around each speaker:

while (<>) {
($PlayName, $RestOfLine) = split(" ", $_, 2);

This is essentially the same search pattern as used in 'findchars.pl' (which looks for strings of uppercase characters, which may be separated by spaces but not by anything else), but instead of just finding them they are tagged for easy identification later. When 'all.txt', the original etext, is run through 'breaker.pl', the result is a file with speeches tagged by speaker ('tagged.txt'). To illustrates the tagging, here is how the above line from King John looks after processing by 'breaker.pl':

JN 3.3. 66B
<speaker>JN-KING JOHN</speaker>
My lord.
<speaker>JN-KING JOHN</speaker>
A grave.
JN 3.3. 66B He shall not live.
<speaker>JN-KING JOHN</speaker>

There are still things here which we don't want in the final "parts" for these two characters: the code representing the play's name (JN), the act, scene, and line numbers, and the line-number suffix 'B' with its associated run-over character '+'. These are all feature of the underlying etext which we will remove in the next stage.

To turn the file 'tagged.txt' into a collection of characters' "parts" all we need to do is spot those single lines which have the <speaker>...</speaker> tagging and when we find one, start pouring the output into a new file whose name is based on the character name we find between the tags. If this is a new character (one who hasn't yet spoken in the play), the act of OPENing a filehandle for that character will create the output file, but if the file already exists (and hence contains what the character has said so far), the act of OPENing the same filehandle again will cause the new words to be appended to the existing file. This happens because the filehandle is opened in the 'append' mode by putting ">>" before the filename. Here is the script 'makeparts.pl' which takes the output of 'breaker.pl', ie takes the file 'tagged.txt', and creates a character part for each character in the canon:

while (<>) {
$WholeLine = $_;
if ($WholeLine =~ m/\<speaker>(.*)\<\/speaker>/) { # if there's a new speaker
  open (CURRENTSPEAKER, ">>parts/$1.cue");         # open new (or reopen old) handle
  $FirstLineOfNewSpeaker = (<>);  # pull first line (which hasn't got
                                  # playname, act.sc.line number prefix)
  print CURRENTSPEAKER $FirstLineOfNewSpeaker;
                                  # and send line to the current part

else {                            # or if it's just a continuing speech....
  ($PlayName, $ActScene, $Line, $Speech) = split(/\s+/,$WholeLine, 4);
                                  # then separate off the playname
                                  # and act.sc.line from speech
  print CURRENTSPEAKER $Speech;   # send this line to the current speaker's part
}                                 # go back for next line

In 'tagged.txt' the first line for a new speaker hasn't got the playname and act.scene.line numbers which we want to strip out, so the code for "if there's a new speaker" just pulls the line after the speaker's name and sends it out to that speaker's file. But in the code for "if it's just a continuing speech"--and hence does have the playname and act.scene.line numbers which need to be removed--a 'split' on whitespace (represented by \s+ in Perl) takes away PlayName and ActScene and Line numbers, and then just the Speech itself is sent out to the character's file. The script 'makeparts.pl' creates a new file for each character in the canon. Rather than provide a link for each one (there 1046 "parts"), I've bundled them into a zip-file, 'parts.zip', available below.

Another outstanding problem: The stage directions should be deleted from the original etext, else they'll end up being attributed as someone's speech.


1Perl is a programming language common on Unix systems and also available for the Windows and Macintosh operating systems. Perl is good with textual strings, hence it's suitable for this application.

2The poetical output should be consider too, shouldn't it? That goes on the list of 'next time' improvements, together with a consideration of the effects of editorial modernization of spellings found in the early printed texts.


Here are the files referred to above: all.txt | charnames.txt | rarewords.txt | tagged.txt | parts.zip. You can alternatively have all five in one package: everything-egan.zip.


Roth's Refinements

By Steve Roth

In 1991, Donald Foster published a three-part series of articles in Shakespeare Newsletter suggesting that it might be possible to determine, through lexical analysis, what parts Shakespeare played, and when. The idea is that Shakespeare would have remembered words from parts he'd conned, and those words would appear more frequently in plays he wrote (shortly) thereafter--they would exert an inordinate "influence" on those subsequent plays. A table of plays between 1599 and 1607 (below) seemed to support his position.

Don has said that this use of lexical analysis is something of a sideshow and demonstration piece (the uncharitable might call it a parlor trick <g>), but it has generated a great deal of interest from both adherents and detractors. Unfortunately nobody has been able to check or verify Don's findings. He promised over the years to publish the database (which he dubbed "Shaxicon") on which his proposals were based, but it has never been published.

This led Gabriel Egan to propose re-creating the data using perl scripting, starting with the Oxford electronic texts. He dubbed the enterprise "Shaxican," and posted the work on his site (www.totus.org) that's now moved here. Below is my continuation of Gabriel's work, taking Shaxican to the point that results can be compared to Don's analysis.

Words versus strings

I should begin by pointing out that the base text we're working from does not identify "words" per se, as Don's Shaxicon database apparently does. "Row" (n.) and "row" (v.) are not the same word. Since we're comparing strings of characters here, not words, the results aren't really comparable to Don's work, nor can they serve to support or disqualify that work. When I say "word" in the rest of this writeup, I'm actually referring to strings. The tools presented here could, with some revision, be adapted for use with a "lemmatized" text.

The primary goal and result here is the analysis/reporting architecture conceived, and the tools to build that architecture.

The specific goal of the architecture was to duplicate Don's analysis of playing parts as presented in the Shakespeare Newsletter articles. That goal restricts the wider usability of the architecture to some extent, but the granularity of the data and the database analysis/reporting tools are still fairly flexible for those who want to compare rare words in various plays, or in various parts compared to various plays.

The key to the analysis was building a database of each part/play/rareword correlation--occurrences where a rare word appears in both a play and a part from a different play. viz:

Word Play Part Occurences in this play Occurrences in this part
a-bleeding ROM MV_LANCELOT 1 1
a-bleeding MV ROM_PRINCE 1 1
a-doing R3 COR_BRUTUS 1 1
a-doing COR R3_SCRIVENER 1 1
a-hungry WIV TN_SIR ANDREW 1 1
a-hungry TN WIV_SLENDER 1 1
a-making MAC HAM_POLONIUS 1 1
a-nights JC 2H4_MISTRESS QUICKLY 2 1
a-nights 2H4 JC_CAESAR 1 2
a-nights TIM JC_CAESAR 1 2
a-nights 2H4 TIM_APEMANTUS 1 1
a-nights JC TIM_APEMANTUS 2 1
a-pieces TNK AIT_PORTER 1 1
a-pieces AIT TNK_PALAMON 1 1
a-tilt CYL 1H6_JOAN 1 1
a-weary ROM 1H4_KING HENRY 1 1
a-weary 1H4 ROM_NURSE 1 1

For instance, "a-nights" appears once in the 2H4-MISTRESS part, and twice in Caesar (both occurrences in Caesar's part).

By sorting, summarizing, and analyzing those "match" occurrences we can see how many rare words are shared between different plays, and between specific parts and other plays.


I started by building the database records in perl, expanding on Gabriel's work. (All files and scripts mentioned are available in the Downloads section below.)

First we need clean text files of each play, and each part. The script 'MakePartAndPlayFiles.pl' reads 'all.txt' (the whole Oxford Shakespeare in one file, as improved upon and provided by Gabriel) and creates those files, stripping out stage directions, speaker designations, and line-numbering apparatus in the process. This results in 38 play files (both Lear Quarto and Folio are included in the Oxford text, but here I use only the Quarto version) and (you were probably wondering this) 1,754 part files.

Then we need a list of rare words in 'all.txt'. 'MakeRareWordList.pl' does that. You can specify the minimum and maximum number of occurrences that together define a "rare" word. It's currently set to a maximum of 12 (Foster's apparent breakpoint judging from Funeral Elegy, though the Shakespeare Newsletter articles are ambiguous, suggesting 10) and a minimum of 2 (if there's only one occurrence in the corpus, the word can't very well appear in both a part and a different play). I've also excluded words with less than three letters. The results with those settings (11,051 words) are here in 'rarewords.txt' (116k). This file also includes, for each word, the total number of occurrences in the corpus.

'CorrelationBuilder.pl' reads the part, play, and rareword files and creates 'correlations.txt' (4.6mb!), containing the database records as described above. Thanks and no few kudos go here to my officemate Glenn Fleishman, who wrote this central and quite sophisticated script (and taught me Perl in the process). I've merely fine-tuned, packaged, and debugged a bit.

Glenn suggests that this whole portion of Shaxican should be dubbed "ShakesPerl."

Finally, to calculate the percentage each part constitutes of its play, we need to count the words in each part and play. 'PartAndPlayWordCounter.pl' does that, generating two files of counts--one for the plays ('playwordcounts.txt'), one for the parts ('partwordcounts.txt').


I did the next part of the analysis in FileMaker, so the moniker is unavoidable. I won't document that whole database here, though interested parties are welcome to look at the field definitions in the Adobe Acrobat file 'ShakeMakerFields.pdf', and contact me if you have any questions or would like a copy of the database.

Some results from the analysis are summarized in the following table, re-creating the table in Shakespeare Newsletter, and comparing the Shaxican results to Shaxicon's. I've highlighted fields where the parts' "influence" is (arbitrarily) at least 50% higher than those parts' share of the source play.

Here are Shakespeare's supposed roles in Henry V - Hamlet with percentages of cross-indexed vocabulary, 1599-1607:

    % of Play H5 AYL TN Ham Tro AWW MM Oth Mac Cym Cor Ant
H5: Chor & Mont. Shaxicon 8.1 - 16.8 15.5 17.9 20.2 11.4 14.1 15.4 18.0 17.9 20.5 33.3
Shaxican 7.4 - 6.9 9.4 12.7 11.9 7.2 11.8 11.3 13.3 8.7 12.8 16.7
AYL: Adam & Corin. Shaxicon 5.0 3.6 - 4.3 8.9 11.8 11.1 10.4 11.9 3.8 12.9 11.1 13.0
Shaxican 5.0 6.8 - 1.3 8.5 7.2 5.3 7.2 7.0 3.8 7.1 6.1 7.4
TN: Valen. & Anton. Shaxicon 4.6 5.1 6.4 - 5.9 12.2 8.9 18.2 3.6 8.6 6.7 6.3 22.2
Shaxican 4.6 4.6 5.0 - 6.3 5.8 4.1 4.0 1.0 7.4 2.9 5.5 7.7
Ham: Ghost & 1 Player Shaxicon 3.5 4.8 5.3 4.5 - 10.4 3.4 3.2 5.2 11.2 10.0 8.0 11.0
Shaxican 3.7 5.3 4.2 4.7 5.4 3.3 4.2 1.7 6.4 8.7 7.2 7.5


You can see the more detailed FileMaker reporting which provided the table data in two Acrobat files: one with all the shared words listed ('FosterAnalysisWithWordDetail.pdf') and one without ('FosterAnalysisNoWordDetail.pdf').

Two other reports look at the same parts and their possible influences on every play in the corpus, using the method that Gabriel proposed: comparing relative word frequency in target plays to the frequency of those words in the whole corpus. (The theory is that words that Shakespeare "remembered" would appear more frequently in plays he wrote thereafter.) The report with full word detail is in the document 'RelFreqAnalWithWordDetail.pdf' (43 pages). The summary report without word detail is in the document 'RelFreqAnalNoWordDetail.pdf' (9 pages). This analysis method needs further scrutiny. These are just two examples of the types of reports that can be generated given some moderately high-level FileMaker skills (which I am happy to impart to interested parties).

This application definitely pushes FileMaker's limits. While generating these reports on a small subset of the data takes very little time, when you start sorting 165,000 "match" records, you need to plan on lunch or a good night's sleep. One goal might be to move this architecture to a speedier platform. MySQL is a likely candidate because it's free, feature-rich, widely used at universities, and awesomely fast. It's not nearly as easy to learn as FileMaker, though.

Miscellaneous issues

One issue is the Oxford editors' uncertainty about who spoke certain speeches. Those speakers are bracketed in the text, so you end up with additonal part files containing those questionably attributed speeches (i.e. WIV-[SHALLOW].txt and ADO_[FIRST] WATCHMAN.txt). This requires more care in searching for parts, to be sure there are/are not bracketed versions of the part. These brackets could be removed and the parts combined with the non-bracketed parts, in essence accepting the Oxford editors' suppositions.

The Perl scripts here, when run on the Oxford texts, remove all foreign-language speeches. (Foreign-language speeches are enclosed in {} in the text, which is the same delimiter used to enclose stage directions.) While this is arguably fine for rare-word analysis, it also alters the part-as-a-percentage-of-play calculations some, especially in H5.

ALL of this work requires vetting by others for errors of both design and implementation. The tools are quite complex, and there are many opportunities for missteps.

Please feel free to contact me with any questions, comments, ideas, or suggestions.


Steve Roth


Here are the files referred to above: MakePartAndPlayFiles.pl | MakeRareWordList.pl | CorrelationBuilder.pl | correlations.txt | PartAndPlayWordCounter.pl | playwordcounts.txt | partwordcounts.txt | ShakeMakerFields.pdf | FosterAnalysisWithWordDetail.pdf | FosterAnalysisNoWordDetail.pdf | RelFreqAnalWithWordDetail.pdf | RelFreqAnalNoWordDetail.pdf. You can alternatively have all twelve in one package: everything-roth.zip.

Can we tell from lexical analysis what parts Shakespeare (or anyone else) played?: Testing the Cibber corpus

by Matthew Steggle and Steve Roth

The short answer to the question posed in the title of this article is, "We can't". We tested several analysis methods; none showed any ability to accurately identify the parts played by a player/playwright, or to demonstrate that conned parts influence that player/writer's subsequently written works. The analysis methods are described below.

These are negative results, of course, so of course they can never be definitive. It's possible that some other analysis method would succeed where ours have failed. And there are some caveats to our implementation of the analyses (discussed below) that might disqualify their results.

Testing the methods against known results

To recap: Foster's model proposes that:

1. Shakespeare writes a play, say, A Midsummer Night's Dream.

2. The play is performed, and Shakespeare acts a part in it, say, Egeus. In memorizing all Egeus' speeches, Shakespeare becomes unusually attuned to the rare words included in those speeches.

3. In Shakespeare's next play, rare words from Egeus' part will crop up disproportionately often, because he will remember those words unusually clearly.

Aside from the general problems of rare-word analysis discussed by Steve and Gabriel above, and aside from the persisting problem that Foster's results depend on a database that no other scholar has access to, there are several more particular difficulties with this highly attractive model:

1. The date and sequence of composition of Shakespeare's plays is not known with complete precision.

2. Our knowledge of the date and extent of stage performances of the plays is, at best, patchy.

3. There is little "gold standard" evidence of the parts that Shakespeare did play as an actor: tradition ascribes to him roles including the ghost in Hamlet, and Adam in As You Like It, but beyond this it is hard to go. Therefore, it is hard to check the accuracy of the conclusions of this method.

4. The model has not been demonstrated on any other comparable author.

We set out to address problem 4. Would the Foster method be able to predict the parts played by an author whose dramatic career could be traced with some certainty?

Finding a candidate

The candidate author had to be an actor-writer, as close as possible in time to Shakespeare, with a large corpus of writing, well-documented as to date and solely authored by him. Furthermore, he needed to have enjoyed a well-attested theatrical career: we needed to know what parts he played in what plays at what date.

No pre-1642 actor-dramatist meets these criteria adequately. The best contenders are Robert Armin, whose corpus of accessible work is too small--two full-length plays, and one jest-book--and Ben Jonson, whose writings are certainly extensive, but whose theatrical career cannot be traced beyond his appearances as "mad Hieronimo" in the 1590s.

Moving to the Restoration theatre is less than ideal, since it opens up a gap in many aspects of theatrical culture. This applies most obviously to patterns of line-learning and performance, but in fact affects all aspects of authorship practice as well. Indeed, the very factors that make Restoration theatre better documented make it less like the theatre of Shakespeare. However, it does contain a candidate who meets our initial criteria: Colley Cibber.

Cibber's parts and works

Colley Cibber (1671-1757), actor, playwright, theatre manager, and ultimately poet laureate, is generally remembered as the enemy of Pope and as one of the victims of Pope's Dunciad. But he was also a prolific actor-writer. Furthermore, both his literary and theatrical careers are well documented, thanks partly to his own autobiography. He is therefore a suitable candidate for this project. If the rare-words technique works, it should be possible to deduce the parts he played from the influence that they have on subsequent plays. We can then compare the results of this deduction to the known parts Cibber played to evaluate the rare-words technique's value in determining parts played.

Matt's first step was to build a corpus of available material written by Cibber, and of parts that Cibber played. The backbone of both took the form of Cibber's own plays, sourced in electronic form from Literature Online (LION), whose policy has been to keyboard the texts from the first edition.

The plays involved can usefully be presented in the following table:

Play (date of first publication) Cibber's part Year of first Performance
Love's last shift (1696) Sir Novelty Fashion 1696
Woman's wit (1697) Longville 1697
The rival queans (1729) ?Alexander ?1699
Xerxes (1699) No part 1699
Love makes a man (1701) Clodio 1700
The school-boy (1707) Mass Johnny 1702
She wou'd, and she wou'd not (1703) Don Manuel 1702
The careless husband (1705) Lord Foppington 1704
Perolla and Izadora (1706) Pacuvius 1705
The comical lovers (1707) Celadon 1707
The double gallant (1707) Atall 1707
The lady's last stake (1708) Sir George Brillant 1707
The rival fools (1709) Samuel Simple 1709
The non-juror (1718) Dr Wolf 1711
Ximena (1719) Don Alvarez 1712
Cinna's conspiracy (1713) No part 1713
The refusal (1721) Witling 1721
Caesar in Aegypt (1725) Achoreus 1724
Papal tyranny in the reign of King John (1745) Pandulph 1745



LION has 26 dramatic texts ascribed to Cibber. However, two of these are linked: Cibber completed Vanbrugh's A Journey to London after Vanbrugh's death, performing the resulting play as The Provoked Husband. Both plays were therefore excluded from the building of the database, as they do not represent Cibber's sole authorship.

Richard III is Cibber's adaptation of Shakespeare's play, incorporating large numbers of Shakespeare's lines: again, it is excluded on the grounds that not all the words are by Cibber. Damon and Phillida, listed as Cibber's by LION, is excluded from the Cibber canon by Viator and Burling, and from our corpus (Timothy J. Viator and William J. Burling, eds., The Plays of Colley Cibber, Volume One (Madison, NJ: Farleigh Dickinson UP, 2001). In spite of uncertainties about date and text regarding The Rival Queans, Viator and Burling (429) conclude that it is probably all by Cibber. Myrtillo, Venus and Adonis, and Love in a Riddle are all excluded from this study on the grounds that they are musicals rather than conventional plays.

Other texts that do not appear on LION, but should also be mentioned as exclusions are:

The Lottery (1728). Not by Cibber according to Viator and Burling.

Hob in the well, also called Flora (1729). Not by Cibber according to Viator and Burling.

Polypheme (1735). No etext available.

The Hypocrite (1716), a rewrite of The Non-Juror. No etext available.

The Bulls and the Bears (1715). Lost.

Thus, we are left with nineteen dramatic texts. All of them were (according to the title-pages of the first editions) written unaided by Cibber. In all but two of these, Cibber's part can be established either from the cast-list in the publication or from other evidence collected by Viator and Burling.

There is one other text that it is possible to add to the corpus: Cibber's prose autobiography Apology for the life of Colley Cibber (1740). An 2-part etext of this is available below ('Cibber Apology 1.html', 'Cibber Apology 2.html'). The resulting corpus comes to about 566,000 words, which is still a little smaller than Shakespeare's (about 711,000 words).

Preparing the Texts

In the case of the LION plays, each text was downloaded from LION and saved as a ".txt" file. Using a combination of automated search and replace routines, and manual editing and checking, Matt adjusted the format so as to be suitable for the Perl program. This involved:

1. stripping out LION's header and footer material

2. stripping out LION's indications of page and line numbers

3. stripping out all those parts of the publication that were not the dialogue itself: prefatory material, cast lists, indications of scene-division, inset songs, and stage-directions.

4. replacing all "VV" with "W" (but spelling itself was left unstandardized, for reasons of repeatability and speed)

5. ensuring that each speech was arranged with a blank line, followed by a line containing only the speech-prefix, followed by the speech itself.

6. ensuring that each speaker always had a consistent and unique speech-prefix.

The file was then emailed to Steve, who ran it through a specially modified Perl script to convert it into a list of parts, and processed them through FileMaker as described in the above section 'Roth's refinements'. Matt checked the list of parts generated, and used this to weed out remaining problems with inconsistent speech-prefixes, incorrect positioning of blank lines, etc.

The same procedure was followed with the Apology, except that instead of stage directions to remove, there were the notes of the later editor to be deleted.

By the end of this process, then, we had built a database of 8,186 "rare" words (words used between two and twelve times in the Cibber corpus).

Counting and analysis methods

We tested several analysis methods, not just the one that Foster adopted for his Shakespeare Newsletter articles. We gave them people's names for easy remembrance. The methods are described below. A few definitions of the terms we used will make those methods easier to understand. We refer to a "source play" as a play made up of parts ("source parts") that might be influencing "target plays." Influence means the presumed effect of a source part or play on a target play, as measured by the usage of rare words from the source part/play in the target play. There are two counting methods that may be used as the basis for each analysis method. It's unclear which of these methods Foster used in his Shakespeare Newsletter analysis.

1. How many of the rare words from a source part are used in a target play? We called this a count of "usages."

2. How many instances of the source-part rare words are there in the target play? (A word can obviously be used multiple times in a play.) We called this a count of "instances."

We tested both counting methods for the two analysis methods for which it was appropriate. The results from each counting method varied significantly in one of the analysis methods, but not in a way that affected our overall conclusion. The result of each analysis method (the presumed "influence") was expressed as a ratio.

Foster Ratio Assume that a source part which constitutes 5% of its play will also exert 5% of that source parts influence on a target play. If that source part's influence is significantly higher, it suggests that the author may have played the source part prior to writing the target play. Foster's (unstated) break-point was an influence at least 50% greater than predicted.

There are at least two difficulties with this method; it does not account for two aspects of the data:

Target play length. A long play has more words than a short play, hence more rare words, hence more likely correlations with any source part. So long plays, in general, would by this method seem to have been more heavily influenced than short plays. Our analysis bore this out.

Random variation. Some parts, statistically, will by sheer chance have lots of rare words. These parts will seem to exert inordinate influence on all target plays.

It's also unclear whether the assumption described in the first sentence of this ratio description is valid, and it's not clear how to test its validity given the random variation.

Egan Ratio Gabriel suggested this method in his initial SHAXICAN work ('The idea' above). Compare the relative frequency of source-part rare words in a target play to their frequency in the whole corpus. If their frequency is significantly higher in the target play, it suggests that the source part influenced that play, causing the writer to use the words more frequently. Since this method inherently counts the number of times a word is used (in a target play and in the corpus), it only relies on one of the two counting methods. This method accounts for both the difficulties inherent in the Foster Ratio, and does not rely on the underlying assumption of that method.

Roth Ratio Compare the count of rare words shared between source part and target play to the count of rare words in the source part. A high ratio suggests more influence. This method accounts for the number of rare words in the source part, but not for the length of target plays.

Steggle Ratio A count of rare words shared between works as a proportion of the works' combined lengths. Not so much a test of Foster's approach as a test of the analysis machinery in use. Matthew suggested this analysis early in the process to see if the Shaxican engine would reveal likely similarities between plays (with no analysis of parts or their influence). As expected, it showed that tragedies tend to share rare words with other tragedies (not surprisingly, tragedies showed the strongest correlation), comedies with comedies, etc. Hence, we conclude that rare-word statistics overall are more likely to reflect associations due to genre than due to other factors.

As explained at the beginning of this article, none of the analysis methods showed a discernable (by us) correlation between parts played and works written subsequently. Detailed tables of results are available in PDF format in the Downloads section below.

Now, we are aware that we're not exactly replicating Foster's technique:

As discussed by Gabriel, Foster's counting method depends on words (distinguishing the noun 'row' from the verb 'row') while SHAXICAN's depends on strings (treating 'row' as just three letters in succession, irrespective of the meaning)

There is an extra complication with Cibber which doesn't apply with the work on Shakespeare: the Cibber corpus is built from texts in unmodernized spelling. However, we consider that the overall effect of this is likely to be small, since the eighteenth-century spelling of the Cibber texts is already reasonably consistent.

And there are a number of stones that we have left unturned:

We didn't experiment with altering the threshold of defining a "rare" word, currently set at 12 or fewer appearances in the whole canon.

We didn't experiment with permutations of the initial canon: with including the 'marginal' texts we excluded, or with paring it down by (say) excluding the autobiography. Since generic similarity clearly 'drowns out' other factors in the Steggle ratio, we could also have tried analyses based solely on a corpus of tragedies, say, or city comedies. Furthermore, we didn't analyze the potential influence of parts known to have been acted by Cibber in plays he did not write.

The reason we have not investigated these avenues is that the initial results were so disappointing as to discourage us. In order to make a convincing case that the technique can meaningfully be applied to Shakespeare, where the correct answer is not known, it would be necessary first to get loud and clear results from Cibber. And at the moment, we can't.

Our reports and how to read them

We provide four reports (in Acrobat PDF format) that show the "influence" of parts played by Cibber on his writings. Each presents the same data--comparing all the analysis methods--but sorted by different ratios so it's easy to scan for patterns resulting from each method. The Foster Ratio is the part's percent of RW influence (how many rare words from the part appear in the target play [or instances of those rare words] over total rare words shared with the source play) over the part's percent of its play (which is Foster's measure of "projected" influence). Only one report, 'FosterRatioSort.pdf', is provided (sorted by usages) because the usages/instances ratios don't vary much. The Egan Ratio is the frequency of shared rare words in the target play (necessarily usage instances) over frequency of those words in the corpus (ditto). This method inherently relies on an "instances" count and the report is 'EganRatioSort.pdf'. The Roth Ratio is the number of rare words shared between source part and target play (or instances of those words in target play) over number of rare words in the source part, and the report is 'RothRatioUsagesSort.pdf'. For the 'instances' case, there's another report: 'RothRatioInstancesSort.pdf'.

These reports only show source-part/target-play correlations where the part has a known playing date by Cibber, and the source part and target play share at least 10 rare words. A report on the Steggle Ratio is not included because it compares plays and plays, not parts and plays.

The abbreviations used in the reports are as follows.

Play (date of first publication) Cibber's part Perf. Play title abbrev. Cibber's part abbrev.
Love's last shift (1696) Sir Novelty Fashion 1696 llsh llsh_SirNov
Woman's wit (1697) Longville 1697 wowi wowi_Lon
The rival queans (1729) ?Alexander ?1699 rivq rivq_Al
Xerxes (1699) No part 1699 xerx -
Love makes a man (1701) Clodio 1700 lmam lmam_Clo
The school-boy (1707) Mass Johnny 1702 scho scho_Maj
She wou'd, and she wou'd not (1703) Don Manuel 1702 shwo shwo_DMa
The careless husband (1705) Lord Foppington 1704 care care_LdFop
Perolla and Izadora (1706) Pacuvius 1705 pero pero_Pac
The comical lovers (1707) Celadon 1707 comi comi_Cel
The double gallant (1707) Atall 1707 doub doub_At
The lady's last stake (1708) Sir George Brillant 1707 lady lady_LdGeo
The rival fools (1709) Samuel Simple 1709 rivf rivf_Sim
The non-juror (1718) Dr Wolf 1711 nonj nonj_Doct
Ximena (1719) Don Alvarez 1712 xime xime_Alv
Cinna's conspiracy (1713) No part 1713 cinn -
The refusal (1721) Witling 1721 refu refu_Wit
Caesar in Aegypt (1725) Achoreus 1724 caes caes_Acho
Papal tyranny in the reign of King John (1745) Pandulph 1745 papa papa_Pand

Cibber's two volumes of autobiography were identified in the study as APOLVOL1 and APOLVOL2 respectively.

Matt Steggle & Steve Roth, December 2002


Here are the files referred to above: Cibber Apology 1.htm | Cibber Apology 2.htm | FosterRatioSort.pdf | EganRatioSort.pdf | RothRatioUsagesSort.pdf | RothRatioInstancesSort.pdf. You can alternatively have all twelve in one package: everything-cibber.zip.