WAN Python script

Word Adjacency Networks (WANs) for authorship attribution

A Word Adjacency Network (WAN) is a mathematical system, a Markov chain, that represents the proximities of certain words within a text. Given a list of words-of-interest, typically the 100 or so function words that comprise about half of all spoken and written English, and a text to process, the WAN computer algorithm will record in a Markov chain (stored internally as a two-dimensional matrix) the averaged distances at which each of the words-of-interest is found from each of the others. It has been demonstrated that the habit of placing certain function words near to other function words is an authorial characteristic that varies from one writer to another in a distinctive and relatively consistent way and hence that a comparison of WANs derived from different texts can provide evidence for attributing authorship in cases where other evidence is not available.

The comparison of a WAN derived from one text, say a piece of writing of disputed authorship, and the WAN derived from another text, say the securely attributed works of a plausible candidate for authorship of the disputed text, is a comparison of differing probability distributions. For this comparison the standard measure from Information Theory known as Kullback-Leibler divergence (more colloquially, relative entropy) may be used. When comparing the relative entropy between WANs from multiple texts, the lowest values tend to be between WANs from texts by the same author. This fact is exploited in application of the WAN method to the problem of authorship attribution: amongst the candidate authors for a disputed or suspect text we seek the one whose authorial canon is least different in this regard from the text to be attributed. .

Six published articles by Gabriel Egan and others describe the WAN method and apply it to problems of authorship attribution in the fields of Shakespeare and early modern drama:

Santiago Segarra, Mark Eisen, Gabriel Egan and Alejandro Ribeiro. "Attributing the Authorship of the Henry VI Plays By Word Adjacency." Shakespeare Quarterly vol. 67 (2016): 232-56

Santiago Segarra, Mark Eisen, Gabriel Egan and Alejandro Ribeiro. "Stylometric Analysis of Early Modern English Plays." Digital Scholarship in the Humanities 33 (2018): 500-28

Santiago Segarra, Mark Eisen, Gabriel Egan, and Alejandro Ribeiro. "A response to Pervez Rizvi's critique of the Word Adjacency Method for authorship attribution" ANQ: A Quarterly Journal of Short Articles, Notes and Reviews 33. (2020): 332-337

Santiago Segarra, Mark Eisen, Gabriel Egan, and Alejandro Ribeiro. "A response to Rosalind Barber's critique of the Word Adjacency Method for authorship attribution" ANQ: A Quarterly Journal of Short Articles, Notes and Reviews 34 (2021): 291-296

Paul Brown, Mark Eisen, Santiago Segarra, Alejandro Ribeiro, and Gabriel Egan. "How the Word Adjacency Network (WAN) Algorithm Works" Digital Scholarship in the Humanities 37 (2022): 321-355

Gabriel Egan, Mark Eisen, Alejandro Ribeiro, and Santiago Segarra. "'I would I had that corporal soundness': Pervez Rizvi's Analysis of the Word Adjacency Network Method of Authorship Attribution" Digital Scholarship in the Humanities Advance Access (2023): 1-14

The purpose of this webpage is to provide supporting materials for those wishing to use the WAN method in their own investigations of authorship.

Materials in support of Paul Brown, Mark Eisen, Santiago Segarra, Alejandro Ribeiro, and Gabriel Egan. "How the Word Adjacency Network (WAN) Algorithm Works" Digital Scholarship in the Humanities 37 (2022): 321-335

The software provided here is written in the language Python (version 3) and takes as its input three ASCII text files: 1) a sample of writing, 2) another sample of writing, and 3) a list of words-of-interest. The software creates two Word Adjacency Networks, each representing the proximities of the words-of-interest for one of the two samples of writing, and then calculates and outputs the relative entropy between the two WANs. The algorithm that the software follows is described in Brown et al. cited above.

Program code

The WAN program code in the language Python 3 and can be found here: WAN.py.

If you click this link, the program code will display in any web-browser but because it contains long lines some browser may unhelpfully wrap them around the screen if they don't fit within your display. To get an executable copy of the software, instead of simply clicking on the link "WAN.py" do a right-click on the link -- or left-click if you usually right-click or a click-and-hold if you are using a Macintosh computer -- to bring up the so-called "Context Menu" that gives you options for what to do with a link. The option you want is to download to your computer a copy of the file that the link points to. In the browsers Microsoft Edge, Firefox, and Google Chrome this option is called "Save link as", in Microsoft Internet Explorer it is called "Save target as", and in Safari it is called "Download linked file as". (If you want to start a petition demanding that all the browsers use the same words for this operation, put my name down.) Save the dowloaded code as a file called "WAN.py". If you are using a Windows computer, beware that by default Windows hides the part of a filename (called the extension) that cames after the full stop, so first switch off this concealment. You can find out how to switch it off by doing a web search for "How to prevent Windows hiding file extensions". (If you want to start a petition asking Microsoft to switch off the default behaviour of hiding file extensions, put my name down.)

The program "WAN.py" has hard-coded into it the filenames of the three input files it uses: the two text files whose WANs it will generate and file containing the list of words it will use to generate the two WANs. The program does not save the WANs to your hard disk: it uses them internally and deletes them when the program terminates. The program code begins with several paragraphs of comments explaining its operation, including the kind of encoding it expects to find in its input files -- which is pure ASCII in either ANSI or UTF-8 form -- and indicating which lines of the program refer to the hard-coded filenames. You can either give your files the hard-coded filenames that the program expects or you can edit the program so that it expects the filenames you have chosen for your files.

The output of the program is a log (sent to the terminal screen) detailing its operations as they happen and ending with a statement of the value in the units called centinats (100ths of the unit of entropy called the nat) of the relative entropy between the two WANs, and hence of the difference, with respect to the clustering of the word-of-interest you specified, between the two texts from which those WANs were derived. The smaller this relative entropy (the lower the number of centinats) the more alike the two texts are in their habits of clustering the words-of-interest you specified.

This program "WAN.py" is offered here to illustrate the principles by which the Word Adjacency Network method of author attribution works and to enable any interested investigators to try it out for themselves. For a serious investigation of a real-word authorship attribution problem it is likely that the investigator will need to run the "WAN.py" script hundreds or even thousands of times on various texts and sub-divisions of texts. This cannot practically be achieved by manually running the "WAN.py" script at the command line. An article that walks the reader through such a real-world application of the method, using automated invoking of the WAN algorithm multiple times, is currently under review at a scholarly journal and once it is accepted for publication somewhere a new section will be added below this one providing the computer code, text files, and numerical-data files that support this new article. In the meantime, Gabriel Egan <mail@gabrielegan.com> would be happy to talk to anyone interested in this new approach to authorship attribution or wanting help to get started in using it.

Materials in support of Gabriel Egan, Mark Eisen, Alejandro Ribeiro, and Santiago Segarra. "'I would I had that corporal soundness': Pervez Rizvi's Analysis of the Word Adjacency Network Method of Authorship Attribution" Digital Scholarship in the Humanities 38 (2023): 1494-1507

In this article the authors respond to Pervez Rizvi's two-part article ‘An Analysis of the Word Adjacency Network Method, Part 1: The Evidence of Its Unsoundness’ and ‘Part 2: A True Understanding of the Method’ Digital Scholarship in the Humanities, 38: 347-378 (2022). As part of our response we describe the contents of a ZIP-compressed dataset of supporting materials for his article that Rizvi put online at <http://www.shakespearestext.com/wan.zip>. On 24 October 2022, as we were writing our response, we downloaded Rizvi's ZIP file containing his dataset and we thought it best to deposit a date-stamped copy of that ZIP file here so that anyone reading our article can confirm that it did, on that date, contain the materials that we refer to in our article. Our copy of Rizvi's ZIP file is here: wan.zip. When last checked by us, on the date of publication of our article, 28 April 2023, the copy of the ZIP file on Rizvi's website and the copy here were identical.

In response to our article "'I would I had . . .'", Pervez Rizvi published in Digital Scholarship in the Humanities a letter disputing some of what we wrote. The editor of Digital Scholarship in the Humanities invited us to reply to this letter and in our reply (<https://doi.org/10.1093/llc/fqad107>) we refer to three files that in October 2017 Rizvi published on his website at <https://shakespearestext.com/CAN>. Rizvi has since reorganized his website and the files are not present on the version available on the day that our letter was published, which was 10 January 2024. For the benefit of readers following this exchange who might want to see the three files that Rizvi published in 2017, and that we say contradict his claim in his letter not to have tested for or stated his views on the co-authorship of the Shakespeare plays Titus Andronicus, Pericles, and Timon of Athens, we provide those three files from 2017 here:

summary-collocations-titus_andronicus_excl_act_1_2_1_2_2_3_2_4_1.csv

summary-collocations-pericles_acts_1_to_2.csv

summary-collocations-timon_of_athens_middleton.csv