Picture of Gabriel Egan G a b r i e l   E g a n  .  com

Word Adjacency Networks (WANs) for authorship attribution

A Word Adjacency Network (WAN) is a mathematical system, a Markov chain, that represents the proximities of certain words within a text. Given a list of words-of-interest, typically the 100 or so function words that comprise about half of all spoken and written English, and a text to process, the WAN computer algorithm will record in a Markov chain (stored internally as a two-dimensional matrix) the averaged distances at which each of the words-of-interest is found from each of the others. It has been demonstrated that the habit of placing certain function words near to other function words is an authorial characteristic that varies from one writer to another in a distinctive and relatively consistent way and hence that a comparison of WANs derived from different texts can provide evidence for attributing authorship in cases where other evidence is not available.

The comparison of a WAN derived from one text, say a piece of writing of disputed authorship, and the WAN derived from another text, say the securely attributed works of a plausible candidate for authorship of the disputed text, is a comparison of differing probability distributions. For this comparison the standard measure from Information Theory known as Kullback-Leibler divergence (more colloquially, relative entropy) may be used. When comparing the relative entropy between WANs from multiple texts, the lowest values tend to be between WANs from texts by the same author. This fact is exploited in application of the WAN method to the problem of authorship attribution: amongst the candidate authors for a disputed or suspect text we seek the one whose authorial canon is least different in this regard from the text to be attributed. .

Five published articles by Gabriel Egan and others describe the WAN method and apply it to problems of authorship attribution in the fields of Shakespeare and early modern drama:

Segarra, Santiago, Mark Eisen, Gabriel Egan and Alejandro Ribeiro. 2016. "Attributing the Authorship of the Henry VI Plays By Word Adjacency." Shakespeare Quarterly. vol. 67. pp. 232-56

Segarra, Santiago, Mark Eisen, Gabriel Egan and Alejandro Ribeiro. 2018. "Stylometric Analysis of Early Modern English Plays." Digital Scholarship in the Humanities 33. 500-28

Segarra, Santiago, Mark Eisen, Gabriel Egan, and Alejandro Ribeiro. 2020. "A response to Pervez Rizvi's critique of the Word Adjacency Method for authorship attribution" ANQ: A Quarterly Journal of Short Articles, Notes and Reviews vol. 33. pp. 332-337

Segarra, Santiago, Mark Eisen, Gabriel Egan, and Alejandro Ribeiro. 2020. "A response to Rosalind Barber's critique of the Word Adjacency Method for authorship attribution" ANQ: A Quarterly Journal of Short Articles, Notes and Reviews Advance online access. pp. 1-6

Brown, Paul, Mark Eisen, Santiago Segarra, Alejandro Ribeiro, and Gabriel Egan. 2021. "How the Word Adjacency Network (WAN) algorithm works" forthcoming in Digital Scholarship in the Humanities.

The purpose of this webpage is to provide supporting materials for those wishing to use the WAN method in their own investigations of authorship.


Materials in support of Paul Brown, Mark Eisen, Santiago Segarra, Alejandro Ribeiro, and Gabriel Egan. 2020. "How the Word Adjacency Network (WAN) algorithm works" forthcoming in Digital Scholarship in the Humanities.

The software provided here is written in the language Python (version 3) and takes as its input three ASCII text files: 1) a sample of writing, 2) another sample of writing, and 3) a list of words-of-interest. The software creates two Word Adjacency Networks, each representing the proximities of the words-of-interest for one of the two samples of writing, and then calculates and outputs the relative entropy between the two WANs. The algorithm that the software follows is described in Brown et al. cited above.

Program code

The WAN program code in the language Python 3 and can be found here: WAN.py.

If you click this link, the program code will display in any web-browser but because it contains long lines some browser may unhelpfully wrap them around the screen if they don't fit within your display. To get an executable copy of the software, instead of simply clicking on the link "WAN.py" do a right-click on the link -- or left-click if you usually right-click or a click-and-hold if you are using a Macintosh computer -- to bring up the so-called "Context Menu" that gives you options for what to do with a link. The option you want is to download to your computer a copy of the file that the link points to. In the browsers Microsoft Edge, Firefox, and Google Chrome this option is called "Save link as", in Microsoft Internet Explorer it is called "Save target as", and in Safari it is called "Download linked file as". (If you want to start a petition demanding that all the browsers use the same words for this operation, put my name down.) Save the dowloaded code as a file called "WAN.py". If you are using a Windows computer, beware that by default Windows hides the part of a filename (called the extension) that cames after the full stop, so first switch off this concealment. You can find out how to switch it off by doing a web search for "How to prevent Windows hiding file extensions". (If you want to start a petition asking Microsoft to switch off the default behaviour of hiding file extensions, put my name down.)

The program "WAN.py" has hard-coded into it the filenames of the three input files it uses: the two text files whose WANs it will generate and file containing the list of words it will use to generate the two WANs. The program does not save the WANs to your hard disk: it uses them internally and deletes them when the program terminates. The program code begins with several paragraphs of comments explaining its operation, including the kind of encoding it expects to find in its input files -- which is pure ASCII in either ANSI or UTF-8 form -- and indicating which lines of the program refer to the hard-coded filenames. You can either give your files the hard-coded filenames that the program expects or you can edit the program so that it expects the filenames you have chosen for your files.

The output of the program is a log (sent to the terminal screen) detailing its operations as they happen and ending with a statement of the value in the units called centinats (100ths of the unit of entropy called the nat) of the relative entropy between the two WANs, and hence of the difference, with respect to the clustering of the word-of-interest you specified, between the two texts from which those WANs were derived. The smaller this relative entropy (the lower the number of centinats) the more alike the two texts are in their habits of clustering the words-of-interest you specified.

This program "WAN.py" is offered here to illustrate the principles by which the Word Adjacency Network method of author attribution works and to enable any interested investigators to try it out for themselves. For a serious investigation of a real-word authorship attribution problem it is likely that the investigator will need to run the "WAN.py" script hundreds or even thousands of times on various texts and sub-divisions of texts. This cannot practically be achieved by manually running the "WAN.py" script at the command line. An article that walks the reader through such a real-world application of the method, using automated invoking of the WAN algorithm multiple times, is currently under review at a scholarly journal and once it is accepted for publication somewhere a new section will be added below this one providing the computer code, text files, and numerical-data files that support this new article. In the meantime, Gabriel Egan <mail@gabrielegan.com> would be happy to talk to anyone interested in this new approach to authorship attribution or wanting help to get started in using it.