The right markup on Shakespeare’s texts can help algorithms address other texts in the early modern English language.
The object of my research is early modern language:
- specifically texts printed before 1700,
- specifically Shakespeare,
- specifically his rhetorical figures.
The rhetorical figure I just used is called an anaphora, where the same word is repeated at the beginning of a sequence of clauses or sentences.
Here’s a better example of an anaphora, from Shakespeare:
This quotation is also a zeugma, where one verb (“glory”) serves two or more clauses.
My purpose isn’t to lesson you in rhetorical figures, but to show you two things in one quotation:
- Shakespeare’s language is obscure, to the modern-English reader. The verb “to glory” is out of common use; as is the meaning of “force” as health or vigour.
- Shakespeare’s language is poetic, to any reader. He’s giving us a list, here using the anaphora, which we could render in bullet points, but instead he uses figurative language, in iambic pentameter.
I chose this example because it’s metrically straightforward; and it offers two straightforward rhetorical figures — which happen to start with ‘A’ and ‘Z’.
And I chose it because I had to start somewhere, with some piece of evidence that scales outward: from these figures in Sonnet 91, in turn,
- to all of Shakespeare’s figures;
- to all of Shakespeare’s texts;
- to all texts printed before 1700;
- to all archived writing from the period;
- to all lost texts;
- and beyond, to all those “nameless unrecorded acts” that Wordsworth writes about,
- to Thomas Grey’s “mute inglorious Milton[s],”
- to unrealized ambitions,
- to all of early modern consciousness,
- to —
I could go on. The point is, we are always scaling from parts to wholes, from models to world, from the quantities to the qualities of a text.
My subject is scaling data. This quotation is my data, this and the hierarchies it’s nested within: from this quotation, to full digital editions of Shakespeare, to the EEBO-TCP transcriptions of every book printed before 1700. (That’s where I’ll stop.)
A brief explanation:
The TCP is the Text Creation Partnership, which has transcribed a single edition of nearly 45,000 (out of the full 70,000) of the Early English Books Online (EEBO) collection of texts printed before 1700. Jonathan Hope has called it “the most important humanities project in the anglophone world in a hundred years.”
Again, very briefly:
Martin Mueller — who’s done more for the digital analysis of early modern English texts than any scholar — in a recent blog post sums up the TCP’s problems succinctly: “the TCP is a magnificent but flawed project.” Its errors (which in the aggregate affect just 0.5% of its words) can cast what Mueller calls “a negative halo effect” on these texts.
I, too, want the TCP texts to be better — to be not just more complete and error-free, but more amenable to algorithms. Algorithmic amenability is Martin Mueller’s elegant phrase, in one of his posts about curating the TCP texts:
its capacity for being variously divided or manipulated, combined with other texts for the purposes of cross-corpus analyses, having data derivatives extracted from, or levels or metadata added to it.
How do we make the texts amenable in this way?
Go back to my first quotation, with its metrical regularity, and its rhetorical figures:
Take each of those qualities and model them: quantify them, make them so simple that even a computer can understand them. Count syllables; tag parts of speech with Natural Language Processing trained in the early modern idiom; teach an algorithm to recognize rhetorical figures — starting with those that repeat and reorder words (antimetabole, chiasmus, gradatio) and sentence structures (isocolon, stychiomythia).
Why these features? Because they are both noticeable and addressable qualities of a text.
We notice rhetorical figures as we read them (even if we have to look up their names, as I did); and then we can recognize them the next time we see them. We notice a noun used as a verb. Then when we come to lines like “Grace me no grace, nor uncle me no uncle” in Richard II, we recognize what he’s doing.
I said these are noticeable and addressable features. ‘Addressable’ is Michael Witmore’s term. It means “one can query a position within the text at a certain level of abstraction”:
- act & scene divisions, printed lines, syllables;
- rhetorical structures;
- parts of speech.
As Witmore writes, “[R]eading might be described as the continual redisposition of levels of address.” To read is to address a text, to abstract it into what Quentin Meillassoux, whom Witmore quotes, calls “those aspects of the object that can be formulated in mathematical terms.”
But can we formulate blank verse in mathematical terms? What about its figures, from anaphoras to zeugmas?
And what would we do with it?
- Imagine a new theory of the adverb, the anaphora, or other quantifiable features of language.
- Imagine an alternate, more comprehensive history of the sonnet: that finds its forms embedded in longer texts (as in famous examples like Romeo and Juliet or Paradise Lost.
- Imagine a new conception of figurative language in printed sermons, or scientific prose.
So Witmore, pairing massive addressability with future tokenizations and new statistical procedures, writes that “something that is arguably true now about a collection of texts can only be known in the future.”
Let’s review my wish list of ‘scalable’ features to help text-analysis algorithms address the noticeable features of Shakespeare’s text. I’ve mentioned a few features like syllables, parts of speech, and rhetorical figures. We could expand these in different directions, but let’s narrow them down by asking two key questions:
- Why? What problems or questions would the systematic recognition (by detecting, encoding, and visualizing) of these features actually address?
- How? Which of them could we feasibly ask a program to recognize?
Question 2 raises second-order questions like which dataset to use (the Folger Digital Texts? the New Variorum Shakespeare?); how to regularize and modernize Shakespeare’s texts; how to divide it into sentence-length units despite inconsistent punctuation; and so on.
I hope that readers of this post will be able to help me with Question 2, by pointing me in the right direction. As I gather feedback, I’ll update this post.
I’m more confident about my answers to Question 1. The goal is to find features of Shakespeare’s language that provoke research insights, that are machine-readable, and that are typical of broader usage.
Start with rhetorical figures. I think immediately of the differences between writers who use very few figures, like Christopher Marlowe, and those who used them frequently, like Thomas Kyd. So it’s possible to distinguish between writers based on their usage. I don’t say this to get into attribution studies or stylometrics, as past generations of digital humanists studying early modern texts have done — but instead to ask if we couldn’t identify influence between writers.
In future posts, like this one, I’ll add examples from Kyd and Lyly, and begin compiling lists of figures. That should help me address queries about how to plan and launch this automated process. Thanks for reading.