The Augmented Criticism Lab’s Sonnet Database

This is the text of a short paper I delivered at the Digital Humanities 2019 conference in Utrecht, the Netherlands on 12 July 2019. The Augmented Criticism Lab’s Sonnet Database is in beta release.

To keep to my 10 minutes, I’ll be as focused as possible. My aim is to raise a research question, and then to describe my methods for answering it.

My question is:

Is the sonnet a form or a genre?

My method for answering it is:

A database of sonnets for text-analysis.

This paper has 5 parts. I’ll move briskly and suggestively through them:

  1. My motive
  2. My method
  3. My constraints
  4. My results
  5. My plans

My research domain is English Renaissance poetry: writers like Shakespeare, Sidney, and Spenser. Before them, in the early C16, Thomas Wyatt translated the C14 Italian poet Petrarch’s ‘sonnetti’ (or little songs) into English for the first time, and the conventions were set: a short complaint of love-sickness.

Later poets took up the sonnet and wrote hundreds upon hundreds of them. Most poets writing in the late C16 wrote at least one, and many wrote many of them. Most famously, Shakespeare wrote a sequence of 154 sonnets, like this famous one.

Definitions of the sonnets tend to start with its form: 14 lines of rhymed ten-syllable (pentameter) verse. Subtypes distinguish the Shakespearean from the Petrarchan sonnet based on their rhyme schemes.

But then, the definitions go beyond formal to generic features. The sonnet is a first-person reflection or “dialectical self-confrontation,” writes Paul Oppenheimer (1989) often with a volta (or turn) from problem to resolution. Its subjects are conventional, usually unrequited love.

The more sonnets you read, the more nuanced the formal and generic definitions become.

  • The Italian models vary the numbers of syllables and of lines, so English writers do the same.
  • Nor are all English sonnets about love. There are devotional sonnets, for instance; and occasional sonnets, on public events; and dedicatory sonnets at the start of books, to signal the author’s patronage network.

There is no Platonic ideal of the sonnet; we live in a world of shadows. But there is the McCarthyian model of the sonnet, the abstraction arising from systematic observations of different manifestations.

Each sonnet has quantifiable features (meter, rhyme, lemmas, n-grams, parts of speech, sentiments) that are more or less typical. You could score each sonnet using these metrics.

Why would we do that? Because modelling sonnets gives broad impressions statistical weight and precision. It adds nuance to subject definitions, and contextualizes famous examples like this one. Shakespeare’s sonnets are well known, so they exert a disproportionate influence on our understanding; this famous example should be one datum among other data.

So it is, in the database I built. My team has transcribed 1,885 English Renaissance sonnets from recent scholarly editions by about twenty authors: Shakespeare, Sidney, Spenser, Wroth, Donne, Milton.

We built it because it didn’t exist before. There is no one corpus of multi-author, well-edited poems of a single type with regularized orthography.

We used Renaissance sonnets because this is the era of the first translations and adaptations of Italian models, when the sonnet’s English conventions were set.

  • (This assumes, falsely, that every English-language sonnet should be compared only with other English-language sonnets: I don’t yet know what to do with translations.)

We used recent editions rather than hoovering up 19th-century editions from Project Gutenberg (say) because they are what scholars use, and because we needed regularized orthography. Where there are no modernized texts (e.g. for Mary Wroth), we have tools to modernize them with stand-off markup.

A problem with using editions is that we are still compiling only the famous examples by canonical authors, rather than the ground truth of uncollected and obscure sonnets.

Taking just the English Renaissance, there are a billion words in the EEBO-TCP corpus of printed books to 1700. We’re exploring how machine learning can help us identify sonnets among their 14-line clusters of poetic texts.

But an advantage of using editions is that it trusts the experts to define what is, and is not, a sonnet.

(John Donne poses particular problems in this respect, but he’s hardly unique.)

Let’s return to my initial question: “Is the sonnet a form or a genre?” Most readers, including me, accept the formal definition and move on to its generic qualities.

So a better question is whether the sonnet is more than a form. Accept the formal definition as provisional, in order to collect conventional sonnets. As I said, trust the experts that most sonnets are 14 lines long.

Do those formally conventional sonnets resemble poems in other forms?

Collect them, analyze them, and measure if other poems resemble them: the outliers, the 10- and 12- and 16- and 25-line poems, or (much later) the prose sonnets.

I haven’t done that yet, but I can see how to do it.

Here’s what I have done. I’ve read sonnets like Wyatt’s and Shakespeare’s and this one, by John Milton. They form my mental model of what kinds of sonnets each poet writes. The database quantifies that model, based on word choices.

Milton writes his sonnets starting in the 1620s, at the start of the sonnet’s decline into obsolescence, until the Romantic poets.

Milton wrote 6 Italian sonnets that feel conventionally Petrarchan, and (as I’ve said before) I don’t yet know how to make multilingual comparisons.

But his 18 English sonnets feel different from those that come before: more topical, with more classical references.

  • So I compared his word-choices to all the words in the corpus. That tells you something about his distance from the typical sonnet, at least using this one measure.

I’m not going to show you dynamic visualizations of those distances, using a list of metrics. That’s just not how literary critics read. We read a sonnet, and compare it to our mental model of all sonnets, and we build a mental model of that author’s sonnets.

It’s not just that we value words; it’s that we value exceptions to the rule over conventional rule-followers. That’s what makes a text interesting or, well, conventional.

When I read Milton’s English sonnets, I want to answer this question: are they typical or atypical?

Here is what we can see. These are the lemmas that Milton uses, that no one else does. (Ignore the numbers: we had duplicates, and the numbers are irrelevant anyway.)

There’s a predominance of proper nouns: people, places, references; and a lot of dramatic violence: maw (mouth), slaughter, rescue.

But even more illuminating are these lemmas that others use, that Milton never does. (Here the numbers of each instance are more reliable and relevant.)

Notice how Milton never uses words that seem associated with the syrupy love-sickness that we sometimes see in Elizabethan sonnets: sweet, beauty, desire, fair, pleasure. Nor with Petrarchan suffering: die, flame, pain, alas, ill, burn.

This is a glimpse of what the database enables, at present. Using the API in combination with natural-language processing, users can quantify the features of its sonnets.

It has miles to go before it is as thorough, as consistent, and as well-documented as it needs to be; and if its limitations are evident from my talk, I can only beg for your constructive advice on how to address them.

Leave a Reply