A few months ago in this space, I wrote about choosing novels to teach in a graduate course I’m offering this fall. I was convinced that novels were necessary because it’s a course on digital text-analysis, among other topics in the digital humanities. And because my exemplary critics Stephen Ramsay and Matthew Jockers (required reading for the course) focus on 19th- and 20th-century novels in their work.
Now I’m refocusing on two text types that are (arguably) extra-novelistic, at least in form: Samuel Pepys’s Diary, a daily record of his life between 1660 and 1669; and an anthology of sonnets, those 14-line poems made famous by Shakespeare, Wordsworth, and company.
The whole purpose of this course is to think about how close reading by human readers and ‘distant reading’ by machine readers are complementary, not in conflict as they’re often contrapositioned. There’s no reason to think that my close reading of the language of Milton’s sonnets (say) is invalidated or otherwise threatened by a machine’s algorithmic analysis of the same language. It’s augmented by these tools, which we might think of like a super-fast concordance, but much more powerful.
The choice of Pepys’s long text (1000+ pages even in this abridged version) and a single poetic form stretching across multiple centuries will give us two ways of comparing the close and the distant, of seeing the immediate text in the wider scale of literary history. But they’ll also pose new challenges, like how to filter and prepare the texts that will fill out the wider-scale corpora that we need to compare them with. Not to mention developing a complete Pepys from the texts on Project Gutenberg, which look to be fragmented by year.
So we’ll need well-edited and complete electronic texts of various kinds, including Pepys’s complete text (ideally) and a wide array of sonnets. There are at least three problems I can think of:
1 / Digital editing
We’ll need electronic texts of sonnet collections, and of Pepys’s Diary. For Pepys right now I’m using Project Gutenberg, but there must be a better way. We have some subscription-based resources on campus, like Literature Online. But it has no Pepys, for example.
2 / Corpus-building
We’ll need to compare these editions to reliable electronic texts, with sufficient metadata to make the kinds of distinctions that will make them useful. For instance, we might want just prose from the 17th century, or all the poems that are sonnets. I need to build text corpora that are as large as we can make them, but also as flexible as possible (i.e. they can be filtered by date or other criteria). One of our assignments, by the way, might be for the students to do some of this data-curation and compilation work.
3 / Programming
Finally, of course, we’ll need to run various processes on these texts, like topic modelling or natural-language processing. We need local experts in programming languages like R and Python, preferably with some familiarity with Mallet and Gephi.