NLP for Literary Critics: An Introduction and Tutorial

Preface: Knowledge and Information

Shall I compare thee, human, to a machine? Thou art more critical and more intemperate (Shakespeare, Sonnet 18).

But seriously: how do human readers compare to machines? I ask because I want to define how literary critics can use machines to augment and extend our readings. Figuring that out depends on an understanding of how our readings compare to the machine’s abilities. Sure, they’re faster: but faster at what, exactly?

Literary critics read words, and notice features of those words in relation to other words. Then we make arguments about those features. Those arguments formulate and posit our knowledge about word sequences, which we call poems or stories or novels or plays or epics or whatever.

Whereas computers treat words like any other string of characters. They ’notice’ only what we tell them to notice: what letters are in the words; what other words they rhyme with; what proximity words have to each other; words’ relative frequency in a long text; words’ parts of speech (noun, verb, etc.) or other metadata; and so on.

What resemblance does a critic’s knowledge have to that information? The key to answering that question lies in its terminology: what debt does knowledge owe to information? You can assemble all the facts about a subject, but knowledge comes from sorting and filtering and synthesizing facts, and then articulating claims that are more than the sum of their facts. Knowledge transmutes the information on which it relies.

Program or be Programmed

Among the ‘ten commandments’ of Douglas Rushkoff’s Program or be Programmed is its closing injunction for readers to “access … the control panel of civilization” by writing our own computer programs. Why? Not to train ourselves to work in Silicon Valley, but to take more control of our digital lives: to create, rather than (only) to consume.

Rushkoff’s control-panel analogy is good, but here’s a better one. The moment you shift from using tools to writing programs is like the moment you cook your first meal. It’s more customizable, for one; no more must you tolerate other people’s ingredients. But it also forces you to eat your own mistakes. And in the beginning, you make a lot of mistakes.

I spent last week at the Digital Humanities Summer Institute, at the University of Victoria, learning to program and making my share of mistakes. Specifically, I learned how to use the Natural Language Toolkit, or NLTK. It’s a Python package (bear with me) that enables you to work with texts written in non-computer (‘natural’) languages like English, Swahili, or Portuguese. I learned how to take a text and slice it into a list of words, and then to do things with those words: counting, sorting, categorizing, comparing, transforming, substituting, and seeing how they’re arranged relative to each other.

My instructor for the course was Aaron Mauro, director of the Penn State Digital Humanities Lab and Certified NLTK Plenipotentiary. (Ok, I made up that second title.) A gifted and patient teacher, Aaron is the ideal guide to thorny syntax and other impediments for novice programmers like me. He taught me what the toolkit could do, and where I could take my coding practice: into the information-gathering that both informs and challenges my knowledge. It informs my sense of particular texts, yet it challenges what I thought I knew about them.

I’ve also benefitted enormously, and directly, from the advice and guidance of my student Josh Harkema at the University of Calgary. He reviewed all of my code in this post, and made it work when I couldn’t. Needless to say, all the errors and bad code that follow are my own.

That’s enough talking and paying fealty. Let me share some of what I’ve learned. What follows is my introductory tutorial to working with language in Python and the NLTK.

Text Transformations

Start with some basic terminology. The text you’re reading right now is split into paragraphs, sentences, and words. Those are human labels. But programs don’t treat text that way; they treat it as ‘strings’ — or series of letters and spaces and punctuation and other characters.

Take this pangram (or sentence with every letter of the alphabet), which I prefer to the one about the dogs and the fox:

Jackdaws love my big sphinx of quartz

In Python you can easily turn a sentence like that into a string, and then change the string’s characters. For instance, here’s how you change them all to uppercase:

One line of code and hey presto, you don’t have to retype it in uppercase! But there’s an even better way to do it; instead of writing the sentence in every print() statement, define a variable called pangramString, and do all sorts of things with it, one after another:

If we really want to have some fun (and why else are we programming, amiright?) let’s define a new variable called swapSting with mixed upper- and lower-case characters and then swap them all, in an instant:

Good times. The use cases for that text transformation might be hard for human readers to imagine, but it’s an illustration of the simplest kinds of transformations machines can do.

Lists and List Comprehensions

Let’s get more sophisticated, then, with some transformations of the contents of a sentence like our pangram. Consider something simple-sounding like a find-and-replace. Say I’ve misspelled ‘quartz’ as ‘quarts’ and my pangram no longer contains a Z character. What would I do?

For that, you need a list. Why? Because string are immutable; that means you can’t change their contents once you’ve created them. So you can’t ask the machine to swap ‘quarts’ for ‘quartz’ in pangramString.

But you can do it if you convert the string into a list, which is mutable: so you can change its contents after you’ve created it.

A list is a series of strings in a defined order. In our pangram, the order of words is important; our sentence would be nonsensical if its words were jumbled. Instead of treating the whole sentence as one string, I’ll make it a list of seven strings in order:

Again, this is the version with the misspelled ‘quartz’ at the end. How do we change that? It’ll take two steps. First we remove ‘quarts’ and then we append ‘quartz’, like this:

Let’s say now we want to sort those strings in alphabetical order. We convert them all to lowercase, so Python will treat them as equals, and then we sort them. Like this:

The first step here, the conversion to lowercase, is called a list comprehension. These are compact ways to create new lists, in this case using an old list: we make pangramListLowerSorted from pangramListLower. The [square brackets] around [words.lower() for words in pangramList] here are saying newList = [string.change() for string in oldList].

But list comprehensions get really powerful when they filter the lists, not just rearrange them. If you have an enormous list of ten thousand or a hundred thousand words, like a novel, you will want to see only parts of it — that is, only the strings in the list that interest you. List comprehensions will filter the list using your criteria.

For now, we’ll stick to our pangram list. Go back to the lowercase version, and start by reminding ourselves what’s in it.

The second step here is a list comprehension that includes an if statement: that is, a condition. If the length (or number of characters) of words is greater than two, then include them in the new variable pangramLongWords. If the length is less than two, don’t. Another way to say this is newList = \[string.change() for string in oldList if conditionMet].

Down the NLTK Rabbit-Hole

Everything I’ve shown you so far requires nothing more than elementary Python; that’s how I make it look easy. So now let’s open the NLTK toolbox and see what we can do.

Let’s extend the capabilities we’ve seen so far: turning strings to lowercase; sorting them by frequency; and filtering them. We’ll add punctuation removal, just to clean up the text. And then we’ll use two tools from the NLTK toolbox: tokenizing and lemmatizing.

Getting started requires a series of import statements and download functions that get our data and software set up to run.

Now we import the Project Gutenberg directory, which contains nineteen files to start working with right away. (For present purposes, that’s all we’ll do; working with our own files is for another day.)

We’ll start with Lewis Carroll’s Alice in Wonderland. A quick inquiry tells us that there are 34,110 items in this list, and we can peek inside by printing the first 50 of them:

Right away, we can see some features of these list items. The first is that they are all strings: whether upper- or lower-case, chapter titles or dates, words or punctuation. That’s fine when we humans read a file; we know to treat “CHAPTER” differently from “sister” differently from “1865” differently from “.”. But the machine doesn’t know that (and why would it?). So we’ll have to tidy things up if we want good results.

The thing is, though, alice is not a list; it’s actually a view of a corpus file defined by an NLTK module called ‘corpusview’. You determine this by asking Python what type of variable it is:

The details don’t matter to us, but we need to know that while it behaves like a list (of tokens, FWIW), it’s not. But that’s okay; we have functions to run on it that will make it into a list.

Now we’ll remove punctuation from alice. Start by declaring a new variable, alice_cleaned, and giving it an empty ‘sentinel value’ — which ensures that the for-loop we start here won’t run forever.

Next we need to import a package called ‘punctuation’ and then run a for-loop to ignore those unwanted strings when building of our new list. If we print the first 100 strings in alice_cleaned, we can confirm that it’s worked:

Now we convert alice-cleaned to lowercase, using a list comprehension:

The new variable, alice_lowered, is nearly ready to sort its strings by frequency. First, though, we need to filter out all of its stopwords: like “the” and “and” and “or” and “he” and “by” and … you get the idea: all of those articles and pronouns and prepositions that appear everywhere in a text, but arguably just connect more significant words together.

Fortunately, there’s an NLTK function for removing stopwords:

In the first code block I’ve added two more stopwords (‘said’ and ‘alice’) and six punctuation-clusters because I found they vastly outnumbered the other words in the text, throwing off the insights we now gain. If I were a better programmer I would find a way to remove them all, but I’m not. You can see in the new variable alice_unstopped that there are still a few here and there, but they’re infrequent enough to ignore:

And so we come to our long-awaited insights. Which are the top ten words repeated in Alice in Wonderland?

Unsurprisingly, after ‘said’ and ’alice’, ‘little’ appears more than any other word in this story about growing and shrinking; ‘know’ and ‘thought’ also reflect its preoccupation with knowing and thinking.
Very poetic, but how does it look on a graph? NLTK has a nice built-in function for that, too:

And there you have it. In future posts, I’ll reflect further on the next steps I want to take — namely, working with my own files rather than just the gutenberg corpus; and extending my NLTK skills to part-of-speech tagging; rhetorical-figure detecting (which I’ve done before); and beyond.