Data Curation in the Networked Humanities

Between now and 2015, I’m working to improve the automated encoding of early modern English texts, to enable text analysis.

In October and November 2012 I delivered two papers to launch Encoding Shakespeare, a project funded by the Social Sciences and Humanities Research Council of Canada. The first was presented to Daniel Paul O’Donnell‘s Digital Humanities students at the University of Lethbridge (Alberta). The second was delivered to the London Seminar in Digital Text and Scholarship at the School of Advanced Study, University of London. (For hosting/inviting me, thanks to Willard McCarty, Claire Warwick, and Andrew Prescott; and to Dan O’Donnell.)

Here is the video of the London paper:

An audio version is also available.

Here is the text of the London paper, with slides:

Data Curation 001

My goal in this talk is to posit my project’s ultimate goals — and to ask you if my protocols and platform for encoding Shakespeare will help me reach them. I come in the spirit of open-source research, to speak frankly about my doubts and my ignorance about the domains of computational/corpus linguistics and programming that are necessary to my success. I come to ask you what I should address, and how, to answer my questions — and beyond them, to ask you which questions I have forgotten, both in transit, and in setting my destination.

This paper has three parts. First I address the potential of algorithmic text analysis; then the problem of messy data; and finally the protocols for a networked-humanities data curation system.

This third part is the most tentative, as of this writing; Fall 2012 is about defining my protocols and identifying which tags the most text-analysis engines require for the best results — whatever that entails. (So I welcome your comments and resource links.)

But I’ll begin with a preface to set out the queries I’d like my research ultimately to enable — queries that begin with a process that rectifies the problem Charles Babbage once identified, that “machinery has been taught arithmetic instead of poetry.”

I / The Potential

<a> Tools

What does it mean to be a literary scholar in 2012? For one thing, it means getting used to the idea that new tools and new digital surrogates at our disposal enable new modes of scholarship, modes that are only possible in 2012. We can formulate new questions, submit new queries, search both new texts and new digital surrogates for old texts. We can test hypotheses, and create new ones, by applying digital tools like WordseerVoyant and TAPoR to digital texts.

These tools are telescopes for the mind, in Willard McCarty’s homage to Margaret Masterman: “extending our perceptual scope and reach,” “transform[ing] our conception of the human world just as in the seventeenth century the optical telescope set in motion a fundamental rethink of our relation to the physical one” [from Debates in the Digital Humanities, 113]. Yet as Andrew Prescott’s “Electric Current of the Imagination” address reminds us, today’s radio telescopes gather more data in a day than is in the world’s national libraries.

Data Curation 002

Nevertheless, we look through our telescopes at texts as quantitative objects, which can be unfamiliar to the traditionally-trained humanist. McCarty (again) reminds us of “the fundamental dependence of any computing system on an explicit, delimited conception of the world or ‘model’ of it” [cit. Ramsay and Rockwell, Debates in the Digital Humanities, 81].

The models we examine are those familiar to computational and statistical linguists, but can be unfamiliar to traditionally-minded literary critics — who tend to consider all linguistics as “the operating system of language,” as “pre-interpretive and pre-critical,” as a means to higher-order interpretation (Samuels & McGann: 1999).

By ‘quantitative objects’ I mean not just as a series of zeros and ones, but as a dataset of words that can be numbered, listed, visualized, and otherwise rearranged or deformed by quantitative criteria: linguistic categories, frequencies, relationships/contexts, tenses, speakers, &c. (How easily quantities slip into qualities is another topic I’ve explored elsewhere.) These deformations disrupt our usual progress through texts, revealing relationships between words and other addressable components that may be widely dispersed.

Data Curation 003

‘Addressable’ is Michael Witmore‘s term. It means “one can query a position within the text at a certain level of abstraction”: genre, individual lines of print, parts of speech. Statistical text-analysis is simply an automated mode of address: recognizing what the philosopher Quentin Meillassoux, whom Witmore quotes, calls “those aspects of the object that can be formulated in mathematical terms.” These mathematical aspects, Witmore continues, “can be meaningfully conceived as properties of the object in itself.” Pairing massive addressability with future tokenizations and new statistical procedures, Witmore concludes that “something that is arguably true now about a collection of texts can only be known in the future.”

To read a text is to initiate these addresses. As Witmore writes, “[R]eading might be described as the continual redisposition of levels of address.” To read critically is to gather quantities of evidence from texts that sort into our qualitative / interpretive categories. Algorithms accelerate this gathering by simplifying it. We can use algorithmic text-analysis on linguistically encoded texts to reveal evidence for new interpretations, to confirm or deny the “hunches” with which literary scholars often begin their research (Rockwell: 2003).

Data Curation 004

What kind of hunches? When we quantify our qualitative inquiries, we can make counter-intuitive arguments. Consider Shakespeare’s use of genre. Jonathan Hope and Witmore have recently used DocuScope‘s statistical procedure of principal component analysis (PCA) to identify 101 language action types (or LATs) in Shakespeare’s plays: uncertainty, disclosure, fear, sadness, reassurance, confrontation, question, denial, aside, and so on. (So Martin Mueller describes DocuScope as “a very large dictionary of short phrases or grammatical patterns that are mapped to a taxonomy of about 100 micro-rhetorical acts.”) These 101 LATs are in 17 clusters comprising 51 dimensions. So (in the documentation) “the First Person [Options] cluster conveys the perspective of a unique entity looking out on the world from the inside.” Its four LATs are First Person, Self-Disclosure, Self-Reluctance, and Autobiography. Here in the opening 1000 words of Shakespeare’s Richard III, this cluster is highlighted in red: words and phrases like “I am”, “my”, “me”, “I’ll”, and “myself”.

Data Curation 005

By identifying both significant and consistent differences between Shakespeare’s comedies and histories, Hope and Witmore have made “genre visible on the level of the sentence,” a technique that led them to conclude that Othello is “” linguistically, at least “” more comic than tragic.

We are looking at two images from their 2010 Shakespeare Quarterly essay. Compare these scatterplots of the LATs in Shakespeare’s four principal genres and it’s clear that while Othello (the blue dots in the right graph) trends toward the upper-right quadrant, so comedies (the red dots in the left graph), while tragedies (the orange dots in the left graph) — the genre that this play’s narrative, and centuries of critical consensus assigns it to — tend to the upper-left.

What gives a text like Othello its genre? Is it the narrative that unfolds, and its resemblance to other narratives we assign the labels ‘tragedy’ or ‘comedy’? Resemblance is essential to categorization. But resemblance to what? Hope and Witmore offer quantitative evidence that these linguistic categories offer at least another dimension to a text beyond its narrative. “[D]igitally based research in the humanities expands the possibilities of iterative comparison,” they write, “because it is more productively indifferent to linear reading and the powerful directionality of human attention.”

Data Curation 006

Statistical tools can also reconfirm the qualitative judgements of linear reading: that’s one implication of Hope and Witmore’s phrase “productively indifferent.” Say you’ve read a number of Shakespearean comedies and couldn’t help but notice that most of them end in marriages. And so you formulate a broad theory of comedy: “The theme of the comic is the integration of society, which usually takes the form of incorporating a central character into it.” (‘You’ are Northrop Frye–did I mention?–writing your seminal Anatomy of Criticism in 1957.) You notice how many characters assert their desires to unite with other characters. That’s one step along the way from conflict to reconciliation that comedies are about — with marriages serving the ultimate end of social harmony.

How, for instance, do characters shift (in the course of a given comedy) from using singular first-person pronouns (I/me/you/he/she &c.) to plural ones (we/they/their &c.)? Jonathan Hope has written about this:

Shakespearean comedies typically involve people arguing about things, striving to arrive at a “˜we’ of agreement, but not being able to until the final scene.

And in an excellent follow-up post, Anupam Basu has charted this quantitatively. [On slide.]

So it is possible to quantify the evidence of emerging communal identities in comedies — by considering all early modern plays; isolating the comedies; and addressing evidence like first-person pronouns. You (the computer) count up these instances and then you (the critic) write a nuanced, qualitative argument from quantitative data.

One concern about all of this, however. Surely digital humanists are doing more than retreading old ground — by quantitatively “proving” or “testing” the qualitative claims of past criticism. Frye’s assertion was in 1957, and it would be a poor measure of our discipline’s progress to say that by 2012 we could only reaffirm his claim by pointing to a chart.

And equally, how are we moving beyond search, however comprehensive, when we find and compare every instance of a word and its synonyms? There is certainly value in the movement from a constellation of instances or citations to a machine-enabled aggregation of every instance of a word and its cognates, say, or of a stanzaic and metrical patter, or of a topic. I can foresee that a new history of the sonnet could be possible, for instance, with these tools. Each of these queries begins with local readings and extends outward; even Frye must have begun with a single comedy (say, A Midsummer Night’s Dream) before testing his theory in other comedies and ultimately applying it to all comedies. We are only broadening our circumference when we use enhanced search tools.

But let’s look instead to the methods of a scholarship that originates in algorithmic deformations, in (say) heat maps that visualize topics over the careers of a given set of writers. For argument’s sake, make them few enough to fit legibly on a single screen: say, World War I poets, or Jacobean clergymen. Sort those topics by a range of criteria: the writer’s age, the text’s length or its form, or its medium of delivery (print, manuscript, oral recitation). It’s not hard to imagine that in these visualizations we would ‘see’ or recognize some quality that linear readings would not find.

Stephen Ramsay (@sramsay), in Reading Machines, has written about this method, reasserting that deformations and quantifications can be the source of hypotheses and interpretations — beyond their function of testing or substantiating the hunches we get from linear readings. These two functions, of provocation and of expansive rearrangement, should always be in productive tension.

Poetic Constraints on Ordinary Language

So. What kinds of research queries do I aim to provoke or enable by encoding Shakespeare, as a training set for algorithms to grasp early modern language use? As I say, I am here to pose questions that lead to my destination, and help me to clarify what that destination is.

Begin with the premise that poetry distinguishes itself from other kinds of language use by adhering to a given set of constraints: verse forms, meter, rhyme, syntax, and other conventions combine in various ways to distinguish poetry from “ordinary” language, however you define that. They do help readers distinguish poetry from sermons or contracts or other forms of equally constrained language. (Let me clarify that I am referring to early modern poetry, which predates more difficult innovations like found poetry or free verse.) I said these constraints “combine in various ways” to account for the characteristic differences of generic convention, stylistic preference, textual influence or (even) adaptation, linguistic register, subject matter, and so on in an infinite list. For instance, John Milton’s Latinate syntax often places verbs at the ends of sentences in Paradise Lost both because he is writing an epic, and because of his education.

So if conventions, combined in various ways, make poetry’s language poetic, are these conventions that a machine could recognize and disentangle? And if so, could we quantify what makes Paradise Lost a (1) Miltonic (2) epic written in (3) Latinate (4) blank-verse (5) pentameter? Each of these five descriptive features is an opportune problem for current and future algorithms, and these are just the five I can come up with. So if we isolated these features and overlaid a filter or lens on Milton’s text, we could visualize and quantify what distinguishes his poetry from (say) Spenser’s; and his poetry from (say) his prose pamphlets; and so on.

This function is predicated on a definition of poetry as language constrained or confined by conventions, many of which may defy algorithms for some time. But if Hope and Witmore can assert a linguistic model for genre, can claims about textual influence or adaptation be far behind?

The point is to go beyond what is obvious to a competent critic, to automate base-level detection (“these lines are blank verse”) so that we can set the algorithms loose on texts we have not read. And to identify features that are extensible beyond poetry, like rhetorical figures. What are these figures, from anadiplosis to brachylogia to chiasmus all the way to zeugma, but linguistic patterns and aberrations from ordinary usage? Give a machine enough “ordinary” text and it can flag and catalogue the repetitions, ellipses, and other departures from that baseline — and it can do so in any contemporary string of verse or prose.

~~

That is what it means to be a literary critic in 2012, or perhaps 2015. You don’t have to do this quantifying if you choose not to, but then you will lack the evidence that turns “linear reading and … directionality” into something less refutable (Hope and Witmore, again).

It means that we can do a new kind of reading: distant reading, or “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data” (from a New York Times profile of Franco Moretti). It’s also called algorithmic criticism (in Stephen Ramsay’s phrase) or macroanalysis (in Matthew Jockers‘). Our training in those “particular texts” gives us the expertise to drill into the data, to identify the broad ideas in narrow evidence, and to reject false positives.

This criticism was possible before 2012, of course, but now we can run algorithms on online text repositories from our laptops. Which brings me to the data itself.

<b> Texts

Data Curation 007

If the capabilities of our tools are enormous, so too are the numbers of digitized texts we can exercise them on. To take just my own field — early modern English literature — the corpus that the Text Creation Partnership is publicly releasing over time (2015 – 2020) is an extraordinary trove of data: searchable transcriptions of some 70,000 texts, or roughly a billion words (Andrew Hardie estimates). They comprise the “Book of English” from the beginning of print in the 1470s to the end of the seventeenth century. Jonathan Hope, at the Renaissance Society of America meeting in 2012, called it “the most important humanities project in the anglophone world in a hundred years.”

Data Curation 008

That term (“Book of English”) is Martin Mueller’s, who has done more for the digital analysis of early modern English texts than any scholar. He developed the NUPOS tagset for MorphAdorner, which can tag each word’s standard spelling, part of speech (adjective, adverb, conjunction, noun, preposition, pronoun, verb), and lemma (e.g. identifying “hung” as the past participle of “hang”). It is also capable of recognizing some 350,000 spelling variants of early modern English words, a capability harnessed by his text-analysis program WordHoard, which Jonathan Hope recently used to provoke new questions about Hamlet (as my students did earlier this year).

[Aside: You may notice my repeated citations of Mueller’s work. That is because my thinking about the barriers to exploration has been repeatedly influenced by them, just as my sense of its potential began with my use of his tools.]

Tools rely on not just a lot of data, but good data: reliable, accurate, and mostly error-free transcriptions of the texts for which they are surrogates. The TCP’s data is very, very good — but not perfect, as I’ll discuss in a moment. What it needs is data curation, or “managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use” (Lord: 2004, cit. Mueller: 2010). As Mueller wrote in September 2012, “Curation is the servant of exploration.”

II / The Problems

Data Curation 009

There are at least three barriers to distant reading / algorithmic criticism of early modern English texts:

  1. The first is the limitation of primary data — in this case, the TCP-transcribed early modern English texts.
  2. The second is the limitation of metadata, or the linguistic/interpretive encoding overlying the primary data.
  3. The third, most intractable is the limited ability of quantifiable data to address qualitative inquiry. There may be something ineffable about literature that can’t be parsed by algorithms alone. I’m going to set this fundamental problem aside, for now — and think about whether it’s a problem of digital inquiry worth addressing, or it’s too romantic to be disabling.

I’ll now elaborate on these first two problems in turn.

<a> Primary Data (Texts)

It’s worth reiterating that the TCP’s data is far more enabling than problematic. I’ve repeatedly used the search function to explore the usage of particular words and their cognates. Lateral thinking, citing different occurrences that reveal different valences of a word, gives interpretations weight and subtlety. In my classroom, when a student asks about the word “surfeit” in Doctor Faustus, I cite Orsino’s opening line in Twelfth Night. With the OED and the TCP at hand, a diligent undergraduate can surfeit her own curiosity about both the historical meanings and the usage contexts of particular words.

Data Curation 010

But Mueller’s premise in a recent blog post sums it up well: “the TCP is a magnificent but flawed project.” That’s because its transcriptions of these 70,000 texts are imperfect. How imperfect? Just taking early modern drama as a subset, as he does, the error rate is about 0.5% of individual words:

By my computations the error rate for 284 plays goes all the way from a top of 8.8% for Wealth and Health, a text of 7,645 words from 1554, to 0.01% for 7,619 words of The Old Wives’ Tale from 1590. The median value is 0.5%.

[T]here are 101 texts with an error rate of less than 0.2.%. […] On the other hand, there are 90 texts with an error rate of more than 1%[,] and 40 texts with an error rate of 2% or more.

Statistics tell only part of the story, so Mueller offers examples. The broader problem is what he calls “the negative halo effect that spreads from error-ridden pages and undermines trust in the resource.”

[Aside: I need to learn more about the TCP’s own internal protocols for data curation and correction.]

<b> Metadata (Tokens)

Quantifications of qualitative inquiry, like the inquiries into Othello‘s genre I mentioned above, depend on thorough and accurate metadata. If we want to analyze all the questions that include proper nouns in The Merry Wives of Windsor, for instance, we need to identify or “˜tag’ every proper noun in the play (e.g. Windsor), and isolate those that appear in the sentences identified as questions. Likewise, if we want to count Hamlet (the character)’s neologisms, we need to tag both a set of speeches as Hamlet’s, and a set of words as neologistic.

Data Curation 011

In Mueller’s elegant phrase, our goal is

the algorithmic amenability of the digital surrogate, its capacity for being variously divided or manipulated, combined with other texts for the purposes of cross-corpus analyses, having data derivatives extracted from, or levels or metadata added to it. (my emphasis; full text.)

Data Curation 012

Natural-language processing (NLP) can automate much of this encoding, by identifying grammatical relationships or finding words in a historical dictionary; but its algorithms give far more accurate results for texts composed after 1700, when English spelling, orthography, and usage were more standardized. For example, in the sentence (at left in this slide) “I pawn my credit and mine honour” from Shakespeare’s Henry VI Part 3, the Stanford Parser, an automated part-of-speech tagger trained on modern English, tags “my credit” as possessive [“poss”], but not “mine honour.” [It treats ‘mine’ as a noun: “nn.”] Even the best-trained algorithms have difficulty isolating the proper nouns in Shakespeare’s sentences using them interchangeably as nouns and verbs, like this one from Richard II: “Grace me no grace, nor uncle me no uncle.” (I made these discoveries while working with Wordseer, a highly capable tool and visualization engine developed by Aditi Muralidharan at UC Berkeley.)

III / The Protocols

What?

The distance between early modern and modern English is a large problem, but a tractable one. The solution is to train the NLP algorithms to recognize more linguistic conventions of early modern English than they currently do, so that we can automate the encoding process as much as possible. The Stanford Parser can already handle Chinese, Arabic, and modern English; its training set for the latter was The Wall Street Journal. To extend this to early modern English, I will use Shakespeare’s complete works as a training set.

Data Curation 013

My aim is to achieve for Shakespeare the 95-97% tagging accuracy that the best NLP algorithms can achieve for modern English usage. Rayson et al. have achieved up to 89% tagging accuracy using 1000-word training sets from four of Shakespeare’s comedies, by regularizing and modernizing their spelling, and then manually editing the output of an NLP program. Encoding Shakespeare has proposed to expand the training set to all of Shakespeare’s works. Narrowing the gap between 89% and 97% is achievable with the right protocols, so it’s important to get them right.

An important question is exactly which linguistic or other features of Shakespeare’s texts I’m proposing to encode. The answer has two parts: (1) whatever features are readily extensible from Shakespeare’s 865,185 words to the billion words in the TCP’s Book of (early modern) English; and (2) whatever features are the most amenable to text-analysis algorithms now and in the future. How to identify either of these is my most significant barrier right now.

I need to learn more about how to choose a training set, and what to do with that text to make its features scalable to the texts that are my object. Certainly I have proposed to do more than is possible, so the question of how to refine my goals is actually threefold:

  1. Which features are extensible? And to what / to which texts?
  2. Which features do present and future text-analysis tools need to learn from us? That is, from human readers, as domain experts.
  3. Where to begin in the next 2 1/2 years, the first phase of a multi-phase project?

My current thinking about the linguistic constraints on poetry, and its departures from ordinary (contemporary) language, will help to address these three problems. First, narrowing the focus to poetry still offers a good proportion of Shakespeare’s texts — chosen for reasons I’ll explain in a moment. And the linguistic features of poetry are predicated on constraints or limits that may lend themselves more readily to machine reading. Second, Hope and Witmore’s work with Docuscope shows that current algorithms can do a great deal with whole un-encoded text, as they do — but they can’t (yet) isolate surface-level features like rhetorical figures, meter, or rhyme, let alone more complicated inter-textual features like influence or adaptation. (I may be wrong — I hope so — so please tell me in the comments field if they can, and how I can talk to their developers.) So these are things that the text-analysis tools need to learn from us. The third problem was where to begin in the immediate future, and here I propose the following three steps:

  1. Identify the widest possible range of linguistic features that make poetic language distinct from ordinary language; this will serve as a wish list.
  2. Narrow this list to those few features that are readily quantifiable. For instance, meter is a quantity (of syllables); tone would be more difficult.
  3. Choose (from this narrowed list) the features that seem the most scalable, feasible, and necessary. That is:
    1. scalable from Shakespeare to the TCP texts (or rather, TCP poetry);
    2. feasible to be recognized by a machine, and taught using the protocols I’ll outline next; and
    3. necessary to exceed the capabilities of current automated systems, and to enable new processes that spark new queries.

It all seems very open-ended at present, but that’s the nature of research. Now, on to my protocols.

How?

So what protocols will Encoding Shakespeare follow? How will this project improve the automated encoding of early modern English language? In the final section of my paper, I’ll describe the ideas I’ve had about my protocols, based on the problems I identified with part-of-speech tags in Wordseer’s reliance on the Stanford Parser.

Data Curation 014

Phase 1: Iterative Tweaking for Error Reduction

I have proposed to begin by breaking down the problem of language into constituent problems of words: of spelling, of morphology (word formation), and of grammatical relationships and parts of speech. I proposed this division not because it makes a large problem addressable, but because it makes the solution a partnership between fast, narrow algorithms and slow, nuanced human readers. I adapted this methodology from recent computational linguistics research which has shown, for seventeenth-century legal documents, that “an iterative approach to NLP model building, along with the injection of domain-specific knowledge, can produce robust systems for processing non-contemporary corpora and texts” (Sweetnam and Fennell: 2011).

The first step is morphological tagging, using the MorphAdorner program to tag each word’s standard spelling, part of speech (adjective, adverb, conjunction, noun, preposition, pronoun, verb), and lemma (e.g. identifying “hung” as the past participle of “hang”). My initial purpose is to encode standard forms, meanings, and dependant relationships. A comparative study has described MorphAdorner as the best tool for tagging parts of speech (Wilkens: 2009).

Manually editing these tags would be very time-consuming if you were to enact the process just once, but I have proposed to repeat the process many times, and to distribute the editing among online users. I want to balance computational analysis with human readers, a balance that will help improve text encoding until errors (“mine honour”) are vanishingly rare. If MorphAdorner misses Shakespeare using a noun as a verb (e.g. “Destruction straight shall dog them at the heels”), a reader can tag the correct part of speech. When it identifies a standardized spelling (e.g. substitutions of vv for w, or i for j), a reader versed in early modern typographic conventions can readily confirm or correct it. The aim is to make any correction only once “” to train the algorithm to recognize words like “dog” as a verb in sentences like this, so it recognizes the next one (e.g. in “uncle me no uncle”).

Data Curation 015

Phase 2: Crowdsourcing

Even if we limit our dataset to Shakespeare’s complete works, as I will do in Phase 1, the task of codifying his usage and correcting errors is itself susceptible to error. If our human reader mistakenly misreads “dog” as a noun, or worse, tags it as a variant spelling of “dodge,” the error will proliferate through the algorithm.

The solution is crowdsourcing, or networked human computing: dividing a large task into discrete questions, or a large text into segments; distributing them to a number of readers; and aggregating them in a way that allows the computer to compare results before accepting them. It’s not as simple as duplicating tasks; as Charles Babbage recognized in the nineteenth century, the system must find different ways to perform the same result (Economist: 2011). So the mistaken “dodge them at the heels” might be rejected after a comparison to “the dogs of war.” I adopt this method from the networked science model of human computation, where it has been used for large problems like galaxy classification and protein folding “” problems that require human expertise and judgement to train computers to recognize what we recognize, to think as we do.

Encoding Shakespeare has proposed to invite users to examine sections of a text with clearly-visible markup: a sonnet, say, or three sentences from a speech. Users will correct this markup and add their own annotations using an embedded tool. I will investigate tools like CATMA (Computer Aided Textual Markup and Analysis), to find one that is transparent and easy to use, and whose encoding meets Text Encoding Initiative (TEI) and Open Annotation Collaboration specifications so others can use it.

Many users will be students and other non-specialist readers. Studies have shown the effectiveness of non-expert annotations for natural language tasks (Snow et al.: 2008; Kittur et al.: 2008). And Mueller’s catalogue of the “careful and literate readers” who might contribute to collaborative curation is inspirational: “high school students in AP classes, undergraduates, teachers, scholars, other professionals, amateurs from many walks of life, and — a particularly important group — educated retirees with time on their hands and a desire to do something useful.”

This is my rationale for choosing Shakespeare — which I promised to explain: because his texts are the most widely taught in the idiom of early modern English poetry, and because that idiom is my ultimate object. (Why is it my object? Because it’s my area of specialization, and because the TCP texts are going to be released in a few years.)

The crowdsourcing model has been successfully implemented in social science and humanities projects, particularly historical projects that focus on language transcription. At the January 2012 American Historical Association annual meeting, a panel titled “Crowdsourcing History” presented nine projects from historians and software developers, focused on (open-source) tools and techniques for transcribing a range of historical material: naturalists’ field notes, American civil war diaries, and author-specific manuscripts (Landrum: 2012). For instance: since early 2010, the 1,508 registered users of the Transcribe Bentham project at University College, London have transcribed 45% of the 5,580 manuscripts of the philosopher Jeremy Bentham. But transcription does not go far enough for my purposes. Models that may be more analogous are Alexandra Eveleigh (University College London)’s work on collaborative descriptions of material in the UK’s National Archives; the United States’ own National Archives’ “Citizen Archivist Dashboard” model; and the National Archives of Australia’s “The Hive” platform.

Data Curation 016

A key precedent for my proposed system is AnnoLex, the tool that Mueller and Craig Berry developed a few years ago. Thanks to their suggestions and to Twitter, I’ve learned about two more recent projects, each at different phases: TypeWright and eMOP, the Early Modern OCR Project. My goal from now to 2015 is to complement and augment their plans and capabilities, not to repeat their work. Mueller conceives of his project’s future in inspirational terms:

We need a social and technical space of collaborative curation, where individual texts live as curatable objects continually subject to correction, refinement, or enrichment by many hands, and coexisting a different levels of (im)perfection.

To conclude: My ultimate goal is to develop an extensible and intelligent model of the early modern English language: extensible, because it will take linguistic conventions from a limited dataset and learn from the variations that each new text offers; and intelligent, because it will teach machines to encode texts more like human experts read them, so that these machines can provoke us to read, think, and inquire more definitively.

In the comments field below, I hope we can address some of the questions I’ve raised about my aims and methodology — but more importantly, those questions I’ve forgotten, dodged, or distorted. Thanks for reading. 

7 Comments

Add Yours →

[…] recognizes early modern parts of speech; in 2013 we’re not quite able to do that yet, as I’ve written elsewhere. But because I’m an optimist, and because I want to claim this very cool-sounding word before […]

Leave a Reply