In Praise of Small Data

Editor’s Note

This is the seventh and final post in the series “Medieval Studies in the Age of Big Data”–a post that, appropriately enough, questions the premise of the series as a whole! The series was introduced in a post last year. Previous contributions by Martin Foys, Tim Stinson, Bruce Holsinger, Deborah McGrady, Stephen G. Nichols, and Elaine Treharne can be found hereherehereherehere, and here. Alex Gillespie teaches English at the University of Toronto (links to her blog and bio follow the post).

If you like this post and this series, you can Like Burnable Books on Facebook to receive updates.

***

I can remember the day that I worked out I was a “literature person.” I was 11 or so and I was discussing the universe with a wise older relative. I asked him if, on some distant planet, matter might be made from something other than atoms.

“No,” he said. “Scientists have proved that the universe is made out of atoms.”

“But what if there’s a place where there are things that scientists don’t know about?”

“Then those things are made out of atoms; otherwise, it’s not a place in the universe. Our whole concept of the universe is based on what we know about atoms.”

“Just atoms – nothing else?”

“Just atoms. Anything else is made up.”

Oh yes, I thought. Made up stuff. That’s what I like.

Readers may already know how this anecdote ends. Thirty years ago, astrophysicists did indeed believe that the universe was built from “baryonic matter” or atoms. But since then, big data has suggested the presence of something not-atomic: dark matter, detectable in measures of cosmic microwave background fluctuations. Although my relative was not precisely wrong, not in his historical moment anyway, the end of the story is definitely Alex triumphans.

But that’s not why I remember the conversation, or why I connect it to the decisions I made about my career a few years later. I didn’t know anything about dark matter when I was 11. I still don’t. I remember the conversation because at the time, I intuited something else that turned out to be right. I am a literature person. I prefer the illimitable possibilities of fiction to the circumscribed domain of facts. I am deeply distrustful of any theory that places its holder at the end of history, in possession of some final truth.

All of this subtends the slightly skeptical position I want to take on Medieval Studies in the Age of Big Data. For a start, I come from a family of scarily clever scientists and mathematicians, whose example makes it impossible for me to think of my work as an exercise in “big data.” I do engage in data-driven research. I am working on a project that uses digital tools to annotate images of the marginal notes left by Matthew Parker’s scribes in his manuscripts. I am gathering data for another project on medieval bookbindings, like this one from Auckland Public Libraries: Sir George Grey Special Collections, Med. MS G.132.

APL MS G. 132, medieval bookbinding as digital data; reproduced with kind permission.

But the data I work with isn’t actually big. If I described every extant Western bookbinding predating 1450, the total might be 5,000 or maybe 10,000. Many catalogues ignore bindings so it’s hard to be sure, but even if the figure were 20,000 or 100,000, and even if I collected dozens of data from each of these books, that’s not “big data” as the term is used in the technology sector or hard sciences. Big data does not look like this: 12,000,000. It looks like this: 5×1020. Big data begins not with an archive of digitized manuscripts but with the observation that “every day we create 2.5 quintillion bytes of data—so much that 90% of the data in the world today has been created in the last two years.”

Granted, those who know their big data insist that size isn’t the whole point. Big data also refers to new forms of data and the fact that “the digital streams that individuals produce are growing apace” (Tom White, Hadoop, p.  2). For every one photograph your grandmother took, you probably take thirty-five or even a hundred. If you think about big data that way, then the digital images and descriptions of medieval bookbindings I have started posting on my blog count as “big data.” So do the tweets of Erik Kwakkel. So will the results of the DNA tests planned for pieces of medieval parchment by Dan Bradley’s Codex project. All this work contributes to the growing “data footprint” of medieval studies.

But I am still really uncomfortable tossing humanist endeavor onto the big data bandwagon. Whatever you take “big data” to mean, it clearly privileges the quantitative over the qualitative. It argues for the utility of all knowable facts over doubts that such facts are all there is to know. It makes a case for numbers against analysis and “the end of theory” (however controversially).

That case is absolutely worth listening to. I am grateful to big data for showing that girls are as good at science as boys; that Mitt Romney was never going to win that election; and that there is some other, dark matter in the universe.

See! I just don’t think I am in the same algorithm-driven business.

And I am troubled by the idea (I think it’s an insidious one) that I should be in that business or should claim that I am. Doesn’t that further erode the already precarious status of more traditional and specialized modes of humanist inquiry? Do researchers in the sciences routinely claim – for example to funding agencies – that they are going to engage in some “intense close reading, using methods learned from collaborators in the Department of English Literature?”  Of course they don’t. Which is a shame, because that sort of interdisciplinary inquiry might be quite interesting. And it might reveal just how much humanities researchers can achieve with really-quite-tidgy data.

Which brings me at last to my central point. I am in favour of small data.

First, I think the best work in e.g. history, art history, philosophy, theology, musicology, literary studies, and critical theory, even when it involves the collection of data, even when it contextualizes its findings by a larger groupings of facts, is based upon the slow and close and careful reading of small data – whether that’s marginal images from some medieval manuscripts (Michael Camille); Wittgenstein’s claim that “obeying a rule is a practice” (Charles Taylor); or the madwoman hidden in Mr. Rochester’s attic (Sandra M. Gilbert and Susan Gubar).

I think that the best data-driven medieval studies employ smallish data and small data methodologies. New technologies have changed the form of that data. They have made gathering and disseminating it easier and faster. But the born-digital word hoard analyzed so brilliantly by the editors of the Dictionary of Old English; the manuscript descriptions distributed by e-codices, the Biblioteca Medicea Laurenziana, or Erik Kwakkel’s tweets; Paul Needham’s extraordinary Index Possessorum Incunabulorum – all these projects deal with quite modest data sets, modeled on those gathered in a pre-digital era by pencil and note-card wielding philologists. The 2013 version of this data is shinier and zippier, with many more images. It’s easier to get at. But it’s still humanities size, designed for humanist use. It’s small.

Finally, I think it is significant that the medieval period was an age of small data. There was a limited culture of data collection; medieval people were comparatively data poor. It was a long time ago. A lot of medieval data is lost. As a result, scholarly efforts to answer quite simple factual questions are hampered by the quantity and quality of available evidence.

Here’s an example of this problem, from a question I am working on. What did medieval books cost? Diane Booton’s fascinating book on manuscript and print in medieval Brittany describes a Légende dorée that sold for 6 francs d’or in 1371 (Manuscripts, Market and the Transition to Print, p. 32). But no evidence survives of how many leaves that book had, how it was bound, or how it was decorated. The book was sold second hand (it had formerly belonged to a chanter of Rennes Cathedral); it’s unclear whether this increased or diminished its value for the buyer, another cleric. Information about the supply of books in Brittany at this time is scant, and there is no way to gauge how that might have affected price. The priest who purchased the book was probably short on data too. He certainly didn’t have printed catalogues or anything as glorious as AbeBooks to check out what a Legenda Aurea was going for in Paris or Lyon.

I’ve heard scholars argue for an online database of medieval book prices. I think that could be very useful. But I do not think that it would provide a big data answer to my question about the cost of medieval books, or liberate me from the conceptual labours involved in medieval codicological inquiry. To answer my question – what did medieval books cost? – I’ll still need to consider each piece of evidence on its own rather hazy terms. I’ll need to account for the fact that the available data, like all data, is a human invention, a product of the sometimes inscrutable, sometimes precipitate processes by which people arrange the stuff of their existence, past, present, and future.

In short: I am drawn to quantitative methods of analysis, but I am absolutely committed to qualitative ones. And for the latter, small data does very nicely.

I began with an anecdote; I’ll end with another one. I’m not its author, but I have permission to publish it. Here then is Eric Stanley, in praise of small data (private correspondence, 25 October 2012):

I always say that Arts people generalize on the basis of one. That brings me to the Belgrade tram problem. Each tram in Belgrade bears its own number. A passenger on the Orient Express got off the train at Belgrade during a brief halt, got on to the station forecourt, and there saw a tram. It was numbered 19. He got back into the train, took his seat, and asked himself the question. How many trams are there in Belgrade? Statisticians will tell you that no answer is possible to this question. I was at a dinner with one famous mathematical physicist, (Sir) Rudolf Peierls (not yet knighted), a mathematical linguist, Alan S. C. Ross, and a physicist. Alan Ross raised the Belgrade tram problem in conversation. He said there was no possible answer. I said, I am not so sure that you can’t give an answer within a rough range: my guess would make me say, the total number of trams in Belgrade is 67. My reason? It was likely to be dull number but bigger of course than the dull 19. Alan Ross said that was nonsense. But Peierls said that sometimes in physics projects (and after all he had been involved when in the States the first two atomic bombs were built) one needs a rough but reasoned guess: I had provided a reasoned total number of trams in Belgrade. I was and still am proud of Peierls’s support of my ability to generalize on the basis of one tram.

I’ll leave it to readers to decide whether Peierls, builder of atomic bombs, should have been equally proud to find himself in agreement with a humanist – the future Rawlinson and Bosworth Oxford Professor of Anglo-Saxon, no less – on the subject at hand: the rough but reasoned guesses, the half truths and useful fictions, by which human beings always arrive at the things we think we know.

Belgrade Tram

 

If you liked this post you can Like this blog on Facebook. 

Alexandra Gillespie is an associate professor of English and Medieval Studies at the University of Toronto. You can read about her academic work here; she blogs here; and now and then she even tweets (@alexgillespie).

 

One comment on “In Praise of Small Data

  1. Anonymous on said:

    Perhaps the conclusions derived from
    small data are the slow food of scholarship, and terroir,
    the intuition derived from the scholar’s deep familiarity
    with context and comparanda.

    -Mary Garrison

Leave a Reply

Your email address will not be published.

HTML tags are not allowed.