Bursary Reports 2019
English Faculty, University of Oxford
Workshop: From Text to Tech
During the first day keynote for DHOxSS, Barbara McGillivray showed a bell curve image of the distribution of technology adopters. On one side, we found the innovators—that small set who drive technological disruption and birth new methods into being—and on the other, we found the sceptics: that equally petite cohort who, even after mainstream embrace, remain uncertain of a product’s added value. My eye was drawn, however, to a different part of the graph: the ‘chasm’ separating the ‘early adopters’ and the ‘early majority.’ For marketers, that ‘chasm’ represents the crucial gap to be crossed to ensure an economically-viable range of consumers will purchase your product.
But when it comes to ‘DH,’ my ‘chasm’ takes a different form. It is less a question of disposition and instead one of technical proficiency. Before the summer school began, I felt I had the research questions and the personal interest to be an ‘early adopter.’ What I needed, before entering my PhD this fall, was to sit down and learn text mining. So I came to the ‘Text to Tech’ workshop with a clear vision in mind. I’d done some work in Java and felt comfortable with the way one thinks in code. Now it was time to master Python, to learn to extract texts files from a variety of sources; to assemble them into cleaned corpuses; and to run basic text mining operations. From these points of view, the workshop could not have been more of a success. By the end, I’d calculated text: token ratios, mapped character networks, and even experimented with topic modelling and sentiment analysis. I felt ready to scale up.
Then something unexpected happened. I started playing with my texts, and the methods turned fickle. One of my primary research interests is the reception of French literature in the Victorian period, so I’m fascinated with the way nineteenth-century English periodicals discussed their contemporaries writing across the Channel. One of the old truisms in this field is that the French novel was seen as morally perverse or corruptive compared to its statelier English counterpart—a hypothesis that is ripe for a quantitative method. Instead of citing a few articles on Honoré de Balzac or Émile Zola, we have the ability, through text mining, to develop more robust corpuses and to look for broader patterns of reception over the course of the century. Sentiment analysis—which tracks emotionally-coded key words to assess how positive or negative a perspective is put forth in a text—seemed the perfect tool.
As my control, I tried out George Eliot’s infamous ‘Silly Novels by Lady Novelists.’ Penned anonymously for the Westminster Review in 1856, the piece traded on gendered tropes of romance-writing and romance-reading to give an unabashed takedown. When two respected sentiment analysis tools—Vader and TextBlob—returned positive scores for the article, I knew that something was wrong. So I started feeding in some stock phrases, simplified versions of lines I’d seen in other pieces. When ‘Balzac is monstrous’ and ‘Balzac is prodigious’ both received neutral scores, I realised the issue was simple: The lexicons for these programs was too poor. These tools had been trained on and tested for contemporary texts: movie reviews, political headlines, and tweets—not on the dense stylistics of the Victorian literati.
It was a dispiriting moment, but luckily individual learning was only half the story of DHOxSS. One remarkable quality of summer school is how it assembles a truly diverse set of thinkers. It is all too rare for me to find myself in a room with literature scholars who study topics outside my period; it is almost unheard of for that room to include a mixture of linguists, heritage experts, and data scientists. At times our queries and struggles were strikingly similar. My table-mates for the week included an Australian studying criminal records and a Hong Konger intrigued by financial news coverage. We were all taken by sentiment analysis, and all—for different reasons, and to varying degrees—grew sceptical of the method. Odds were that if I grabbed a DHOxSS participant at random, they’d have a strong position, and a few sob stories, on OCR.
Then there were the points where expertise differed to no small degree. Despite being billed as a ‘beginner’ workshop, ‘Text to Tech’ benefitted from the presence of participants who could, under no reasonable circumstances, qualify to receive that label. By talking to several skilled data scientists, I came to realise that the deficiencies I’d identified in sentiment analysis were far from insurmountable. I did not need to replace Textblob or Vader, so much as augment their lexicons—either through manual input or by training them on texts from my corpus. By speaking to a researcher at HathiTrust, meanwhile, I learned there were opportunities available for bespoke corpus creation—and, even better, for combining those data with outputs from other digitised periodical sources. I can only hope that the exchange was felt to be bidirectional—that by articulating some of the terms under which literary scholars discuss genre, character, and theories of ‘realism,’ I could inform others’ research as they’d informed my own.
Coming away from DHOxSS, my projects have become simultaneously more and less challenging. Less challenging—because I have developed familiarity with Python and with the key text mining methods I sought upon entering. Less, too, because I met new mentors, collaborators, and friends—whose advice on OCR and corpus creation has made assembling coherent data on Victorian periodicals look far more feasible. More challenging—because once you start feeding real data into your systems, things never quite go as planned. But that’s also where things get interesting. I’ve realised that my DH projects are going to involve not just object study but also system design. They will require me to build a better sentiment analyser, one that will read terms like ‘monstrous’ and ‘prodigious’ with a proper eye. I’ll know I’ve crossed a new boundary when I feed my function ‘Silly Novels by Lady Novelists’ and see a starkly negative sentiment score. The result, I hope, will inform more a project on the ‘Poisonous French Novel’; it will be a tool other researchers can use to better parse the Victorians.
I am so grateful to have had the opportunity to participate in DHOxSS and in particular for the bursary. The week served as the perfect launching pad for me to develop a robust set of DH projects during my PhD. Perhaps next year I’ll be back for Advanced Data Science!