I'm a paragraph. Click here to add your own text and edit me. It's easy.

I'm a paragraph. Click here to add your own text and edit me. It's easy.

I'm a paragraph. Click here to add your own text and edit me. It's easy.

From Text to Tech

“Python programming for powerful text processing in the Humanities”

Why not consider Humanities Data: Case studies and Approaches instead?

Convenors: Barbara McGillivray, Gard Jenset, Mariona Coll Ardanuy

Hashtag: #text2tech and #DHOxSS2019

Computers: Participants are not required to bring their own laptops for this workshop. Laptop computers will be provided by DHOxSS

 

Abstract

Digitization efforts provide increasingly large amounts of text to answer old and new research questions in the Humanities. The technical skills needed to automatically process and analyse texts at such large scale are in high demand. The workshop will address this need by imparting the basic skills to automatically process and mine textual data, and assumes no prior knowledge of programming.

 

The workshop will build on the experience gathered in previous successful and fully booked editions of the "From Text to Tech" workshop at DHOxSS 2015–2018. It will ensure beginners, as well as people with some previous experience in programming, will be able to take full advantage of the content presented.

 

We will use Python, a very flexible programming language widely used in Humanities research. The workshop will take a hands-on, stepwise approach. It will offer a very basic introduction to Python programming, corpus linguistics, and Natural Language Processing, and will cover the process of cleaning texts and adding automatic linguistic annotation to them (lemmatization, part-of-speech tagging, syntactic parsing). It will also teach the basics of semantic analysis and network analysis.

 

The workshop will also include research talks by invited speakers, that will demonstrate concretely how the skills acquired in the lectures can be applied to answer questions in a range of humanistic disciplines. ​

 

Computers will be provided for this workshop.

Experience necessary

No prior technical knowledge necessary

Advanced Reading List

* Piotrowski, Michael. Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2012. http://dx.doi.org/10.2200/S00436ED1V01Y201207HLT017.

* Montfort, Nick. Exploratory Programming for the Arts and Humanities. Cambridge, Massachusetts: The MIT Press, 2016. https://mitpress.mit.edu/books/exploratory-programming-arts-and-humanities.

 

* Karsdorp, Folgert. Python Programming for Humanists, ongoing. http://www.karsdorp.io/python-course/.

 

* Sinclair, Stéfan, and Geoffrey Rockwell. The Art of Literary Text Analysis. Melissa Mony., 2016. https://github.com/sgsinclair/alta/blob/77b256f7c3ff3ceb6643d53da401096c8cdcc468/ipynb/ArtOfLiteraryTextAnalysis.ipynb.

 

* Turkel, William J., and Alan MacEachern. The Programming Historian. 1st ed. NiCHE: Network in Canadian History & Environment, 2007. http://niche-canada.org/programming-historian.

"The workshop was a nice mix of lecture and practical exercise, the teachers were responsive to questions, left us with a wealth of materials to work on after the workshop and chose useful and interesting guest speakers."

DHOxSS 2018 participant

TIMETABLE

 
Link to overview of the week's timetable including evening events.
Monday 22nd July
08:00-09:00

Registration (Sloane Robinson building)
Tea and coffee (ARCO building)
09:00-10:00

Opening Keynote (Sloan Robinson O'Reilly lecture theatre)

10:00-10:30

Refreshment break (ARCO building)

10:30-12:00

Introduction to programming in Python

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.

 

Speaker: Gard Jenset

12:00-13:30

Lunch (Dining Hall)
13:30-15:30

Introduction to programming in Python (continued)

 

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.

Speaker: Barbara McGillivray

 

15:30-16:00

 

Refreshment break (ARCO building)
16:00-17:00

Introduction to Corpora

The session will give an introduction to the main concepts of corpus linguistics, including corpus creation and corpus processing for research in Digital Humanities.

Speaker: Barbara McGillivray

Tuesday 23rd July
09:00-10:30
 
Basic text processing with Python

The session gives an introduction to working with linguistic data in Python. Topics include simple regular expressions and other methods for handling text data.

Speaker: Mariona Coll Ardanuy

10:30-11:00
 
Refreshment break (ARCO building)

 

11:00-13:00
Data structures in Python

This session will cover basic data structures like lists and dictionaries in Python, with practical examples.

Speaker: Mariona Coll Ardanuy

13:00-14:30
Lunch (Dining Hall)
14:30-15:30
Identifying the author of a document.

 

This talk introduces an approach to "forensic stylometry", that is, identifying the author of a text, based on a corpus of documents. This field made headlines in 2013 when two professors of computational linguistics proved that JK Rowling was the author of a detective series which she had written under a pseudonym. Traditionally this would have been done with a hand engineered sequence of components for removing stopwords, lemmatising words, and constructing a bag of words model. However recent advances in deep learning software have made it simple to build text classifiers with almost no feature engineering. In a few hours we can build a classifier to identify authorship. It can be trained in a few minutes and will run on a regular laptop.

Speaker: Tom Wood

15:30-16:00
Refreshment break (ARCO building)

 

16:00-17:00
 
Lectures (various venues)
Wednesday 24th July

09:00-10:30  
 
Introduction to Natural Language Processing (NLP) in Python

 

This session introduces the NLTK library and shows how it can be used for tasks such as stemming, part-of-speech tagging, and lemmatization with Python.

Speaker: Barbara McGillivray

10:30-11:00
 
Refreshment break (ARCO building)
11:00-13:00

Sentiment analysis in Python

 

This session will consist of an exercise in which participants will learn how to perform basic sentiment analysis on textual data.

Speaker: Mariona Coll Ardanuy
13:00-14:30
Lunch (Dining Hall)
14:30-15:30  

 

Word Embeddings and Python in corpus research: the dative alternation in spoken British English

This talk gives an introduction to how studies in corpus linguistics can use Python and available resources like word embeddings in a supporting role to an annotated corpus. Typically, word embeddings have been used more in computational linguistics and natural language processing applications than in corpus linguistics research. Using a study of the so-called dative alternation (“gave Y X” vs. “gave X to Y”) an example, the talk illustrates how we can better answer quantitative research questions by combining various types of resources, and how combining the different programming languages R and Python offers additional flexibility to doing corpus research.

Speaker: Gard Jenset

15:30-16:00

Refreshment break (ARCO building)
16:00-17:00

 

Lectures (various venues)

Thursday 25th July

 

09:00-10:30
Parsing HTML and XML documents

Python can be used to extract text from structured documents like HTML and XML. This session gives an introduction to how this is done.

Speaker: Gard Jenset

 

10.30-11:00
 
Refreshment break (ARCO building)

11:00-13:00

Extracting semantic information from text

The session gives an introduction to how Python and the NLTK library can be used to extract semantic information from unstructured text and measure similarity between documents.

Speaker: Barbara McGillivray

13:00-14:30

Lunch (Dining Hall)
14:30-15:30
Using the Oxford English Dictionary to explore linguistic features of texts

This talk uses data derived from the Oxford English Dictionary (OED) to explore aspects of the history of English. I review some of the features of the OED, and discuss some examples of quantitative analysis based on these features. We then look at how to use this data as a component in the study of historical texts, drawing on the OED API (https://developer.oxforddictionaries.com/our-data) as a tool to 'outsource' some of the effort involved in linguistic and textual analysis. I present some sample applications which use the OED API to support tasks like information retrieval, text annotation, and the reading and interpretation of documents. The talk includes practical examples of Python code to connect to and query the OED API, and to process the response.

Speaker: James McCracken

15:30-16:00
Refreshment break (ARCO building)

 

16:00-17:00
 
Lectures (various venues)

Friday 26th July

09:00-10:30
Networks in Python

The session introduces basic concepts of network theory, and shows how structured textual data can be turned into networks.

Speaker: Mariona Coll Ardanuy

10.30-11:00
 
Refreshment break (ARCO building)
11:00-13:00
Problem solving session

The session will provide an opportunity to apply the skills taught during the week, with instructors present to provide guidance.

Speakers: Mariona Coll Ardanuy, Gard Jenset, Barbara McGillivray

 
13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Group presentations from students

15:30-16:00

Refreshment break (ARCO building)

16:00-17:00
Closing plenary (O'Reilly lecture theatre)
Speaker biographies

​​James McCracken is lead developer for the Oxford English Dictionary, developing resources to support OED's lexicographic research, and exploring new applications of OED data. His work includes computational approaches to linguistics, literary studies, and history, with a particular interest in how advances in natural language processing and machine learning can be applied to the analysis of historical texts.

Barbara McGillivray is a computational linguist, and works as a research fellow at The Alan Turing Institute and the University of Cambridge. She holds a PhD in computational linguistics form the University of Pisa. Her research interests include: Language Technology for Cultural Heritage, Latin computational linguistics, quantitative historical linguistics, and computational lexicography.

Tom Wood studied physics as his first degree and then got interested in natural language processing. He did a Masters at Cambridge University in Computer Speech, Text and Internet Technology, and since then he has worked in machine learning and AI in various companies, including computer vision and designing dialogue systems (think of Siri), in the UK, Spain and Germany. He has worked as a data scientist for CV-Library, one of the UK's largest job boards, developing machine learning algorithms to parse jobseekers' CVs and make smart job recommendations. He works as a freelance data science consultant via his own company Fast Data Science Ltd (www.fastdatascience.com).

Gard B. Jenset has a PhD in English linguistics from the University of Bergen. He currently works with deep learning and language technology in industry. Among his research interests are corpus linguistics and quantitative methods in historical linguistics.

Mariona Coll Ardanuy is a post-doctoral research associate in the Living with Machines project at The Alan Turing Institute. She has a PhD in Computational Linguistics from the University of Göttingen, her research interests lay at the intersection of the humanities and language technology.

 
 
 
 
 
  • Black Twitter Icon

© 2019 University of Oxford