I'm a paragraph. Click here to add your own text and edit me. It's easy.

I'm a paragraph. Click here to add your own text and edit me. It's easy.

I'm a paragraph. Click here to add your own text and edit me. It's easy.

From Text to Tech

Python programming for powerful text processing in the Humanities

Convenors: Mariona Coll Ardanuy, Gard Jenset, Kaspar Beelen, Federico Nanni

Hashtag: #text2tech and #DHOxSS2019

Computers: Participants are not required to bring their own laptops for this workshop. Laptop computers will be provided by DHOxSS

 

Abstract

This workshop offers a foundational introduction to Python programming and Natural Language Processing, from cleaning texts to extracting meaning from them, as well as the basics of automated semantic analysis with machine learning.

Intended outcomes

 

At the end of the workshop, participants will have acquired basic practical skills and knowledge on how Python can be used for processing humanities textual data. They will leave with an understanding of key aspects of natural language processing and how these can be applied to their research in the humanities. Participants will be able to extract text and metadata from structured HTML and XML documents, and to perform automated analyses on large corpora.

Experience necessary

 

No prior technical knowledge necessary

Computer and software requirements

 

Participants are not required to bring their own laptops for this workshop. Laptop computers will be provided by DHOxSS

 

Advanced Reading List

Convenors

 

Mariona Coll Ardanuy is a post-doctoral research associate in the Living with Machines project at The Alan Turing Institute. She has a PhD in Computational Linguistics from the University of Göttingen, her research interests lay at the intersection of the humanities and language technology.

Gard B. Jenset has a PhD in English linguistics from the University of Bergen. He currently works with deep learning and language technology in industry. Among his research interests are corpus linguistics and quantitative methods in historical linguistics.

Kaspar Beelen obtained his PhD in History at the University of Antwerp (2014). As a digital historian, Kaspar investigates how artificial intelligence can contribute to historical debates. He has worked as a post-doctoral fellow at the University of Toronto (Computer Science and Political Science Departments) and the University of Amsterdam (Institute of Informatics), where he also served as assistant professor in Digital Humanities (Media Studies Department). Kaspar currently works as a research associate at the Alan Turing Institute, London, where he investigates--as part of the “Living with Machines” project—the lived experience of the industrial revolution using data-driven techniques.

Federico Nanni is a Research Data Scientist at The Alan Turing Institute, working as part of the Research Engineering Group, and a visiting fellow at the School of Advanced Study, University of London. He completed a PhD in History of Technology and Digital Humanities at the University of Bologna focusing on the use of web archives in historical research and has been a post-doc in Computational Social Science at the Data and Web Science Group of the University of Mannheim. He also spent time as a visiting researcher at the Foundation Bruno Kessler and the University of New Hampshire, working on Natural Language Processing and Information Retrieval.

"The workshop was a nice mix of lecture and practical exercise, the teachers were responsive to questions, left us with a wealth of materials to work on after the workshop and chose useful and interesting guest speakers."

DHOxSS 2018 participant

TIMETABLE

 
13-17 JULY 2020 PROGRAMME COMING SHORTLY
BELOW IS 2019 PROGRAMME FOR  GENERAL REFERENCE
Link to overview of the week's timetable including evening events.
Monday 22nd July
08:00-09:00

Registration (Sloane Robinson building)
Tea and coffee (ARCO building)
09:00-10:00

Opening Keynote (Sloan Robinson O'Reilly lecture theatre)

10:00-10:30

Refreshment break (ARCO building)

10:30-12:00

Introduction to programming in Python

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.

 

Speaker: Gard Jenset

12:00-13:30

Lunch (Dining Hall)
13:30-15:30

Introduction to programming in Python (continued)

 

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.

Speaker: Barbara McGillivray

 

15:30-16:00

 

Refreshment break (ARCO building)
16:00-17:00

Introduction to Corpora

The session will give an introduction to the main concepts of corpus linguistics, including corpus creation and corpus processing for research in Digital Humanities.

Speaker: Barbara McGillivray

Tuesday 23rd July
09:00-10:30
 
Basic text processing with Python

The session gives an introduction to working with linguistic data in Python. Topics include simple regular expressions and other methods for handling text data.

Speaker: Mariona Coll Ardanuy

10:30-11:00
 
Refreshment break (ARCO building)

 

11:00-13:00
Data structures in Python

This session will cover basic data structures like lists and dictionaries in Python, with practical examples.

Speaker: Mariona Coll Ardanuy

13:00-14:30
Lunch (Dining Hall)
14:30-15:30
Identifying the author of a document.

 

This talk introduces an approach to "forensic stylometry", that is, identifying the author of a text, based on a corpus of documents. This field made headlines in 2013 when two professors of computational linguistics proved that JK Rowling was the author of a detective series which she had written under a pseudonym. Traditionally this would have been done with a hand engineered sequence of components for removing stopwords, lemmatising words, and constructing a bag of words model. However recent advances in deep learning software have made it simple to build text classifiers with almost no feature engineering. In a few hours we can build a classifier to identify authorship. It can be trained in a few minutes and will run on a regular laptop.

Speaker: Tom Wood

15:30-16:00
Refreshment break (ARCO building)

 

16:00-17:00
 
Lectures (various venues)
Wednesday 24th July

09:00-10:30  
 
Introduction to Natural Language Processing (NLP) in Python

 

This session introduces the NLTK library and shows how it can be used for tasks such as stemming, part-of-speech tagging, and lemmatization with Python.

Speaker: Barbara McGillivray

10:30-11:00
 
Refreshment break (ARCO building)
11:00-13:00

Sentiment analysis in Python

 

This session will consist of an exercise in which participants will learn how to perform basic sentiment analysis on textual data.

Speaker: Mariona Coll Ardanuy
13:00-14:30
Lunch (Dining Hall)
14:30-15:30  

 

Word Embeddings and Python in corpus research: the dative alternation in spoken British English

This talk gives an introduction to how studies in corpus linguistics can use Python and available resources like word embeddings in a supporting role to an annotated corpus. Typically, word embeddings have been used more in computational linguistics and natural language processing applications than in corpus linguistics research. Using a study of the so-called dative alternation (“gave Y X” vs. “gave X to Y”) an example, the talk illustrates how we can better answer quantitative research questions by combining various types of resources, and how combining the different programming languages R and Python offers additional flexibility to doing corpus research.

Speaker: Gard Jenset

15:30-16:00

Refreshment break (ARCO building)
16:00-17:00

 

Lectures (various venues)

Thursday 25th July

 

09:00-10:30
Parsing HTML and XML documents

Python can be used to extract text from structured documents like HTML and XML. This session gives an introduction to how this is done.

Speaker: Gard Jenset

 

10.30-11:00
 
Refreshment break (ARCO building)

11:00-13:00

Extracting semantic information from text

The session gives an introduction to how Python and the NLTK library can be used to extract semantic information from unstructured text and measure similarity between documents.

Speaker: Barbara McGillivray

13:00-14:30

Lunch (Dining Hall)
14:30-15:30
Using the Oxford English Dictionary to explore linguistic features of texts

This talk uses data derived from the Oxford English Dictionary (OED) to explore aspects of the history of English. I review some of the features of the OED, and discuss some examples of quantitative analysis based on these features. We then look at how to use this data as a component in the study of historical texts, drawing on the OED API (https://developer.oxforddictionaries.com/our-data) as a tool to 'outsource' some of the effort involved in linguistic and textual analysis. I present some sample applications which use the OED API to support tasks like information retrieval, text annotation, and the reading and interpretation of documents. The talk includes practical examples of Python code to connect to and query the OED API, and to process the response.

Speaker: James McCracken

15:30-16:00
Refreshment break (ARCO building)

 

16:00-17:00
 
Lectures (various venues)

Friday 26th July

09:00-10:30
Networks in Python

The session introduces basic concepts of network theory, and shows how structured textual data can be turned into networks.

Speaker: Mariona Coll Ardanuy

10.30-11:00
 
Refreshment break (ARCO building)
11:00-13:00
Problem solving session

The session will provide an opportunity to apply the skills taught during the week, with instructors present to provide guidance.

Speakers: Mariona Coll Ardanuy, Gard Jenset, Barbara McGillivray

 
13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Group presentations from students

15:30-16:00

Refreshment break (ARCO building)

16:00-17:00
Closing plenary (O'Reilly lecture theatre)
Speaker biographies

​​James McCracken is lead developer for the Oxford English Dictionary, developing resources to support OED's lexicographic research, and exploring new applications of OED data. His work includes computational approaches to linguistics, literary studies, and history, with a particular interest in how advances in natural language processing and machine learning can be applied to the analysis of historical texts.

Barbara McGillivray is a computational linguist, and works as a research fellow at The Alan Turing Institute and the University of Cambridge. She holds a PhD in computational linguistics form the University of Pisa. Her research interests include: Language Technology for Cultural Heritage, Latin computational linguistics, quantitative historical linguistics, and computational lexicography.

Tom Wood studied physics as his first degree and then got interested in natural language processing. He did a Masters at Cambridge University in Computer Speech, Text and Internet Technology, and since then he has worked in machine learning and AI in various companies, including computer vision and designing dialogue systems (think of Siri), in the UK, Spain and Germany. He has worked as a data scientist for CV-Library, one of the UK's largest job boards, developing machine learning algorithms to parse jobseekers' CVs and make smart job recommendations. He works as a freelance data science consultant via his own company Fast Data Science Ltd (www.fastdatascience.com).

 
 
 
 
 
  • Black Twitter Icon

© 2019 University of Oxford