I'm a paragraph. Click here to add your own text and edit me. It's easy.

I'm a paragraph. Click here to add your own text and edit me. It's easy.

I'm a paragraph. Click here to add your own text and edit me. It's easy.

From Text to Tech

Python programming for powerful text processing in the Humanities

Convenors: Mariona Coll Ardanuy, Kaspar Beelen, Federico Nanni

Hashtag: #text2tech and #DHOxSS20

Computers: Participants are not required to bring their own laptops for this workshop. Laptop computers will be provided by DHOxSS

 

Abstract

This workshop offers a foundational introduction to Python programming and Natural Language Processing, from cleaning texts to extracting meaning from them, as well as the basics of automated semantic analysis with machine learning.

Intended outcomes

 

At the end of the workshop, participants will have acquired basic practical skills and knowledge on how Python can be used for processing humanities textual data. They will leave with an understanding of key aspects of natural language processing and how these can be applied to their research in the humanities. Participants will be able to extract text and metadata from structured HTML and XML documents, and to perform automated analyses on large corpora.

Experience necessary

 

No prior technical knowledge necessary

Computer and software requirements

 

Participants are not required to bring their own laptops for this workshop. Laptop computers will be provided by DHOxSS

 

Advanced Reading List

Convenors

 

Mariona Coll Ardanuy is a post-doctoral research associate in the Living with Machines project at The Alan Turing Institute. She has a PhD in Computational Linguistics from the University of Göttingen, her research interests lay at the intersection of the humanities and language technology.

Kaspar Beelen obtained his PhD in History at the University of Antwerp (2014). As a digital historian, Kaspar investigates how artificial intelligence can contribute to historical debates. He has worked as a post-doctoral fellow at the University of Toronto (Computer Science and Political Science Departments) and the University of Amsterdam (Institute of Informatics), where he also served as assistant professor in Digital Humanities (Media Studies Department). Kaspar currently works as a research associate at the Alan Turing Institute, London, where he investigates--as part of the “Living with Machines” project—the lived experience of the industrial revolution using data-driven techniques.

Federico Nanni is a Research Data Scientist at The Alan Turing Institute, working as part of the Research Engineering Group, and a visiting fellow at the School of Advanced Study, University of London. He completed a PhD in History of Technology and Digital Humanities at the University of Bologna focusing on the use of web archives in historical research and has been a post-doc in Computational Social Science at the Data and Web Science Group of the University of Mannheim. He also spent time as a visiting researcher at the Foundation Bruno Kessler and the University of New Hampshire, working on Natural Language Processing and Information Retrieval.

"Instructors were amazing and gave really nice sessions with great exercises."

DHOxSS 2010 participant

TIMETABLE

 
Monday, 13th July
08:00-09:00

Registration (Sloane Robinson building)
Tea and coffee (ARCO building)
09:00-10:00

Opening Keynote (O'Reilly lecture theatre)

10:00-10:30

Refreshment break (ARCO building)

10:30-12:00

Introduction to programming in Python

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

12:00-13:30

Lunch (Dining Hall)
13:30-15:30

Introduction to programming in Python (continued)

 

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

 

15:30-16:00

 

Refreshment break (ARCO building)
16:00-17:00

Exercises and catching-up session

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

Tuesday, 14th July
09:00-10:30
 
Basic text processing with Python

The session gives an introduction to working with linguistic data in Python. Topics include simple regular expressions and other methods for handling text data.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

10:30-11:00
 
Refreshment break (ARCO building)

 

11:00-13:00
Data structures in Python

This session will cover basic data structures like lists and dictionaries in Python, with practical examples.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

13:00-14:30
Lunch (Dining Hall)
14:30-15:30
Invited talk TBC

 

15:30-16:00
Refreshment break (ARCO building)

 

16:00-17:00
 
Additional sessions (various venues)
Wednesday, 15th July

09:00-10:30  
 
Introduction to Natural Language Processing (NLP) in Python

 

This session introduces the SpaCy library and shows how it can be used for tasks such as lemmatization, part-of-speech tagging and named entity recognition.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

10:30-11:00
 
Refreshment break (ARCO building)
11:00-13:00

Introduction to Natural Language Processing (NLP) in Python (continued)

 

This session introduces the SpaCy library and shows how it can be used for tasks such as lemmatization, part-of-speech tagging and named entity recognition.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni
13:00-14:30
Lunch (Dining Hall)
14:30-15:30  

 

Invited talk TBC

15:30-16:00

Refreshment break (ARCO building)
16:00-17:00

 

Additional sessions (various venues)

Thursday, 16th July

 

09:00-10:30
Working with semi-structured and tabular data

This session gives an introduction to working with semi-structured texts, such as XML or HTML documents. It shows how to access data via APIs and analyse the content with Pandas.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

 

10.30-11:00
 
Refreshment break (ARCO building)

11:00-13:00

Working with semi-structured and tabular data 

This session gives an introduction to working with semi-structured texts, such as XML or HTML documents. It shows how to access data via APIs and analyse the content with Pandas.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

13:00-14:30

Lunch (Dining Hall)
14:30-15:30
Word embeddings

We offer a kind introduction to semantic analysis with word embeddings. We show how this technique can be used for humanities research, such as tracking semantic change or understanding biases in a corpus.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

15:30-16:00
Refreshment break (ARCO building)

 

16:00-17:00
 
Additional sessions (various venues)

Friday, 17th July

09:00-10:30
Topic modelling

In this session we cover how to apply topic models for understanding the content of a corpus and discover underlying trends.

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

10.30-11:00
 
Refreshment break (ARCO building)
11:00-13:00
Document classification

This session provides simple tools to categorize large collections of documents according to predefined classes (such as topics, genre or author).

Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

 
13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Open discussion
Speakers: Kaspar Beelen, Mariona Coll Ardanuy, Federico Nanni

 

15:30-16:00

Refreshment break (ARCO building)

16:00-17:00
Closing keynote (O'Reilly lecture theatre)
Speaker biographies

 
 
 
 
 
  • Black Twitter Icon

© 2020 University of Oxford