Applied Data Analysis for the Humanities

Advanced class: Data science methods to analyse humanities data

Convenors: Giovanni Colavizza, Matteo Romanello

Hashtag: #ADA and #DHOxSS20

Computers: Participants should bring their own laptops with Python 3.5+ installed. See below for more details.

Abstract

This workshop offers an introduction to data analysis techniques of practical use to humanities scholars and GLAM professionals. Topics include: reading, cleaning, representing, describing, modelling and visualizing text and network data, following a tidy approach. Attendees will also have the opportunity to work on their own projects in groups.

Intended outcomes

Participants will get an understanding of how to make the fundamental steps in an applied data analysis project, up to getting a first understanding of their data. They will develop a familiarity with the data analysis Python stack (Pandas, matplotlib). Participants will acquire the skills of practical applicability in their work.

Experience necessary

Participants should have a basic familiarity with the Python language and standard library (e.g. via previous attendance to Text2Tech or equivalent). More expert attendees should feel free to use any language of their choice, e.g. R.

Advance (optional) reading list

Computer and software requirements

Participants must bring their own laptops with Python 3.5+ (and, optionally, Anaconda) installed, minimal time will be devoted to set-up, thus participants are asked to plan accordingly. Participants should also verify to have administrative access to their laptops in advance, in order to be able to install new software.

A guide on how to set-up Python on your machine is given here:

 

https://programminghistorian.org/en/lessons/introduction-and-installation 
Also useful: https://programminghistorian.org/en/lessons/?topic=get-ready
Installing Anacoda: https://docs.continuum.io/anaconda/install 
Creating a Python environment: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands

Convenors

Giovanni Colavizza is assistant professor of digital humanities at the University of Amsterdam and a visiting researcher at The Alan Turing Institute. He did his PhD in Technology Management at the Digital Humanities Laboratory of EPFL, working on methods for text mining and citation analysis of scholarly publications. He was for two years the operations manager of the Venice Time Machine, a large-scale digitisation and indexation project based at the Archives of Venice, and is cofounder of Odoma, a start-up offering customised machine learning techniques in the cultural heritage domain. Giovanni is interested in science studies, applied machine learning and the application of computational methods to conduct research in the humanities

Matteo Romanello​ is research scientist at the Digital Humanities Laboratory of the EPFL, where he works on methods for text mining and information extraction from large collections of texts. He obtained his PhD at the Digital Humanities department at King’s College London, under the supervision of Prof. Willard McCarty, with a thesis on methods for citation mining of Classics publications. Matteo is a Digital Humanities specialist with expertise in the areas of classics, archaeology and history. His main research interests include natural language processing and information extraction, especially their domain-specific applications; citation mining and analysis; and applications of semantic web technologies in the humanities. Before joining the EPFL Matteo worked as a teaching fellow at the University of Rostock, as a researcher at the German Archaeological Institute, and was visiting research scholar at Tufts University (Perseus project).

"I left the workshop with a sense of confidence that I could implement what my teachers had shown us during the week. Giovanni and Matteo did a great job of showing how humanistic questions translate into code syntax. I am very excited to hear that they will be returning next summer: a lucky group of students indeed."

DHOxSS 2019 participant

 

TIMETABLE

 
Monday, 13th July 
08:00-09:00
Registration (Sloane Robinson building)
Tea and coffee (ARCO building)
09:00-10:00

Opening Keynote (O'Reilly lecture theatre)
10:00-10:30

Refreshment break (ARCO building)
10:30-12:00

Workshop Introductions

  • Introductions

  • Opening class: presentation of the strand, objectives and schedule, example of a data analysis application

  • Setting up systems

Speakers:  Giovanni Colavizza, Matteo Romanello

12:00-13:30

Lunch (Dining Hall)
13:30-15:30

Data formats and Input/Output 
  • Markup: XML, JSON, CSV; binary

  • Read/Write, store

Speakers:  Giovanni Colavizza, Matteo Romanello

15:30-16:00

Refreshment break (ARCO building)
16:00-17:00

Options (attendees chose one from the below):

  • Catching-up classes: assistance is provided to clarify any issue from the previous classes or in setting-up your Python environment.

  • Exercises/project: exercises or mini-projects will be provided for practice. Alternatively, attendees can bring their own mini-project to the class and work on it, individually or with others

Speakers:  Giovanni Colavizza, Matteo Romanello

.
Tuesday, 14th July

 

09:00-10:30

The Python data analysis stack, part I 
 
Pandas, matplotlib, Seaborn
Speakers:  Giovanni Colavizza, Matteo Romanello

10:30-11:00

Refreshment break (ARCO building)
11:00-13:00

 

Tidy data, part I 

 

  • Basic concepts of (proper/tidy) data modelling

  • Typologies of databases (SQL, noSQL, graph)

  • Manipulating data with pandas

 

Speakers:  Giovanni Colavizza, Matteo Romanello

13:00-14:30

Lunch (Dining Hall)

 

14:30-15:30

Options (attendees chose one from the below):

  • Catching-up classes: assistance is provided to clarify any issue from the previous classes or in setting-up your Python environment.

  • Exercises/project: exercises or mini-projects will be provided for practice. Alternatively, attendees can bring their own mini-project to the class and work on it, individually or with others.

  • Lectures at Text to Tech: attend the invited lectures given as part of the Text to Tech strand

Speakers:  Giovanni Colavizza, Matteo Romanello

15:30-16:00

Refreshment break (ARCO building)
16:00-17:00

Additional sessions (various venues)

Wednesday, 15th July
 
09:00-10:30

Tidy data, part II 

 

From messy data to tidy data, step by step

Speakers:  Giovanni Colavizza, Matteo Romanello

10:30-11:00

Refreshment break (ARCO building)
11:00-13:00

The Python data analysis stack, part II 

 

Pandas, matplotlib, Seaborn

Speakers:  Giovanni Colavizza, Matteo Romanello

 

13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Options (attendees chose one from the below):

 

  • Catching-up classes: assistance is provided to clarify any issue from the previous classes or in setting-up your Python environment.

  • Exercises/project: exercises or mini-projects will be provided for practice. Alternatively, attendees can bring their own mini-project to the class and work on it, individually or with others.

  • Lectures at Text to Tech: attend the invited lectures given as part of the Text to Tech strand

Speakers:  Giovanni Colavizza, Matteo Romanello

15.30-16.00

Refreshment break (ARCO building)
16:00-17:00

Additional sessions (various venues)

 

Thursday, 16th July 
 
09:00-10:30

Applied Data Analysis, part I

  • Descriptive statistics

  • Sampling and uncertainty

  • Relating two variables

Speakers:  Giovanni Colavizza, Matteo Romanello

10.30-11.00

Refreshment break (ARCO building)
11:00-13:00

Applied Data Analysis, part II

  • Visualization with the Python stac

  • Primer on good and bad dataviz practices

Speakers:  Giovanni Colavizza, Matteo Romanello
13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Options (attendees chose one from the below):

 

  • Catching-up classes: assistance is provided to clarify any issue from the previous classes or in setting-up your Python environment.

  • Exercises/project: exercises or mini-projects will be provided for practice. Alternatively, attendees can bring their own mini-project to the class and work on it, individually or with others.

  • Lectures at Text to Tech: attend the invited lectures given as part of the Text to Tech strand

Speakers:  Giovanni Colavizza, Matteo Romanello

15.30-16.00

Refreshment break (ARCO building)
16:00-17:00

Additional sessions (various venues)

 

Friday, 17th July
 
09:00-10:30

 

Applied Data Analysis, part III: Advanced topics

  • Variable selection and preparation

  • Statistical analyses

  • Hypothesis testing

  • Hints of modelling

Speakers:  Giovanni Colavizza, Matteo Romanello
10:30-11:00

Refreshment break (ARCO building)
11:00-13:00

 

Applied Data Analysis, part IV

Analysis of data resulting from text processing e.g. topic modelling or text reuse

Speakers:  Giovanni Colavizza, Matteo Romanello

13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Closing session
  • Communicating data analysis results

  • Best practices about publishing datasets, licensing issues, reproducibility, data repositories (frontal)

  • Q&A and farewell

Speakers:  Giovanni Colavizza, Matteo Romanello

15:30-16:00

Refreshment break (ARCO building)

16:00-17:00

Closing Keynote (O'Reilly lecture theatre)

 

 
 
 
 
  • Black Twitter Icon

© 2020 University of Oxford