Applied Data Analysis

“Advanced class: Data science methods to analyse humanities data"

Convenors: Giovanni Colavizza, Matteo Romanello

Hashtag: #ADA and #DHOxSS2019

Computers: Participants should bring their own laptops with Python 3.5+ installed. See below for more details.

Abstract

This workshop offers an introduction to data analysis techniques of practical use to humanities scholars and GLAM professionals. Topics include: reading, cleaning, representing, describing and visualizing metadata, as well as text and network data, following a tidy approach. Classes are hands-on and interactive, as we will always work with real-world examples of metadata (e.g. from the Oxford English Dictionary), text (e.g. from historical newspapers) and relational data (e.g. publication citation data). Attendees will also have the opportunity to work on their own projects in groups.

Experience necessary

Participants should have a basic familiarity with the Python language and standard library (e.g. via previous attendance to Text2Tech or equivalent). More expert attendees should feel free to use any language of their choice, e.g. R.

Advance reading list
 

Intended outcomes

Participants will get an understanding of how to make the first steps in an applied data analysis project, up to getting a descriptive and visual understanding of their data. They will develop a familiarity with the Python data analysis Python stack (Pandas, matplotlib, Seaborn). Participants will acquire skills of practical applicability in their work.

Participants must bring their own laptops with Python 3.5+ (and, optionally, Anaconda) installed, minimal time will be devoted to set-up, thus participants are asked to plan accordingly. Participants should also verify to have administrative access to their laptops in advance, in order to be able to install new software.

A guide on how to set-up Python on your machine is given here:

 

https://programminghistorian.org/en/lessons/introduction-and-installation 
Also useful: https://programminghistorian.org/en/lessons/?topic=get-ready
Installing Anacoda: https://docs.continuum.io/anaconda/install 
Creating a Python environment: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands

Convenors

Giovanni Colavizza is a Senior Data Scientist at The Alan Turing Institute and Research Scientist at the Centre for Science and Technology Studies (CWTS), Leiden University. He did his PhD in Technology Management at the Digital Humanities Laboratory of EPFL, working on methods for text mining and citation analysis of scholarly publications. He was for two years the operations manager of the Venice Time Machine, a large-scale digitisation and indexation project based at the Archives of Venice, and is cofounder of Odoma, a start-up offering customised machine learning techniques in the cultural heritage domain. Giovanni is interested in science studies, applied machine learning and the application of computational methods to conduct research in the humanities.

Matteo Romanello​ is a research scientist at the Digital Humanities Laboratory of the EPFL, where he works on methods for text mining and information extraction from large collections of texts. He obtained his PhD at the Digital Humanities department at King’s College London, under the supervision of Prof. Willard McCarty, with a thesis on methods for citation mining of Classics publications. Matteo is a Digital Humanities specialist with expertise in the areas of classics, archaeology and history. His main research interests include natural language processing and information extraction, especially their domain-specific applications; citation mining and analysis; and applications of semantic web technologies in the humanities. Before joining the EPFL Matteo worked as a teaching fellow at the University of Rostock, as a researcher at the German Archaeological Institute, and was visiting research scholar at Tufts University (Perseus project).

"I loved it, learnt a lot. Hope that I can put all I learnt to use in my research."

DHOxSS 2018 participant

TIMETABLE

 
Link to overview of the week's timetable including evening events.
Monday 22nd July
08:00-09:00
Registration (Sloane Robinson building)
Tea and coffee (ARCO building)
09:00-10:00

Opening Keynote (Sloane Robinson lecture theatre)
10:00-10:30

Refreshment break (ARCO building)
10:30-12:00

Workshop Introductions

  • Introductions

  • Opening class: presentation of the strand, objectives and schedule, example of a data analysis application

  • Setting up systems

(Giovanni Colavizza, Matteo Romanello)

12:00-13:30

Lunch (Dining Hall)
13:30-15:30

Data formats and Input/Output (interactive class via notebooks)
  • Markup: XML, JSON, CSV; binary

  • Read/Write, store

(Giovanni Colavizza, Matteo Romanello)

15:30-16:00

Refreshment break (ARCO building)
16:00-17:00

Options (attendees chose one from the below):

  • Catching-up classes: assistance is provided to clarify any issue from the previous classes or in setting-up your Python environment.

  • Exercises/project: exercises or mini-projects will be provided for practice. Alternatively, attendees can bring their own mini-project to the class and work on it, individually or with others

(Giovanni Colavizza, Matteo Romanello)

.
Tuesday 23rd July

 

09:00-10:30

The Python data analysis stack, part I (interactive via notebooks)
 
Pandas, matplotlib, Seaborn
(Giovanni Colavizza, Matteo Romanello)

10:30-11:00

Refreshment break (ARCO building)
11:00-13:00

 

Tidy data, part I (frontal, 30m)

 

  • Basic concepts of (proper/tidy) data modelling

  • Typologies of databases (SQL, noSQL, graph)

 
Tidy data, part II (frontal/interactive, 1h)

 

Manipulating data with pandas

 

 

(Giovanni Colavizza, Matteo Romanello)

13:00-14:30

Lunch (Dining Hall)

 

14:30-15:30

Options (attendees chose one from the below):

  • Catching-up classes: assistance is provided to clarify any issue from the previous classes or in setting-up your Python environment.

  • Exercises/project: exercises or mini-projects will be provided for practice. Alternatively, attendees can bring their own mini-project to the class and work on it, individually or with others.

  • Lectures at Text to Tech: attend the invited lectures given as part of the Text to Tech strand

(Giovanni Colavizza, Matteo Romanello)

15:30-16:00

Refreshment break (ARCO building)
16:00-17:00

Lectures (various venues)

Wednesday 24th July
 
09:00-10:30

Tidy data, part III (interactive)

 

From messy data to tidy data, step by step

(Giovanni Colavizza, Matteo Romanello)

10:30-11:00

Refreshment break (ARCO building)
11:00-13:00

The Python data analysis stack, part II (interactive via notebooks)

 

Pandas, matplotlib, Seaborn

(Giovanni Colavizza, Matteo Romanello)

 

13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Options (attendees chose one from the below):

 

  • Catching-up classes: assistance is provided to clarify any issue from the previous classes or in setting-up your Python environment.

  • Exercises/project: exercises or mini-projects will be provided for practice. Alternatively, attendees can bring their own mini-project to the class and work on it, individually or with others.

  • Lectures at Text to Tech: attend the invited lectures given as part of the Text to Tech strand

(Giovanni Colavizza, Matteo Romanello)

15.30-16.00

Refreshment break (ARCO building)
16:00-17:00

Lectures (various venues)

 

Thursday 25th July
 
09:00-10:30

Applied Data Analysis, part I: Basics

  • Descriptive statistics

  • Sampling and uncertainty

  • Relating two variables

  • Hypothesis testing (if time allows)

(Giovanni Colavizza, Matteo Romanello)

10.30-11.00

Refreshment break (ARCO building)
11:00-13:00

Applied Data Analysis, part II: Visualization

  • Visualization with the Python stac

  • Primer on good and bad dataviz practices

(Giovanni Colavizza, Matteo Romanello)
13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Options (attendees chose one from the below):

 

  • Catching-up classes: assistance is provided to clarify any issue from the previous classes or in setting-up your Python environment.

  • Exercises/project: exercises or mini-projects will be provided for practice. Alternatively, attendees can bring their own mini-project to the class and work on it, individually or with others.

  • Lectures at Text to Tech: attend the invited lectures given as part of the Text to Tech strand

(Giovanni Colavizza, Matteo Romanello)

15.30-16.00

Refreshment break (ARCO building)
16:00-17:00

Lectures (various venues)

 

Friday 26th July
 
09:00-10:30

 

Applied Data Analysis, part III: Advanced topics

  • Variable selection and preparation

  • Statistical analyses

  • Basic modelling

(Giovanni Colavizza, Matteo Romanello)
10:30-11:00

Refreshment break (ARCO building)
11:00-13:00

 

Applied Data Analysis, part IV: Advanced application

Analysis of data resulting from text processing e.g. topic modelling or text reuse

(Giovanni Colavizza, Matteo Romanello)

13:00-14:30

Lunch (Dining Hall)
14:30-15:30

Closing session
  • Communicating data analysis results

  • Best practices about publishing datasets, licensing issues, reproducibility, data repositories (frontal)

  • Q&A and farewell

(Giovanni Colavizza, Matteo Romanello)

15:30-16:00

Refreshment break (ARCO building)

16:00-17:00

Closing Plenary (O'Reilly lecture theatre)

 

 
 
 
 
 
  • Black Twitter Icon

© 2019 University of Oxford