Bursary Reports 2019
Chenzi Xu, DPhil Candidate in General Linguistics and Comparative Philology, University of Oxford
Workshop: From Text to Tech
I’m very grateful to have attended the Digital Humanities Summer School at an early stage of my DPhil journey. Apart from the new skills and knowledge learnt in the workshop, a broader vision of humanities research mingled with digital technologies benefits me immensely. Now I will not think of data collection, processing, analysis, and visualisation in the same way I used to.
I attended the From Text to Tech workshop and had some hands-on practices of introductory Python programming and natural language processing. The convenors Barbara McGillivray, Gard Jenset, and Mariona Coll Ardanuy were extremely helpful in explaining some arbitrary and conventional uses in the Python programming, and clearing some small but frustrating mistakes that inexperienced beginners tend to make.
We first walked through the basics of Python including data types and structures, assignments and variables, control structures, and so on. One of the major take-aways for me from learning the basic syntax of Python was the logical ways of thinking. I’m amazed at how combinations of simple for-/if-loops can be so powerful for executing some tasks on a large scale. Then we were introduced the NLTK library to do basic text processing such as tokenisation, stemming, part-of-speech tagging, and lemmatisation. Tom Wood’s presentation on stylometry demonstrates the application of some of these functions. More recent deep learning methods introduced in his presentation are also conducive to identifying personal patterns or styles of word use. In the subsequent lectures, we learnt several methods of converting text string data to numerical attributes such as Bag of words model, TFIDF, and Word2Vec. Semantic relations between words can thus be represented through Euclidean distance or cosine similarity. Projecting word meanings into a vector space at first seemed abstruse, Barbara’s illustration of two-dimensional vectorisation of words was nevertheless illuminating. From quantifying sentiment to visualising the networks, there were many more inspiring digital perspectives on textual data. In addition, we had some preliminary practices using some useful Python libraries such as BeautifulSoup to extract texts from HTML and XML documents, and Gensim for topic modelling.
This was a fruitful week in the beautiful Keble college, where I interacted with students and scholars at different stages of their research from various fields on the lawns during tea breaks and in the grandiose dining hall over lunches. Thanks to the excellent organizing team of DHOxSS, we had such a great networking opportunity and have heard about many exciting ongoing research and research ideas.
The workshop demonstrates how Python programming can be an efficient tool for humanities scholars to collect and interpret large-scale data and to perform quantitative analysis on non-numerical data, adding insights to qualitative analysis. I feel more competent in discussing and applying quantitative methods in my own linguistic research and I look forward to exploring much more about python programming, machine learning, and other digital practices.