Practical Text Analysis Pipelines for Humanists with Deadlines

Adriaan Lemmens and Vincent Vandeghinste

Among researchers in the humanities, interest continues to grow in the potential of advanced text analytics. At the same time, we observe that the actual use of such methods still remains out of reach for many. We believe that this is partly due to a skill gap, with existing tools generally pre-supposing knowledge that prospective users simply do not possess, nor can be reasonably expected to have time to learn.

As part of CLARIAH-VL (a joint effort between CLARIN and DARIAH in Flanders), we are building a web service that aims to address this concern so as to finally put the latest-and-greatest in NLP in the hands of non-technical researchers. Dubbed *Manatee* (short for *MANatee Automates Text Exploration*, and rhyming with the name of our academic target domain), this future contribution to CLARIN-BE’s infrastructure is essentially a dashboard application that lets users manage corpora, execute carefully curated analysis workflows, visualize the results, and export the enriched datasets to one of several standard formats. It is an interactive environment in which, for example, an historian can upload a directory of PDFs, have them analyzed for sentiment with respect to a particular entity, and view the result as a chart plotting sentiment against time. Another example user is a corpus linguist interested in historical varieties of Dutch. After uploading their data, they opt for a specific configuration of part-of-speech tagging and morphological analysis, and — by way of our integration with INT’s Blacklab corpus search engine — are able to execute powerful search queries against the analyzed data. Behind the scenes, *Manatee* relies on algorithm implementations and machine learning models that are contributed in an open-source model.
Our starting partners are members of the CLiPS group (University of Antwerp) and LT3 group (Ghent University).

*Manatee* sets itself apart from more established competitors in the same domain (such as WebLicht and Voyant Tools) by prioritizing simplicity of interaction. Starting at the application level, *Manatee* is designed to avoid overwhelming users with choice. Specifics are hidden insofar as they are not relevant, and defaults are automatically inferred. Guided by non-technically worded questions, users can express the ‘what’ of their goal without being expected to know the ‘how’.
Simplicity is also the guiding principle when it comes to encouraging fellow developers to contribute custom components. Here, *Manatee* defines an easy-to-use Python micro-framework that lets others add new components (e.g. analyzers, workflows, visualizers) at almost no extra development cost. In the future, we aim to address the needs of an additional third group, namely, researchers that prefer to interact with *Manatee* programmatically rather than through its user interface. Once *Manatee*’s internal API has stabilized, work will start on a public web API and Python SDK. However, this is a long-term goal.

*Manatee* is in active development.
An alpha release is slated for July, with a beta release planned for late November.
The current plan is to support Dutch, French, and English, both contemporary and historical varieties.
Our poster will highlight the innovative aspects of *Manatee*’s design.