The Index Thomisticus Treebank Project : Annotation, Parsing and Valency Lexicon

Barbara McGillivray*, Marco Passarotti** and Paolo Ruffolo**
*University of Pisa, Italy; b.mcgillivray@ling.unipi.it
**Catholic University of the Sacred Heart, Milan, Italy; marco.passarotti@unicatt.it, paolo.ruffolo@poste.it
Résumé (en anglais)
We present an overview of the Index Thomisticus Treebank project (IT-TB). The IT-TB consists of around 60,000 tokens from the Index Thomisticus by Roberto Busa SJ, an 11-million-token Latin corpus of the texts by Thomas Aquinas. We briefly describe the annotation guidelines, shared with the Latin Dependency Treebank (LDT). The application of data-driven dependency parsers on IT-TB and LDT data is reported on. We present training and parsing results on several datasets and provide evaluation of learning algorithms and techniques. Furthermore, we introduce the IT-TB valency lexicon extracted from the treebank. We report on quantitative data of the lexicon and provide some statistical measures on subcategorisation structures.