Antal van den Bosch* and Alexander Greefhorst*
*Radboud University Nijmegen - BP 9103 NL-6500 HD Nijmegen, Pays-Bas
Predicting liaison in French is a non-trivial problem to model. We compare a memory-based machine-learning algorithm with a rule-based baseline. The memory-based learner is trained to predict whether liaison occurs between two words on the basis of lexical, orthographic, morphosyntactic, and sociolinguistic features. Best performance is obtained using only a selection of lexical and syntactic features (a window of the five last letters of a word and the five first letters of the following word, whether the liaison is obligatory or optional, Part-of-Speech tags, the number of syllables in a word and the Levenshtein distance to the 20 nearest phonological neighbors. Counter to our expectations, including sociolinguistic features even lowered the precision and recall of our predictions. Selecting only lexical and syntactic features yields a best overall performance at a precision of .80, with recall at .85. The F-scores, the harmonic mean of precision and recall, of the memory-based algorithm are higher than that of a baseline based on the rules of Grevisse and Goosse (2011), IGTree (a decision-tree learner) and the Naive Bayes classifier. Ripper, a more sophisticated rule induction algorithm, was able to produce similar results to our memory-based algorithm, but when it comes to optional liaison contexts, Ripper misses more instances in which real speakers would produce a liaison. It appears that predicting liaison benefits from being able to generalize from specific examples in context.